Documentation API of the 'cern.hep.aida.bin.QuantileBin1D' Java class
QuantileBin1D
cern.hep.aida.bin

Class QuantileBin1D

  • All Implemented Interfaces:
    DoubleBufferConsumer, Serializable, Cloneable
    Direct Known Subclasses:
    DynamicBin1D


    public class QuantileBin1Dextends MightyStaticBin1D
    1-dimensional non-rebinnable bin holding double elements with scalable quantile operations defined upon;Using little main memory, quickly computes approximate quantiles over very large data sequences with and even without a-priori knowledge of the number of elements to be filled;Conceptually a strongly lossily compressed multiset (or bag);Guarantees to respect the worst case approximation error specified upon instance construction.First see the package summary and javadoc tree view to get the broad picture.

    Motivation and Problem:Intended to help scale applications requiring quantile computation.Quantile computation on very large data sequences is problematic, for the following reasons:Computing quantiles requires sorting the data sequence.To sort a data sequence the entire data sequence needs to be available.Thus, data cannot be thrown away during filling (as done by static bins like StaticBin1D and MightyStaticBin1D).It needs to be kept, either in main memory or on disk.There is often not enough main memory available.Thus, during filling data needs to be streamed onto disk.Sorting disk resident data is prohibitively time consuming.As a consequence, traditional methods either need very large memories (like DynamicBin1D) or time consuming disk based sorting.

    This class proposes to efficiently solve the problem, at the expense of producing approximate rather than exact results.It can deal with infinitely many elements without resorting to disk.The main memory requirements are smaller than for any other known approximate technique by an order of magnitude.They get even smaller if an upper limit on the maximum number of elements ever to be added is known a-priori.

    Approximation error:The approximation guarantees are parametrizable and explicit but probabilistic, and apply for arbitrary value distributions and arrival distributions of the data sequence.In other words, this class guarantees to respect the worst case approximation error specified upon instance construction to a certain probability.Of course, if it is specified that the approximation error should not exceed some number very close to zero,this class will potentially consume just as much memory as any of the traditional exact techniques would do.However, for errors larger than 10-5, its memory requirements are modest, as shown by the table below.

    Main memory requirements:Given in megabytes, assuming a single element (double) takes 8 byte.The number of elements required is then MB*1024*1024/8.

    Parameters:

    • epsilon - the maximum allowed approximation error on quantiles; in [0.0,1.0].To get exact rather than approximate quantiles, set epsilon=0.0;
    • delta - the probability allowed that the approximation error fails to be smaller than epsilon; in [0.0,1.0].To avoid probabilistic answers, set delta=0.0.For example, delta = 0.0001 is equivalent to a confidence of 99.99%.
    • quantiles - the number of quantiles to be computed; in [0,Integer.MAX_VALUE].
    • is N known? - specifies whether the exact size of the dataset over which quantiles are to be computed is known.
    • Nmax - the exact dataset size, if known. Otherwise, an upper limit on the dataset size. If no upper limit is known set to infinity (Long.MAX_VALUE).
    Nmax=inf - we are sure that exactly (known) or less than (unknown) infinity elements will be added.
    Nmax=106 - we are sure that exactly (known) or less than (unknown) 106 elements will be added.
    Nmax=107 - we are sure that exactly (known) or less than (unknown) 107 elements will be added.
    Nmax=108 - we are sure that exactly (known) or less than (unknown) 108 elements will be added.

    Required main memory [MB]
    #quantiles
    epsilon
    delta   N unknown N known
    Nmax=inf Nmax=106 Nmax=107 Nmax=108 Nmax=inf Nmax=106 Nmax=107 Nmax=108
       
    any
    0
    any infinity 7.6 76 762 infinity 7.6 76 762
    any
    10 -1
    0 infinity 0.003 0.005 0.006 0.03 0.003 0.005 0.006
    10 -2
    0.02 0.03 0.05 0.31 0.02 0.03 0.05
    10 -3
    0.12 0.2 0.3 2.7 0.12 0.2 0.3
    10 -4
    0.6 1.2 2.1 26.9 0.6 1.2 2.1
    10 -5
    2.5 6.4 11.6 205 2.5 6.4 11.6
    10 -6
    7.6 25.4 63.6 1758 7.6 25.4 63.6
       
    100
    10 -2
    10 -1 0.033 0.021 0.03 0.03 0.020 0.020 0.020 0.020
    10 -5 0.038 0.021 0.03 0.04 0.024 0.020 0.020 0.020
    10 -3
    10 -1 0.48 0.12 0.2 0.3 0.32 0.12 0.2 0.3
    10 -5 0.54 0.12 0.2 0.3 0.37 0.12 0.2 0.3
    10 -4
    10 -1 6.6 0.6 1.2 2.1 4.6 0.6 1.2 2.1
    10 -5 7.2 0.6 1.2 2.1 5.2 0.6 1.2 2.1
    10 -5
    10 -1 86 2.5 6.4 11.6 63 2.5 6.4 11.6
    10 -5 94 2.5 6.4 11.6 70 2.5 6.4 11.6
       
    10000
    10 -2
    10 -1 0.04 0.02 0.03 0.04 0.02 0.02 0.02 0.02
    10 -5 0.04 0.02 0.03 0.04 0.03 0.02 0.03 0.03
    10 -3
    10 -1 0.52 0.12 0.21 0.3 0.35 0.12 0.21 0.3
    10 -5 0.56 0.12 0.21 0.3 0.38 0.12 0.21 0.3
    10 -4
    10 -1 7.0 0.64 1.2 2.1 5.0 0.64 1.2 2.1
    10 -5 7.5 0.64 1.2 2.1 5.4 0.64 1.2 2.1
    10 -5
    10 -1 90 2.5 6.4 11.6 67 2.5 6.4 11.6
    10 -5 96 2.5 6.4 11.6 71 2.5 6.4 11.6
         
    #quantiles epsilon delta Nmax=inf Nmax=106 Nmax=107 Nmax=108 Nmax=inf Nmax=106 Nmax=107 Nmax=108
    N unknown N known
    Required main memory [MB]

    Implementation:

    After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay,Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets.Proc. of the 1999 ACM SIGMOD Int. Conf. on Management of Data,Paper available here.

    and

    Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate Medians and other Quantiles in One Pass and with Limited Memory,Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of Data,Paper available here.

    The broad picture is as follows. Two concepts are used: Shrinking and Sampling.Shrinking takes a data sequence, sorts it and produces a shrinked data sequence by picking every k-th element and throwing away all the rest.The shrinked data sequence is an approximation to the original data sequence.

    Imagine a large data sequence (residing on disk or being generated in memory on the fly) and a main memory block of n=b*k elements (b is the number of buffers, k is the number of elements per buffer). Fill elements from the data sequence into the block until it is full or the data sequence is exhausted.When the block (or a subset of buffers) is full and the data sequence is not exhausted, apply shrinking to lossily compress a number of buffers into one single buffer.Repeat these steps until all elements of the data sequence have been consumed.Now the block is a shrinked approximation of the original data sequence. Treating it as if it would be the original data sequence, we can determine quantiles in main memory.

    Now, the whole thing boils down to the question of: Can we choose b and k (the number of buffers and the buffer size) such that b*k is minimized, yet quantiles determined upon the block are guaranteed to be away from the true quantiles no more than some epsilon?It turns out, we can. It also turns out that the required main memory block size n=b*k is usually moderate (see the table above).

    The theme can be combined with random sampling to further reduce main memory requirements, at the expense of probabilistic guarantees.Sampling filters the data sequence and feeds only selected elements to the algorithm outlined above.Sampling is turned on or off, depending on the parametrization.

    This quick overview does not go into important details, such as assigning proper weights to buffers, how to choose subsets of buffers to shrink, etc. For more information consult the papers cited above.

    Time Performance:

    Pentium Pro 200 Mhz, SunJDK 1.2.2, NT, java -classic,
    filling 10 4 elements at a time, reading 100 percentiles at a time,
    hasSumOfLogarithms()=false, hasSumOfInversions()=false, getMaxOrderForSumOfPowers()=2
    Performance
    Quantiles Epsilon Delta   Filling
    [#elements/sec]
      Quantile computation
    [#quantiles/sec]
    N unknown,
    Nmax=inf
    N known,
    Nmax=107
    N unknown,
    Nmax=inf
    N known,
    Nmax=107
         
    104 10 -1 10 -1

    1600000

    1300000 250000 130000
    10 -2 360000 1200000 50000 20000
    10 -3 150000 200000 3600 3000
    10 -4 120000 170000 80 1000
    See Also:
    cern.jet.stat.quantile, Serialized Form

Warning: You cannot see the full API documentation of this class since the access to the DatMelt documentation for third-party Java classes is denied. Guests can only view jhplot Java API. To view the complete description of this class and its methods, please request the full DataMelt membership.

If you are already a full member, please login to the DataMelt member area before visiting this documentation.