**'cern.hep.aida.bin.QuantileBin1D'**Java class

## Class QuantileBin1D

- All Implemented Interfaces:
- DoubleBufferConsumer, Serializable, Cloneable

- Direct Known Subclasses:
- DynamicBin1D

public class QuantileBin1Dextends MightyStaticBin1D

1-dimensional non-rebinnable bin holding`double`elements with scalable quantile operations defined upon;Using little main memory, quickly computes approximate quantiles over very large data sequences with and even without a-priori knowledge of the number of elements to be filled;Conceptually a strongly lossily compressed multiset (or bag);Guarantees to respect the worst case approximation error specified upon instance construction.First see the package summary and javadoc tree view to get the broad picture.**Motivation and Problem:**Intended to help scale applications requiring quantile computation.Quantile computation on very large data sequences is problematic, for the following reasons:Computing quantiles requires sorting the data sequence.To sort a data sequence the entire data sequence needs to be available.Thus, data cannot be thrown away during filling (as done by static bins like`StaticBin1D`

and`MightyStaticBin1D`

).It needs to be kept, either in main memory or on disk.There is often not enough main memory available.Thus, during filling data needs to be streamed onto disk.Sorting disk resident data is prohibitively time consuming.As a consequence, traditional methods either need very large memories (like`DynamicBin1D`

) or time consuming disk based sorting.This class proposes to efficiently solve the problem, at the expense of producing approximate rather than exact results.It can deal with infinitely many elements without resorting to disk.The main memory requirements are smaller than for any other known approximate technique by an order of magnitude.They get even smaller if an upper limit on the maximum number of elements ever to be added is known a-priori.

**Approximation error:**The approximation guarantees are parametrizable and explicit but probabilistic, and apply for arbitrary value distributions and arrival distributions of the data sequence.In other words, this class guarantees to respect the worst case approximation error specified upon instance construction to a certain probability.Of course, if it is specified that the approximation error should not exceed some number*very close*to zero,this class will potentially consume just as much memory as any of the traditional exact techniques would do.However, for errors larger than 10^{-5}, its memory requirements are modest, as shown by the table below.**Main memory requirements:**Given in megabytes, assuming a single element (`double`) takes 8 byte.The number of elements required is then`MB*1024*1024/8`.**Parameters:***epsilon*- the maximum allowed approximation error on quantiles; in`[0.0,1.0]`.To get exact rather than approximate quantiles, set`epsilon=0.0`;*delta*- the probability allowed that the approximation error fails to be smaller than epsilon; in`[0.0,1.0]`.To avoid probabilistic answers, set`delta=0.0`.For example,`delta = 0.0001`is equivalent to a confidence of`99.99%`.*quantiles*- the number of quantiles to be computed; in`[0,Integer.MAX_VALUE]`.*is N known?*- specifies whether the exact size of the dataset over which quantiles are to be computed is known.- N
_{max}- the exact dataset size, if known. Otherwise, an upper limit on the dataset size. If no upper limit is known set to infinity (`Long.MAX_VALUE`).

_{max}=inf - we are sure that exactly (*known*) or less than (*unknown*) infinity elements will be added.

N_{max}=10^{6}- we are sure that exactly (*known*) or less than (*unknown*) 10^{6}elements will be added.

N_{max}=10^{7}- we are sure that exactly (*known*) or less than (*unknown*) 10^{7}elements will be added.

N_{max}=10^{8}- we are sure that exactly (*known*) or less than (*unknown*) 10^{8}elements will be added.Required main memory [MB]#quantiles epsilondelta N unknown N known N _{max}=infN _{max}=10^{6}N _{max}=10^{7}N _{max}=10^{8}N _{max}=infN _{max}=10^{6}N _{max}=10^{7}N _{max}=10^{8}any 0any infinity 7.6 76 762 infinity 7.6 76 762 any 10^{ -1}0 infinity 0.003 0.005 0.006 0.03 0.003 0.005 0.006 10^{ -2}0.02 0.03 0.05 0.31 0.02 0.03 0.05 10^{ -3}0.12 0.2 0.3 2.7 0.12 0.2 0.3 10^{ -4}0.6 1.2 2.1 26.9 0.6 1.2 2.1 10^{ -5}2.5 6.4 11.6 205 2.5 6.4 11.6 10^{ -6}7.6 25.4 63.6 1758 7.6 25.4 63.6 100 10^{ -2}10 ^{ -1}0.033 0.021 0.03 0.03 0.020 0.020 0.020 0.020 10 ^{ -5}0.038 0.021 0.03 0.04 0.024 0.020 0.020 0.020 10^{ -3}10 ^{ -1}0.48 0.12 0.2 0.3 0.32 0.12 0.2 0.3 10 ^{ -5}0.54 0.12 0.2 0.3 0.37 0.12 0.2 0.3 10^{ -4}10 ^{ -1}6.6 0.6 1.2 2.1 4.6 0.6 1.2 2.1 10 ^{ -5}7.2 0.6 1.2 2.1 5.2 0.6 1.2 2.1 10^{ -5}10 ^{ -1}86 2.5 6.4 11.6 63 2.5 6.4 11.6 10 ^{ -5}94 2.5 6.4 11.6 70 2.5 6.4 11.6 10000 10^{ -2}10 ^{ -1}0.04 0.02 0.03 0.04 0.02 0.02 0.02 0.02 10 ^{ -5}0.04 0.02 0.03 0.04 0.03 0.02 0.03 0.03 10^{ -3}10 ^{ -1}0.52 0.12 0.21 0.3 0.35 0.12 0.21 0.3 10 ^{ -5}0.56 0.12 0.21 0.3 0.38 0.12 0.21 0.3 10^{ -4}10 ^{ -1}7.0 0.64 1.2 2.1 5.0 0.64 1.2 2.1 10 ^{ -5}7.5 0.64 1.2 2.1 5.4 0.64 1.2 2.1 10^{ -5}10 ^{ -1}90 2.5 6.4 11.6 67 2.5 6.4 11.6 10 ^{ -5}96 2.5 6.4 11.6 71 2.5 6.4 11.6 #quantiles epsilon delta N _{max}=infN _{max}=10^{6}N _{max}=10^{7}N _{max}=10^{8}N _{max}=infN _{max}=10^{6}N _{max}=10^{7}N _{max}=10^{8}N unknown N known Required main memory [MB] **Implementation:**After: Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay,Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets.Proc. of the 1999 ACM SIGMOD Int. Conf. on Management of Data,Paper available here.

and

Gurmeet Singh Manku, Sridhar Rajagopalan and Bruce G. Lindsay, Approximate Medians and other Quantiles in One Pass and with Limited Memory,Proc. of the 1998 ACM SIGMOD Int. Conf. on Management of Data,Paper available here.

The broad picture is as follows. Two concepts are used:

*Shrinking*and*Sampling*.Shrinking takes a data sequence, sorts it and produces a shrinked data sequence by picking every k-th element and throwing away all the rest.The shrinked data sequence is an approximation to the original data sequence.Imagine a large data sequence (residing on disk or being generated in memory on the fly) and a main memory

*block*of`n=b*k`elements (`b`is the number of buffers,`k`is the number of elements per buffer). Fill elements from the data sequence into the block until it is full or the data sequence is exhausted.When the block (or a subset of buffers) is full and the data sequence is not exhausted, apply shrinking to lossily compress a number of buffers into one single buffer.Repeat these steps until all elements of the data sequence have been consumed.Now the block is a shrinked approximation of the original data sequence. Treating it as if it would be the original data sequence, we can determine quantiles in main memory.Now, the whole thing boils down to the question of: Can we choose

`b`and`k`(the number of buffers and the buffer size) such that`b*k`is minimized, yet quantiles determined upon the block are*guaranteed*to be away from the true quantiles no more than some`epsilon`?It turns out, we can. It also turns out that the required main memory block size`n=b*k`is usually moderate (see the table above).The theme can be combined with random sampling to further reduce main memory requirements, at the expense of probabilistic guarantees.Sampling filters the data sequence and feeds only selected elements to the algorithm outlined above.Sampling is turned on or off, depending on the parametrization.

This quick overview does not go into important details, such as assigning proper

*weights*to buffers, how to choose subsets of buffers to shrink, etc. For more information consult the papers cited above.**Time Performance:**Pentium Pro 200 Mhz, SunJDK 1.2.2, NT, java -classic,

filling 10^{4}elements at a time, reading 100 percentiles at a time,

hasSumOfLogarithms()=false, hasSumOfInversions()=false, getMaxOrderForSumOfPowers()=2Performance Quantiles Epsilon Delta Filling

[#elements/sec]Quantile computation

[#quantiles/sec]N unknown,

N_{max}=infN known,

N_{max}=10^{7}N unknown,

N_{max}=infN known,

N_{max}=10^{7}10 ^{4}10 ^{ -1}10 ^{ -1}1600000

1300000 250000 130000 10 ^{ -2}360000 1200000 50000 20000 10 ^{ -3}150000 200000 3600 3000 10 ^{ -4}120000 170000 80 1000 - See Also:
`cern.jet.stat.quantile`

, Serialized Form

**Warning:**You cannot see the full API documentation of this class since the access to the DatMelt documentation for third-party Java classes is denied. Guests can only view jhplot Java API. To view the complete description of this class and its methods, please request the full DataMelt membership.

If you are already a full member, please login to the DataMelt member area before visiting this documentation.