Table of Content
Search clouds
Licenses
Author's resources
This is an old revision of the document!
Statistics is the study of the collection, analysis, interpretation, presentation and organization of data. It deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments.
The package jhplot.stat can be used for descriptive analysis of random distributions. Similarly, cern.jet.stat.Descriptive package contains descriptive methods to calculate many statistical characteristics.
Consider also several other packages:
But before using such packages, check again the data containers such as P1D or H1D. They already have many useful methods to access statistical information on data.
Some statistical questions have been already described in the Section random_numbers. Below we will consider several advances topics.
SCaVis can be used to determine statistical characteristics of an arbitrary frequency distribution. Moments of this distribution can be calculated up to the 6th order. Read more here.
1: from jhplot import * 2: from jhplot.math.StatisticSample import * 3: 4: a=randomLogNormal(1000,0,10) # generate random 1000 numbers between 0 and 10 using a LogNormal distribution 5: p0=P0D(a) # convert it to an array 6: print p0.getStatString() # print detailed characteristics
Run this script and you will get a very detailed information about this distribution (rather self-explanatory)
Click here to see the result of this code
Click here to see the result of this code
Size: 1000 Sum: 2.0795326321690155E11 SumOfSquares: 1.722072831288292E22 Min: 4.3681673233597326E-14 Max: 1.187289072883721E11 Mean: 2.0795326321690154E8 RMS: 4.1497865382309628E9 Variance: 1.7194678431631995E19 Standard deviation: 4.14664664899627E9 Standard error: 1.3112848062732975E8 Geometric mean: 0.7193930848008395 Product: 9.252494313364321E-144 Harmonic mean: 2.2976022239249118E-11 Sum of inversions: 4.352363475222163E13 Skew: 25.65476598759878 Kurtosis: 694.7699433878664 Sum of powers(3): 1.839916709064571E33 Sum of powers(4): 2.0782654881247146E44 Sum of powers(5): 2.4093597349729484E55 Sum of powers(6): 2.8286717081193334E66 Moment(0,0): 1.0 Moment(1,0): 2.0795326321690154E8 Moment(2,0): 1.722072831288292E19 Moment(3,0): 1.839916709064571E30 Moment(4,0): 2.0782654881247147E41 Moment(5,0): 2.409359734972948E52 Moment(6,0): 2.8286717081193336E63 Moment(0,mean()): 1.0 Moment(1,mean()): 4.931390285491944E-7 Moment(2,mean()): 1.7177483753200437E19 Moment(3,mean()): 1.8291913748162454E30 Moment(4,mean()): 2.0630054468429083E41 Moment(5,mean()): 2.3878300421487077E52 Moment(6,mean()): 2.798744135044988E63 25%, 50%, 75% Quantiles: 0.0012310644573427145, 0.9530465118707188, 535.0653267374155 quantileInverse(median): 0.5005 Distinct elements & frequencies not printed (too many).
One can access all such values using the method “getStat()” which returns a Java Map (or Jython dictionary) with the key representing statistical characteristics of this array.
You can also visualize the random numbers in the form of a histogram:
from jhplot import * from jhplot.math.StatisticSample import * a=randomLogNormal(1000,0,10) h=H1D("LogNormal",40,-10,10) # make a histogram h.fill(a) c1 = HPlot() c1.setGTitle("LogNormal"); c1.visible() c1.setAutoRange() c1.draw(h)
You can get detailed statistics on P1D data using the method getStat(axis), where axis=0 for X and axis=1 for Y. It returns a map (for JAVA) or Python dictionary (for Jython) where each statistical characteristics can be accessed using a key, such as mean, RMS, variance, error on the mean at. Assuming that P1D is represented by “p1” object, try this code:
stat=p2.getStat(0) # get PYTHON dictionary with statistics for X for key in stat: print key , 't', stat[key]
This will print the following values:
Click here to see the output
Click here to see the output
error 0.996592835069 rms 5.05682000584 mean 4.42857142857 variance 6.95238095238 stddev 2.63673679998
Two distributions (1D and 2D histograms, P1D data points) can be compared by applying several statistical tests. The following statistical comparisons are available
Consider a simple statistical test: compare 2 histograms. You can generate 2 similar histograms using this code snippet:
from java.awt import Color from java.util import Random from jhplot import * c1 = HPlotJa("Canvas") c1.setGTitle("Statistical comparisons") c1.visible() c1.setAutoRange() h1 = H1D("Histo1",20, -2, 2.0) h1.setColor(Color.blue) h2 = H1D("Histo2",20, -2, 2.0) r = Random() for i in range(10000): h1.fill(r.nextGaussian()) h2.fill(r.nextGaussian()) if (i<100): h2.fill(2*r.nextGaussian()+2) h1.setErrAll(1) h2.setErrAll(0) c1.draw(h1) c1.draw(h2)
Here we show statistical uncertainties only for the first (blue) histogram (see the method setErrAll(0)). The output of this code is shown below
Now we can perform a several tests to calculate the degree of similarity of these distributions (including their uncertainties). Below we show a code which compares these two histograms and calculate Chi2 per degree of freedom:
The output of this script is shown here:
AndersonDarling method= 2.21779532164 / 20 Chi2 method= 0.786556311893 / 20 Goodman method= 0.624205522632 / 20 KolmogorovSmirnov method= 0.419524135727 / 20
In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.
There are several tools to perform linear regressions and estimate slope and intercept (with statistical uncertainties), as well as to estimate the prediction and confidence level bands. Let us make a short example generating data in 2D and performing linear regression fit:
1: from jhplot import * 2: from jhplot.stat import LinReg 3: from java.awt import Color 4: from java.util import Random 5: 6: c1 = HPlot("Linear regression") 7: c1.visible() 8: c1.setGTitle("Linear regression") 9: c1.setAutoRange() 10: 11: p1= P1D("data") 12: rand = Random() 13: for i in range(200): 14: x=rand.nextGaussian() 15: y=rand.nextGaussian() 16: p1.add(0.2*x, y) 17: c1.draw(p1) 18: 19: r = LinReg(p1) 20: print "Intercept=",r.getIntercept(), "+/-",r.getInterceptError() 21: print "Slope=",r.getSlope(),"+/-",r.getSlopeError() 22: 23: pP=r.getPredictionBand(Color.red,0.5) # get prediction and show as a band 24: c1.draw(pP) 25: 26: c1.draw( r.getResult() ) # draw F1D function representing the fit result 27: c1.draw( r.getPrediction() ) # draw the prediction interval
The output of this script is the fit values:
Intercept= 0.0637647629564 +/- 0.0656878315703 Slope= 0.1024396794 +/- 0.331874315937
Many useful distribution functions can be found in cern.jet.stat.Probability package. The package contains numerical integration of certain probability distributions. Below we will show how to use the normal distribution which is very useful distribution in many statistical analyses.
In the example below we will compute the probability that our random outcome is within a specified interval using the normal distribution.
The code below returns the area under the normal probability density function, integrated from minus infinity to -1.17 (assumes mean is zero, variance is one).
from cern.jet.stat.Probability import * print normal(-1.17)
For the two-sided case, one can multiply the result by 2.
Normalized factorial moments (NFM) are used to measure deviations of a multiplicity distribution from a Poissonian case. As example, let us consider calculations of normalised factorial moments (NFM) for several distributions. They are defined as
<m 14>F_q = {<n (n-1) .. (n+1-q)>}/{ <n>^{q}}</m>.
where “n” is a random number (integer) number. According to this definition, a Poisson distribution has all moments equal to 1. A broader than a Poisson distribution have moments larger then one. Let us calculate the NFM up to 4th order for a Poisson distribution Binomial and a Negative-binomial distributions
1: # Statistics | P | 1.7 | S.Chekanov | Calculations of normalised factorial moments of several distributions 2: # Definide as F_q=<n*(n-1) .. (n-q+1)/ <n>^q 3: # authors: S.Chekanov 4: 5: 6: from jhplot import * 7: from cern.jet.random.engine import * 8: from cern.jet.random import * 9: from jhplot.shapes import Line 10: from java.awt import Color 11: from jhpro.stat import * 12: 13: 14: c1 = HPlot("Canvas",600,400) 15: c1.visible(1) 16: c1.setNameX("NFM order") 17: c1.setNameY("Values") 18: c1.setRange(0,5,0,2) 19: 20: # show a line in the NDC system 21: line = Line(0.0,1, 5., 1.) 22: line.setPosCoord("USER") 23: line.setColor(Color.gray) 24: line.setTransparency(0.5) 25: c1.add(line) 26: 27: # build a random engine 28: engine=MersenneTwister() 29: poisson=Poisson(10,engine) 30: m=MomentsFacNorm(4) # calculates moments up to 4th order 31: 32: # use a Poissonian random numbers 33: for i in range(100): 34: m.process( poisson.nextInt()) 35: p1=m.getResults() 36: p1.setTitle("NFM for Poissson") 37: p1.setSymbol(4) 38: p1.setSymbolSize(10) 39: c1.draw(p1) 40: print(p1.toString()) 41: 42: 43: ## Binomial distribution 44: binomial=Binomial(10, 0.2, engine) 45: m=MomentsFacNorm(4) # calculates moments up to 4th order 46: for i in range(200): 47: m.process( binomial.nextInt()) 48: p2=m.getResults() 49: p2.setTitle("NFM for Binomial") 50: p2.setSymbol(5) 51: p2.setSymbolSize(10) 52: c1.draw(p2) 53: print(p2.toString()) 54: 55: 56: ## NegativeBinomial distribution (NBD) 57: nbinom=NegativeBinomial(10, 0.4, engine) 58: m=MomentsFacNorm(4) 59: for i in range(300): 60: m.process( nbinom.nextInt()) 61: p3=m.getResults() 62: p3.setTitle("NFM for NBD") 63: p3.setSymbol(6) 64: p3.setSymbolSize(10) 65: c1.draw(p3) 66: print(p3.toString())
The output file shows the NFM for all three distributions together with statistical errors.
Correlation coefficients between two Python lists can can be obtained using several methods.
This section describes how to set a limit on observation of a signal in the presence of a background distribution. We will also consider a situation when the background is affected by a systematic uncertainty. We will consider how to estimate statistical significance of an observation in presence of statistical and systematic al errors (in case of observation) and also how to set the 95% CL exclusion limit (in case of no observation). We will show how to estimate 95% confidence limit with correct treatment of statistical errors.
Click to read more
Click to read more
A complete description of how to use Java, Jython and SCaVis for scientific analysis is described in the book Scientific data analysis using Jython and Java published by Springer Verlag, London, 2010 (by S.V.Chekanov)