You are a guest. Restricted access. Read more.
SCaVis manual

# Statistics

Snippet from Wikipedia: Statistics

Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data. In applying statistics to, e.g., a scientific, industrial, or societal problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all persons living in a country" or "every atom composing a crystal".

The package `jhplot.stat` can be used for descriptive analysis of random distributions. Similarly, `cern.jet.stat.Descriptive` package contains descriptive methods to calculate many statistical characteristics.

Consider also several other packages:

But before using such packages, check again the data containers such as `jhplot.P1D` or `jhplot.H1D` . They already have many useful methods to access statistical information on data.

Some statistical questions have been already described in the Section random_numbers. Below we will consider several advances topics.

# Moments of a distribution

SCaVis can be used to determine statistical characteristics of an arbitrary frequency distribution. Moments of this distribution can be calculated up to the 6th order. Read more here.

Code example

```1: from jhplot  import *
2: from jhplot.math.StatisticSample import *
3:
4: a=randomLogNormal(1000,0,10) # generate random 1000 numbers between 0 and 10 using a LogNormal distribution
5: p0=P0D(a)                    # convert it to an array
6: print p0.getStatString()     # print detailed characteristics
```

```Size: 1000
Sum: 2.0795326321690155E11
SumOfSquares: 1.722072831288292E22
Min: 4.3681673233597326E-14
Max: 1.187289072883721E11
Mean: 2.0795326321690154E8
RMS: 4.1497865382309628E9
Variance: 1.7194678431631995E19
Standard deviation: 4.14664664899627E9
Standard error: 1.3112848062732975E8
Geometric mean: 0.7193930848008395
Product: 9.252494313364321E-144
Harmonic mean: 2.2976022239249118E-11
Sum of inversions: 4.352363475222163E13
Skew: 25.65476598759878
Kurtosis: 694.7699433878664
Sum of powers(3): 1.839916709064571E33
Sum of powers(4): 2.0782654881247146E44
Sum of powers(5): 2.4093597349729484E55
Sum of powers(6): 2.8286717081193334E66
Moment(0,0): 1.0
Moment(1,0): 2.0795326321690154E8
Moment(2,0): 1.722072831288292E19
Moment(3,0): 1.839916709064571E30
Moment(4,0): 2.0782654881247147E41
Moment(5,0): 2.409359734972948E52
Moment(6,0): 2.8286717081193336E63
Moment(0,mean()): 1.0
Moment(1,mean()): 4.931390285491944E-7
Moment(2,mean()): 1.7177483753200437E19
Moment(3,mean()): 1.8291913748162454E30
Moment(4,mean()): 2.0630054468429083E41
Moment(5,mean()): 2.3878300421487077E52
Moment(6,mean()): 2.798744135044988E63
25%, 50%, 75% Quantiles: 0.0012310644573427145, 0.9530465118707188, 535.0653267374155
quantileInverse(median): 0.5005
Distinct elements & frequencies not printed (too many).```

Let us continue with this example and now we would like to return all statistical characteristics of the sample as a dictionary. We can do this by appending the following lines that 1) create a dictionary “stat” with key/value pairs; 2) retrieve a variance of the sample using the key ``Variance''.

```stat=p0.getStat()
print "Variance=",stat["variance"]```

which will print “Variance= 757.3”. If not sure about the names of the keys, simply print the dictionary as “print stat”.

One can create histograms that catch the most basic characteristics of data. This is especially important if there is no particular reasons to deal with complete data arrays. We can easily do this with above Fibonacci sequence as:

```h=p0.getH1D(10, 0, 100)
print h.getStat()```

The code converts the array into a histogram with 10 equidistant bins in the range 0-100, and then it prints the map with statistical characteristics.

You can also visualize the random numbers in the form of a histogram as shown in this detailed example above. We create random numbers, convert them to histograms and plot them.

Unregistered users have a limited access to this section. You can unlock advanced pages after becoming a full member. You can also request to edit this manual and insert comments.

# Statistics with P1D

You can get detailed statistics on P1D data using the method getStat(axis), where axis=0 for X and axis=1 for Y. It returns a map (for JAVA) or Python dictionary (for Jython) where each statistical characteristics can be accessed using a key, such as mean, RMS, variance, error on the mean at. Assuming that P1D is represented by “p1” object, try this code:

```stat=p2.getStat(0)  # get PYTHON dictionary with statistics for X
for key in  stat:
print key , 't', stat[key]```

This will print the following values:

```error     0.996592835069
rms     5.05682000584
mean     4.42857142857
variance     6.95238095238
stddev     2.63673679998```

# Comparing two histograms

Comparison of two histograms test hypotheses that two histograms represent identical distributions. Both `jhplot.H1D` and `jhplot.H2D` histograms have the method called “compareChi2(h1,h2)” It calculates Chi2 between 2 histograms taking into account errors on the heights of the bins. The number chi2/ndf gives the estimate: values smaller or close to 1 indicates similarity between 2 histograms.

```d=compareChi2(h1,h2) # h1, h2 are H1D or H2D histograms defined above
chi2=d[0] # chi2
ndf =d[1] # number of degrees of freedom
p   =d[2] # probability (p-value)```

Two histograms are identical if chi2=0. Make sure that both histograms have error (or set them to small values).

A similar method also exists for `jhplot.P1D` data points. The comparison is done for Y-values, assuming symmetric errors on Y. However, data should be ordered in X for correct comparison.

# Linear regression analysis

Snippet from Wikipedia: Linear regression

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. The case of one explanatory variable is called simple linear regression. For more than one explanatory variable, the process is called multiple linear regression.

There are several tools to perform linear regressions and estimate slope and intercept (with statistical uncertainties), as well as to estimate the prediction and confidence level bands. Let us make a short example generating data in 2D and performing linear regression fit:

Code example

``` 1: from jhplot  import *
2: from jhplot.stat  import LinReg
3: from java.awt import Color
4: from java.util import Random
5:
6: c1 = HPlot("Linear regression")
7: c1.visible()
8: c1.setGTitle("Linear regression")
9: c1.setAutoRange()
10:
11: p1= P1D("data")
12: rand = Random()
13: for i in range(200):
14:           x=rand.nextGaussian()
15:           y=rand.nextGaussian()
17: c1.draw(p1)
18:
19: r = LinReg(p1)
20: print "Intercept=",r.getIntercept(), "+/-",r.getInterceptError()
21: print "Slope=",r.getSlope(),"+/-",r.getSlopeError()
22:
23: pP=r.getPredictionBand(Color.red,0.5) # get prediction and show as a band
24: c1.draw(pP)
25:
26: c1.draw(  r.getResult()  )     # draw F1D function representing the fit result
27: c1.draw(  r.getPrediction() )  # draw the prediction interval
```

The output of this script is the fit values:

```Intercept= 0.0637647629564 +/- 0.0656878315703
Slope= 0.1024396794 +/- 0.331874315937```

# Normalised Factorial Moments

Normalized factorial moments (NFM) are used to measure deviations of a multiplicity distribution from a Poissonian case. As example, let us consider calculations of normalised factorial moments (NFM) for several distributions. They are defined as

<m 14>F_q = {<n (n-1) .. (n+1-q)>}/{ <n>^{q}}</m>.

where “n” is a random number (integer) number. According to this definition, a Poisson distribution has all moments equal to 1. A broader than a Poisson distribution have moments larger then one. Let us calculate the NFM up to 4th order for a Poisson distribution Binomial and a Negative-binomial distributions

Code example

``` 1: # Statistics | P | 1.7 | S.Chekanov | Calculations of normalised factorial moments of several distributions
2: # Definide as F_q=<n*(n-1) .. (n-q+1)/ <n>^q
3: # authors:  S.Chekanov
4:
5:
6: from jhplot  import *
7: from cern.jet.random.engine import *
8: from cern.jet.random  import *
9: from jhplot.shapes  import Line
10: from java.awt import Color
11: from jhpro.stat import *
12:
13:
14: c1 = HPlot("Canvas",600,400)
15: c1.visible(1)
16: c1.setNameX("NFM order")
17: c1.setNameY("Values")
18: c1.setRange(0,5,0,2)
19:
20: # show a line in the NDC system
21: line = Line(0.0,1, 5., 1.)
22: line.setPosCoord("USER")
23: line.setColor(Color.gray)
24: line.setTransparency(0.5)
26:
27: # build a random engine
28: engine=MersenneTwister()
29: poisson=Poisson(10,engine)
30: m=MomentsFacNorm(4) # calculates moments up to 4th order
31:
32: # use a Poissonian random numbers
33: for i in range(100):
34:             m.process( poisson.nextInt())
35: p1=m.getResults()
36: p1.setTitle("NFM for Poissson")
37: p1.setSymbol(4)
38: p1.setSymbolSize(10)
39: c1.draw(p1)
40: print(p1.toString())
41:
42:
43: ## Binomial distribution
44: binomial=Binomial(10, 0.2, engine)
45: m=MomentsFacNorm(4) # calculates moments up to 4th order
46: for i in range(200):
47:             m.process( binomial.nextInt())
48: p2=m.getResults()
49: p2.setTitle("NFM for Binomial")
50: p2.setSymbol(5)
51: p2.setSymbolSize(10)
52: c1.draw(p2)
53: print(p2.toString())
54:
55:
56: ## NegativeBinomial distribution (NBD)
57: nbinom=NegativeBinomial(10, 0.4, engine)
58: m=MomentsFacNorm(4)
59: for i in range(300):
60:             m.process( nbinom.nextInt())
61: p3=m.getResults()
62: p3.setTitle("NFM for NBD")
63: p3.setSymbol(6)
64: p3.setSymbolSize(10)
65: c1.draw(p3)
66: print(p3.toString())
```

The output file shows the NFM for all three distributions together with statistical errors.

# Correlation coefficients

Correlation coefficients between two Python lists can can be obtained using several methods.

Unregistered users have a limited access to this section. You can unlock advanced pages after becoming a full member. You can also request to edit this manual and insert comments.

# Setting limits

This section describes how to set a limit on observation of a signal in the presence of a background distribution. We will also consider a situation when the background is affected by a systematic uncertainty. We will consider how to estimate statistical significance of an observation in presence of statistical and systematic al errors (in case of observation) and also how to set the 95% CL exclusion limit (in case of no observation). We will show how to estimate 95% confidence limit with correct treatment of statistical errors.

Unregistered users have a limited access to this section. You can unlock advanced pages after becoming a full member. You can also request to edit this manual and insert comments.

Table of content