DMelt:Numeric/7 PCA Analysis

From HandWiki
Member

PCA analysis

Principal Component Analysis (PCA) is an important for many applications. Read Principal_component_analysis.

Below we will show examples of Principal Component Analysis (PCA) data transformation using matrices as input. We will consider a situation when some of the columns in the data matrix are linearly dependent or when there are more columns than rows in the data matrix i.e. there are more dimensions than samples in the data set. In the above example we train the data and then apply to a test data.

from jhplot.math.pca import *
from Jama import Matrix

trainingData = Matrix([[1, 2, 3, 4, 5, 6],[6, 5, 4, 3, 2, 1],[2, 2, 2, 2, 2, 2]])
pca = PCA(trainingData)
testData = Matrix([[1, 2, 3, 4, 5, 6],[1, 2, 1, 2, 1, 2]]);

transformedData = pca.transform(testData, PCA.TransformationType.WHITENING);
print transformedData.toString()

More details on the corresponding Java code is in pca_transform.

The output of this code is shown below:

-0.9999999999999998, -0.5773502691896268
-0.08571428571428596, 1.732050807568878

Principal_component_analysis is used for Dimensionality_reduction, a method of transforming complex data in large dimensions into data with lesser dimensions ensuring that it conveys similar information.

It was described in [DMelt:Statistics/6_Dimensionality_reduction]. Let us consider IRIS dataset [1]. The IRIS data set has 4 numerical attributes. Therefore, it is difficult for humans to visualize such data. Therefore, one can reduce the dimensionality of this dataset down to two. We will use Principal component analysis (PCA) which convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. PCA needs the data samples to have a mean of ZERO, so we need a transform to ensue this property as well.

Here is the code that uses the Java package jsat.datatransform.PCA jsat.datatransform.PCA to perform this transformation:

from java.io import File
from jsat.classifiers import DataPoint,ClassificationDataSet
from jsat.datatransform import PCA,DataTransform,ZeroMeanTransform
from jsat import ARFFLoader,DataSet

print "Download iris_org.arff"
from jhplot import *
print Web.get("https://datamelt.org/examples/data/iris_org.arff")
fi=File("iris_org.arff")
dataSet = ARFFLoader.loadArffFile(fi)
# We specify '0' as the class we would like to make the target class. 
cData = ClassificationDataSet(dataSet, 0)

# The IRIS data set has 4 numerical attributes, unfortunately humans are not good at visualizing 4 dimensional things.
# Instead, we can reduce the dimensionality down to two. 
# PCA needs the data samples to have a mean of ZERO, so we need a transform to ensue this property as well
zeroMean = ZeroMeanTransform(cData);
cData.applyTransform(zeroMean);

# PCA is a transform that attempts to reduce the dimensionality while maintaining all the variance in the data. 
# PCA also allows us to specify the exact number of dimensions we would like 
pca = PCA(cData, 2, 1e-9);
        
# We can now apply the transformations to our data set
cData.applyTransform(pca);

c1 = SPlot()
c1.visible()
c1.setAutoRange()
c1.setMarksStyle('various')
#c1.setConnected(1, 0)
c1.setNameX('X')
c1.setNameY('Y')
# output
for i in range(cData.getSampleSize()):
  dataPoint = cData.getDataPoint(i)  
  category = cData.getDataPointCategory(i) # get category 
  vec=dataPoint.getNumericalValues();
  c1.addPoint(category,vec.get(0),vec.get(1),1)
  if (i%10==0): 
                c1.update()
                print i,dataPoint,category

The output image is shown here:

DMelt example: Dimensionality reduction of IRIS data using PCA and JSAT

  1. Fisher,R.A. "The use of multiple measurements in taxonomic problems", Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).