DMelt:AI/Data Clustering

From HandWiki
Member


Data clustering

Cluster analysis is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. DataMelt contains a framework for clustering analysis, i.e. for non-supervised learning in which the classification process does not depend on a priory information. It includes the following algorithms:

  • K-means clustering analysis (single and multi pass)
  • C-means (fuzzy) algorithm
  • Agglomerative hierarchical clustering

All algorithms can be run in a fixed cluster mode and for a best estimate, i.e. when the number of clusters is not a priory given but is found after estimation of the cluster compactness. The data points can be defined in multidimensional space.

Data clustering is based on jMinHep package. You can run this in a completely stand-alone mode, without DataMelt. DataMelt integrates this Java program and enable Java scripting.

Using GUI

The easiest approach is to run a GUI editor to perform clustering. In the example below, we create several clusters in 3D and then passed the data holder to a GUI for clustering analysis:

from java.util import Random 
from jminhep.cluster    import *
from jhplot import *

# create data for analysis 
data = DataHolder("Example")
# fill 3D data with Gaussian random numbers
rand = Random()
for i in range(100):
      a =[]
      a.append( 10*rand.nextGaussian() )
      a.append( 2*rand.nextGaussian()+1 )
      a.append( 10*rand.nextGaussian()+3 )
      data.add( DataPoint(a) )

# start jMinHEP GUI
c1=HCluster(data)

This brings up a GUI editor which will run a selected algorithm:

DMelt example: Perform a cluster analysis using jMinHEP GUI

Data clustering in JSAT

The package jsat.clustering/ jsat.clustering/ provides another Java classes for data clustering. Here is an example that uses IRIS input data, runs multiple k-means algorithms, prints scores. Then it shows the clusters on a canvas:

from java.io import File
from jsat import ARFFLoader,DataSet
from jsat.classifiers import ClassificationDataSet
from jsat.clustering import Clusterer
from jsat.clustering.kmeans import KMeans,GMeans,HamerlyKMeans,KMeansPDN,XMeans,XMeans 
from jsat.clustering.evaluation import NormalizedMutualInformation
from java.util.stream import IntStream

print "Download iris_org.arff"
from jhplot import *
print Web.get("https://datamelt.org/examples/data/iris_org.arff")
fi=File("iris_org.arff")
dataSet = ARFFLoader.loadArffFile(fi)
# We specify '0' as the class we would like to make the target class. 
data = ClassificationDataSet(dataSet, 0)

"""
We will use the NMI as our evaluation criteria. It compares the
clustering results with the class labels. The class labels aren't
necessarily the best ground truth for clusters. In fact, how to
properly evaluate clustering algorithms is a very open question! But
this is a commonly used method.
         
he ClusterEvaluation interface dictates that values near 0 are
better, and larger values are worse. NMI is usually the opposite, but
obeys the interface. Read the NMI's Javadoc for more details.
"""

evaluator = NormalizedMutualInformation();

"""
We will use a normal k-means algorithm to do clustering
when we specify the number of clusters we want. JSAT implements a
number of different algorithms that all solve the k-means problem,
and are better in different scenarios. This one is likely to be the
best for most users.
"""

simpleKMeans = HamerlyKMeans()

from jarray import zeros
clusteringResults = zeros(data.getSampleSize(), "i")

# try different algorithms now..
def evaluate(methodsToEval):
  methodsToEval.cluster(data, clusteringResults)
  kFound = IntStream.of(clusteringResults).max().getAsInt()+1
  print methodsToEval.toString(), " found=",kFound, " -> ",evaluator.evaluate(clusteringResults, data)

evaluate(KMeansPDN())
evaluate(XMeans())
evaluate(GMeans())

print "Run k-means with a specific value of k, and keep track of cluster assignments"
print "Row evaluate the cluster assignments and print a score.."
print simpleKMeans.toString()
for k in range(2,7):
  clusteringResults = simpleKMeans.cluster(data, k, clusteringResults);
  print "k=",k," score=",evaluator.evaluate(clusteringResults, data)

print "Running for 3 clusters (optimal):"
clusteringResults = simpleKMeans.cluster(data, 3, clusteringResults)
clusters=clusteringResults.tolist()
print "Cluster assignments=", clusters 

c1 = SPlot()
c1.visible()
c1.setAutoRange()
c1.setMarksStyle('various')
c1.setNameX('X')
c1.setNameY('Y')
for i in range(data.getSampleSize()):
  dataPoint = data.getDataPoint(i)
  category = data.getDataPointCategory(i)          # get category 
  vec=dataPoint.getNumericalValues();
  c1.addPoint(clusters[i],vec.get(0),vec.get(1),1)
  if (i%10==0):
                c1.update()
                print i,dataPoint,category," Nr cluster=",clusters[i]

The canvas with the output (shown in 2D, for index 0 and 1) is:

DMelt example: Clustring IRIS data using k-means using JSAT Note that IRIS data are multidimensional data and cannot be easily visualized.

Using Jython scripts

Alternatively, one can run any clustering algorithm in batch mode without GUI. You can use Java, or any scripting programming language.

We show below a code which creates a data sample in 3D and then runs several clustering algorithms in one go. You can optionally print positions of the clusters and membership of the data points. The following modes will be used:

  • K-means algorithm fixed cluster mode with single seed event
  • K-means algorithm for multiple iterations
  • K-means clustering using exchange method for best estimate
  • K-means clustering using exchange method
  • Hierarchical clustering algorithm
  • Hierarchical clustering algorithm, best estimate

The following modes are available:

  • 111 - standard k-means with single seed
  • 112 - kmeans algorithm for multiple iterations
  • 113 - k-means in exchange mode
  • 114 - k-means multiple pass
  • 121 - Hierarchical -- standard mode
  • 122 - Hierarchical clustering algorithm, best estimate
  • 131 - fuzzy
  • 132 - fuzzy best estimate
from java.awt import Color
from java.util import Random 
from jminhep.cluster import * 

# create a data holder
data = DataHolder("Example")

# fill 3D data with Gaussian random numbers
rand = Random()
for i in range(100):
      a =[]
      a.append( 10*rand.nextGaussian() )
      a.append( 2*rand.nextGaussian()+1 )
      a.append( 10*rand.nextGaussian()+3 )
      data.add( DataPoint(a) )
      del a

# show the data
# HTable(data)

# Print data
# data.print()

# initialte partitioner
pat = Partition(data);

# set mode
pat.set(3, 0.001, 1.7, 1000);

# probability for membership (only for Fuzzy algorithm)
pat.setProbab(0.68)


# define types of cluster analysis
mode =[]
mode.append(111)
mode.append(112)
mode.append(113)
mode.append(114)
mode.append(121)
mode.append(122)
# mode.append(131)
# mode.append(132)

 

for i in range(len(mode)):
        print "test=",i
        pat.run(mode[i])
        print "algorithm: " +  pat.getName()
        print "Compactness: " + str(pat.getCompactness())
        print "No of final clusters: " + str(pat.getNclusters())
        Centers = pat.getCenters()
#        Centers.Print()

The output of the above script is shown below:

test 0
algorithm: kmeans algorithm fixed cluster mode with single seed event
Compactness: 1.98271418832
No of final clusters: 3
test= 1
algorithm: kmeans algorithm for multiple iterations
Compactness: 1.31227526642
No of final clusters: 3
test= 2
algorithm: K-means clustering using exchange method for best estimate
Compactness: 1.35529140568
No of final clusters: 5
test= 3
algorithm: K-means clustering using exchange method
Compactness: 1.35529140568
No of final clusters: 5
test= 4
algorithm: Hierarchical clustering algorithm
Compactness: 1.41987639705
No of final clusters: 5
test= 5
algorithm: Hierarchical clustering algorithm, best estimate
Compactness: 1.20134128248
No of final clusters: 6

You can print centers of clusters as:

Centers = pat.getCenters()
for i in range(Centers.getSize()):
                   g=Centers.getRow(i)
                   n=g.getDimension()
                   print i, " ",  g.getAttribute(0),g.getAttribute(1),g.getAttribute(2)

Where "Centers" are "DataHolder" container in the above example.