titus.producer.kmeans.KMeans

class titus.producer.kmeans.KMeans(numberOfClusters, dataset, weights=None, metric=<titus.producer.kmeans.Euclidean object at 0x7faecd3b33d0>, minPointsInCluster=None, maxPointsForClustering=None)[source]

Bases: object

Represents a k-means optimization by storing a dataset and performing all operations in-place.

Usually, you would construct the object, possibly stepup, then optimize and export to pfaDocument.

__init__(numberOfClusters, dataset, weights=None, metric=<titus.producer.kmeans.Euclidean object at 0x7faecd3b33d0>, minPointsInCluster=None, maxPointsForClustering=None)

Construct a KMeans object, initializing cluster centers to unique, random points from the dataset.

Parameters:
  • numberOfClusters (positive integer) – number of clusters (the “k” in k-means)
  • dataset (2-d Numpy array) – dataset to cluster; dataset.shape[0] is the number of records (rows), dataset.shape[1] is the number of dimensions for each point (columns)
  • weights (1-d Numpy array or None) – how much to weight each point in the dataset: must have shape equal to (dataset.shape[0],); 0 means ignore the dataset, 1 means normal weight; None generates all ones
  • metric (titus.produce.kmeans.Metric) – metric for Numpy and PFA, such as Euclidean(AbsDiff())
  • minPointsInCluster (non-negative integer or None) – minimum number of points before jumping (replacing cluster with a random point during optimization)
  • maxPointsForClustering (positive integer or None) – maximum number of points in an optimization (if dataset.shape[0] exceeds this amount, a random subset is chosen)
centers(sort=True)

Get the cluster centers as a sorted Python list (canonical form).

Parameters:sort (bool) – if True, sort the centers for stable results
Return type:list of list of numbers
Returns:the cluster centers as Pythonized JSON
closestCluster(dataset=None, weights=None)

Identify the closest cluster to each element in the dataset.

Parameters:
  • dataset (2-d Numpy array or None) – an input dataset or the built-in dataset if None is passed
  • weights (1-d Numpy array or None) – input weights or the built-in weights if None is passed
Return type:

1-d Numpy array of integers

Returns:

the indexes of the closest cluster for each datum

iterate(dataset, weights, iterationNumber, condition)

Perform one iteration step (in-place; modifies self.clusters).

Parameters:
  • dataset (2-d Numpy array) – an input dataset
  • weights (1-d Numpy array) – input weights
  • iterationNumber (non-negative integer) – the iteration number
  • condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition
Return type:

bool

Returns:

the result of the stopping condition

newCluster()

Pick a random point from the dataset and ensure that it is different from all other cluster centers.

Return type:1-d Numpy array
Returns:a copy of a random point, guaranteed to be different from all other clusters.
optimize(condition)

Run a standard k-means (Lloyd’s algorithm) on the dataset, changing the clusters in-place.

Parameters:condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition
Return type:None
Returns:nothing; modifies cluster set in-place
pfaDocument(clusterTypeName, ids, populations=False, sort=True, preprocess=None, idType='string', dataComponentType='double', centerComponentType='double')

Create a PFA document to score with this cluster set.

Parameters:
  • clusterTypeName (string) – name of the PFA record type
  • ids (list of string) – names of the clusters
  • populations (bool) – if True, include the number of training points as a “population” field
  • sort (bool) – if True, sort the centers for stable results
  • preprocess (PrettyPFA substitution or None) – pre-processing expression
  • idType (Pythonized JSON) – subtype for the id field
  • dataComponentType (Pythonized JSON) – subtype for the data array items
  • centerComponentType (Pythonized JSON) – subtype for the center array items
Return type:

Pythonized JSON

Returns:

a complete PFA document that performs clustering

pfaType(clusterTypeName, idType='string', centerComponentType='double', populations=False)

Create a PFA type schema representing this cluster set.

Parameters:
  • clusterTypeName (string) – name of the PFA record type
  • idType (Pythonized JSON) – subtype for the id field
  • centerComponentType (Pythonized JSON) – subtype for the center array items
  • populations (bool) – if True, include the number of training points as a “population” field
Return type:

Pythonized JSON

Returns:

PFA type schema for an array of clusters

pfaValue(ids, populations=False, sort=True)

Create a PFA data structure representing this cluster set.

Parameters:
  • ids (list of string) – names of the clusters
  • populations (bool) – if True, include the number of training points as a “population” field
  • sort (bool) – if True, sort the centers for stable results
Return type:

Pythonized JSON

Returns:

data structure that should be inserted in the init section of the cell or pool containing the clusters

randomPoint()

Pick a random point from the dataset.

Return type:1-d Numpy array
Returns:a copy of a random point
randomSubset(subsetSize)

Return a (dataset, weights) that are randomly chosen to have subsetSize records.

Parameters:subsetSize (positive integer) – size of the sample
Return type:(2-d Numpy array, 1-d Numpy array)
Returns:(dataset, weights) sampled without replacement (if the original dataset is unique, the new one will be, too)
stepup(condition, base=2)

Optimize the cluster set in successively larger subsets of the dataset. (This can be viewed as a cluster seeding technique.)

If randomly seeded, optimizing the whole dataset can be slow to converge: a long time per iteration times many iterations.

Optimizing a random subset takes as many iterations, but the time per iteration is short. However, the final cluster centers are only approximate.

Optimizing the whole dataset with approximate cluster starting points takes a long time per iteration but fewer iterations.

This procedure runs the k-means optimization technique on random subsets with exponentially increasing sizes from the smallest base**x that is larger than minPointsInCluster (or numberOfClusters) to the largest base**x that is a subset of the whole dataset.

Parameters:
  • condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition
  • base (integer greater than 1) – the factor by which the subset size is increased after each convergence
Return type:

None

Returns:

nothing; modifies cluster set in-place