titus.producer.kmeans.KMeans¶

class titus.producer.kmeans.KMeans(numberOfClusters, dataset, weights=None, metric=<titus.producer.kmeans.Euclidean object at 0x7faecd3b33d0>, minPointsInCluster=None, maxPointsForClustering=None)[source]¶

Bases: object

Represents a k-means optimization by storing a dataset and performing all operations in-place.

Usually, you would construct the object, possibly stepup, then optimize and export to pfaDocument.

__init__(numberOfClusters, dataset, weights=None, metric=<titus.producer.kmeans.Euclidean object at 0x7faecd3b33d0>, minPointsInCluster=None, maxPointsForClustering=None)¶

Construct a KMeans object, initializing cluster centers to unique, random points from the dataset.

Parameters:

Parameters:	numberOfClusters (positive integer) – number of clusters (the “k” in k-means) dataset (2-d Numpy array) – dataset to cluster; `dataset.shape[0]` is the number of records (rows), `dataset.shape[1]` is the number of dimensions for each point (columns) weights (1-d Numpy array or `None`) – how much to weight each point in the `dataset`: must have shape equal to `(dataset.shape[0],)`; `0` means ignore the dataset, `1` means normal weight; `None` generates all ones metric (titus.produce.kmeans.Metric) – metric for Numpy and PFA, such as `Euclidean(AbsDiff())` minPointsInCluster (non-negative integer or `None`) – minimum number of points before jumping (replacing cluster with a random point during optimization) maxPointsForClustering (positive integer or `None`) – maximum number of points in an optimization (if `dataset.shape[0]` exceeds this amount, a random subset is chosen)

numberOfClusters (positive integer) – number of clusters (the “k” in k-means)
dataset (2-d Numpy array) – dataset to cluster; dataset.shape[0] is the number of records (rows), dataset.shape[1] is the number of dimensions for each point (columns)
weights (1-d Numpy array or None) – how much to weight each point in the dataset: must have shape equal to (dataset.shape[0],); 0 means ignore the dataset, 1 means normal weight; None generates all ones
metric (titus.produce.kmeans.Metric) – metric for Numpy and PFA, such as Euclidean(AbsDiff())
minPointsInCluster (non-negative integer or None) – minimum number of points before jumping (replacing cluster with a random point during optimization)
maxPointsForClustering (positive integer or None) – maximum number of points in an optimization (if dataset.shape[0] exceeds this amount, a random subset is chosen)

centers(sort=True)¶

Get the cluster centers as a sorted Python list (canonical form).

Parameters:	sort (bool) – if `True`, sort the centers for stable results
Return type:	list of list of numbers
Returns:	the cluster centers as Pythonized JSON

closestCluster(dataset=None, weights=None)¶

Identify the closest cluster to each element in the dataset.

Parameters:	dataset (2-d Numpy array or `None`) – an input dataset or the built-in dataset if `None` is passed weights (1-d Numpy array or `None`) – input weights or the built-in weights if `None` is passed
Return type:	1-d Numpy array of integers
Returns:	the indexes of the closest cluster for each datum

iterate(dataset, weights, iterationNumber, condition)¶

Perform one iteration step (in-place; modifies self.clusters).

Parameters:	dataset (2-d Numpy array) – an input dataset weights (1-d Numpy array) – input weights iterationNumber (non-negative integer) – the iteration number condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition
Return type:	bool
Returns:	the result of the stopping condition

newCluster()¶

Pick a random point from the dataset and ensure that it is different from all other cluster centers.

Return type:	1-d Numpy array
Returns:	a copy of a random point, guaranteed to be different from all other clusters.

optimize(condition)¶

Run a standard k-means (Lloyd’s algorithm) on the dataset, changing the clusters in-place.

Parameters:	condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition
Return type:	`None`
Returns:	nothing; modifies cluster set in-place

pfaDocument(clusterTypeName, ids, populations=False, sort=True, preprocess=None, idType='string', dataComponentType='double', centerComponentType='double')¶

Create a PFA document to score with this cluster set.

Parameters:	clusterTypeName (string) – name of the PFA record type ids (list of string) – names of the clusters populations (bool) – if `True`, include the number of training points as a “population” field sort (bool) – if `True`, sort the centers for stable results preprocess (PrettyPFA substitution or `None`) – pre-processing expression idType (Pythonized JSON) – subtype for the `id` field dataComponentType (Pythonized JSON) – subtype for the data array items centerComponentType (Pythonized JSON) – subtype for the center array items
Return type:	Pythonized JSON
Returns:	a complete PFA document that performs clustering

pfaType(clusterTypeName, idType='string', centerComponentType='double', populations=False)¶

Create a PFA type schema representing this cluster set.

Parameters:	clusterTypeName (string) – name of the PFA record type idType (Pythonized JSON) – subtype for the `id` field centerComponentType (Pythonized JSON) – subtype for the center array items populations (bool) – if `True`, include the number of training points as a “population” field
Return type:	Pythonized JSON
Returns:	PFA type schema for an array of clusters

pfaValue(ids, populations=False, sort=True)¶

Create a PFA data structure representing this cluster set.

Parameters:	ids (list of string) – names of the clusters populations (bool) – if `True`, include the number of training points as a “population” field sort (bool) – if `True`, sort the centers for stable results
Return type:	Pythonized JSON
Returns:	data structure that should be inserted in the `init` section of the cell or pool containing the clusters

randomPoint()¶

Pick a random point from the dataset.

Return type:	1-d Numpy array
Returns:	a copy of a random point

randomSubset(subsetSize)¶

Return a (dataset, weights) that are randomly chosen to have subsetSize records.

Parameters:	subsetSize (positive integer) – size of the sample
Return type:	(2-d Numpy array, 1-d Numpy array)
Returns:	(dataset, weights) sampled without replacement (if the original dataset is unique, the new one will be, too)

stepup(condition, base=2)¶

Optimize the cluster set in successively larger subsets of the dataset. (This can be viewed as a cluster seeding technique.)

If randomly seeded, optimizing the whole dataset can be slow to converge: a long time per iteration times many iterations.

Optimizing a random subset takes as many iterations, but the time per iteration is short. However, the final cluster centers are only approximate.

Optimizing the whole dataset with approximate cluster starting points takes a long time per iteration but fewer iterations.

This procedure runs the k-means optimization technique on random subsets with exponentially increasing sizes from the smallest base**x that is larger than minPointsInCluster (or numberOfClusters) to the largest base**x that is a subset of the whole dataset.

Parameters:	condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition base (integer greater than 1) – the factor by which the subset size is increased after each convergence
Return type:	`None`
Returns:	nothing; modifies cluster set in-place

Navigation

titus.producer.kmeans.KMeans¶

Navigation