Bases: object
Represents a k-means optimization by storing a dataset and performing all operations in-place.
Usually, you would construct the object, possibly stepup, then optimize and export to pfaDocument.
Construct a KMeans object, initializing cluster centers to unique, random points from the dataset.
Parameters: |
|
---|
Get the cluster centers as a sorted Python list (canonical form).
Parameters: | sort (bool) – if True, sort the centers for stable results |
---|---|
Return type: | list of list of numbers |
Returns: | the cluster centers as Pythonized JSON |
Identify the closest cluster to each element in the dataset.
Parameters: |
|
---|---|
Return type: | 1-d Numpy array of integers |
Returns: | the indexes of the closest cluster for each datum |
Perform one iteration step (in-place; modifies self.clusters).
Parameters: |
|
---|---|
Return type: | bool |
Returns: | the result of the stopping condition |
Pick a random point from the dataset and ensure that it is different from all other cluster centers.
Return type: | 1-d Numpy array |
---|---|
Returns: | a copy of a random point, guaranteed to be different from all other clusters. |
Run a standard k-means (Lloyd’s algorithm) on the dataset, changing the clusters in-place.
Parameters: | condition (callable that takes iterationNumber, corrections, values, datasetSize as arguments) – the stopping condition |
---|---|
Return type: | None |
Returns: | nothing; modifies cluster set in-place |
Create a PFA document to score with this cluster set.
Parameters: |
|
---|---|
Return type: | Pythonized JSON |
Returns: | a complete PFA document that performs clustering |
Create a PFA type schema representing this cluster set.
Parameters: | |
---|---|
Return type: | Pythonized JSON |
Returns: | PFA type schema for an array of clusters |
Create a PFA data structure representing this cluster set.
Parameters: | |
---|---|
Return type: | Pythonized JSON |
Returns: | data structure that should be inserted in the init section of the cell or pool containing the clusters |
Pick a random point from the dataset.
Return type: | 1-d Numpy array |
---|---|
Returns: | a copy of a random point |
Return a (dataset, weights) that are randomly chosen to have subsetSize records.
Parameters: | subsetSize (positive integer) – size of the sample |
---|---|
Return type: | (2-d Numpy array, 1-d Numpy array) |
Returns: | (dataset, weights) sampled without replacement (if the original dataset is unique, the new one will be, too) |
Optimize the cluster set in successively larger subsets of the dataset. (This can be viewed as a cluster seeding technique.)
If randomly seeded, optimizing the whole dataset can be slow to converge: a long time per iteration times many iterations.
Optimizing a random subset takes as many iterations, but the time per iteration is short. However, the final cluster centers are only approximate.
Optimizing the whole dataset with approximate cluster starting points takes a long time per iteration but fewer iterations.
This procedure runs the k-means optimization technique on random subsets with exponentially increasing sizes from the smallest base**x that is larger than minPointsInCluster (or numberOfClusters) to the largest base**x that is a subset of the whole dataset.
Parameters: |
|
---|---|
Return type: | None |
Returns: | nothing; modifies cluster set in-place |