msmbuilder.cluster.MiniBatchKMeans¶
-
class
msmbuilder.cluster.
MiniBatchKMeans
(n_clusters=8, init='k-means++', max_iter=100, batch_size=100, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01)¶ Mini-Batch K-Means clustering
Read more in the User Guide.
Parameters: - n_clusters : int, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.
- init : {‘k-means++’, ‘random’ or an ndarray}, default: ‘k-means++’
Method for initialization, defaults to ‘k-means++’:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
‘random’: choose k observations (rows) at random from data for the initial centroids.
If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
- max_iter : int, optional
Maximum number of iterations over the complete dataset before stopping independently of any early stopping criterion heuristics.
- batch_size : int, optional, default: 100
Size of the mini batches.
- verbose : boolean, optional
Verbosity mode.
- compute_labels : boolean, default=True
Compute label assignment and inertia for the complete dataset once the minibatch optimization has converged in fit.
- random_state : int, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- tol : float, default: 0.0
Control early stopping based on the relative center changes as measured by a smoothed, variance-normalized of the mean center squared position changes. This early stopping heuristics is closer to the one used for the batch variant of the algorithms but induces a slight computational and memory overhead over the inertia heuristic.
To disable convergence detection based on normalized center change, set tol to 0.0 (default).
- max_no_improvement : int, default: 10
Control early stopping based on the consecutive number of mini batches that does not yield an improvement on the smoothed inertia.
To disable convergence detection based on inertia, set max_no_improvement to None.
- init_size : int, optional, default: 3 * batch_size
Number of samples to randomly sample for speeding up the initialization (sometimes at the expense of accuracy): the only algorithm is initialized by running a batch KMeans on a random subset of the data. This needs to be larger than n_clusters.
- n_init : int, default=3
Number of random initializations that are tried. In contrast to KMeans, the algorithm is only run once, using the best of the
n_init
initializations as measured by inertia.- reassignment_ratio : float, default: 0.01
Control the fraction of the maximum number of counts for a center to be reassigned. A higher value means that low count centers are more easily reassigned, which means that the model will take longer to converge, but should converge in a better clustering.
See also
KMeans
- The classic implementation of the clustering method based on the Lloyd’s algorithm. It consumes the whole set of input data at each iteration.
Notes
See http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Attributes: - cluster_centers_ : array, [n_clusters, n_features]
Coordinates of cluster centers
- labels_ : list of arrays, each of shape [sequence_length, ]
The label of each point is an integer in [0, n_clusters).
- inertia_ : float
The value of the inertia criterion associated with the chosen partition (if compute_labels is set to True). The inertia is defined as the sum of square distances of samples to their nearest neighbor.
Methods
fit
(sequences[, y])Fit the clustering on the data fit_predict
(sequences[, y])Performs clustering on X and returns cluster labels. fit_transform
(sequences[, y])Alias for fit_predict get_params
([deep])Get parameters for this estimator. partial_fit
(X[, y])Update k means estimate on a single mini-batch X. partial_predict
(X[, y])Predict the closest cluster each sample in X belongs to. partial_transform
(X)Alias for partial_predict predict
(sequences[, y])Predict the closest cluster each sample in each sequence in sequences belongs to. score
(X[, y])Opposite of the value of X on the K-means objective. set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences)Alias for predict -
__init__
(n_clusters=8, init='k-means++', max_iter=100, batch_size=100, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([n_clusters, init, max_iter, …])Initialize self. fit
(sequences[, y])Fit the clustering on the data fit_predict
(sequences[, y])Performs clustering on X and returns cluster labels. fit_transform
(sequences[, y])Alias for fit_predict get_params
([deep])Get parameters for this estimator. partial_fit
(X[, y])Update k means estimate on a single mini-batch X. partial_predict
(X[, y])Predict the closest cluster each sample in X belongs to. partial_transform
(X)Alias for partial_predict predict
(sequences[, y])Predict the closest cluster each sample in each sequence in sequences belongs to. score
(X[, y])Opposite of the value of X on the K-means objective. set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences)Alias for predict -
fit
(sequences, y=None)¶ Fit the clustering on the data
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - self
-
fit_predict
(sequences, y=None)¶ Performs clustering on X and returns cluster labels.
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - Y : list of ndarray, each of shape [sequence_length, ]
Cluster labels
-
fit_transform
(sequences, y=None)¶ Alias for fit_predict
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
partial_fit
(X, y=None)¶ Update k means estimate on a single mini-batch X.
Parameters: - X : array-like, shape = [n_samples, n_features]
Coordinates of the data points to cluster.
- y : Ignored
-
partial_predict
(X, y=None)¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: - X : array-like shape=(n_samples, n_features)
A single timeseries.
Returns: - Y : array, shape=(n_samples,)
Index of the cluster that each sample belongs to
-
partial_transform
(X)¶ Alias for partial_predict
-
predict
(sequences, y=None)¶ Predict the closest cluster each sample in each sequence in sequences belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - Y : list of arrays, each of shape [sequence_length,]
Index of the closest center each sample belongs to.
-
score
(X, y=None)¶ Opposite of the value of X on the K-means objective.
Parameters: - X : {array-like, sparse matrix}, shape = [n_samples, n_features]
New data.
- y : Ignored
Returns: - score : float
Opposite of the value of X on the K-means objective.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
summarize
()¶ Return some diagnostic summary statistics about this Markov model
-
transform
(sequences)¶ Alias for predict