msmbuilder.cluster.KCenters¶

class msmbuilder.cluster.KCenters(n_clusters=8, metric='euclidean', random_state=None)¶

K-Centers clustering

Cluster a vector or Trajectory dataset using a simple heuristic to minimize the maximum distance from any data point to its assigned cluster center.

The runtime of this algorithm is O(kN), where k is the number of clusters and N is the size of the dataset, making it one of the least expensive clustering algorithms available.

Parameters:

n_clusters : int, optional, default: 8: The number of clusters to form as well as the number of centroids to generate.
metric : {“euclidean”, “sqeuclidean”, “cityblock”, “chebyshev”, “canberra”,: “braycurtis”, “hamming”, “jaccard”, “cityblock”, “rmsd”}

The distance metric to use. metric = “rmsd” requires that sequences passed to fit() be `md.Trajectory`; other distance metrics require ``np.ndarray``s.
random_state : integer or numpy.RandomState, optional: The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.

References

[1]	Gonzalez, Teofilo F. “Clustering to minimize the maximum intercluster distance.” Theor. Comput. Sci. 38 (1985): 293-306.

[2]	Beauchamp, Kyle A., et al. “MSMBuilder2: modeling conformational dynamics on the picosecond to millisecond scale.” J. Chem. Theory. Comput. 7.10 (2011): 3412-3419.

Attributes:

`cluster_centers_` : array, [n_clusters, n_features]: Coordinates of cluster centers
`labels_` : list of arrays, each of shape [sequence_length, ]: labels_[i] is an array of the labels of each point in sequence i. The label of each point is an integer in [0, n_clusters).
`distances_` : list of arrays, each of shape [sequence_length, ]: distances_[i] is an array of the labels of each point in sequence i. Distance from each sample to the cluster center it is assigned to.

Methods

`fit`(sequences[, y])	Fit the kcenters clustering on the data
`fit_predict`(sequences[, y])	Performs clustering on X and returns cluster labels.
`fit_transform`(sequences[, y])	Alias for fit_predict
`get_params`([deep])	Get parameters for this estimator.
`partial_predict`(X[, y])	Predict the closest cluster each sample in X belongs to.
`partial_transform`(X)	Alias for partial_predict
`predict`(sequences[, y])	Predict the closest cluster each sample in each sequence in sequences belongs to.
`set_params`(**params)	Set the parameters of this estimator.
`summarize`()	Return some diagnostic summary statistics about this Markov model
`transform`(sequences)	Alias for predict

__init__(n_clusters=8, metric='euclidean', random_state=None)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`([n_clusters, metric, random_state])	Initialize self.
`fit`(sequences[, y])	Fit the kcenters clustering on the data
`fit_predict`(sequences[, y])	Performs clustering on X and returns cluster labels.
`fit_transform`(sequences[, y])	Alias for fit_predict
`get_params`([deep])	Get parameters for this estimator.
`partial_predict`(X[, y])	Predict the closest cluster each sample in X belongs to.
`partial_transform`(X)	Alias for partial_predict
`predict`(sequences[, y])	Predict the closest cluster each sample in each sequence in sequences belongs to.
`set_params`(**params)	Set the parameters of this estimator.
`summarize`()	Return some diagnostic summary statistics about this Markov model
`transform`(sequences)	Alias for predict

fit(sequences, y=None)¶

Fit the kcenters clustering on the data

Parameters:	sequences : list of array-like, each of shape [sequence_length, n_features] A list of multivariate timeseries, or `md.Trajectory`. Each sequence may have a different length, but they all must have the same number of features, or the same number of atoms if they are ``md.Trajectory``s.
Returns:	self

fit_predict(sequences, y=None)¶

Performs clustering on X and returns cluster labels.

Parameters:	sequences : list of array-like, each of shape [sequence_length, n_features] A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns:	Y : list of ndarray, each of shape [sequence_length, ] Cluster labels

fit_transform(sequences, y=None)¶: Alias for fit_predict

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:	deep : boolean, optional If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params : mapping of string to any Parameter names mapped to their values.

partial_predict(X, y=None)¶

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:	X : array-like shape=(n_samples, n_features) A single timeseries.
Returns:	Y : array, shape=(n_samples,) Index of the cluster that each sample belongs to

partial_transform(X)¶: Alias for partial_predict

predict(sequences, y=None)¶

Predict the closest cluster each sample in each sequence in sequences belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters:	sequences : list of array-like, each of shape [sequence_length, n_features] A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns:	Y : list of arrays, each of shape [sequence_length,] Index of the closest center each sample belongs to.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:	self

summarize()¶: Return some diagnostic summary statistics about this Markov model

transform(sequences)¶: Alias for predict