msmbuilder.cluster.SpectralClustering¶
-
class
msmbuilder.cluster.
SpectralClustering
(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1)¶ Apply clustering to a projection to the normalized laplacian.
In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster. For instance when clusters are nested circles on the 2D plan.
If affinity is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
When calling
fit
, an affinity matrix is constructed using either kernel function such the Gaussian (aka RBF) kernel of the euclidean distancedd(X, X)
:np.exp(-gamma * d(X,X) ** 2)
or a k-nearest neighbors connectivity matrix.
Alternatively, using
precomputed
, a user-provided affinity matrix can be used.Read more in the User Guide.
Parameters: - n_clusters : integer, optional
The dimension of the projection subspace.
- eigen_solver : {None, ‘arpack’, ‘lobpcg’, or ‘amg’}
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities
- random_state : int, RandomState instance or None, optional, default: None
A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == ‘amg’ and by the K-Means initialization. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- n_init : int, optional, default: 10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
- gamma : float, default=1.0
Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for
affinity='nearest_neighbors'
.- affinity : string, array-like or callable, default ‘rbf’
If a string, this may be one of ‘nearest_neighbors’, ‘precomputed’, ‘rbf’ or one of the kernels supported by sklearn.metrics.pairwise_kernels.
Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
- n_neighbors : integer
Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for
affinity='rbf'
.- eigen_tol : float, optional, default: 0.0
Stopping criterion for eigendecomposition of the Laplacian matrix when using arpack eigen_solver.
- assign_labels : {‘kmeans’, ‘discretize’}, default: ‘kmeans’
The strategy to use to assign labels in the embedding space. There are two ways to assign labels after the laplacian embedding. k-means can be applied and is a popular choice. But it can also be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
- degree : float, default=3
Degree of the polynomial kernel. Ignored by other kernels.
- coef0 : float, default=1
Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.
- kernel_params : dictionary of string to any, optional
Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.
- n_jobs : int, optional (default = 1)
The number of parallel jobs to run. If
-1
, then the number of jobs is set to the number of CPU cores.
Notes
If you have an affinity matrix, such as a distance matrix, for which 0 means identical elements, and high values means very dissimilar elements, it can be transformed in a similarity matrix that is well suited for the algorithm by applying the Gaussian (RBF, heat) kernel:
np.exp(- dist_matrix ** 2 / (2. * delta ** 2))
Where
delta
is a free parameter representing the width of the Gaussian kernel.Another alternative is to take a symmetric version of the k nearest neighbors connectivity matrix of the points.
If the pyamg package is installed, it is used: this greatly speeds up computation.
References
- Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
- A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
- Multiclass spectral clustering, 2003 Stella X. Yu, Jianbo Shi http://www1.icsi.berkeley.edu/~stellayu/publication/doc/2003kwayICCV.pdf
Attributes: - affinity_matrix_ : array-like, shape (n_samples, n_samples)
Affinity matrix used for clustering. Available only if after calling
fit
.- labels_ : list of arrays, each of shape [sequence_length, ]
The label of each point is an integer in [0, n_clusters).
Methods
fit
(sequences[, y])Fit the clustering on the data fit_predict
(sequences[, y])Performs clustering on X and returns cluster labels. fit_transform
(sequences[, y])Alias for fit_predict get_params
([deep])Get parameters for this estimator. partial_predict
(X[, y])Predict the closest cluster each sample in X belongs to. partial_transform
(X)Alias for partial_predict predict
(sequences[, y])Predict the closest cluster each sample in each sequence in sequences belongs to. set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences)Alias for predict -
__init__
(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([n_clusters, eigen_solver, …])Initialize self. fit
(sequences[, y])Fit the clustering on the data fit_predict
(sequences[, y])Performs clustering on X and returns cluster labels. fit_transform
(sequences[, y])Alias for fit_predict get_params
([deep])Get parameters for this estimator. partial_predict
(X[, y])Predict the closest cluster each sample in X belongs to. partial_transform
(X)Alias for partial_predict predict
(sequences[, y])Predict the closest cluster each sample in each sequence in sequences belongs to. set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences)Alias for predict -
fit
(sequences, y=None)¶ Fit the clustering on the data
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - self
-
fit_predict
(sequences, y=None)¶ Performs clustering on X and returns cluster labels.
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - Y : list of ndarray, each of shape [sequence_length, ]
Cluster labels
-
fit_transform
(sequences, y=None)¶ Alias for fit_predict
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
partial_predict
(X, y=None)¶ Predict the closest cluster each sample in X belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: - X : array-like shape=(n_samples, n_features)
A single timeseries.
Returns: - Y : array, shape=(n_samples,)
Index of the cluster that each sample belongs to
-
partial_transform
(X)¶ Alias for partial_predict
-
predict
(sequences, y=None)¶ Predict the closest cluster each sample in each sequence in sequences belongs to.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters: - sequences : list of array-like, each of shape [sequence_length, n_features]
A list of multivariate timeseries. Each sequence may have a different length, but they all must have the same number of features.
Returns: - Y : list of arrays, each of shape [sequence_length,]
Index of the closest center each sample belongs to.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
summarize
()¶ Return some diagnostic summary statistics about this Markov model
-
transform
(sequences)¶ Alias for predict