msmbuilder.decomposition.SparseTICA

class msmbuilder.decomposition.SparseTICA(n_components, lag_time=1, gamma=0.05, rho=0.01, weighted_transform=True, epsilon=1e-06, tolerance=1e-08, maxiter=10000, max_nc=100, greedy=True, verbose=False)

Sparse time-structure Independent Component Analysis (tICA).

Linear dimensionality reduction which finds sparse linear combinations of the input features which decorrelate most slowly. These can be used for feature selection and/or dimensionality reduction.

This model requires the additional python package cvxpy, which can be installed from PyPI.

Warning

This model is currently experimental, and may undergo significant changes or bug fixes in upcoming releases.

Parameters:

n_components : int

Number of sparse tICs to find.

lag_time : int

Delay time forward or backward in the input data. The time-lagged correlations is computed between datas X[t] and X[t+lag_time].

gamma : nonnegative float, default=0.05

L2 regularization strength. Positive gamma entails incrementing the sample covariance matrix by a constant times the identity, to ensure that it is positive definite. The exact form of the regularized sample covariance matrix is

covariance + (gamma / n_features) * Tr(covariance) * Identity

where \(Tr\) is the trace operator.

rho : positive float

Regularization strength with controls the sparsity of the solutions. Higher values of rho gives more sparse tICS with nonozero loadings on fewer degrees of freedom. rho=0 corresponds to standard tICA.

weighted_transform : bool, default=False

If True, weight the projections by the implied timescales, giving a quantity that has units [Time].

epsilon : positive float, default=1e-6

epsilon should be a number very close to zero, which is used to construct the approximation to the L_0 penality function. However, when it gets too close to zero, the solvers may report feasibility problems due to numerical stability issues. 1e-6 is a fairly good balance here.

tolerance : positive float

Convergence criteria for the sparse generalized eigensolver.

maxiter : int

Maximum number of iterations for the sparse generalized eigensolver.

max_nc : int

Maximum number of iterations without any change in the sparsity pattern.

greedy : bool, default=True

Use a greedy heuristic in the sparse generalized eigensolver. This significantly accelerates the solution for high-dimensional problems under moderate to strong regularization.

verbose : bool, default=False

Print verbose information from the sparse generalized eigensolver.

References

[R12]McGibbon, R. T. and V. S. Pande “Identification of sparse, slow reaction coordinates from molular dynamics simulations” In preparation.
[R12]Sriperumbudur, B. K., D. A. Torres, and G. R. Lanckriet. “A majorization-minimization approach to the sparse generalized eigenvalue problem.” Machine learning 85.1-2 (2011): 3-39.
[R14]Mackey, L. “Deflation Methods for Sparse PCA.” NIPS. Vol. 21. 2008.

Attributes

components_ (array-like, shape (n_components, n_features)) Components with maximum autocorrelation.
offset_correlation_ (array-like, shape (n_features, n_features)) Symmetric time-lagged correlation matrix, C=E[(x_t)^T x_{t+lag}].
eigenvalues_ (array-like, shape (n_features,)) Psuedo-eigenvalues of the tICA generalized eigenproblem, in decreasing order.
eigenvectors_ (array-like, shape (n_components, n_features)) Sparse psuedo-eigenvectors of the tICA generalized eigenproblem. The vectors give a set of “directions” through configuration space along which the system relaxes towards equilibrium.
means_ (array, shape (n_features,)) The mean of the data along each feature
n_observations_ (int) Total number of data points fit by the model. Note that the model is “reset” by calling fit() with new sequences, whereas partial_fit() updates the fit with new data, and is suitable for online learning.
n_sequences_ (int) Total number of sequences fit by the model. Note that the model is “reset” by calling fit() with new sequences, whereas partial_fit() updates the fit with new data, and is suitable for online learning.
timescales_ (array-like, shape (n_components,)) The implied timescales of the tICA model, given by -offset / log(eigenvalues)

Methods

fit(sequences[, y]) Fit the model with a collection of sequences.
fit_transform(sequences[, y]) Fit the model with X and apply the dimensionality reduction on X.
get_params([deep]) Get parameters for this estimator.
partial_fit(X) Fit the model with X.
partial_transform(features) Apply the dimensionality reduction on X.
score(sequences[, y]) Score the model on new data using the generalized matrix Rayleigh quotient
set_params(**params) Set the parameters of this estimator.
summarize() Some summary information.
transform(sequences) Apply the dimensionality reduction on X.
summarize()

Some summary information.

fit(sequences, y=None)

Fit the model with a collection of sequences.

This method is not online. Any state accumulated from previous calls to fit() or partial_fit() will be cleared. For online learning, use partial_fit.

Parameters:

sequences: list of array-like, each of shape (n_samples_i, n_features)

Training data, where n_samples_i in the number of samples in sequence i and n_features is the number of features.

y : None

Ignored

Returns:

self : object

Returns the instance itself.

fit_transform(sequences, y=None)

Fit the model with X and apply the dimensionality reduction on X.

This method is not online. Any state accumulated from previous calls to fit() or partial_fit() will be cleared. For online learning, use partial_fit.

Parameters:

sequences: list of array-like, each of shape (n_samples_i, n_features)

Training data, where n_samples_i in the number of samples in sequence i and n_features is the number of features.

y : None

Ignored

Returns:

sequence_new : list of array-like, each of shape (n_samples_i, n_components)

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep: boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

partial_fit(X)

Fit the model with X.

This method is suitable for online learning. The state of the model will be updated with the new data X.

Parameters:

X: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features.

Returns:

self : object

Returns the instance itself.

partial_transform(features)

Apply the dimensionality reduction on X.

Parameters:

features: array-like, shape (n_samples, n_features)

Training data, where n_samples in the number of samples and n_features is the number of features. This function acts on a single featurized trajectory.

Returns:

sequence_new : array-like, shape (n_samples, n_components)

TICA-projected features

Notes

This function acts on a single featurized trajectory.

score(sequences, y=None)

Score the model on new data using the generalized matrix Rayleigh quotient

Parameters:

sequences : list of array-like

List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.

Returns:

gmrq : float

Generalized matrix Rayleigh quotient. This number indicates how well the top n_timescales+1 eigenvectors of this MSM perform as slowly decorrelating collective variables for the new data in sequences.

References

[R15]McGibbon, R. T. and V. S. Pande, “Variational cross-validation of slow dynamical modes in molecular kinetics” J. Chem. Phys. 142, 124105 (2015)
score_

Training score of the model, computed as the generalized matrix, Rayleigh quotient, the sum of the first n_components eigenvalues

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self
transform(sequences)

Apply the dimensionality reduction on X.

Parameters:

sequences: list of array-like, each of shape (n_samples_i, n_features)

Training data, where n_samples_i in the number of samples in sequence i and n_features is the number of features.

Returns:

sequence_new : list of array-like, each of shape (n_samples_i, n_components)

Versions