msmbuilder.msm.MarkovStateModel¶

class msmbuilder.msm.MarkovStateModel(lag_time=1, n_timescales=None, reversible_type='mle', ergodic_cutoff='on', prior_counts=0, sliding_window=True, verbose=True)¶

Reversible Markov State Model

This model fits a first-order Markov model to a dataset of integer-valued timeseries. The key estimated attribute, transmat_ is a matrix containing the estimated probability of transitioning between pairs of states in the duration specified by lag_time.

Unless otherwise specified, the model is constrained to be reversible (satisfy detailed balance), which is appropriate for equilibrium chemical systems.

Parameters:

lag_time : int: The lag time of the model
n_timescales : int, optional: The number of dynamical timescales to calculate when diagonalizing the transition matrix. If not specified, it will compute n_states - 1
reversible_type : {‘mle’, ‘transpose’, None}: Method by which the reversibility of the transition matrix is enforced. ‘mle’ uses a maximum likelihood method that is solved by numerical optimization, and ‘transpose’ uses a more restrictive (but less computationally complex) direct symmetrization of the expected number of counts.
ergodic_cutoff : float or {‘on’, ‘off’}, default=’on’: Only the maximal strongly ergodic subgraph of the data is used to build an MSM. Ergodicity is determined by ensuring that each state is accessible from each other state via one or more paths involving edges with a number of observed directed counts greater than or equal to ergodic_cutoff. By setting ergodic_cutoff to 0 or ‘off’, this trimming is turned off. Setting it to ‘on’ sets the cutoff to the minimal possible count value.
prior_counts : float, optional: Add a number of “pseudo counts” to each entry in the counts matrix after ergodic trimming. When prior_counts == 0 (default), the assigned transition probability between two states with no observed transitions will be zero, whereas when prior_counts > 0, even this unobserved transitions will be given nonzero probability.
sliding_window : bool, optional: Count transitions using a window of length lag_time, which is slid along the sequences 1 unit at a time, yielding transitions which contain more data but cannot be assumed to be statistically independent. Otherwise, the sequences are simply subsampled at an interval of lag_time.
verbose : bool: Enable verbose printout

References

[1]	Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics: Generation and validation.” J Chem. Phys. 134.17 (2011): 174105.

[2]	Pande, V. S., K. A. Beauchamp, and G. R. Bowman. “Everything you wanted to know about Markov State Models but were afraid to ask” Methods 52.1 (2010): 99-105.

Attributes:

n_states_ : int: The number of states in the model
mapping_ : dict: Mapping between “input” labels and internal state indices used by the counts and transition matrix for this Markov state model. Input states need not necessarily be integers in (0, …, n_states_ - 1), for example. The semantics of mapping_[i] = j is that state i from the “input space” is represented by the index j in this MSM.
countsmat_ : array_like, shape = (n_states_, n_states_): Number of transition counts between states. countsmat_[i, j] is counted during fit(). The indices i and j are the “internal” indices described above. No correction for reversibility is made to this matrix.
transmat_ : array_like, shape = (n_states_, n_states_): Maximum likelihood estimate of the reversible transition matrix. The indices i and j are the “internal” indices described above.
populations_ : array, shape = (n_states_,): The equilibrium population (stationary eigenvector) of transmat_

Methods

`draw_samples`(sequences, n_samples[, …])	Sample conformations for a sequences of states.
`eigtransform`(sequences[, right, mode])	Transform a list of sequences by projecting the sequences onto the first n_timescales dynamical eigenvectors.
`fit`(sequences[, y])	Estimate model parameters.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(sequences)	Transform a list of sequences from internal indexing into labels
`partial_transform`(sequence[, mode])	Transform a sequence to internal indexing
`sample_discrete`([state, n_steps, random_state])	Generate a random sequence of states by propagating the model using discrete time steps given by the model lagtime.
`score`(sequences[, y])	Score the model on new data using the generalized matrix Rayleigh quotient
`score_ll`(sequences)	log of the likelihood of sequences with respect to the model
`set_params`(**params)	Set the parameters of this estimator.
`summarize`()	Return some diagnostic summary statistics about this Markov model
`transform`(sequences[, mode])	Transform a list of sequences to internal indexing
`uncertainty_eigenvalues`()	Estimate of the element-wise asymptotic standard deviation in the model eigenvalues.
`uncertainty_timescales`()	Estimate of the element-wise asymptotic standard deviation in the model implied timescales.

sample

__init__(lag_time=1, n_timescales=None, reversible_type='mle', ergodic_cutoff='on', prior_counts=0, sliding_window=True, verbose=True)¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`([lag_time, n_timescales, …])	Initialize self.
`draw_samples`(sequences, n_samples[, …])	Sample conformations for a sequences of states.
`eigtransform`(sequences[, right, mode])	Transform a list of sequences by projecting the sequences onto the first n_timescales dynamical eigenvectors.
`fit`(sequences[, y])	Estimate model parameters.
`fit_transform`(X[, y])	Fit to data, then transform it.
`get_params`([deep])	Get parameters for this estimator.
`inverse_transform`(sequences)	Transform a list of sequences from internal indexing into labels
`partial_transform`(sequence[, mode])	Transform a sequence to internal indexing
`sample`([state, n_steps, random_state])
`sample_discrete`([state, n_steps, random_state])	Generate a random sequence of states by propagating the model using discrete time steps given by the model lagtime.
`score`(sequences[, y])	Score the model on new data using the generalized matrix Rayleigh quotient
`score_ll`(sequences)	log of the likelihood of sequences with respect to the model
`set_params`(**params)	Set the parameters of this estimator.
`summarize`()	Return some diagnostic summary statistics about this Markov model
`transform`(sequences[, mode])	Transform a list of sequences to internal indexing
`uncertainty_eigenvalues`()	Estimate of the element-wise asymptotic standard deviation in the model eigenvalues.
`uncertainty_timescales`()	Estimate of the element-wise asymptotic standard deviation in the model implied timescales.

Attributes

`eigenvalues_`	Eigenvalues of the transition matrix.
`left_eigenvectors_`	Left eigenvectors, \(\Phi\), of the transition matrix.
`right_eigenvectors_`	Right eigenvectors, \(\Psi\), of the transition matrix.
`score_`	Training score of the model, computed as the generalized matrix, Rayleigh quotient, the sum of the first n_components eigenvalues
`state_labels_`
`timescales_`	Implied relaxation timescales of the model.

draw_samples(sequences, n_samples, random_state=None)¶

Sample conformations for a sequences of states.

Parameters:	sequences : list or list of lists A sequence or list of sequences, in which each element corresponds to a state label. n_samples : int How many samples to return for any given state.
Returns:	selected_pairs_by_state : np.array, dtype=int, shape=(n_states, n_samples, 2) selected_pairs_by_state[state] gives an array of randomly selected (trj, frame) pairs from the specified state.

See also

utils.map_drawn_samples: Extract conformations from MD trajectories by

index.

eigenvalues_¶: Eigenvalues of the transition matrix.

eigtransform(sequences, right=True, mode='clip')¶

Transform a list of sequences by projecting the sequences onto the first n_timescales dynamical eigenvectors.

Parameters:

sequences : list of array-like

List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.

right : bool

Which eigenvectors to map onto. Both the left (\(\Phi\)) and the right (:math`Psi`) eigenvectors of the transition matrix are commonly used, and differ in their normalization. The two sets of eigenvectors are related by the stationary distribution

\Phi_i(x) = \Psi_i(x) * \mu(x)

In the MSM literature, the right vectors (default here) are approximations to the transfer operator eigenfunctions, whereas the left eigenfunction are approximations to the propagator eigenfunctions. For more details, refer to reference [1].

mode : {‘clip’, ‘fill’}

Method by which to treat labels in sequences which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.

clip

Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)

fill: Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]

Returns:

transformed : list of 2d arrays: Each element of transformed is an array of shape (n_samples, n_timescales) containing the transformed data.

References

[1]	Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics: Generation and validation.” J. Chem. Phys. 134.17 (2011): 174105.

fit(sequences, y=None)¶

Estimate model parameters.

Parameters:	sequences : list of array-like List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
Returns:	self

Notes

None and NaN are recognized immediately as invalid labels. Therefore, transition counts from or to a sequence item which is NaN or None will not be counted. The mapping_ attribute will not include the NaN or None.

fit_transform(X, y=None, **fit_params)¶

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:	X : numpy array of shape [n_samples, n_features] Training set. y : numpy array of shape [n_samples] Target values.
Returns:	X_new : numpy array of shape [n_samples, n_features_new] Transformed array.

get_params(deep=True)¶

Get parameters for this estimator.

Parameters:	deep : boolean, optional If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:	params : mapping of string to any Parameter names mapped to their values.

inverse_transform(sequences)¶

Transform a list of sequences from internal indexing into labels

Parameters:	sequences : list List of sequences, each of which is one-dimensional array of integers in `0, ..., n_states_ - 1`.
Returns:	sequences : list List of sequences, each of which is one-dimensional array of labels.

left_eigenvectors_¶

Left eigenvectors, \(\Phi\), of the transition matrix.

The left eigenvectors are normalized such that:

lv[:, 0] is the equilibrium populations and is normalized such that sum(lv[:, 0]) == 1`

The eigenvectors satisfy sum(lv[:, i] * lv[:, i] / model.populations_) == 1. In math notation, this is \(<\phi_i, \phi_i>_{\mu^{-1}} = 1\)

Returns:	lv : array-like, shape=(n_states, n_timescales+1) The columns of lv, `lv[:, i]`, are the left eigenvectors of `transmat_`.

partial_transform(sequence, mode='clip')¶

Transform a sequence to internal indexing

Recall that sequence can be arbitrary labels, whereas transmat_ and countsmat_ are indexed with integers between 0 and n_states - 1. This methods maps a set of sequences from the labels onto this internal indexing.

Parameters:

sequence : array-like

A 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.

mode : {‘clip’, ‘fill’}

Method by which to treat labels in sequence which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.

clip: Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
fill: Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]

Returns:

mapped_sequence : list or ndarray: If mode is “fill”, return an ndarray in internal indexing. If mode is “clip”, return a list of ndarrays each in internal indexing.

right_eigenvectors_¶

Right eigenvectors, \(\Psi\), of the transition matrix.

The right eigenvectors are normalized such that:

Weighted by the stationary distribution, the right eigenvectors are normalized to 1. That is,

sum(rv[:, i] * rv[:, i] * self.populations_) == 1,

or \(<\psi_i, \psi_i>_{\mu} = 1\)

Returns:	rv : array-like, shape=(n_states, n_timescales+1) The columns of lv, `rv[:, i]`, are the right eigenvectors of `transmat_`.

sample_discrete(state=None, n_steps=100, random_state=None)¶

Generate a random sequence of states by propagating the model using discrete time steps given by the model lagtime.

Parameters:

state : {None, ndarray, label}

Specify the starting state for the chain.

None: Choose the initial state by randomly drawing from the model’s stationary distribution.
array-like: If state is a 1D array with length equal to n_states_, then it is is interpreted as an initial multinomial distribution from which to draw the chain’s initial state. Note that the indexing semantics of this array must match the _internal_ indexing of this model.
otherwise: Otherwise, state is interpreted as a particular deterministic state label from which to begin the trajectory.

n_steps : int

Lengths of the resulting trajectory

random_state : int or RandomState instance or None (default)

Pseudo Random Number generator seed control. If None, use the numpy.random singleton.

Returns:

sequence : array of length n_steps: A randomly sampled label sequence

score(sequences, y=None)¶

Score the model on new data using the generalized matrix Rayleigh quotient

Parameters:	sequences : list of array-like List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
Returns:	gmrq : float Generalized matrix Rayleigh quotient. This number indicates how well the top `n_timescales+1` eigenvectors of this MSM perform as slowly decorrelating collective variables for the new data in `sequences`.

References

[1]	McGibbon, R. T. and V. S. Pande, “Variational cross-validation of slow dynamical modes in molecular kinetics” J. Chem. Phys. 142, 124105 (2015)

score_¶: Training score of the model, computed as the generalized matrix, Rayleigh quotient, the sum of the first n_components eigenvalues

score_ll(sequences)¶

log of the likelihood of sequences with respect to the model

Parameters:	sequences : list of array-like List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
Returns:	loglikelihood : float The natural log of the likelihood, computed as \(\sum_{ij} C_{ij} \log(P_{ij})\) where C is a matrix of counts computed from the input sequences.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:	self

summarize()¶: Return some diagnostic summary statistics about this Markov model

timescales_¶

Implied relaxation timescales of the model.

The relaxation of any initial distribution towards equilibrium is given, according to this model, by a sum of terms – each corresponding to the relaxation along a specific direction (eigenvector) in state space – which decay exponentially in time. See equation 19. from [1].

Returns:	timescales : array-like, shape = (n_timescales,) The longest implied relaxation timescales of the model, expressed in units of time-step between indices in the source data supplied to `fit()`.

References

[1]	Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics:

Generation and validation.” J. Chem. Phys. 134.17 (2011): 174105.

transform(sequences, mode='clip')¶

Transform a list of sequences to internal indexing

Recall that sequences can be arbitrary labels, whereas transmat_ and countsmat_ are indexed with integers between 0 and n_states - 1. This methods maps a set of sequences from the labels onto this internal indexing.

Parameters:

sequences : list of array-like

List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.

mode : {‘clip’, ‘fill’}

Method by which to treat labels in sequences which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.

clip: Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
fill: Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]

Returns:

mapped_sequences : list: List of sequences in internal indexing

uncertainty_eigenvalues()¶

Estimate of the element-wise asymptotic standard deviation in the model eigenvalues.

Returns:	sigma_eigs : np.array, shape=(n_timescales+1,) The estimated symptotic standard deviation in the eigenvalues.

References

[1]	Hinrichs, Nina Singhal, and Vijay S. Pande. “Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics.” J. Chem. Phys. 126.24 (2007): 244101.

uncertainty_timescales()¶

Estimate of the element-wise asymptotic standard deviation in the model implied timescales.

Returns:	sigma_timescales : np.array, shape=(n_timescales,) The estimated symptotic standard deviation in the implied timescales.

References

[1]	Hinrichs, Nina Singhal, and Vijay S. Pande. “Calculation of the distribution of eigenvalues and eigenvectors in Markovian state models for molecular dynamics.” J. Chem. Phys. 126.24 (2007): 244101.