msmbuilder.msm.BayesianMarkovStateModel

class msmbuilder.msm.BayesianMarkovStateModel(lag_time=1, n_samples=100, n_steps=0, n_chains=None, n_timescales=None, reversible=True, ergodic_cutoff=1, prior_counts=0, sliding_window=True, random_state=None, sampler='metzner', verbose=False)

Bayesian reversible Markov state model.

Variant of MarkovStateModel which estimates a distribution over transition matrices instead of a single transition matrix using Metropolis Markov chain Monte Carlo. This distribution gives information about the statistical uncertainty in the transition matrix (and functions of the transition matrix), and is stored in all_transmats_

Parameters:

lag_time : int

The lag time of the model

n_samples : int, default=100

Total number of transition matrices to sample from the posterior

n_steps : int, default=n_states

Number of MCMC steps to take between sampled transition matrices. By default, we use n_steps=n_states_**2.

n_chains : int, default=n_procs

Number of independent Markov chains to simulate. The requested number of transition matrix samples will be generated from n_chains independent MCMC chains.

n_timescales : int, optional

The number of dynamical timescales to calculate when diagonalizing the transition matrix.

reversible : bool, default=True

Enforce reversibility during transition matrix sampling

ergodic_cutoff : int, default=1

Only the maximal strongly ergodic subgraph of the data is used to build an MSM. Ergodicity is determined by ensuring that each state is accessible from each other state via one or more paths involving edges with a number of observed directed counts greater than or equal to ergodic_cutoff. Not that by setting ergodic_cutoff to 0, this trimming is effectively turned off.

prior_counts : float, optional

Add a number of “pseudo counts” to each entry in the counts matrix. When prior_counts == 0 (default), the assigned transition probability between two states with no observed transitions will be zero, whereas when prior_counts > 0, even this unobserved transitions will be given nonzero probability.

sliding_window : bool, optional

Count transitions using a window of length lag_time, which is slid along the sequences 1 unit at a time, yielding transitions which contain more data but cannot be assumed to be statistically independent. Otherwise, the sequences are simply subsampled at an interval of lag_time.

random_state : int or RandomState instance or None (default)

Pseudo Random Number generator seed control. If None, use the numpy.random singleton.

sampler : {‘metzner’, ‘metzner_py’}

The sampler implementation to use. ‘metzer’ is the sampler from Ref. [1] implemented in C, ‘metzner_py’ is a pure-python reference implementation.

verbose : bool

Enable verbose printout

Notes

Markov chain Monte Carlo can be computationally expensive. To get good (converged) results and acceptable performance, you’ll likely need to play around with the n_samples, n_steps and n_chains parameters. n_samples gives the total number of transition matrices sampled from the posterior. These samples are generated from n_chains different independent MCMC chains, at an interval of n_steps. The total number of iterations of MCMC performed during fit() is n_samples * n_steps. Increasing n_chains therefore does not alter the total number of iterations – instead it controls whether those iterations occur as part of one long chain or multiple shorter chains (which are run in parallel for sampler=='metzner').

References

[R27]P. Metzner, F. Noe and C. Schutte, “Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data.” Phys. Rev. E 80 021106 (2009)

Attributes

n_states_ (int) The number of states in the model
mapping_ (dict) Mapping between “input” labels and internal state indices used by the counts and transition matrix for this Markov state model. Input states need not necessarily be integers in (0, ..., n_states_ - 1), for example. The semantics of mapping_[i] = j is that state i from the “input space” is represented by the index j in this MSM.
countsmat_ (array_like, shape = (n_states_, n_states_)) Number of transition counts between states. countsmat_[i, j] is counted during fit(). The indices i and j are the “internal” indices described above. No correction for reversibility is made to this matrix.
transmats_ (array_like, shape = (n_samples, n_states_, n_states_)) Samples from the posterior ensemble of transition matrices.

Methods

fit(sequences[, y])
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
inverse_transform(sequences) Transform a list of sequences from internal indexing into
set_params(**params) Set the parameters of this estimator.
summarize()
transform(sequences[, mode]) Transform a list of sequences to internal indexing
all_timescales_

Implied relaxation timescales each sample in the ensemble

Returns:

timescales : array-like, shape = (n_samples, n_timescales,)

The longest implied relaxation timescales of the each sample in the ensemble of transition matrices, expressed in units of time-step between indices in the source data supplied to fit().

References

[R28]Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics:

Generation and validation.” J. Chem. Phys. 134.17 (2011): 174105.

all_eigenvalues_

Eigenvalues of the transition matrices.

Returns:

eigs : array-like, shape = (n_samples, n_timescales+1)

The eigenvalues of each transition matrix in the ensemble

all_left_eigenvectors_

Left eigenvectors, \(\Phi\), of each transition matrix in the ensemble

Each transition matrix’s left eigenvectors are normalized such that:

  • lv[:, 0] is the equilibrium populations and is normalized such that sum(lv[:, 0]) == 1`
  • The eigenvectors satisfy sum(lv[:, i] * lv[:, i] / model.populations_) == 1. In math notation, this is \(<\phi_i, \phi_i>_{\mu^{-1}} = 1\)
Returns:

lv : array-like, shape=(n_samples, n_states, n_timescales+1)

The columns of lv, lv[:, i], are the left eigenvectors of transmat_.

all_right_eigenvectors_

Right eigenvectors, \(\Psi\), of each transition matrix in the ensemble

Each transition matrix’s left eigenvectors are normalized such that:

  • Weighted by the stationary distribution, the right eigenvectors are normalized to 1. That is,

    sum(rv[:, i] * rv[:, i] * self.populations_) == 1,

    or \(<\psi_i, \psi_i>_{\mu} = 1\)

Returns:

rv : array-like, shape=(n_samples, n_states, n_timescales+1)

The columns of lv, rv[:, i], are the right eigenvectors of transmat_.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X : numpy array of shape [n_samples, n_features]

Training set.

y : numpy array of shape [n_samples]

Target values.

Returns:

X_new : numpy array of shape [n_samples, n_features_new]

Transformed array.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep: boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

inverse_transform(sequences)

Transform a list of sequences from internal indexing into labels

Parameters:

sequences : list

List of sequences, each of which is one-dimensional array of integers in 0, ..., n_states_ - 1.

Returns:

sequences : list

List of sequences, each of which is one-dimensional array of labels.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self
transform(sequences, mode='clip')

Transform a list of sequences to internal indexing

Recall that sequences can be arbitrary labels, whereas transmat_ and countsmat_ are indexed with integers between 0 and n_states - 1. This methods maps a set of sequences from the labels onto this internal indexing.

Parameters:

sequences : list of array-like

List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.

mode : {‘clip’, ‘fill’}

Method by which to treat labels in sequences which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.

clip

Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)

fill

Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the pandas library.]

Returns:

mapped_sequences : list

List of sequences in internal indexing

Versions