msmbuilder.msm.BayesianMarkovStateModel¶
-
class
msmbuilder.msm.
BayesianMarkovStateModel
(lag_time=1, n_samples=100, n_steps=0, n_chains=None, n_timescales=None, reversible=True, ergodic_cutoff='on', prior_counts=0, sliding_window=True, random_state=None, sampler='metzner', verbose=False)¶ Bayesian reversible Markov state model.
Variant of
MarkovStateModel
which estimates a distribution over transition matrices instead of a single transition matrix using Metropolis Markov chain Monte Carlo. This distribution gives information about the statistical uncertainty in the transition matrix (and functions of the transition matrix), and is stored inall_transmats_
Parameters: - lag_time : int
The lag time of the model
- n_samples : int, default=100
Total number of transition matrices to sample from the posterior
- n_steps : int, default=n_states
Number of MCMC steps to take between sampled transition matrices. By default, we use
n_steps=n_states_**2
.- n_chains : int, default=n_procs
Number of independent Markov chains to simulate. The requested number of transition matrix samples will be generated from n_chains independent MCMC chains.
- n_timescales : int, optional
The number of dynamical timescales to calculate when diagonalizing the transition matrix.
- reversible : bool, default=True
Enforce reversibility during transition matrix sampling
- ergodic_cutoff : int, default=1
Only the maximal strongly ergodic subgraph of the data is used to build an MSM. Ergodicity is determined by ensuring that each state is accessible from each other state via one or more paths involving edges with a number of observed directed counts greater than or equal to
ergodic_cutoff
. Not that by settingergodic_cutoff
to 0, this trimming is effectively turned off.- prior_counts : float, optional
Add a number of “pseudo counts” to each entry in the counts matrix. When prior_counts == 0 (default), the assigned transition probability between two states with no observed transitions will be zero, whereas when prior_counts > 0, even this unobserved transitions will be given nonzero probability.
- sliding_window : bool, optional
Count transitions using a window of length
lag_time
, which is slid along the sequences 1 unit at a time, yielding transitions which contain more data but cannot be assumed to be statistically independent. Otherwise, the sequences are simply subsampled at an interval oflag_time
.- random_state : int or RandomState instance or None (default)
Pseudo Random Number generator seed control. If None, use the numpy.random singleton.
- sampler : {‘metzner’, ‘metzner_py’}
The sampler implementation to use. ‘metzer’ is the sampler from Ref. [1] implemented in C, ‘metzner_py’ is a pure-python reference implementation.
- verbose : bool
Enable verbose printout
Notes
Markov chain Monte Carlo can be computationally expensive. To get good (converged) results and acceptable performance, you’ll likely need to play around with the
n_samples
,n_steps
andn_chains
parameters.n_samples
gives the total number of transition matrices sampled from the posterior. These samples are generated fromn_chains
different independent MCMC chains, at an interval ofn_steps
. The total number of iterations of MCMC performed duringfit()
isn_samples * n_steps
. Increasingn_chains
therefore does not alter the total number of iterations – instead it controls whether those iterations occur as part of one long chain or multiple shorter chains (which are run in parallel forsampler=='metzner'
).References
[1] P. Metzner, F. Noe and C. Schutte, “Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data.” Phys. Rev. E 80 021106 (2009) Attributes: - n_states_ : int
The number of states in the model
- mapping_ : dict
Mapping between “input” labels and internal state indices used by the counts and transition matrix for this Markov state model. Input states need not necessarily be integers in (0, …, n_states_ - 1), for example. The semantics of
mapping_[i] = j
is that statei
from the “input space” is represented by the indexj
in this MSM.- countsmat_ : array_like, shape = (n_states_, n_states_)
Number of transition counts between states. countsmat_[i, j] is counted during fit(). The indices i and j are the “internal” indices described above. No correction for reversibility is made to this matrix.
- transmats_ : array_like, shape = (n_samples, n_states_, n_states_)
Samples from the posterior ensemble of transition matrices.
Methods
fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. inverse_transform
(sequences)Transform a list of sequences from internal indexing into labels partial_transform
(sequence[, mode])Transform a sequence to internal indexing set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences[, mode])Transform a list of sequences to internal indexing fit -
__init__
(lag_time=1, n_samples=100, n_steps=0, n_chains=None, n_timescales=None, reversible=True, ergodic_cutoff='on', prior_counts=0, sliding_window=True, random_state=None, sampler='metzner', verbose=False)¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([lag_time, n_samples, n_steps, …])Initialize self. fit
(sequences[, y])fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. inverse_transform
(sequences)Transform a list of sequences from internal indexing into labels partial_transform
(sequence[, mode])Transform a sequence to internal indexing set_params
(**params)Set the parameters of this estimator. summarize
()Return some diagnostic summary statistics about this Markov model transform
(sequences[, mode])Transform a list of sequences to internal indexing Attributes
all_eigenvalues_
Eigenvalues of the transition matrices. all_left_eigenvectors_
Left eigenvectors, \(\Phi\), of each transition matrix in the ensemble all_populations_
all_right_eigenvectors_
Right eigenvectors, \(\Psi\), of each transition matrix in the ensemble all_timescales_
Implied relaxation timescales each sample in the ensemble -
all_eigenvalues_
¶ Eigenvalues of the transition matrices.
Returns: - eigs : array-like, shape = (n_samples, n_timescales+1)
The eigenvalues of each transition matrix in the ensemble
-
all_left_eigenvectors_
¶ Left eigenvectors, \(\Phi\), of each transition matrix in the ensemble
Each transition matrix’s left eigenvectors are normalized such that:
lv[:, 0]
is the equilibrium populations and is normalized such that sum(lv[:, 0]) == 1`- The eigenvectors satisfy
sum(lv[:, i] * lv[:, i] / model.populations_) == 1
. In math notation, this is \(<\phi_i, \phi_i>_{\mu^{-1}} = 1\)
Returns: - lv : array-like, shape=(n_samples, n_states, n_timescales+1)
The columns of lv,
lv[:, i]
, are the left eigenvectors oftransmat_
.
-
all_right_eigenvectors_
¶ Right eigenvectors, \(\Psi\), of each transition matrix in the ensemble
Each transition matrix’s left eigenvectors are normalized such that:
Weighted by the stationary distribution, the right eigenvectors are normalized to 1. That is,
sum(rv[:, i] * rv[:, i] * self.populations_) == 1
,or \(<\psi_i, \psi_i>_{\mu} = 1\)
Returns: - rv : array-like, shape=(n_samples, n_states, n_timescales+1)
The columns of lv,
rv[:, i]
, are the right eigenvectors oftransmat_
.
-
all_timescales_
¶ Implied relaxation timescales each sample in the ensemble
Returns: - timescales : array-like, shape = (n_samples, n_timescales,)
The longest implied relaxation timescales of the each sample in the ensemble of transition matrices, expressed in units of time-step between indices in the source data supplied to
fit()
.
References
[1] Prinz, Jan-Hendrik, et al. “Markov models of molecular kinetics: Generation and validation.” J. Chem. Phys. 134.17 (2011): 174105.
-
fit_transform
(X, y=None, **fit_params)¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
Parameters: - X : numpy array of shape [n_samples, n_features]
Training set.
- y : numpy array of shape [n_samples]
Target values.
Returns: - X_new : numpy array of shape [n_samples, n_features_new]
Transformed array.
-
get_params
(deep=True)¶ Get parameters for this estimator.
Parameters: - deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns: - params : mapping of string to any
Parameter names mapped to their values.
-
inverse_transform
(sequences)¶ Transform a list of sequences from internal indexing into labels
Parameters: - sequences : list
List of sequences, each of which is one-dimensional array of integers in
0, ..., n_states_ - 1
.
Returns: - sequences : list
List of sequences, each of which is one-dimensional array of labels.
-
partial_transform
(sequence, mode='clip')¶ Transform a sequence to internal indexing
Recall that sequence can be arbitrary labels, whereas
transmat_
andcountsmat_
are indexed with integers between 0 andn_states - 1
. This methods maps a set of sequences from the labels onto this internal indexing.Parameters: - sequence : array-like
A 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
- mode : {‘clip’, ‘fill’}
Method by which to treat labels in sequence which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.
clip
Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
fill
Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the
pandas
library.]
Returns: - mapped_sequence : list or ndarray
If mode is “fill”, return an ndarray in internal indexing. If mode is “clip”, return a list of ndarrays each in internal indexing.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.Returns: - self
-
summarize
()¶ Return some diagnostic summary statistics about this Markov model
-
transform
(sequences, mode='clip')¶ Transform a list of sequences to internal indexing
Recall that sequences can be arbitrary labels, whereas
transmat_
andcountsmat_
are indexed with integers between 0 andn_states - 1
. This methods maps a set of sequences from the labels onto this internal indexing.Parameters: - sequences : list of array-like
List of sequences, or a single sequence. Each sequence should be a 1D iterable of state labels. Labels can be integers, strings, or other orderable objects.
- mode : {‘clip’, ‘fill’}
Method by which to treat labels in sequences which do not have a corresponding index. This can be due, for example, to the ergodic trimming step.
clip
Unmapped labels are removed during transform. If they occur at the beginning or end of a sequence, the resulting transformed sequence will be shorted. If they occur in the middle of a sequence, that sequence will be broken into two (or more) sequences. (Default)
fill
Unmapped labels will be replaced with NaN, to signal missing data. [The use of NaN to signal missing data is not fantastic, but it’s consistent with current behavior of the
pandas
library.]
Returns: - mapped_sequences : list
List of sequences in internal indexing