Markov state models (MSMs) are a class of timeseries models for modeling the long-timescale dynamics of molecular systems. An MSM is essentially a kinetic map of the conformational space a molecule explores. The model consists of (1) a set of conformational states and (2) a matrix of transition probabilities (or, equivalently, transition rates) between each pair of states.
In MSMBuilder, you can use MarkovStateModel to build MSMs from “labeled” trajectories – that is, sequences of integers corresponding to the index of the conformational state occupied by the system at each time point on a trajectory. The Geometric Clustering module provides a number of different methods for clustering the trajectories that you can use to define the states.
MarkovStateModel([lag_time, n_timescales, ...]) | Reversible Markov State Model |
BayesianMarkovStateModel([lag_time, ...]) | Bayesian reversible Markov state model. |
There are two steps in constructing an MSM
Count the number of observed transitions between states. That is, construct \(\mathbf{C}\) such that \(C_{ij}\) is the number of observed transitions from state \(i\) at time \(t\) to state \(j\) at time \(t+\tau\), summed over all times \(t\).
Estimate the transition probability matrix, \(\mathbf{T}\)
\[T_{ij} = P( s_{t+\tau} = j | s_t = i)\]where \(S = (s_t)\) is a trajectory in state-index space of length \(N\), and \(s_t \in \{1, \ldots, k\}\) the state-index of the trajectory at time \(t\).
The probability that a given transition probability matrix would generate some observed trajectory (the likelihood) is
Assuming a prior distribution on \(T\) of the form \(P(T)=\prod_{ij} T_{ij}^{B_{ij}}\), we then have a posterior distribution
MSMBuilder implements two MSM estimators.
- MarkovStateModel performs maximum likelihood estimation. It estimates a single transition matrix, \(\mathbf{T}\), to maximimize \(\mathcal{L}(\mathbf{T})\).
- BayesianMarkovStateModel uses Metropolis Markov chain Monte Carlo to (approximately) draw a sample of transition matrices from the posterior distribution \(P(\mathbf{T} | S)\). This sampler is described in Metzner et al. [5] This can be used to estimate the sampling uncertainty in functions of the transition matrix (e.g. relaxation timescales).
Note
The uncertainty in the transition matrix (and functions of the transition matrix) that can be estimated from BayesianMarkovStateModel do not fully account for all sources of error. In particular, the discretization induced by clustering produces a negative bias on the eigenvalues of the transition matrix – they asymptotically underestimate the eigenvalues of the propagator / transfer operator in the limit of infinite sampling. [6] See section 3D (Quantifying the discretization error) of Prinz et al. for more discussion on the discretization error. [1]
The most important tradeoff with MSMs is a bias-variance dilemma on the number of states. We know analytically that the expected value of the relaxation timescales is below the true value when using a finite number of states, and that the magnitude of this bias decreases as the number of states goes up. On the other hand, the statistical error in the MSM (variance) goes up as the number of states increases with a fixed data set, because there are fewer transitions (data) per element of the transition probability matrix.
There are no existing algorithms in the MSM literature which fully balance these competing sources of error in an automatic and practical way, although some partially satisfactory algorithms are available. [3] [4]
A second key parameter is the lag time of the model. The lag time controls a trade off between accuracy and descriptive power. [TODO: WRITE MORE]
[1] | Prinz, J.-H., et al. Markov models of molecular kinetics: Generation and validation J. Chem. Phys. 134.17 (2011): 174105. |
[2] | Pande, V. S., K. A. Beauchamp, and G. R. Bowman. Everything you wanted to know about Markov State Models but were afraid to ask Methods 52.1 (2010): 99-105. |
[3] | McGibbon, R. T., C. R. Schwantes, and Vijay S. Pande. Statistical Model Selection for Markov Models of Biomolecular Dynamics. J. Phys. Chem. B (2014). |
[4] | Kellogg, E. H., O. F. Lange, and D. Baker. Evaluation and optimization of discrete state models of protein folding. J. Phys. Chem. B 116.37 (2012): 11405-11413. |
[5] | Metzner, P., F. Noe, and C. Schutte. Estimating the sampling error: Distribution of transition matrices and functions of transition matrices for given trajectory data. Phys. Rev. E 80.2 (2009): 021106. |
[6] | Nuske, F., et al. Variational approach to molecular kinetics. J. Chem. Theory Comput.10.4 (2014): 1739-1752. |