How should I get started?

  1. If you have a lot of data, use a subset of it to get started. You will be able to iterate much more quickly and explore the impact of different modeling and parameter choices running on your laptop. Once you’ve got a sense for what works, start to scale your analysis up to a full MD dataset.
  2. Use the Anaconda scientific python distribution and its conda package manager to install python packages.
  3. Get involved on the github issue tracker.

How do I report a bug?

Post a note on the github issue tracker.

How do I contribute a new feature?

File a pull request on github. If you’re not familiar with github, there are some instructions on the scikit-learn site here.

Where should I start with the MSM literature?

Some of the key PIs involved in research on Markov modeling of biomolecular conformational dynamics include Hans Andersen, Robert Best, Greg Bowman, Amedeo Caflisch, John Chodera, Peter Deuflhard, Ken Dill, Gianni De Fabritiis, Helmut Grubmuller, Xuhui Huang, Ronald Levy, Frank Noe, Vijay Pande, Jed Pitera, Benoit Roux, Christof Schutte, Bill Swope, Eric Vanden-Eijnden, and Marcus Weber.

In 2014, Greg Bowman, Vijay Pande, and Frank Noe edited the book An Introduction to Markov State Models and Their Application to Long Timescale Molecular Simulation, which features contributions from many of the authors above.

Two outstanding reviews (the first is somewhat old, but still very much worth reading) of the field are

Methodologically, some of our favorite recent papers (2013-) include: [1]

Some particularly notable recent applications of MSMs include

What is the relationship between MSMBuilder and other packages?

Another software packages that performs similar analyses is EMMA. MSMBuilder inherits a lot of ideas about API design, machine learning, and software engineering from scikit-learn. MSMBuilder also has a number of python dependencies. See the Installation page for details.

How much MD sampling do I need to build an MSM?

There’s no definitive way to answer this question – in general reasoning about the convergence of any stochastic sampling is very tricky. We can’t really be certain that there isn’t another free energy minima that our simulations didn’t find.

An MSM (or tICA, HMM, etc) can help you answer this question. Using the MSM, compare the slowest relaxation timescales in your model with the total amount of aggregate sampling you have. If your system takes hundreds of microseconds to relax to equilibrium according to your model, you probably want at minimum hundreds of microseconds of sampling.

Another thing you can do is to split your data set into a couple (e.g. 2-10) chunks, and then repeat your analysis on subsets of the data. For example, break your data up into 5 chunks and then build 5 MSMs, each of which is fit using 4/5 of the data (with one chunk left out). If the 5 MSMs are all consistent with one another, you might have very good sampling. If they give totally different results from one another, you don’t have enough sampling.

How can I validate an MSM?

The gold standard is to use your MSM to make predictions about experimental observables for a real molecular system that can be tested in the lab. The relaxation timescales which are calculated by MSMs, tICA, HMMs, and other types of kinetic models correspond to approximations for the relaxation timescales that should be observed in experiments like T-jump spectroscopy. It’s best to look in the literature for this. See for example [2] and [3] for a couple cool connections between MSMs and IR experiments.

One tricky part about validating an MSM by comparing to experiments is that there are multiple possible reasons that an MSM could be “wrong”. The MD forcefield used for the simulations might not be a sufficiently accurate model of reality. You might not have enough sampling. The MSM itself might not resolve the slow degrees of freedom in the system (e.g. because of poor clustering).

Another good idea is to build multiple MSMs, and see if they are consistent with one another. For example, a common thing is to compare the implied timescales of a series of MSMs built with the same clustering but with different lag time (which should converge). See the validation section of [4].

How can statistical models like MSMs be used to accelerate MD?

See Bowman, G R., D. L. Ensign, and S. S. Pande. Enhanced modeling via network theory: Adaptive sampling of markov state models. J. Chem. Theory Compt. 6.3 (2010): 787-794 and Doerr, S., and G. De Fabritiis. On-the-fly learning and sampling of ligand binding by high-throughput molecular simulations. J. Chem. Theory Comput. (2014).

What are the tradeoffs between running a large number of short MD simulations vs. a few long ones?

Thats a good question.

My simulations use replica exchange, aMD, or metadynamics. Can I use these tools to analyze them?

Yes, but you’re going to have to be careful. Replica exchange, aMD, meta-dynamics, and other related thermodynamic sampling methods sacrifice physical kinetics to achieve potentially faster thermodynamic sampling. So you’re going to need to be careful about interpreting the time-related quantities from any models you might build using msmbuilder such as the transition matrix of an MSM, or tICA eigenvalues. With clustering you’re fine.

Why am I getting MemoryErrors?

Traceback (most recent call last):
  File "file.py", line 5, in <module>
    np.zeros((N, M))

If you’re running models in msmbuilder and you get a traceback with a MemoryError (e.g. above), the reason is that you don’t have enough RAM in your machine to run whatever you’re trying to run. One thing you can do is just get more RAM, but his isn’t going to scale very far.

To debug this kind of issue, you need really to reason about the size of the arrays that are being created, which means thinking about the number of data points in your dataset, the number of features, etc. Some algorithms, like LandmarkHierarchical let you trace off the memory requirement against accuracy.

If you’re trying to build models with thousands of features, consider running a dimensionality reduction algorithm like PCA or tICA first. Or if you have milliseconds of MD data sampled at a picosecond frequency, consider subsampling (e.g. only analyze every 100th or 10,000th snapshot from your simulations).

How can I cite MSMBuilder?

Please cite MSMBuilder2: Modeling Conformational Dynamics on the Picosecond to Millisecond Scale Most of the individual methods that are implemented in MSMBuilder were also introduced in published papers. The documentation for each class or command should have the appropriate references listed.


[1]Of course, this is merely an opinion
[2]Zhuang, W, et al. Simulating the T-jump-triggered unfolding dynamics of trpzip2 peptide and its time-resolved IR and two-dimensional IR signals using the Markov state model approach. J. Phys. Chem. B 115.18 (2011): 5415-5424.
[3]Baiz, C. R., et al. A Molecular Interpretation of 2D IR Protein Folding Experiments with Markov State Models. Biophysical journal 106.6 (2014): 1359-1370.
[4]Pande, V S., K. Beauchamp, and G. R. Bowman. Everything you wanted to know about Markov State Models but were afraid to ask Methods 52.1 (2010): 99-105.