The aim of this package is to provide software tools for predictive modeling of the long timescale dynamics of biomolecular systems using statistical modeling to analyze physical simulations.
Given a dataset of one or more stochastic trajectories tracking the coordinates of every (10,000+) atom in a molecular system at a discrete time interval, how do we understand the slow dynamical processes and make quantitative predictions about the system?
To build a dynamical model, we apply (stepwise) a series of dimensionality reductions. The basic set of steps is outlined below. Note that most steps are optional under certain circumstances. The particulars should become clear as you continue reading the documentation.
- Set up a system for molecular dynamics, and run one or more simulations for as long as you can on as many CPUs or GPUs as you have access. There are a lot of great software packages for running MD, e.g OpenMM, Gromacs, Amber, CHARMM, and many others. MSMBuilder is not one of them.
- Featurize trajectories into an appropriate vector of features. The full \(3N\) set of atomic coordinates is potentially unwieldy and redundant. It likely does not respect the rotational or translational symmetry of your system either. We commonly use backbone dihedral angles as our features, although this depends highly on the system being modeled.
- Decompose your features into a new basis that preserves the relevant information in your data with fewer dimensions. We typically use tICA, which finds linear combinations of input degrees of freedom that maximize autocorrelation or “slowness”.
- Cluster your data to define (micro-)states by grouping similar input data points. At this stage, we’ve reduced the dimensionality of the problem from potentially thousands of \(xyz\) coordinates to a single cluster (state) index.
- Estimate a model from the clustered data. We typically build an MSM, which models the important dynamics of the system.
- Use GMRQ cross-validation to select the best model. There are many hyperparameters (knobs to tweak) in the workflow. This scoring function can help us pick the best values.