Fs Peptide (in RAM)ΒΆ

Modeling dynamics of FS Peptide

This example shows a typical, basic usage of MSMBuilder to model dynamics of a protein system.

MSMBuilder includes example datasets

The following cell loads an example protein dataset. In practice, you will generate your own dataset by running molecular dynamics (MD) on your system of interest. MSMBuilder does not run MD.

In [1]:
# Download example dataset
from msmbuilder.example_datasets import FsPeptide
fs_peptide = FsPeptide(verbose=False)
xyz = fs_peptide.get().trajectories
print(fs_peptide.description())
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/mdtraj/formats/pdb/pdbfile.py:196: UserWarning: Unlikely unit cell vectors detected in PDB file likely resulting from a dummy CRYST1 record. Discarding unit cell vectors.
  warnings.warn('Unlikely unit cell vectors detected in PDB file likely '
This dataset consists of 28 molecular dynamics trajectories of Fs peptide
(Ace-A_5(AAARA)_3A-NME), a widely studied model system for protein folding.
Each trajectory is 500 ns in length, and saved at a 50 ps time interval (14
us aggegrate sampling). The simulations were performed using the AMBER99SB-ILDN
force field with GBSA-OBC implicit solvent at 300K, starting from randomly
sampled conformations from an initial 400K unfolding simulation. The
simulations were performed with OpenMM 6.0.1.

The dataset, including the script used to generate the dataset
is available on figshare at

http://dx.doi.org/10.6084/m9.figshare.1030363

Since the data was saved at 50 ps / frame, we only load every 10th frame (with a new frequency of 0.5/ns).

In [2]:
xyz = [t[::10] for t in xyz]
print("{} trajectories".format(len(xyz)))
# msmbuilder does not keep track of units! You must keep track of your
# data's timestep
to_ns = 0.5
print("with length {} ns".format(set(len(x)*to_ns for x in xyz)))
28 trajectories
with length {500.0} ns

Featurization

The raw (x, y, z) coordinates from the simulation do not respect the translational and rotational symmetry of our problem. A Featurizer transforms cartesian coordinates into other representations. Here we use the DihedralFeaturizer to turn our data into phi and psi dihedral angles. Observe that the 264*3-dimensional space is reduced to 84 dimensions

In [3]:
from msmbuilder.featurizer import DihedralFeaturizer
featurizer = DihedralFeaturizer(types=['phi', 'psi'])
diheds = featurizer.fit_transform(xyz)

print(xyz[0].xyz.shape)
print(diheds[0].shape)
(1000, 264, 3)
(1000, 84)

Preprocessing

Since the range of values in our raw data can vary widely from feature to feature, we can scale values to reduce bias. Here we use the RobustScaler to center and scale our dihedral angles by their respective interquartile ranges.

In [4]:
from msmbuilder.preprocessing import RobustScaler
scaler = RobustScaler()
scaled_diheds = scaler.fit_transform(diheds)

print(diheds[0].shape)
print(scaled_diheds[0].shape)
(1000, 84)
(1000, 84)

Intermediate kinetic model: tICA

tICA is similar to principal component analysis (see "tICA vs. PCA" example). Note that the 84-dimensional space is reduced to 4 dimensions.

In [5]:
from msmbuilder.decomposition import tICA
tica_model = tICA(lag_time=2, n_components=4)
# fit and transform can be done in seperate steps:
tica_model.fit(diheds)
tica_trajs = tica_model.transform(diheds)

print(diheds[0].shape)
print(tica_trajs[0].shape)
(1000, 84)
(1000, 4)

tICA Histogram

We can histogram our data projecting along the two slowest degrees of freedom (as found by tICA)

In [6]:
%matplotlib inline
import msmexplorer as msme
import numpy as np
txx = np.concatenate(tica_trajs)
_ = msme.plot_histogram(txx)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/seaborn/apionly.py:6: UserWarning: As seaborn no longer sets a default style on import, the seaborn.apionly module is deprecated. It will be removed in a future version.
  warnings.warn(msg, UserWarning)

Clustering

Conformations need to be clustered into states (sometimes written as microstates). We cluster based on the tICA projections to group conformations that interconvert rapidly. Note that we transform our trajectories from the 4-dimensional tICA space into a 1-dimensional cluster index.

In [7]:
from msmbuilder.cluster import MiniBatchKMeans
clusterer = MiniBatchKMeans(n_clusters=100, random_state=42)
clustered_trajs = clusterer.fit_transform(tica_trajs)

print(tica_trajs[0].shape)
print(clustered_trajs[0].shape)
(1000, 4)
(1000,)
In [8]:
from matplotlib import pyplot as plt
plt.hexbin(txx[:,0], txx[:,1], bins='log', mincnt=1, cmap='viridis')
plt.scatter(clusterer.cluster_centers_[:,0],
            clusterer.cluster_centers_[:,1], 
            s=100, c='w')
Out[8]:
<matplotlib.collections.PathCollection at 0x14b3050bfc88>

MSM

We can construct an MSM from the labeled trajectories

In [9]:
from msmbuilder.msm import MarkovStateModel
msm = MarkovStateModel(lag_time=2, n_timescales=20)
msm.fit(clustered_trajs)
MSM contains 1 strongly connected component above weight=0.50. Component 0 selected, with population 100.000000%
Out[9]:
MarkovStateModel(ergodic_cutoff='on', lag_time=2, n_timescales=20,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
In [10]:
assignments = clusterer.partial_transform(txx)
assignments = msm.partial_transform(assignments)
msme.plot_free_energy(txx, obs=(0, 1), n_samples=10000,
                      pi=msm.populations_[assignments],
                      xlabel='tIC 1', ylabel='tIC 2')
plt.scatter(clusterer.cluster_centers_[msm.state_labels_, 0],
            clusterer.cluster_centers_[msm.state_labels_, 1],
            s=1e4 * msm.populations_,       # size by population
            c=msm.left_eigenvectors_[:, 1], # color by eigenvector
            cmap="coolwarm",
            zorder=3) 
plt.colorbar(label='First dynamical eigenvector')
plt.tight_layout()
In [11]:
msm.timescales_
Out[11]:
array([ 3945.14131925,  1339.41593592,  1066.24252899,   549.83489889,
         297.94676168,   253.55650728,   222.34876895,   179.88927971,
         151.40133055,   129.89272175,   125.00427867,    98.52943851,
          98.31081754,    94.10475336,    90.68019339,    75.94855924,
          73.24168864,    66.74039138,    61.03863276,    51.03848178])
In [12]:
msme.plot_timescales(msm, n_timescales=5,
                     ylabel='Implied Timescales ($ns$)')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x14b304fb3198>

Macrostate Model

In [13]:
from msmbuilder.lumping import PCCAPlus
pcca = PCCAPlus.from_msm(msm, n_macrostates=4)
macro_trajs = pcca.transform(clustered_trajs)
Optimization terminated successfully.
         Current function value: -3.990911
         Iterations: 11
         Function evaluations: 101
In [14]:
msme.plot_free_energy(txx, obs=(0, 1), n_samples=10000,
                      pi=msm.populations_[assignments],
                      xlabel='tIC 1', ylabel='tIC 2')
plt.scatter(clusterer.cluster_centers_[msm.state_labels_, 0],
            clusterer.cluster_centers_[msm.state_labels_, 1],
            s=50,
            c=pcca.microstate_mapping_,
            zorder=3
           )
plt.tight_layout()

(Fs-Peptide-in-RAM.ipynb; Fs-Peptide-in-RAM.eval.ipynb; Fs-Peptide-in-RAM.py)