Fs Peptide (command line)ΒΆ

Modeling dynamics of FS Peptide

This example shows a typical, basic usage of the MSMBuilder command line to model dynamics of a protein system.

In [1]:
# Work in a temporary directory
import tempfile
import os
os.chdir(tempfile.mkdtemp())
In [2]:
# Since this is running from an IPython notebook,
# we prefix all our commands with "!"
# When running on the command line, omit the leading "!"
! msmb -h
usage: msmb [-h] [-v]  ...

Statistical models for biomolecular dynamics

optional arguments:
 -h, --help             show this help message and exit
 -v, --version          show program's version number and exit

commands:
 
  AtomIndices           Create index file for atoms or distance pairs.
  ConvertChunkedProject
                        Convert an MD dataset with chunked trajectories into a standard format.
  TemplateProject       Set up a new MSMBuilder project
  AlphaAngleFeaturizer  Featurizer to extract alpha (dihedral) angles.
  BinaryContactFeaturizer
                        Featurizer based on residue-residue contacts below a cutoff.
  SASAFeaturizer        Featurizer based on solvent-accessible surface areas.
  SuperposeFeaturizer   Featurizer based on euclidian atom distances to reference structure.
  RawPositionsFeaturizer
                        Featurize an MD trajectory into a vector space with the raw cartesian coordinates.
  AtomPairsFeaturizer   Featurizer based on distances between specified pairs of atoms.
  ContactFeaturizer     Featurizer based on residue-residue distances.
  KappaAngleFeaturizer  Featurizer to extract kappa angles.
  LogisticContactFeaturizer
                        Featurizer based on logistic-transformed residue-residue contacts.
  VonMisesFeaturizer    Featurizer based on dihedral angles soft-binned along the unit circle.
  GaussianSolventFeaturizer
                        Featurizer on weighted pairwise distance between solute and solvent.
  RMSDFeaturizer        Featurizer based on RMSD to one or more reference structures.
  DRIDFeaturizer        Featurizer based on distribution of reciprocal interatomic distances (DRID).
  DihedralFeaturizer    Featurizer based on dihedral angles.
  RobustScaler          Scale features using statistics that are robust to outliers.
  StandardScaler        Standardize features by removing the mean and scaling to unit variance.
  DoubleEWMA            Smooth time-series data using forward and backward exponentially-weighted moving average
                        filters.
  Butterworth           Smooth time-series data using a low-pass, zero-delay Butterworth filter.
  KMeans                K-Means clustering.
  MiniBatchKMeans       Mini-Batch K-Means clustering.
  SpectralClustering    Apply clustering to a projection to the normalized laplacian.
  NDGrid                Discretize continuous data points onto an N-dimensional grid.
  RegularSpatial        Regular spatial clustering.
  GMM                   Gaussian Mixture Model.
  MiniBatchKMedoids     Mini-Batch K-Medoids clustering.
  MeanShift             Mean shift clustering using a flat kernel.
  KCenters              K-Centers clustering.
  KMedoids              K-Medoids clustering.
  AgglomerativeClustering
                        Agglomerative Clustering.
  APM                   APM This Program is a python package which implements Automatic State Partitioning for Multibody
                        Systems (APM) algorithm to cluster trajectories and model the transitions based on a Markov
                        State Model (MSM).
  LandmarkAgglomerative
                        Landmark-based agglomerative hierarchical clustering.
  AffinityPropagation   Perform Affinity Propagation Clustering of data.
  KernelTICA            Time-structure Independent Componenent Analysis (tICA) using the kernel trick.
  SparsePCA             Sparse Principal Components Analysis (SparsePCA).
  tICA                  Time-structure Independent Component Analysis (tICA).
  PCA                   Principal component analysis (PCA).
  MiniBatchSparsePCA    Mini-batch Sparse Principal Components Analysis.
  FactorAnalysis        Factor Analysis (FA).
  FastICA               FastICA: a fast algorithm for Independent Component Analysis.
  SparseTICA            Sparse time-structure Independent Component Analysis (tICA).
  AlanineDipeptide      Download example alanine dipeptide dataset.
  MetEnkephalin         Download example Met-Enkephalin dataset.
  FsPeptide             Download example Fs-peptide dataset.
  DoubleWell            Generate example double well potential dataset.
  QuadWell              Generate example quad-well potential dataset.
  MullerPotential       Generate example Muller potential dataset.
  ImpliedTimescales     Scan the implied timescales of MarkovStateModels with respect to lag time.
  GaussianHMM           Reversible Gaussian Hidden Markov Model L1-Fusion Regularization.
  MarkovStateModel      Reversible Markov State Model.
  BayesianContinuousTimeMSM
                        Bayesian reversible first-order Master equation model.
  ContinuousTimeMSM     Reversible first order master equation model.
  BayesianMarkovStateModel
                        Bayesian reversible Markov state model.
  TransformDataset      Use a pre-fit model to transform a dataset.

Get example data

In [3]:
! msmb FsPeptide --data_home ./
! tree
Copying fs_peptide from msmb_data package to ./
Example dataset saved: ./fs_peptide
/bin/sh: 1: tree: not found

Featurization

The raw (x, y, z) coordinates from the simulation do not respect the translational and rotational symmetry of our problem. A Featurizer transforms cartesian coordinates into other representations. Here we use the DihedralFeaturizer to turn our data into phi and psi dihedral angles. Observe that the 264*3-dimensional space is reduced to 84 dimensions.

In [4]:
# Remember '\' is the line-continuation marker
# You can enter this command on one line
! msmb DihedralFeaturizer \
    --out featurizer.pkl  \
    --transformed diheds  \
    --top fs_peptide/fs-peptide.pdb \
    --trjs "fs_peptide/*.xtc" \
    --stride 10
DihedralFeaturizer(sincos=True, types=['phi', 'psi'])
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/mdtraj/formats/pdb/pdbfile.py:196: UserWarning: Unlikely unit cell vectors detected in PDB file likely resulting from a dummy CRYST1 record. Discarding unit cell vectors.
  warnings.warn('Unlikely unit cell vectors detected in PDB file likely '


Saving transformed dataset to 'diheds/'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('diheds/')

Saving "featurizer.pkl"... (<class 'msmbuilder.featurizer.featurizer.DihedralFeaturizer'>)
To load this DihedralFeaturizer object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('featurizer.pkl')

Preprocessing

Since the range of values in our raw data can vary widely from feature to feature, we can scale values to reduce bias. Here we use the RobustScaler to center and scale our dihedral angles by their respective interquartile ranges.

In [5]:
! msmb RobustScaler \
    -i diheds \
    --transformed scaled_diheds.h5
RobustScaler(copy=True, with_centering=True, with_scaling=True)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'scaled_diheds.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('scaled_diheds.h5')

Intermediate kinetic model: tICA

tICA is similar to principal component analysis (see "tICA vs. PCA" example). Note that the 84-dimensional space is reduced to 4 dimensions.

In [6]:
! msmb tICA -i scaled_diheds.h5 \
    --out tica_model.pkl \
    --transformed tica_trajs.h5 \
    --n_components 4 \
    --lag_time 2
tICA(kinetic_mapping=False, lag_time=2, n_components=4, shrinkage=None)
Fitting model...
*********
*RESULTS*
*********
time-structure based Independent Components Analysis (tICA)
-----------------------------------------------------------
n_components        : 4
shrinkage           : 0.0011725243789861743
lag_time            : 2
kinetic_mapping     : False

Top 5 timescales :
[ 288.73719263  158.84187298  133.35210428  129.88067247]

Top 5 eigenvalues :
[ 0.99309722  0.9874878   0.98511402  0.9847192 ]

--------------------------------------------------------------------------------


Saving transformed dataset to 'tica_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('tica_trajs.h5')

Saving "tica_model.pkl"... (<class 'msmbuilder.decomposition.tica.tICA'>)
To load this tICA object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('tica_model.pkl')

tICA Histogram

We can histogram our data projecting along the two slowest degrees of freedom (as found by tICA). You have to do this in a python script.

In [7]:
from msmbuilder.dataset import dataset
ds = dataset('tica_trajs.h5')

%matplotlib inline
import msmexplorer as msme
import numpy as np
txx = np.concatenate(ds)
_ = msme.plot_histogram(txx)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)

Clustering

Conformations need to be clustered into states (sometimes written as microstates). We cluster based on the tICA projections to group conformations that interconvert rapidly. Note that we transform our trajectories from the 4-dimensional tICA space into a 1-dimensional cluster index.

In [8]:
! msmb MiniBatchKMeans -i tica_trajs.h5 \
    --transformed labeled_trajs.h5 \
    --out clusterer.pkl \
    --n_clusters 100 \
    --random_state 42
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10,
        n_clusters=100, n_init=3, random_state=42, reassignment_ratio=0.01,
        tol=0.0, verbose=0)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'labeled_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('labeled_trajs.h5')

Saving "clusterer.pkl"... (<class 'msmbuilder.cluster.MiniBatchKMeans'>)
To load this MiniBatchKMeans object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('clusterer.pkl')

MSM

We can construct an MSM from the labeled trajectories

In [9]:
! msmb MarkovStateModel -i labeled_trajs.h5 \
    --out msm.pkl \
    --lag_time 2
MarkovStateModel(ergodic_cutoff='on', lag_time=2, n_timescales=None,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
MSM contains 1 strongly connected component above weight=0.50. Component 0 selected, with population 100.000000%
*********
*RESULTS*
*********
Markov state model
------------------
Lag time         : 2
Reversible type  : mle
Ergodic cutoff   : on
Prior counts     : 0

Number of states : 100
Number of nonzero entries in counts matrix : 841 (8.41%)
Nonzero counts matrix entries:
    Min.   : 0.5
    1st Qu.: 0.5
    Median : 2.0
    Mean   : 16.6
    3rd Qu.: 8.0
    Max.   : 378.5

Total transition counts :
    13972.0 counts
Total transition counts / lag_time:
    6986.0 units
Timescales:
    [1828.40, 1249.05, 572.64, 311.84, 304.83, 298.62, 229.76, 204.46, 139.46, 132.14, 118.04, 110.47, 105.30, 79.05, 69.94, 66.07, 61.18, 53.25, 48.55, 46.22, 42.69, 41.32, 40.70, 37.78, 32.36, 29.14, 28.35, 24.80, 23.65, 23.04, 21.88, 21.11, 19.13, 18.44, 17.77, 16.34, 15.73, 15.40, 14.70, 13.35, 12.97, 9.94, 9.92, 9.70, 9.09, 8.79, 8.69, 8.60, 8.18, 7.49, 7.08, 7.01, 6.87, 6.72, 6.49, 6.35, 6.22, 6.17, 5.89, 5.61, 5.56, 5.50, 5.33, 5.28, 5.23, 4.91, 4.81, 4.68, 4.60, 4.28, 4.26, 4.07, 4.05, 3.98, 3.71, 3.36, 3.27, 2.96, 2.63, 2.51, 2.47, 2.41, 2.28, 2.27, 2.18, 2.08, 2.07, 1.77, 1.74, 1.32, 1.20, 1.20, 1.18, 1.00, 0.98, 0.70, 0.59, 0.55, 0.36]  units

--------------------------------------------------------------------------------
Saving "msm.pkl"... (<class 'msmbuilder.msm.msm.MarkovStateModel'>)
To load this MarkovStateModel object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('msm.pkl')

Plot Free Energy Landscape

Subsequent plotting and analysis should be done from Python

In [10]:
from msmbuilder.utils import load
msm = load('msm.pkl')
clusterer = load('clusterer.pkl')

assignments = clusterer.partial_transform(txx)
assignments = msm.partial_transform(assignments)

from matplotlib import pyplot as plt
msme.plot_free_energy(txx, obs=(0, 1), n_samples=10000,
                      pi=msm.populations_[assignments],
                      xlabel='tIC 1', ylabel='tIC 2')
plt.scatter(clusterer.cluster_centers_[msm.state_labels_, 0],
            clusterer.cluster_centers_[msm.state_labels_, 1],
            s=1e4 * msm.populations_,       # size by population
            c=msm.left_eigenvectors_[:, 1], # color by eigenvector
            cmap="coolwarm",
            zorder=3
           ) 
plt.colorbar(label='First dynamical eigenvector')
plt.tight_layout()

(Fs-Peptide-command-line.ipynb; Fs-Peptide-command-line.eval.ipynb; Fs-Peptide-command-line.py)