Fs Peptide (command line)ΒΆ

Modeling dynamics of FS Peptide

This example shows a typical, basic usage of the MSMBuilder command line to model dynamics of a protein system.

In [1]:
# Work in a temporary directory
import tempfile
import os
os.chdir(tempfile.mkdtemp())
In [2]:
# Since this is running from an IPython notebook,
# we prefix all our commands with "!"
# When running on the command line, omit the leading "!"
! msmb -h
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
usage: msmb [-h] [-v]  ...

Statistical models for biomolecular dynamics

optional arguments:
 -h, --help             show this help message and exit
 -v, --version          show program's version number and exit

commands:
 
  AtomIndices           Create index file for atoms or distance pairs.
  ConvertChunkedProject
                        Convert an MD dataset with chunked trajectories into a standard format.
  TemplateProject       Set up a new MSMBuilder project
  DihedralFeaturizer    Featurizer based on dihedral angles.
  KappaAngleFeaturizer  Featurizer to extract kappa angles.
  AlphaAngleFeaturizer  Featurizer to extract alpha (dihedral) angles.
  AtomPairsFeaturizer   Featurizer based on distances between specified pairs of atoms.
  RMSDFeaturizer        Featurizer based on RMSD to one or more reference structures.
  LandMarkRMSDFeaturizer
                        Kernel Landmark Featuizer based on RMSD to one or more reference structures.
  SuperposeFeaturizer   Featurizer based on euclidian atom distances to reference structure.
  DRIDFeaturizer        Featurizer based on distribution of reciprocal interatomic distances (DRID).
  ContactFeaturizer     Featurizer based on residue-residue distances.
  AngleFeaturizer       Featurizer based on angles between 3 atoms.
  BinaryContactFeaturizer
                        Featurizer based on residue-residue contacts below a cutoff.
  LogisticContactFeaturizer
                        Featurizer based on logistic-transformed residue-residue contacts.
  GaussianSolventFeaturizer
                        Featurizer on weighted pairwise distance between solute and solvent.
  VonMisesFeaturizer    Featurizer based on dihedral angles soft-binned along the unit circle.
  RawPositionsFeaturizer
                        Featurize an MD trajectory into a vector space with the raw cartesian coordinates.
  SASAFeaturizer        Featurizer based on solvent-accessible surface areas.
  LigandContactFeaturizer
                        Featurizer based on ligand-protein distances.
  BinaryLigandContactFeaturizer
                        Featurizer based on binary ligand-protein contacts.
  LigandRMSDFeaturizer  Featurizer based on RMSD to one or more reference structures.
  Butterworth           Smooth time-series data using a low-pass, zero-delay Butterworth filter.
  DoubleEWMA            Smooth time-series data using forward and backward exponentially-weighted moving average
                        filters.
  StandardScaler        Standardize features by removing the mean and scaling to unit variance.
  RobustScaler          Scale features using statistics that are robust to outliers.
  KMeans                K-Means clustering.
  KCenters              K-Centers clustering.
  KMedoids              K-Medoids clustering.
  MiniBatchKMedoids     Mini-Batch K-Medoids clustering.
  RegularSpatial        Regular spatial clustering.
  LandmarkAgglomerative
                        Landmark-based agglomerative hierarchical clustering.
  GMM                   Gaussian Mixture.
  MeanShift             Mean shift clustering using a flat kernel.
  NDGrid                Discretize continuous data points onto an N-dimensional grid.
  SpectralClustering    Apply clustering to a projection to the normalized laplacian.
  AffinityPropagation   Perform Affinity Propagation Clustering of data.
  APM                   APM This Program is a python package which implements Automatic State Partitioning for Multibody
                        Systems (APM) algorithm to cluster trajectories and model the transitions based on a Markov
                        State Model (MSM).
  AgglomerativeClustering
                        Agglomerative Clustering.
  MiniBatchKMeans       Mini-Batch K-Means clustering.
  tICA                  Time-structure Independent Component Analysis (tICA).
  SparseTICA            Sparse time-structure Independent Component Analysis (tICA).
  FastICA               FastICA: a fast algorithm for Independent Component Analysis.
  FactorAnalysis        Factor Analysis (FA).
  KernelTICA            Time-structure Independent Componenent Analysis (tICA) using the kernel trick.
  PCA                   Principal component analysis (PCA).
  SparsePCA             Sparse Principal Components Analysis (SparsePCA).
  MiniBatchSparsePCA    Mini-batch Sparse Principal Components Analysis.
  KSparseTICA           Sparse tICA where eigenvectors must have k or fewer nonzero values.
  AlanineDipeptide      Download example alanine dipeptide dataset.
  FsPeptide             Download example Fs-peptide dataset.
  MetEnkephalin         Download example Met-Enkephalin dataset.
  DoubleWell            Generate example double well potential dataset.
  QuadWell              Generate example quad-well potential dataset.
  MullerPotential       Generate example Muller potential dataset.
  ImpliedTimescales     Scan the implied timescales of MarkovStateModels with respect to lag time.
  GaussianHMM           Reversible Gaussian Hidden Markov Model L1-Fusion Regularization.
  MarkovStateModel      Reversible Markov State Model.
  BayesianMarkovStateModel
                        Bayesian reversible Markov state model.
  ContinuousTimeMSM     Reversible first order master equation model.
  BayesianContinuousTimeMSM
                        Bayesian reversible first-order Master equation model.
  TransformDataset      Use a pre-fit model to transform a dataset.

Get example data

In [3]:
! msmb FsPeptide --data_home ./
! tree
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Copying fs_peptide from msmb_data package to ./
Example dataset saved: ./fs_peptide
/bin/sh: 1: tree: not found

Featurization

The raw (x, y, z) coordinates from the simulation do not respect the translational and rotational symmetry of our problem. A Featurizer transforms cartesian coordinates into other representations. Here we use the DihedralFeaturizer to turn our data into phi and psi dihedral angles. Observe that the 264*3-dimensional space is reduced to 84 dimensions.

In [4]:
# Remember '\' is the line-continuation marker
# You can enter this command on one line
! msmb DihedralFeaturizer \
    --out featurizer.pkl  \
    --transformed diheds  \
    --top fs_peptide/fs-peptide.pdb \
    --trjs "fs_peptide/*.xtc" \
    --stride 10
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
DihedralFeaturizer(sincos=True, types=['phi', 'psi'])
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/mdtraj/formats/pdb/pdbfile.py:196: UserWarning: Unlikely unit cell vectors detected in PDB file likely resulting from a dummy CRYST1 record. Discarding unit cell vectors.
  warnings.warn('Unlikely unit cell vectors detected in PDB file likely '


Saving transformed dataset to 'diheds/'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('diheds/')

Saving "featurizer.pkl"... (<class 'msmbuilder.featurizer.featurizer.DihedralFeaturizer'>)
To load this DihedralFeaturizer object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('featurizer.pkl')

Preprocessing

Since the range of values in our raw data can vary widely from feature to feature, we can scale values to reduce bias. Here we use the RobustScaler to center and scale our dihedral angles by their respective interquartile ranges.

In [5]:
! msmb RobustScaler \
    -i diheds \
    --transformed scaled_diheds.h5
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'scaled_diheds.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('scaled_diheds.h5')

Intermediate kinetic model: tICA

tICA is similar to principal component analysis (see "tICA vs. PCA" example). Note that the 84-dimensional space is reduced to 4 dimensions.

In [6]:
! msmb tICA -i scaled_diheds.h5 \
    --out tica_model.pkl \
    --transformed tica_trajs.h5 \
    --n_components 4 \
    --lag_time 2
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
tICA(commute_mapping=False, kinetic_mapping=False, lag_time=2, n_components=4,
   shrinkage=None)
Fitting model...
*********
*RESULTS*
*********
time-structure based Independent Components Analysis (tICA)
-----------------------------------------------------------
n_components        : 4
shrinkage           : 0.0011725244155440773
lag_time            : 2
kinetic_mapping     : False

Top 5 timescales :
[ 288.73718985  158.84187276  133.35210395  129.88067026]

Top 5 eigenvalues :
[ 0.99309722  0.9874878   0.98511402  0.9847192 ]

--------------------------------------------------------------------------------


Saving transformed dataset to 'tica_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('tica_trajs.h5')

Saving "tica_model.pkl"... (<class 'msmbuilder.decomposition.tica.tICA'>)
To load this tICA object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('tica_model.pkl')

tICA Histogram

We can histogram our data projecting along the two slowest degrees of freedom (as found by tICA). You have to do this in a python script.

In [7]:
from msmbuilder.dataset import dataset
ds = dataset('tica_trajs.h5')

%matplotlib inline
import msmexplorer as msme
import numpy as np
txx = np.concatenate(ds)
_ = msme.plot_histogram(txx)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/seaborn/apionly.py:6: UserWarning: As seaborn no longer sets a default style on import, the seaborn.apionly module is deprecated. It will be removed in a future version.
  warnings.warn(msg, UserWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Clustering

Conformations need to be clustered into states (sometimes written as microstates). We cluster based on the tICA projections to group conformations that interconvert rapidly. Note that we transform our trajectories from the 4-dimensional tICA space into a 1-dimensional cluster index.

In [8]:
! msmb MiniBatchKMeans -i tica_trajs.h5 \
    --transformed labeled_trajs.h5 \
    --out clusterer.pkl \
    --n_clusters 100 \
    --random_state 42
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10,
        n_clusters=100, n_init=3, random_state=42, reassignment_ratio=0.01,
        tol=0.0, verbose=0)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'labeled_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('labeled_trajs.h5')

Saving "clusterer.pkl"... (<class 'msmbuilder.cluster.MiniBatchKMeans'>)
To load this MiniBatchKMeans object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('clusterer.pkl')

MSM

We can construct an MSM from the labeled trajectories

In [9]:
! msmb MarkovStateModel -i labeled_trajs.h5 \
    --out msm.pkl \
    --lag_time 2
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
MarkovStateModel(ergodic_cutoff='on', lag_time=2, n_timescales=None,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
MSM contains 1 strongly connected component above weight=0.50. Component 0 selected, with population 100.000000%
*********
*RESULTS*
*********
Markov state model
------------------
Lag time         : 2
Reversible type  : mle
Ergodic cutoff   : on
Prior counts     : 0

Number of states : 100
Number of nonzero entries in counts matrix : 841 (8.41%)
Nonzero counts matrix entries:
    Min.   : 0.5
    1st Qu.: 0.5
    Median : 2.0
    Mean   : 16.6
    3rd Qu.: 8.0
    Max.   : 378.5

Total transition counts :
    13972.0 counts
Total transition counts / lag_time:
    6986.0 units
Timescales:
    [1828.40, 1249.05, 572.64, 311.84, 304.83, 298.62, 229.76, 204.46, 139.46, 132.14, 118.04, 110.47, 105.30, 79.05, 69.94, 66.07, 61.18, 53.25, 48.55, 46.22, 42.69, 41.32, 40.70, 37.78, 32.36, 29.14, 28.35, 24.80, 23.65, 23.04, 21.88, 21.11, 19.13, 18.44, 17.77, 16.34, 15.73, 15.40, 14.70, 13.35, 12.97, 9.94, 9.92, 9.70, 9.09, 8.79, 8.69, 8.60, 8.18, 7.49, 7.08, 7.01, 6.87, 6.72, 6.49, 6.35, 6.22, 6.17, 5.89, 5.61, 5.56, 5.50, 5.33, 5.28, 5.23, 4.91, 4.81, 4.68, 4.60, 4.28, 4.26, 4.07, 4.05, 3.98, 3.71, 3.36, 3.27, 2.96, 2.63, 2.51, 2.47, 2.41, 2.28, 2.27, 2.18, 2.08, 2.07, 1.77, 1.74, 1.32, 1.20, 1.20, 1.18, 1.00, 0.98, 0.70, 0.59, 0.55, 0.36]  units

--------------------------------------------------------------------------------
Saving "msm.pkl"... (<class 'msmbuilder.msm.msm.MarkovStateModel'>)
To load this MarkovStateModel object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('msm.pkl')

Plot Free Energy Landscape

Subsequent plotting and analysis should be done from Python

In [10]:
from msmbuilder.utils import load
msm = load('msm.pkl')
clusterer = load('clusterer.pkl')

assignments = clusterer.partial_transform(txx)
assignments = msm.partial_transform(assignments)

from matplotlib import pyplot as plt
msme.plot_free_energy(txx, obs=(0, 1), n_samples=10000,
                      pi=msm.populations_[assignments],
                      xlabel='tIC 1', ylabel='tIC 2')
plt.scatter(clusterer.cluster_centers_[msm.state_labels_, 0],
            clusterer.cluster_centers_[msm.state_labels_, 1],
            s=1e4 * msm.populations_,       # size by population
            c=msm.left_eigenvectors_[:, 1], # color by eigenvector
            cmap="coolwarm",
            zorder=3
           ) 
plt.colorbar(label='First dynamical eigenvector')
plt.tight_layout()

(Fs-Peptide-command-line.ipynb; Fs-Peptide-command-line.eval.ipynb; Fs-Peptide-command-line.py)