Fs Peptide (command line)ΒΆ

Modeling dynamics of FS Peptide

This example shows a typical, basic usage of the MSMBuilder command line to model dynamics of a protein system.

In [1]:
# Work in a temporary directory
import tempfile
import os
os.chdir(tempfile.mkdtemp())
In [2]:
# Since this is running from an IPython notebook,
# we prefix all our commands with "!"
# When running on the command line, omit the leading "!"
! msmb -h
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
usage: msmb [-h] [-v]  ...

Statistical models for biomolecular dynamics

optional arguments:
 -h, --help             show this help message and exit
 -v, --version          show program's version number and exit

commands:
 
  AtomIndices           Create index file for atoms or distance pairs.
  ConvertChunkedProject
                        Convert an MD dataset with chunked trajectories into a standard format.
  TemplateProject       Set up a new MSMBuilder project
  LogisticContactFeaturizer
                        Featurizer based on logistic-transformed residue-residue contacts.
  VonMisesFeaturizer    Featurizer based on dihedral angles soft-binned along the unit circle.
  RMSDFeaturizer        Featurizer based on RMSD to one or more reference structures.
  LigandContactFeaturizer
                        Featurizer based on ligand-protein distances.
  BinaryContactFeaturizer
                        Featurizer based on residue-residue contacts below a cutoff.
  DRIDFeaturizer        Featurizer based on distribution of reciprocal interatomic distances (DRID).
  RawPositionsFeaturizer
                        Featurize an MD trajectory into a vector space with the raw cartesian coordinates.
  KappaAngleFeaturizer  Featurizer to extract kappa angles.
  SuperposeFeaturizer   Featurizer based on euclidian atom distances to reference structure.
  SASAFeaturizer        Featurizer based on solvent-accessible surface areas.
  GaussianSolventFeaturizer
                        Featurizer on weighted pairwise distance between solute and solvent.
  AtomPairsFeaturizer   Featurizer based on distances between specified pairs of atoms.
  DihedralFeaturizer    Featurizer based on dihedral angles.
  ContactFeaturizer     Featurizer based on residue-residue distances.
  AlphaAngleFeaturizer  Featurizer to extract alpha (dihedral) angles.
  LandMarkRMSDFeaturizer
                        Kernel Landmark Featuizer based on RMSD to one or more reference structures.
  BinaryLigandContactFeaturizer
                        Featurizer based on binary ligand-protein contacts.
  LigandRMSDFeaturizer  Featurizer based on RMSD to one or more reference structures.
  StandardScaler        Standardize features by removing the mean and scaling to unit variance.
  Butterworth           Smooth time-series data using a low-pass, zero-delay Butterworth filter.
  RobustScaler          Scale features using statistics that are robust to outliers.
  DoubleEWMA            Smooth time-series data using forward and backward exponentially-weighted moving average
                        filters.
  KMeans                K-Means clustering.
  MiniBatchKMeans       Mini-Batch K-Means clustering.
  AgglomerativeClustering
                        Agglomerative Clustering.
  GMM                   Gaussian Mixture.
  SpectralClustering    Apply clustering to a projection to the normalized laplacian.
  MiniBatchKMedoids     Mini-Batch K-Medoids clustering.
  KCenters              K-Centers clustering.
  AffinityPropagation   Perform Affinity Propagation Clustering of data.
  KMedoids              K-Medoids clustering.
  RegularSpatial        Regular spatial clustering.
  MeanShift             Mean shift clustering using a flat kernel.
  APM                   APM This Program is a python package which implements Automatic State Partitioning for Multibody
                        Systems (APM) algorithm to cluster trajectories and model the transitions based on a Markov
                        State Model (MSM).
  LandmarkAgglomerative
                        Landmark-based agglomerative hierarchical clustering.
  NDGrid                Discretize continuous data points onto an N-dimensional grid.
  MiniBatchSparsePCA    Mini-batch Sparse Principal Components Analysis.
  SparseTICA            Sparse time-structure Independent Component Analysis (tICA).
  PCA                   Principal component analysis (PCA).
  KernelTICA            Time-structure Independent Componenent Analysis (tICA) using the kernel trick.
  KSparseTICA           Sparse tICA where eigenvectors must have k or fewer nonzero values.
  SparsePCA             Sparse Principal Components Analysis (SparsePCA).
  tICA                  Time-structure Independent Component Analysis (tICA).
  FastICA               FastICA: a fast algorithm for Independent Component Analysis.
  FactorAnalysis        Factor Analysis (FA).
  AlanineDipeptide      Download example alanine dipeptide dataset.
  FsPeptide             Download example Fs-peptide dataset.
  MetEnkephalin         Download example Met-Enkephalin dataset.
  DoubleWell            Generate example double well potential dataset.
  QuadWell              Generate example quad-well potential dataset.
  MullerPotential       Generate example Muller potential dataset.
  ImpliedTimescales     Scan the implied timescales of MarkovStateModels with respect to lag time.
  GaussianHMM           Reversible Gaussian Hidden Markov Model L1-Fusion Regularization.
  MarkovStateModel      Reversible Markov State Model.
  BayesianMarkovStateModel
                        Bayesian reversible Markov state model.
  ContinuousTimeMSM     Reversible first order master equation model.
  BayesianContinuousTimeMSM
                        Bayesian reversible first-order Master equation model.
  TransformDataset      Use a pre-fit model to transform a dataset.

Get example data

In [3]:
! msmb FsPeptide --data_home ./
! tree
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
Copying fs_peptide from msmb_data package to ./
Example dataset saved: ./fs_peptide
/bin/sh: 1: tree: not found

Featurization

The raw (x, y, z) coordinates from the simulation do not respect the translational and rotational symmetry of our problem. A Featurizer transforms cartesian coordinates into other representations. Here we use the DihedralFeaturizer to turn our data into phi and psi dihedral angles. Observe that the 264*3-dimensional space is reduced to 84 dimensions.

In [4]:
# Remember '\' is the line-continuation marker
# You can enter this command on one line
! msmb DihedralFeaturizer \
    --out featurizer.pkl  \
    --transformed diheds  \
    --top fs_peptide/fs-peptide.pdb \
    --trjs "fs_peptide/*.xtc" \
    --stride 10
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
DihedralFeaturizer(sincos=True, types=['phi', 'psi'])
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/mdtraj/formats/pdb/pdbfile.py:196: UserWarning: Unlikely unit cell vectors detected in PDB file likely resulting from a dummy CRYST1 record. Discarding unit cell vectors.
  warnings.warn('Unlikely unit cell vectors detected in PDB file likely '


Saving transformed dataset to 'diheds/'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('diheds/')

Saving "featurizer.pkl"... (<class 'msmbuilder.featurizer.featurizer.DihedralFeaturizer'>)
To load this DihedralFeaturizer object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('featurizer.pkl')

Preprocessing

Since the range of values in our raw data can vary widely from feature to feature, we can scale values to reduce bias. Here we use the RobustScaler to center and scale our dihedral angles by their respective interquartile ranges.

In [5]:
! msmb RobustScaler \
    -i diheds \
    --transformed scaled_diheds.h5
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
RobustScaler(copy=True, quantile_range=(25.0, 75.0), with_centering=True,
       with_scaling=True)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'scaled_diheds.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('scaled_diheds.h5')

Intermediate kinetic model: tICA

tICA is similar to principal component analysis (see "tICA vs. PCA" example). Note that the 84-dimensional space is reduced to 4 dimensions.

In [6]:
! msmb tICA -i scaled_diheds.h5 \
    --out tica_model.pkl \
    --transformed tica_trajs.h5 \
    --n_components 4 \
    --lag_time 2
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
tICA(commute_mapping=False, kinetic_mapping=False, lag_time=2, n_components=4,
   shrinkage=None)
Fitting model...
*********
*RESULTS*
*********
time-structure based Independent Components Analysis (tICA)
-----------------------------------------------------------
n_components        : 4
shrinkage           : 0.0011725243789861743
lag_time            : 2
kinetic_mapping     : False

Top 5 timescales :
[ 288.73719263  158.84187298  133.35210428  129.88067247]

Top 5 eigenvalues :
[ 0.99309722  0.9874878   0.98511402  0.9847192 ]

--------------------------------------------------------------------------------


Saving transformed dataset to 'tica_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('tica_trajs.h5')

Saving "tica_model.pkl"... (<class 'msmbuilder.decomposition.tica.tICA'>)
To load this tICA object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('tica_model.pkl')

tICA Histogram

We can histogram our data projecting along the two slowest degrees of freedom (as found by tICA). You have to do this in a python script.

In [7]:
from msmbuilder.dataset import dataset
ds = dataset('tica_trajs.h5')

%matplotlib inline
import msmexplorer as msme
import numpy as np
txx = np.concatenate(ds)
_ = msme.plot_histogram(txx)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/IPython/html.py:14: ShimWarning: The `IPython.html` package has been deprecated. You should import from `notebook` instead. `IPython.html.widgets` has moved to `ipywidgets`.
  "`IPython.html.widgets` has moved to `ipywidgets`.", ShimWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Clustering

Conformations need to be clustered into states (sometimes written as microstates). We cluster based on the tICA projections to group conformations that interconvert rapidly. Note that we transform our trajectories from the 4-dimensional tICA space into a 1-dimensional cluster index.

In [8]:
! msmb MiniBatchKMeans -i tica_trajs.h5 \
    --transformed labeled_trajs.h5 \
    --out clusterer.pkl \
    --n_clusters 100 \
    --random_state 42
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
MiniBatchKMeans(batch_size=100, compute_labels=True, init='k-means++',
        init_size=None, max_iter=100, max_no_improvement=10,
        n_clusters=100, n_init=3, random_state=42, reassignment_ratio=0.01,
        tol=0.0, verbose=0)
Fitting model...
*********
*RESULTS*
*********
NotImplemented
--------------------------------------------------------------------------------


Saving transformed dataset to 'labeled_trajs.h5'
To load this dataset interactive inside an IPython
shell or notebook, run

  $ ipython
  >>> from msmbuilder.dataset import dataset
  >>> ds = dataset('labeled_trajs.h5')

Saving "clusterer.pkl"... (<class 'msmbuilder.cluster.MiniBatchKMeans'>)
To load this MiniBatchKMeans object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('clusterer.pkl')

MSM

We can construct an MSM from the labeled trajectories

In [9]:
! msmb MarkovStateModel -i labeled_trajs.h5 \
    --out msm.pkl \
    --lag_time 2
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)
MarkovStateModel(ergodic_cutoff='on', lag_time=2, n_timescales=None,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
MSM contains 1 strongly connected component above weight=0.50. Component 0 selected, with population 100.000000%
*********
*RESULTS*
*********
Markov state model
------------------
Lag time         : 2
Reversible type  : mle
Ergodic cutoff   : on
Prior counts     : 0

Number of states : 100
Number of nonzero entries in counts matrix : 841 (8.41%)
Nonzero counts matrix entries:
    Min.   : 0.5
    1st Qu.: 0.5
    Median : 2.0
    Mean   : 16.6
    3rd Qu.: 8.0
    Max.   : 378.5

Total transition counts :
    13972.0 counts
Total transition counts / lag_time:
    6986.0 units
Timescales:
    [1828.40, 1249.05, 572.64, 311.84, 304.83, 298.62, 229.76, 204.46, 139.46, 132.14, 118.04, 110.47, 105.30, 79.05, 69.94, 66.07, 61.18, 53.25, 48.55, 46.22, 42.69, 41.32, 40.70, 37.78, 32.36, 29.14, 28.35, 24.80, 23.65, 23.04, 21.88, 21.11, 19.13, 18.44, 17.77, 16.34, 15.73, 15.40, 14.70, 13.35, 12.97, 9.94, 9.92, 9.70, 9.09, 8.79, 8.69, 8.60, 8.18, 7.49, 7.08, 7.01, 6.87, 6.72, 6.49, 6.35, 6.22, 6.17, 5.89, 5.61, 5.56, 5.50, 5.33, 5.28, 5.23, 4.91, 4.81, 4.68, 4.60, 4.28, 4.26, 4.07, 4.05, 3.98, 3.71, 3.36, 3.27, 2.96, 2.63, 2.51, 2.47, 2.41, 2.28, 2.27, 2.18, 2.08, 2.07, 1.77, 1.74, 1.32, 1.20, 1.20, 1.18, 1.00, 0.98, 0.70, 0.59, 0.55, 0.36]  units

--------------------------------------------------------------------------------
Saving "msm.pkl"... (<class 'msmbuilder.msm.msm.MarkovStateModel'>)
To load this MarkovStateModel object interactively inside an IPython
shell or notebook, run: 

  $ ipython
  >>> from msmbuilder.utils import load
  >>> model = load('msm.pkl')

Plot Free Energy Landscape

Subsequent plotting and analysis should be done from Python

In [10]:
from msmbuilder.utils import load
msm = load('msm.pkl')
clusterer = load('clusterer.pkl')

assignments = clusterer.partial_transform(txx)
assignments = msm.partial_transform(assignments)

from matplotlib import pyplot as plt
msme.plot_free_energy(txx, obs=(0, 1), n_samples=10000,
                      pi=msm.populations_[assignments],
                      xlabel='tIC 1', ylabel='tIC 2')
plt.scatter(clusterer.cluster_centers_[msm.state_labels_, 0],
            clusterer.cluster_centers_[msm.state_labels_, 1],
            s=1e4 * msm.populations_,       # size by population
            c=msm.left_eigenvectors_[:, 1], # color by eigenvector
            cmap="coolwarm",
            zorder=3
           ) 
plt.colorbar(label='First dynamical eigenvector')
plt.tight_layout()
/home/travis/miniconda3/envs/docenv/lib/python3.5/site-packages/matplotlib/font_manager.py:1297: UserWarning: findfont: Font family ['sans-serif'] not found. Falling back to DejaVu Sans
  (prop.get_family(), self.defaultFamily[fontext]))

(Fs-Peptide-command-line.ipynb; Fs-Peptide-command-line.eval.ipynb; Fs-Peptide-command-line.py)