Ligand Featurization¶

Featurizing Ligand-Protein trajectories¶

Trajectories containing a protein and ligand can now be featurized in several ways. A reference frame with at least two chains, one of which is the protein and one of which is the ligand, is required for all featurizations. These chains can be manually specified by their indexes or MSMBuilder can guess which trajectory is the protein (by choosing the longest CA-containing chain) and which is the ligand (by choosing the longest chain containing up to 200 atoms; tie goes to the lower index).

Here we explore Ligand-Protein contact featurizations and their binary transforms as well as RMSD calculations with customizable alignment and calcuation indices.

Generate a toy trajectory¶

In [1]:

%matplotlib inline
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import mdtraj as md

top = md.Topology()
c = top.add_chain()

r0 = top.add_residue('HET', c)
r1 = top.add_residue('HET', c)
r2 = top.add_residue('HET', c)
r3 = top.add_residue('HET', c)
r4 = top.add_residue('HET', c)
r5 = top.add_residue('HET', c)
r6 = top.add_residue('HET', c)
r7 = top.add_residue('HET', c)
r8 = top.add_residue('HET', c)
r9 = top.add_residue('HET', c)
residues = [r0,r1,r2,r3,r4,r5,r6,r7,r8,r9]

c_ligand = top.add_chain()
r_ligand = top.add_residue('HET', c_ligand)

for _ in range(10):
    for _, res in enumerate(residues):
        top.add_atom('CA', md.element.carbon, res)
for _ in range(10):
    top.add_atom('CA', md.element.carbon, r_ligand)
traj = md.Trajectory(xyz=np.random.uniform(size=(100, 110, 3)),
                     topology=top,
                     time=np.arange(100))
ref = md.Trajectory(xyz=np.random.uniform(size=(1, 110, 3)),
                    topology=top,
                    time=np.arange(1))

Identify residues characterizing a binding pocket with respect to a reference structure¶

In [2]:

from msmbuilder.featurizer import LigandContactFeaturizer
from msmbuilder.featurizer import BinaryLigandContactFeaturizer

feat = LigandContactFeaturizer(reference_frame=ref, binding_pocket=0.1)
df = pd.DataFrame(feat.describe_features(ref))
df

/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Out[2]:

	atominds	featuregroup	featurizer	otherinfo	resids	resnames	resseqs
0	N/A	closest-heavy	Ligand Contact	N/A	[3, 10]	[HET, HET]	[3, 10]
1	N/A	closest-heavy	Ligand Contact	N/A	[6, 10]	[HET, HET]	[6, 10]
2	N/A	closest-heavy	Ligand Contact	N/A	[8, 10]	[HET, HET]	[8, 10]

In [3]:

pocket_contacts = feat.transform(traj)
print("Number of frames is {}".format(len(pocket_contacts)))
print("Number of features is {}".format(pocket_contacts[0].shape[1]))

Number of frames is 100
Number of features is 3

/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/msmbuilder-3.9.0.dev0-py3.6-linux-x86_64.egg/msmbuilder/featurizer/multichain.py:250: UserWarning: The topology of the trajectory is notthe same as that of the reference frame,which might give meaningless results.
  "which might give meaningless results.")

Create a histogram of instances each residue is within a certain cutoff distance of the ligand¶

In [4]:

feat = BinaryLigandContactFeaturizer(reference_frame=ref, cutoff=0.1)
pocket_bins = feat.transform(traj)
print("Number of residues is {}".format(pocket_bins[0].shape[1]))

/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/msmbuilder-3.9.0.dev0-py3.6-linux-x86_64.egg/msmbuilder/featurizer/multichain.py:250: UserWarning: The topology of the trajectory is notthe same as that of the reference frame,which might give meaningless results.
  "which might give meaningless results.")

Number of residues is 10

In [5]:

count_list = []
for res in range(pocket_bins[0].shape[1]):
    count_list.append(sum([pocket_bins[i][0][res]
                       for i in range(len(pocket_bins))]))

In [6]:

fig = plt.figure(figsize=(6,5))
plt.title('Instances within {} nm cutoff'.format(feat.cutoff),fontsize=18)
plt.bar(range(pocket_bins[0].shape[1]),count_list)
plt.ylabel('Counts', fontsize=16)
plt.xlabel('Residue index', fontsize=16)
plt.xticks(np.linspace(0.4,10.4,pocket_bins[0].shape[1]+1),range(pocket_bins[0].shape[1]))
plt.tight_layout()

Using the LigandRMSDFeaturizer¶

Compute the RMSD of each frame in the trajectory to each frame in a reference trajectory for any set of alignment indices and any set of indices to use for the RMSD calculation. By default, structures are aligned by the protein atoms and the RMSD is calculated for ligand atoms.

In [7]:

from msmbuilder.featurizer import LigandRMSDFeaturizer
feat = LigandRMSDFeaturizer(reference_frame=ref, reference_traj=traj[0:2])
rmsds = feat.transform([traj])
print(rmsds[0][:2])

[[  9.92634330e-09   7.18251407e-01]
 [  7.18251348e-01   8.84589753e-08]]

Specific indices of the ligand and protein can be specified for alignment and calculation. If no reference trajectory is provided, the reference frame is used.

In [8]:

feat_indices = LigandRMSDFeaturizer(reference_frame=ref, align_indices=range(50),
                                   calculate_indices=[105])
rmsds_indices = feat_indices.transform([traj])
print(rmsds_indices[0][:2])

[[ 0.54333776]
 [ 0.72819024]]

Custom indices can also be provided. For example, here we have aligned by the protein (this is the default option but has been enumerated here for clarity) but calculated the RMSD based on all atoms in the reference frame.

In [9]:

feat_custom = LigandRMSDFeaturizer(reference_frame=ref, align_by='protein',
                        calculate_for='custom', calculate_indices=range(ref.n_atoms))
rmsds_custom = feat_custom.transform([traj])
print(rmsds_custom[0][:2])

[[ 0.65306544]
 [ 0.66557592]]

Using all atoms for both aligning and calculating RMSD is equivalent to mdtraj's implementation of RMSD calculations.

In [10]:

feat_mdtraj = LigandRMSDFeaturizer(reference_frame=ref, align_by='custom',
                                  align_indices=range(ref.n_atoms), calculate_for='custom',
                                  calculate_indices=range(ref.n_atoms))
rmsds_mdtraj = feat_mdtraj.transform([traj])
real_mdtraj = md.rmsd(traj, ref, frame=0)
print("multichain implementation:\t {}, ...".format((rmsds_mdtraj[0][0][0],
                                                   rmsds_mdtraj[0][1][0])))
print("mdtraj implementation:\t\t {}, ...".format((real_mdtraj[0], real_mdtraj[1])))

multichain implementation:	 (0.65146613121032715, 0.66455864906311035), ...
mdtraj implementation:		 (0.65146607, 0.66455859), ...

(Ligand-Featurization.ipynb; Ligand-Featurization.eval.ipynb; Ligand-Featurization.py)