Geometric Clustering

Background

The goal of clustering MD trajectories is to group the the data [1] into a set of groups (clusters) such that conformations in the same cluster are structurally similar to one another, and conformations in different clusters are structurally distinct.

The two central issues for clustering MD data are

  1. How should “structurally similar” be defined? What distance metric should be used?
  2. Given the distance metric, what algorithm should be used to actually cluster the data?

On point 1, there is no consensus in the protein MD literature. Popular distance metrics include cartesian root-mean squared deviation of atomic positions (RMSD) [3], distances based on the number of native contacts formed, distances based on the difference in backbone dihedral angles, and probably others.

On point 2, “Optimal” clustering is NP-hard [2], so there’s usually a tradeoff between clustering quality and computational cost. For that reason MSMBuilder has a bunch of different clustering algorithms implemented.

API and Implementation Notes

All clustering algorithms in MSMBuilder follow the following basic API. Hyperparameters, including the number of clusters, random seeds, the distance metric (if applicable), etc are passed to the class constructor. Then, the heavy-lifting is done by calling fit(sequences). The argument to fit should be a list of molecular dynamics trajectories or a list of 2D numpy arrays, each of shape (length_of_trajecotry, n_features).

Algorithms

KCenters([n_clusters, metric, random_state]) K-Centers clustering
KMeans([n_clusters, init, n_init, max_iter, ...]) K-Means clustering
KMedoids([n_clusters, n_passes, metric, ...]) K-Medoids clustering
MiniBatchKMedoids([n_clusters, max_iter, ...]) Mini-Batch K-Medoids clustering.
RegularSpatial(d_min[, metric]) Regular spatial clustering.
LandmarkAgglomerative(n_clusters[, ...]) Landmark-based agglomerative hierarchical clustering

Additional Algorithms

AffinityPropagation([damping, max_iter, ...]) Perform Affinity Propagation Clustering of data.
GMM([n_components, covariance_type, ...]) Gaussian Mixture Model
MeanShift([bandwidth, seeds, bin_seeding, ...]) Mean shift clustering using a flat kernel.
MiniBatchKMeans([n_clusters, init, ...]) Mini-Batch K-Means clustering
SpectralClustering([n_clusters, ...]) Apply clustering to a projection to the normalized laplacian.
Ward([n_clusters, memory, connectivity, ...]) Ward hierarchical clustering: constructs a tree and cuts it.

Example

import mdtraj as md

# load two trajectories and create a "dataset" with their
# phi dihedral angles
dataset = []
for trajectory_file in ['trj0.xtc', 'trj1.xtc']:
    t = md.load(trajectory_file, top='topology.pdb')
    indices, phi = md.compute_phi(t)
    dataset.append(phi)

from msmbuilder.cluster import KMeans
cluster = KMeans(n_clusters=10)
cluster.fit(dataset)

print(cluster.labels_)

References

[1]The “data”, for MD, refers to snapshots of the structure of a molecular system at a given time point – i.e the set of cartesian coordinates for all the atoms, or some mathematical transformation thereof.
[2]Aloise, Daniel, et al. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning 75.2 (2009): 245-248.
[3]http://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions
Versions