Geometric Clustering¶

Background¶

The goal of clustering MD trajectories is to group the the data [1] into a set of groups (clusters) such that conformations in the same cluster are structurally similar to one another, and conformations in different clusters are structurally distinct.

The two central issues for clustering MD data are

How should “structurally similar” be defined? What distance metric should be used?
Given the distance metric, what algorithm should be used to actually cluster the data?

On point 1, there is no consensus in the protein MD literature. Popular distance metrics include cartesian root-mean squared deviation of atomic positions (RMSD) [3], distances based on the number of native contacts formed, distances based on the difference in backbone dihedral angles, and probably others.

On point 2, “Optimal” clustering is NP-hard [2], so there’s usually a tradeoff between clustering quality and computational cost. For that reason MSMBuilder has a bunch of different clustering algorithms implemented.

API and Implementation Notes¶

All clustering algorithms in MSMBuilder follow the following basic API. Hyperparameters, including the number of clusters, random seeds, the distance metric (if applicable), etc are passed to the class constructor. Then, the heavy-lifting is done by calling fit(sequences). The argument to fit should be a list of molecular dynamics trajectories or a list of 2D numpy arrays, each of shape (length_of_trajecotry, n_features).

Algorithms¶

`KCenters`([n_clusters, metric, random_state])	K-Centers clustering
`KMeans`([n_clusters, init, n_init, max_iter, ...])	K-Means clustering
`KMedoids`([n_clusters, n_passes, metric, ...])	K-Medoids clustering
`MinibatchKMedoids`
`RegularSpatial`(d_min[, metric])	Regular spatial clustering.
`LandmarkAgglomerative`(n_clusters[, ...])	Landmark-based agglomerative hierarchical clustering

Additional Algorithms¶

`AffinityPropagation`([damping, max_iter, ...])	Perform Affinity Propagation Clustering of data.
`GMM`([n_components, covariance_type, ...])	Gaussian Mixture Model
`MeanShift`([bandwidth, seeds, bin_seeding, ...])	Mean shift clustering using a flat kernel.
`MiniBatchKMeans`([n_clusters, init, ...])	Mini-Batch K-Means clustering
`SpectralClustering`([n_clusters, ...])	Apply clustering to a projection to the normalized laplacian.
`Ward`([n_clusters, memory, connectivity, ...])	Ward hierarchical clustering: constructs a tree and cuts it.

Example¶

import mdtraj as md

# load two trajectories and create a "dataset" with their
# phi dihedral angles
dataset = []
for trajectory_file in ['trj0.xtc', 'trj1.xtc']:
    t = md.load(trajectory_file, top='topology.pdb')
    dataset.append(md.compute_phi(t))

from msmbuilder.cluster import KMeans
cluster = KMeans(n_clusters=10)
cluster.fit(dataset)

print(cluster.labels_)

References¶

[1]	The “data”, for MD, refers to snapshots of the structure of a molecular system at a given time point – i.e the set of cartesian coordinates for all the atoms, or some mathematical transformation thereof.

[2]	Aloise, Daniel, et al. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning 75.2 (2009): 245-248.

[3]	http://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions