Clustering

Clustering MD trajectories groups the data [1] into a set of clusters such that conformations in the same cluster are structurally similar to one another, and conformations in different clusters are structurally distinct. The questions that arise are

  1. How should “structurally similar” be defined? What distance metric should be used?
  2. Given the distance metric, what algorithm should be used to actually cluster the data?

On point 1, there is no consensus in the protein MD literature. Popular distance metrics include cartesian root-mean-squared deviation of atomic positions (RMSD) [3], distances based on the number of native contacts formed, distances based on the difference in backbone dihedral angles, and probably others.

On point 2, “Optimal” clustering is NP-hard [2], so there’s usually a tradeoff between clustering quality and computational cost. For that reason, MSMBuilder has a variety of different clustering algorithms implemented.

Algorithms

All clustering algorithms in MSMBuilder follow the following basic API. Hyperparameters, including the number of clusters, random seeds, the distance metric (if applicable) are passed to the class constructor. Then, the computation is done by calling fit(sequences). The argument to fit should be a list of molecular dynamics trajectories or a list of 2D numpy arrays, each of shape (length_of_trajecotry, n_features).

KCenters K-Centers clustering
KMeans K-Means clustering
KMedoids K-Medoids clustering
MiniBatchKMedoids Mini-Batch K-Medoids clustering.
RegularSpatial Regular spatial clustering.
LandmarkAgglomerative Landmark-based agglomerative hierarchical clustering
AffinityPropagation Perform Affinity Propagation Clustering of data.
GMM Gaussian Mixture Model
MeanShift Mean shift clustering using a flat kernel.
MiniBatchKMeans Mini-Batch K-Means clustering
SpectralClustering Apply clustering to a projection to the normalized laplacian.
Ward

References

[1]The “data”, for MD, refers to snapshots of the structure of a molecular system at a given time point – i.e the set of cartesian coordinates for all the atoms, or some mathematical transformation thereof.
[2]Aloise, Daniel, et al. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning 75.2 (2009): 245-248.
[3]http://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions