Many analyses require that the input data be vectors in a (euclidean) vector space. This includes KMeans clustering, tICA and others. Furthermore, other analyses like KCenters clustering require that, if the data are not vectors, that a pairwise distance metric be supplied.
One of the complexities of featurizing molecular dynamics trajectories is that during a simulation, the system is generally permitted to tumble (rotate) in 3D, and the timescale for this tumbling is pretty fast. For a protein in bulk solvent, there’s no special rotational reference frame either. So it’s often desirable to remove rotational motion either via featurization or via a distance metric that is insensitive to rotations. This can be done by featurizing with internal coordinates.
AtomPairsFeaturizer(pair_indices[, ...]) | Featurizer based on distances between specified pairs of atoms. |
ContactFeaturizer([contacts, scheme, ...]) | Featurizer based on residue-residue distances |
DRIDFeaturizer([atom_indices]) | Featurizer based on distribution of reciprocal interatomic |
DihedralFeaturizer([types, sincos]) | Featurizer based on dihedral angles. |
GaussianSolventFeaturizer(solute_indices, ...) | Featurizer on weighted pairwise distance between solute and solvent. |
RMSDFeaturizer(trj0[, atom_indices]) | Featurizer based on RMSD to a series of reference frames. |
RawPositionsFeaturizer([atom_indices, ref_traj]) | Featurize an MD trajectory into a vector space with the raw |
SuperposeFeaturizer(atom_indices, reference_traj) | Featurizer based on euclidian atom distances to reference structure. |
Some clustering methods let you pass in a custom distance metric. In that case, the input to fit() may be a list of MD trajectories instead of a list of numpy arrays. Clustering methods that allow this currently include KCenters and LandmarkAgglomerative. See their documentation for details.