Featurization

Many algorithms require that the input data be vectors in a (euclidean) vector space. This includes KMeans clustering, tICA, and others.

Since there’s usually no special rotational or translational reference frame in an MD simulation, it’s often desirable to remove rotational and translational motion via featurization that is insensitive to rotations and translations.

Featurizations

AtomPairsFeaturizer(pair_indices[, ...]) Featurizer based on distances between specified pairs of atoms.
ContactFeaturizer([contacts, scheme, ...]) Featurizer based on residue-residue distances.
DRIDFeaturizer([atom_indices]) Featurizer based on distribution of reciprocal interatomic
DihedralFeaturizer([types, sincos]) Featurizer based on dihedral angles.
GaussianSolventFeaturizer(solute_indices, ...) Featurizer on weighted pairwise distance between solute and solvent.
RMSDFeaturizer([reference_traj, ...]) Featurizer based on RMSD to one or more reference structures.
RawPositionsFeaturizer([atom_indices, ref_traj]) Featurize an MD trajectory into a vector space with the raw
SuperposeFeaturizer(atom_indices, reference_traj) Featurizer based on euclidian atom distances to reference structure.

Alternative to Featurization

Many algorithms require vectorizable data. Other algorithms only require a pairwise distance metric, e.g. RMSD between two protein conformations. In general, you can define a pairwise distance among vectorized data, but you cannot embed data into a vector space only from pairwise distance.

Some clustering methods let you use an arbitrary distance metric, including RMSD. In this case, the input to fit() may be a list of MD trajectories instead of a list of numpy arrays. Clustering methods that allow this currently include KCenters and KMedoids.