Featurization and Distance Metrics

Background

Many analyses require that the input data be vectors in a (euclidean) vector space. This includes KMeans clustering, tICA and others. Furthermore, other analyses like KCenters clustering require that, if the data are not vectors, that a pairwise distance metric be supplied.

One of the complexities of featurizing molecular dynamics trajectories is that during a simulation, the system is generally permitted to tumble (rotate) in 3D, and the timescale for this tumbling is pretty fast. For a protein in bulk solvent, there’s no special rotational reference frame either. So it’s often desirable to remove rotational motion either via featurization or via a distance metric that is insensitive to rotations. This can be done by featurizing with internal coordinates.

Featurizations

AtomPairsFeaturizer(pair_indices[, ...]) Featurizer based on distances between specified pairs of atoms.
ContactFeaturizer([contacts, scheme, ...]) Featurizer based on residue-residue distances
DRIDFeaturizer([atom_indices]) Featurizer based on distribution of reciprocal interatomic
DihedralFeaturizer([types, sincos]) Featurizer based on dihedral angles.
GaussianSolventFeaturizer(solute_indices, ...) Featurizer on weighted pairwise distance between solute and solvent.
RMSDFeaturizer(trj0[, atom_indices]) Featurizer based on RMSD to a series of reference frames.
RawPositionsFeaturizer([atom_indices, ref_traj]) Featurize an MD trajectory into a vector space with the raw
SuperposeFeaturizer(atom_indices, reference_traj) Featurizer based on euclidian atom distances to reference structure.

Distance Metrics

Some clustering methods let you pass in a custom distance metric. In that case, the input to fit() may be a list of MD trajectories instead of a list of numpy arrays. Clustering methods that allow this currently include KCenters and LandmarkAgglomerative. See their documentation for details.

Versions