Featurization and Distance Metrics¶

Background¶

Many analyses require that the input data be vectors in a (euclidean) vector space. This includes KMeans clustering, tICA and others. Furthermore, other analyses like KCenters clustering require that, if the data are not vectors, that a pairwise distance metric be supplied.

One of the complexities of featurizing molecular dynamics trajectories is that during a simulation, the system is generally permitted to tumble (rotate) in 3D, and the timescale for this tumbling is pretty fast. For a protein in bulk solvent, there’s no special rotational reference frame either. So it’s often desirable to remove rotational motion either via featurization or via a distance metric that is insensitive to rotations. This can be done by featurizing with internal coordinates.

Featurizations¶

`AtomPairsFeaturizer`(pair_indices[, ...])	Featurizer based on distances between specified pairs of atoms.
`ContactFeaturizer`([contacts, scheme, ...])	Featurizer based on residue-residue distances
`DRIDFeaturizer`([atom_indices])	Featurizer based on distribution of reciprocal interatomic
`DihedralFeaturizer`([types, sincos])	Featurizer based on dihedral angles.
`GaussianSolventFeaturizer`(solute_indices, ...)	Featurizer on weighted pairwise distance between solute and solvent.
`RMSDFeaturizer`(trj0[, atom_indices])	Featurizer based on RMSD to a series of reference frames.
`RawPositionsFeaturizer`([atom_indices, ref_traj])	Featurize an MD trajectory into a vector space with the raw
`SuperposeFeaturizer`(atom_indices, reference_traj)	Featurizer based on euclidian atom distances to reference structure.

Distance Metrics¶

Some clustering methods let you pass in a custom distance metric. In that case, the input to fit() may be a list of MD trajectories instead of a list of numpy arrays. Clustering methods that allow this currently include KCenters and LandmarkAgglomerative. See their documentation for details.