# Featurization¶

Many algorithms require that the input data be vectors in a (euclidean)
vector space. This includes `KMeans`

clustering,
`tICA`

, and others.

Since there’s usually no special rotational or translational reference frame in an MD simulation, it’s often desirable to remove rotational and translational motion via featurization that is insensitive to rotations and translations.

## Featurizations¶

`AtomPairsFeaturizer` (pair_indices[, ...]) |
Featurizer based on distances between specified pairs of atoms. |

`ContactFeaturizer` ([contacts, scheme, ...]) |
Featurizer based on residue-residue distances. |

`DRIDFeaturizer` ([atom_indices]) |
Featurizer based on distribution of reciprocal interatomic |

`DihedralFeaturizer` ([types, sincos]) |
Featurizer based on dihedral angles. |

`GaussianSolventFeaturizer` (solute_indices, ...) |
Featurizer on weighted pairwise distance between solute and solvent. |

`RMSDFeaturizer` ([reference_traj, ...]) |
Featurizer based on RMSD to one or more reference structures. |

`RawPositionsFeaturizer` ([atom_indices, ref_traj]) |
Featurize an MD trajectory into a vector space with the raw |

`SuperposeFeaturizer` (atom_indices, reference_traj) |
Featurizer based on euclidian atom distances to reference structure. |

## Alternative to Featurization¶

Many algorithms require vectorizable data. Other algorithms only require a pairwise distance metric, e.g. RMSD between two protein conformations. In general, you can define a pairwise distance among vectorized data, but you cannot embed data into a vector space only from pairwise distance.

Some clustering methods let you use an arbitrary distance
metric, including RMSD. In this case, the input to `fit()`

may be a list
of MD trajectories instead of a list of numpy arrays. Clustering methods
that allow this currently include `KCenters`

and
`KMedoids`

.