.. _apipatterns: .. currentmodule:: msmbuilder API Patterns ============ Models in msmbuilder inherit from base classes in `scikit-learn `_, and follow a similar API. Like sklearn, each type of model is a python class. Models are "fit" to data, and can then "transform" data into a different representation. Unlike sklearn, the data here is a *list* (or :ref:`dataset`) of time-series arrays or trajectories. Hyperparameters --------------- Hyperparameters are passed in via the ``__init__`` method and set as instance attributes. .. code-block:: python from msmbuilder.decomposition import tICA tica = tICA(gamma=0.05) tica.fit(...) Fit --- The estimation of model parameters is done in ``fit()``. In msmbuilder, the ``fit()`` method always accepts a ``list`` or :func:`~msmbuilder.dataset.dataset` of 2-dimensional arrays as input data, where each array represents a single timeseries (trajectory) and has a shape of ``(length_of_trajectory, n_features)``. Some models can also accept a list of MD trajectories (:class:`~md.Trajectory`) as opposed to a list of arrays. .. code-block:: python features = [np.load('traj-1-features.npy'), np.load('traj-2-featues.npy')] assert features[0].ndim == 2 and features[1].ndim == 2 clusterer = KCenters(n_clusters=100) clusterer.fit(dataset) .. note:: This is different from sklearn. In sklearn, estimators take a **single** 2D array as input in ``fit()``. Here we use a list of arrays or trajectories. However, for many models, it's still quite easy to go between sklearn-style input and msmbuilder-style input, as shown in the following code block. .. todo: move to example notebook? .. code-block:: python import msmbuilder.cluster import sklearn.cluster X_sklearn = np.random.normal(size=(100, 10)) # sklearn style input: (n_samples, n_features) X_msmb = [X_sklearn] # MSMBuilder style input: list of (n_samples, n_features) clusterer_sklearn = sklearn.cluster.KMeans(n_clusters=5) clusterer_sklearn.fit(X_sklearn) clusterer_msmb = msmbuilder.cluster.KMeans(n_clusters=5) clusterer_msmb.fit(X_msmb) Some models like :class:`~tica.tICA` only require a single pass over the data. In this case, use the ``partial_fit`` method, which can incrementally learn the model one trajectory at a time and be more memory-efficient. Attributes ---------- Parameters of the model which are **learned or estimated** during ``fit()`` are always set as instance attributes that are named with a trailing underscore. This is merely a convention, and not a special Python syntax. .. code-block:: python tica = tICA(gamma=0.05) tica.fit(...) # timescales is an estimated quantity, so it ends in an underscore print(tica.timescales_) Transform --------- Many models also implement a ``transform()`` method, which converts an input dataset to an alternative representation. For example, the ``transform`` method of :ref:`featurizers` takes as input a list of trajectories and returns a list of 2D feature arrays. :ref:`Clustering` takes a list of 2D feature arrays and returns a list of 1D sequences. Pipelines --------- The models in msmbuilder are designed to work together as part of a :class:`sklearn.pipeline.Pipeline` .. code-block:: python from msmbuilder.cluster import KMeans from msmbuilder.msm import MarkovStateModel from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('cluster', KMeans(n_clusters=100)), ('msm', MarkovStateModel()) ]) pipeline.fit(dataset) .. vim: tw=75