Models in msmbuilder inherit from base classes in scikit-learn, and follow a similar API. Like sklearn, each type of model is a python class. Models are “fit” to data, and can then “transform” data into a different representation. Unlike sklearn, the data here is a list (or dataset) of time-series arrays or trajectories.
Hyperparameters are passed in via
__init__ method and set as instance attributes.
from msmbuilder.decomposition import tICA tica = tICA(gamma=0.05) tica.fit(...)
The estimation of model parameters is done in
fit(). In msmbuilder, the
fit() method always accepts a
dataset() of 2-dimensional arrays as input data,
where each array represents a single timeseries (trajectory) and has a
(length_of_trajectory, n_features). Some models can also
accept a list of MD trajectories (
Trajectory) as opposed to a
list of arrays.
features = [np.load('traj-1-features.npy'), np.load('traj-2-featues.npy')] assert features.ndim == 2 and features.ndim == 2 clusterer = KCenters(n_clusters=100) clusterer.fit(dataset)
This is different from sklearn. In sklearn, estimators take a single
2D array as input in
fit(). Here we use a list of arrays or
trajectories. However, for many models, it’s still quite easy to go
between sklearn-style input and msmbuilder-style input, as shown in
the following code block.
import msmbuilder.cluster import sklearn.cluster X_sklearn = np.random.normal(size=(100, 10)) # sklearn style input: (n_samples, n_features) X_msmb = [X_sklearn] # MSMBuilder style input: list of (n_samples, n_features) clusterer_sklearn = sklearn.cluster.KMeans(n_clusters=5) clusterer_sklearn.fit(X_sklearn) clusterer_msmb = msmbuilder.cluster.KMeans(n_clusters=5) clusterer_msmb.fit(X_msmb)
Some models like
tICA only require a single pass over the
data. In this case, use the
partial_fit method, which can incrementally
learn the model one trajectory at a time and be more memory-efficient.
Parameters of the model which are learned or estimated during
are always set as instance attributes that are named with a trailing
underscore. This is merely a convention, and not a special Python syntax.
tica = tICA(gamma=0.05) tica.fit(...) # timescales is an estimated quantity, so it ends in an underscore print(tica.timescales_)
Many models also implement a
transform() method, which converts an
input dataset to an alternative representation. For example, the
transform method of featurizers takes as input a
list of trajectories and returns a list of 2D feature arrays.
Clustering takes a list of 2D feature arrays and returns
a list of 1D sequences.
The models in msmbuilder are designed to work together as part of a
from msmbuilder.cluster import KMeans from msmbuilder.msm import MarkovStateModel from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('cluster', KMeans(n_clusters=100)), ('msm', MarkovStateModel()) ]) pipeline.fit(dataset)