GMRQ Model Selection¶

We use cross-validation and the generalized matrix Rayleigh quotient (GMRQ) for selecting MSM hyperparameters. The GMRQ is a criterion which "scores" how well the MSM eigenvectors generated on the training dataset serve as slow coordinates for the test dataset [1].

[1] McGibbon, R. T. and V. S. Pande, Variational cross-validation of slow dynamical modes in molecular kinetics (2014)

Get Data¶

This example uses the doublewell dataset, which consists of ten trajectories in 1D with $x \in [-\pi, \pi]$.

In [1]:

from msmbuilder.example_datasets import DoubleWell
trajectories = DoubleWell(random_state=0).get().trajectories
# sub-sample by taking only every 100th data point in each trajectory.
trajectories = [t[::100] for t in trajectories]
print([t.shape for t in trajectories])

/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/home/travis/miniconda3/envs/docenv/lib/python3.6/site-packages/sklearn/grid_search.py:42: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

[(1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1), (1001, 1)]

Set up pipeline¶

The Pipeline is a way of connecting together multiple estimators, so that we can create a custom model that performs a sequence of steps. This model is relatively simple. It will first discretize the trajectory data onto an evenly spaced grid between $-\pi$ and $\pi$, and then build an MSM.

In [2]:

from sklearn.pipeline import Pipeline
from msmbuilder.cluster import NDGrid
from msmbuilder.msm import MarkovStateModel
import numpy as np

model = Pipeline([
    ('grid', NDGrid(min=-np.pi, max=np.pi)),
    ('msm', MarkovStateModel(n_timescales=1, lag_time=1, reversible_type='transpose', verbose=False))
])

Cross validation¶

To get an accurate indication of how well our MSMs are doing at finding the dominant eigenfunctions of our stochastic process, we need to consider the tendency of statistical models to overfit their training data. Our MSMs might build transition matrices which fit the noise in training data as opposed to the underlying signal. One way to combat overfitting in a data-efficient way is with cross validation. This example uses 5-fold cross validation.

In [3]:

from sklearn.cross_validation import KFold
n_states = [5, 10, 25, 50, 100, 200, 500, 750]
cv = KFold(len(trajectories), n_folds=5)
results = []

for n in n_states:
    model.set_params(grid__n_bins_per_feature=n)
    for fold, (train_index, test_index) in enumerate(cv):
        train_data = [trajectories[i] for i in train_index]
        test_data = [trajectories[i] for i in test_index]

        # fit model with a subset of the data (training data).
        # then we'll score it on both this training data (which
        # will give an overly-rosy picture of its performance)
        # and on the test data.
        model.fit(train_data)
        train_score = model.score(train_data)
        test_score = model.score(test_data)

        results.append({
            'train_score': train_score,
            'test_score': test_score,
            'n_states': n,
            'fold': fold})

Use pandas to query our data¶

In [4]:

import pandas as pd
results = pd.DataFrame(results)
results.head()

Out[4]:

	fold	n_states	test_score	train_score
0	0	5	1.980255	1.965947
1	1	5	1.974135	1.967516
2	2	5	1.959164	1.970776
3	3	5	1.962523	1.969898
4	4	5	1.963001	1.970182

Find the average for each fold¶

We use the median for its tolerance to outliers. Mean works too.

In [5]:

avgs = (results
         .groupby('n_states')
         .aggregate(np.median)
         .drop('fold', axis=1))
avgs

Out[5]:

	test_score	train_score
n_states
5	1.963001	1.969898
10	1.982145	1.984224
25	1.985601	1.987711
50	1.985942	1.988318
100	1.985649	1.988545
200	1.985076	1.988789
500	1.979656	1.989989
750	1.974589	1.990888

In [6]:

best_n = avgs['test_score'].argmax()
best_score = avgs.loc[best_n, 'test_score']
print(best_n, "states gives the best score:", best_score)

50 states gives the best score: 1.98594154417

Plot¶

This plot is very similar to figure 1 from McGibbon and Pande. It shows that the performance on the training set keeps going up as we increase the number of states (with the amount of data fixed), whereas the test performance peaks and then starts going down.

We should pick the model with the highest average test set performance. In this example, we're only choosing over the number of MSMs states, but this method can also be used to evaluate the clustering method and any pre-processing like tICA.

However, you do need to fix the number of dynamical processes to "score" (this is the n_timescales attribute for MarkovStateModel), as well as the lag time.

In [7]:

%matplotlib inline
from matplotlib import pyplot as plt

plt.scatter(results['n_states'], results['train_score'], c='b', lw=0, label=None)
plt.scatter(results['n_states'], results['test_score'],  c='r', lw=0, label=None)

plt.plot(avgs.index, avgs['test_score'], c='r', lw=2, label='Mean test')
plt.plot(avgs.index, avgs['train_score'], c='b', lw=2, label='Mean train')

plt.plot(best_n, best_score, c='w', 
         marker='*', ms=20, label='{} states'.format(best_n))

plt.xscale('log')
plt.xlim((min(n_states)*.5, max(n_states)*5))
plt.ylabel('Generalized Matrix Rayleigh Quotient (Score)')
plt.xlabel('Number of states')

plt.legend(loc='lower right', numpoints=1)
plt.tight_layout()

(GMRQ-Model-Selection.ipynb; GMRQ-Model-Selection.eval.ipynb; GMRQ-Model-Selection.py)