Bayesian Estimation of MSMsΒΆ

bayesian-msm_evaluated

BayesianMarkovStateModel

This example demonstrates the class BayesianMarkovStateModel, which uses Metropolis Markov chain Monte Carlo (MCMC) to sample over the posterior distribution of transition matrices, given the observed transitions in your dataset. This can be useful for evaluating the uncertainty due to sampling in your dataset.

In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from mdtraj.utils import timing
from msmbuilder.example_datasets import load_doublewell
from msmbuilder.cluster import NDGrid
from msmbuilder.msm import BayesianMarkovStateModel, MarkovStateModel
loading from /home/rmcgibbo/miniconda/lib/plugins

Load some double-well data

In [2]:
trjs = load_doublewell(random_state=0)['trajectories']
plt.hist(np.concatenate(trjs), bins=50, log=True)
plt.ylabel('Frequency')
plt.show()
328374 steps/s
330706 steps/s
338695 steps/s
339474 steps/s
334440 steps/s
331847 steps/s
340088 steps/s
338045 steps/s
334445 steps/s
338929 steps/s
Saving "/home/rmcgibbo/msmbuilder_data/doublewell/version-1_random-state-0.pkl"... (<type 'list'>)

We'll discretize the space using 10 states

And the build one MSM using the MLE transition matrix estimator, and one with the Bayesian estimator

In [3]:
clusterer = NDGrid(n_bins_per_feature=10)
mle_msm = MarkovStateModel(lag_time=100)
b_msm = BayesianMarkovStateModel(lag_time=100, n_samples=10000, n_steps=1000)

states = clusterer.fit_transform(trjs)
with timing('running mcmc'):
    b_msm.fit(states)

mle_msm.fit(states)
running mcmc: 1.303 seconds
MSM contains 1 strongly connected component above weight=0.01. Component 0 selected, with population 100.000000%
Out[3]:
MarkovStateModel(ergodic_cutoff=0.01, lag_time=100, n_timescales=10,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
In [4]:
plt.subplot(2, 1, 1)
plt.plot(b_msm.all_transmats_[:, 0, 0])
plt.axhline(mle_msm.transmat_[0, 0], c='k')
plt.ylabel('t_00')

plt.subplot(2, 1, 2)
plt.ylabel('t_23')
plt.xlabel('MCMC Iteration')
plt.plot(b_msm.all_transmats_[:, 2, 3])
plt.axhline(mle_msm.transmat_[2, 3], c='k')
plt.show()
In [5]:
plt.plot(b_msm.all_timescales_[:, 0], label='MCMC')
plt.axhline(mle_msm.timescales_[0], c='k', label='MLE')
plt.legend(loc='best')
plt.ylabel('Longest timescale')
plt.xlabel('MCMC iteration')
plt.show()

Now lets try using 50 states

The MCMC sampling is a lot harder to converge

In [6]:
clusterer = NDGrid(n_bins_per_feature=50)
mle_msm = MarkovStateModel(lag_time=100)
b_msm = BayesianMarkovStateModel(lag_time=100, n_samples=1000, n_steps=100000)

states = clusterer.fit_transform(trjs)
with timing('running mcmc (50 states)'):
    b_msm.fit(states)

mle_msm.fit(states)
running mcmc (50 states): 8.922 seconds
MSM contains 1 strongly connected component above weight=0.01. Component 0 selected, with population 100.000000%
Out[6]:
MarkovStateModel(ergodic_cutoff=0.01, lag_time=100, n_timescales=10,
         prior_counts=0, reversible_type='mle', sliding_window=True,
         verbose=True)
In [7]:
plt.plot(b_msm.all_timescales_[:, 0], label='MCMC')
plt.axhline(mle_msm.timescales_[0], c='k', label='MLE')
plt.legend(loc='best')
plt.ylabel('Longest timescale')
plt.xlabel('MCMC iteration')
/home/rmcgibbo/miniconda/lib/python2.7/site-packages/msmbuilder-3.2.0-py2.7-linux-x86_64.egg/msmbuilder/msm/bayesmsm.py:339: RuntimeWarning: invalid value encountered in log
  timescales = - self.lag_time / np.log(us[:, 1:])
Out[7]:
<matplotlib.text.Text at 0x2b2917738bd0>
In [8]:
plt.plot(b_msm.all_transmats_[:, 0, 0], label='MCMC')
plt.axhline(mle_msm.transmat_[0, 0], c='k', label='MLE')
plt.legend(loc='best')
plt.ylabel('t_00')
plt.xlabel('MCMC iteration')
Out[8]:
<matplotlib.text.Text at 0x2b29177c9190>
In [9]:
 

(bayesian-msm.ipynb; bayesian-msm_evaluated.ipynb; bayesian-msm.py)

Versions