msmbuilder.hmm.GaussianFusionHMM

class msmbuilder.hmm.GaussianFusionHMM(n_states, n_features, n_init=10, n_em_iter=10, n_lqa_iter=10, fusion_prior=0.01, thresh=0.01, reversible_type='mle', transmat_prior=None, vars_prior=0.001, vars_weight=1, random_state=None, params='tmv', init_params='tmv', precision='mixed', n_jobs=1, timing=False, n_hotstart='all', init_algo='kmeans')

Reversible Gaussian Hidden Markov Model L1-Fusion Regularization

This model estimates Hidden Markov model for a vector dataset which is contained to be reversible (satisfy detailed balance) with Gaussian emission distributions. This model is similar to a MarkovStateModel without a “hard” assignments of conformations to clusters. Optionally, it can apply L1-regularization to the positions of the Gaussians. See [1] for details.

Parameters:

n_states : int

The number of components (states) in the model

n_init : int

Number of time the EM algorithm will be run with different random seeds. The final results will be the best output of n_init consecutive runs in terms of log likelihood.

n_em_iter : int

The maximum number of iterations of expectation-maximization to run during each fitting round.

n_lqa_iter : int

The number of iterations of the local quadratic approximation fixed point equations to solve when computing the new means with a nonzero L1 fusion penalty.

thresh : float

Convergence threshold for the log-likelihood during expectation maximization. When the increase in the log-likelihood is less than thresh between subsequent rounds of E-M, fitting will finish.

fusion_prior : float

The strength of the L1 fusion prior.

reversible_type : str

Method by which the reversibility of the transition matrix is enforced. ‘mle’ uses a maximum likelihood method that is solved by numerical optimization (BFGS), and ‘transpose’ uses a more restrictive (but less computationally complex) direct symmetrization of the expected number of counts.

transmat_prior : float, optiibal

A prior on the transition matrix entries. If supplied, a psuedocount of transmat_prior - 1 is added to each entry in the expected number of observed transitions from each state to each other state, so this is like a uniform dirichlet alpha in a sense.

vars_prior : float, optional

A prior used on the variance. This can be useful in the undersampled regime where states may be collapsing onto a single point, but is generally not needed.

vars_weight : float, optional

Weight of the vars prior

random_state : int, optional

Random state, used during sampling.

params : str

A string with the parameters to optimizing during the fitting. If ‘t’ is in params, the transition matrix will be optimized. If ‘m’ is in params, the statemeans will be optimized. If ‘v’ is in params, the state variances will be optimized.

init_params : str

A string with the parameters to initialize prior to fitting. If ‘t’ is in params, the transition matrix will be set. If ‘m’ is in params, the statemeans will be set. If ‘v’ is in params, the state variances will be set.

timing : bool, default=False

Print detailed timing information about the fitting process.

n_hotstart : {int, ‘all’}

Number of sequences to use when hotstarting the EM with kmeans. Default=’all’

init_algo : str

Use this algorithm to hotstart the means and covariances. Must be one of “kmeans” or “GMM”

References

[R24]McGibbon, Robert T. et al., “Understanding Protein Dynamics with L1-Regularized Reversible Hidden Markov Models” Proc. 31st Intl. Conf. on Machine Learning (ICML). 2014.

Attributes

means_ :  
vars_ :  
transmat_ :  
populations_ :  
fit_logprob_ :  

Methods

draw_centroids(sequences) Find conformations most representative of model means.
draw_samples(sequences, n_samples[, scheme, ...]) Sample conformations from each state.
fit(sequences[, y]) Estimate model parameters.
get_params([deep]) Get parameters for this estimator.
predict(sequences) Find most likely hidden-state sequence corresponding to each data timeseries.
score(sequences) Log-likelihood of sequences under the model
set_params(**params) Set the parameters of this estimator.
summarize()
fit(sequences, y=None)

Estimate model parameters.

An initialization step is performed before entering the EM algorithm. If you want to avoid this step, pass proper init_params keyword argument to estimator’s constructor.

Parameters:

sequences : list

List of 2-dimensional array observation sequences, each of which has shape (n_samples_i, n_features), where n_samples_i is the length of the i_th observation.

y : unused

Needed for sklearn API consistency.

get_params(deep=True)

Get parameters for this estimator.

Parameters:

deep: boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The former have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:self
timescales_

The implied relaxation timescales of the hidden Markov transition matrix

By diagonalizing the transition matrix, its propagation of an arbitrary initial probability vector can be written as a sum of the eigenvectors of the transition weighted by per-eigenvector term that decays exponentially with time. Each of these eigenvectors describes a “dynamical mode” of the transition matrix and has a characteristic timescale, which gives the timescale on which that mode decays towards equilibrium. These timescales are given by \(-1/log(u_i)\) where \(u_i\) are the eigenvalues of the transition matrix. In a reversible HMM with N states, the number of timescales is at most N-1. (The -1 comes from the fact that the stationary distribution of the chain is associated with an eigenvalue of 1, and an infinite characteristic timescale). The number of timescales can be less than N-1 for every eigenvalue of the transition matrix that is negative (which is allowable by detailed balance).

Returns:

timescales : array, shape=[n_timescales]

The characteristic timescales of the transition matrix. If the model has not been fit or does not have a transition matrix, the return value will be None.

score(sequences)

Log-likelihood of sequences under the model

Parameters:

sequences : list

List of 2-dimensional array observation sequences, each of which has shape (n_samples_i, n_features), where n_samples_i is the length of the i_th observation.

predict(sequences)

Find most likely hidden-state sequence corresponding to each data timeseries.

Uses the Viterbi algorithm.

Parameters:

sequences : list

List of 2-dimensional array observation sequences, each of which has shape (n_samples_i, n_features), where n_samples_i is the length of the i_th observation.

Returns:

viterbi_logprob : float

Log probability of the maximum likelihood path through the HMM.

hidden_sequences : list of np.ndarrays[dtype=int, shape=n_samples_i]

Index of the most likely states for each observation.

draw_centroids(sequences)

Find conformations most representative of model means.

Parameters:

sequences : list

List of 2-dimensional array observation sequences, each of which has shape (n_samples_i, n_features), where n_samples_i is the length of the i_th observation.

Returns:

centroid_pairs_by_state : np.ndarray, dtype=int, shape = (n_states, 1, 2)

centroid_pairs_by_state[state, 0] = (trj, frame) gives the trajectory and frame index associated with the mean of state

mean_approx : np.ndarray, dtype=float, shape = (n_states, 1, n_features)

mean_approx[state, 0] gives the features at the representative point for state

See also

utils.map_drawn_samples
Extract conformations from MD trajectories by index.
GaussianFusionHMM.draw_samples
Draw samples from GHMM
draw_samples(sequences, n_samples, scheme='even', match_vars=False)

Sample conformations from each state.

Parameters:

sequences : list

List of 2-dimensional array observation sequences, each of which has shape (n_samples_i, n_features), where n_samples_i is the length of the i_th observation.

n_samples : int

How many samples to return from each state

scheme : str, optional, default=’even’

Must be one of [‘even’, “maxent”].

match_vars : bool, default=False

Flag for matching variances in maxent discrete approximation

Returns:

selected_pairs_by_state : np.array, dtype=int, shape=(n_states, n_samples, 2)

selected_pairs_by_state[state] gives an array of randomly selected (trj, frame) pairs from the specified state.

sample_features : np.ndarray, dtype=float, shape = (n_states, n_samples, n_features)

sample_features[state, sample] gives the features for the given sample of state

See also

utils.map_drawn_samples
Extract conformations from MD trajectories by index.
GaussianFusionHMM.draw_centroids
Draw centers from GHMM

Notes

With scheme=’even’, this function assigns frames to states crisply then samples from the uniform distribution on the frames belonging to each state. With scheme=’maxent’, this scheme uses a maximum entropy method to determine a discrete distribution on samples whose mean (and possibly variance) matches the GHMM means.

Versions