Getting Started

Introduction

Getting started with Osprey is as simple as setting up a single YAML configuration file. This configuration file will contain your model estimators (estimator), hyperparameter search strategy (strategy), hyperparameter search space (search_space), dataset information (dataset_loader), cross-validation strategy (cv), and a path to a SQL-like database (trials). This page will go over how to set up a basic Osprey toy project and then a more realistic example for a molecular dynamics dataset.

scikit-learn Example

First, we’ll begin with a basic C-Support Vector Classification example using scikit-learn to introduce the basic YAML fields for Osprey. To tell Osprey that we want to use sklearn‘s SVC as our estimator, we can type:

estimator:
  entry_point: sklearn.svm.SVC

If we want to use random search to decide where to search next in hyperparameter space, we can add:

strategy:
  name: random

The search space can be defined for any hyperparameter available in the estimator class. Here we can adjust the value range of the C and gamma hyperparamters. We’ll search over a range of 0.1 to 10 for C and over 1E-5 to 1 in log-space (note: warp: log) for gamma.

search_space:
  C:
    min: 0.1
    max: 10
    type: float

  gamma:
    min: 1e-5
    max: 1
    warp: log
    type: float

To perform 5-fold cross validation, we add:

cv: 5

To load the digits classification example dataset from scikit-learn, we write:

dataset_loader:
  name: sklearn_dataset
  params:
    method: load_digits

And finally we need to list the SQL database where our cross-validation results will be saved:

trials:
    uri: sqlite:///osprey-trials.db

Once this all has been written to a YAML file (e.g. config.yaml), we can start an osprey job in the command-line by invoking:

$ osprey worker config.yaml

msmbuilder Example

Now that we understand the basics, we can move on to a more practical example. This section will go over how to set up a Osprey configuration for cross-validating Markov state models from protein simulations. Our model will be constructed by first calculating torsion angles, performing dimensionality reduction using tICA, clustering using mini-batch k-means, and, finally, an maximum-likelihood estimated Markov state model.

We begin by defining a Pipeline which will construct our desired model:

estimator:
    eval: |
        Pipeline([
                ('featurizer', DihedralFeaturizer()),
                ('tica', tICA()),
                ('cluster', MiniBatchKMeans()),
                ('msm', MarkovStateModel(n_timescales=5, verbose=False)),
        ])
    eval_scope: msmbuilder

Notice that we can easily set default parameters (e.g. msm.n_timescales) in our Pipeline even if we don’t plan on optimizing them.

If we wish to use gaussian process prediction to decide where to search in hyperparameter space, we can add:

strategy:
    name: gp
    params:
      seeds: 50

In this example, we’ll be optimizing the type of featurization, the number of cluster centers and the number of independent components:

search_space:

featurizer__types:
  choices:
    - ['phi', 'psi']
    - ['phi', 'psi', 'chi1']
  type: enum

tica__n_components:
  min: 2
  max: 5
  type: int

cluster__n_clusters:
  min: 10
  max: 100
  type: int

As seen in the previous example, we’ll set tica__n_components and cluster__n_clusters as integers with a set range. Notice that we can change which torsion angles to use in our featurization by creating an enum which contains a list of different dihedral angle types.

In this example, we’ll be using 50-50 shufflesplit cross-validation. This method is optimal for Markov state model cross-validation, as it maximizes the amount of unique data available in your training and test sets:

cv:
  name: shufflesplit
params:
  n_iter: 5
  test_size: 0.5

We’ll be using MDTraj to load our trajectories. Osprey already includes an mdtraj dataset loader to make it easy to list your trajectory and topology files as a glob-string:

dataset_loader:
  name: mdtraj
  params:
    trajectories: ~/local/msmbuilder/Tutorial/XTC/*/*.xtc
    topology: ~/local/msmbuilder/Tutorial/native.pdb
    stride: 1

And finally we need to list the SQL database where our cross-validation results will be saved:

trials:
  uri: sqlite:///osprey-trials.db

Just as before, once this all has been written to a YAML file we can start an osprey job in the command-line by invoking:

$ osprey worker config.yaml