Getting Started¶

Introduction¶

Getting started with Osprey is as simple as setting up a single YAML configuration file. This configuration file will contain your model estimators, hyperparameter search strategy, hyperparameter search space, dataset information, cross-validation strategy, and a path to a SQL-like database. You can use the command osprey skeleton to generate an example configuration file.

First, we will describe how to prepare your dataset for Osprey. Then, we will show how to use Osprey for a simple scikit-learn classification task. We’ll also demonstrate how one might use Osprey to model a molecular dynamics (MD) dataset. And finally, we’ll show how to query and use your final Osprey results.

Formatting Your Dataset¶

Osprey supports a wide variety of file formats (see here for a full list); however, some of these offer more flexibility than others. In general, your data should be formatted as a two-dimensional array, where columns represent different features or variables and rows are individual observations. This is a fairly natural format for delimiter-separated value files (e.g .csv, .tsv), which Osprey handles natively using DSVDatasetLoader. If you choose to save your dataset as a .pkl, .npz, or .npy file, it’s as simple as saving your datasets as 2d NumPy arrays. Note that each file should only contain a single NumPy array. If you’d like to store multiple arrays to a single file for Osprey to read, we recommend storing your data in an HDF5 file.

When working with datasets with labels or a response variable, there are slight differences in how your data should be stored. With delimiter-separated value, NumPY files, and HDF5 files, you can simply append these as an additional column and then select its index as the y_col parameter in the corresponding dataset loader. With Pickle and JobLib files, you should instead save each as a separate value in a dict object and declare the corresponding keys (x_name and y_name) in the JoblibDatasetLoader. Please note that if you wish to use multiple response variables the JoblibDatasetLoader is the only dataset loader currently equipped to do so.

SVM Classification with `scikit-learn`¶

Let’s train a basic C-Support Vector Classification example using scikit-learn and introduce the basic YAML fields for Osprey. To tell Osprey that we want to use sklearn’s SVC as our estimator, we can type:

estimator:
  entry_point: sklearn.svm.SVC

If we want to use random search to decide where to search next in hyperparameter space, we can add:

strategy:
  name: random

The search space can be defined for any hyperparameter available in the estimator class. Here we can adjust the value range of the C and gamma hyperparamters. We’ll search over a range of 0.1 to 10 for C and over 1E-5 to 1 in log-space (note: warp: log) for gamma.

search_space:
  C:
    min: 0.1
    max: 10
    type: float

  gamma:
    min: 1e-5
    max: 1
    warp: log
    type: float

To perform 5-fold cross validation, we add:

cv: 5

To load the digits classification example dataset from scikit-learn, we write:

dataset_loader:
  name: sklearn_dataset
  params:
    method: load_digits

And finally we need to list the SQLite database where our cross-validation results will be saved:

trials:
    uri: sqlite:///osprey-trials.db

Once this all has been written to a YAML file (in this example config.yaml), we can start an osprey job in the command-line by invoking:

$ osprey worker --n-iters 10 --seed 42 config.yaml

The --n-iters option allows you to specify how many iterations to perform for hyperparameter optimization in this particular worker. The --seed option allows you to define a random seed to produce a fully reproducible Osprey worker (Note: This overrides the random_seed option

in the configuration file).

Molecular Dynamics with `msmbuilder`¶

Now that we understand the basics, we can move on to a more practical example. This section will go over how to set up a Osprey configuration for cross-validating Markov state models from protein simulations. Our model will be constructed by first calculating torsion angles, performing dimensionality reduction using tICA, clustering using mini-batch k-means, and, finally, an maximum-likelihood estimated Markov state model.

We begin by defining a Pipeline which will construct our desired model:

estimator:
    eval: |
        Pipeline([
                ('featurizer', DihedralFeaturizer()),
                ('tica', tICA()),
                ('cluster', MiniBatchKMeans()),
                ('msm', MarkovStateModel(n_timescales=5, verbose=False)),
        ])
    eval_scope: msmbuilder

Notice that we can easily set default parameters (e.g. msm.n_timescales) in our Pipeline even if we don’t plan on optimizing them.

If we wish to use gaussian process prediction to decide where to search in hyperparameter space, we can add:

strategy:
    name: gp
    params:
      seeds: 50

In this example, we’ll be optimizing the type of featurization, the number of cluster centers and the number of independent components:

search_space:

featurizer__types:
  choices:
    - ['phi', 'psi']
    - ['phi', 'psi', 'chi1']
  type: enum

tica__n_components:
  min: 2
  max: 5
  type: int

cluster__n_clusters:
  min: 10
  max: 100
  type: int

As seen in the previous example, we’ll set tica__n_components and cluster__n_clusters as integers with a set range. Notice that we can change which torsion angles to use in our featurization by creating an enum which contains a list of different dihedral angle types.

In this example, we’ll be using 50-50 shufflesplit cross-validation. This method is optimal for Markov state model cross-validation, as it maximizes the amount of unique data available in your training and test sets:

cv:
  name: shufflesplit
params:
  n_splits: 5
  test_size: 0.5

We’ll be using MDTraj to load our trajectories. Osprey already includes an mdtraj dataset loader to make it easy to list your trajectory and topology files as a glob-string:

dataset_loader:
  name: mdtraj
  params:
    trajectories: ~/local/msmbuilder/Tutorial/XTC/*/*.xtc
    topology: ~/local/msmbuilder/Tutorial/native.pdb
    stride: 1

And finally we need to list the SQLite database where our cross-validation results will be saved:

trials:
  uri: sqlite:///osprey-trials.db

Just as before, once this all has been written to a YAML file we can start an osprey job in the command-line by invoking:

$ osprey worker --n-iters 10 --seed 42 config.yaml

Working with Osprey Results¶

As mentioned before, all Osprey results are stored in an SQL-like database, as define by the trials field in the configuration file. This makes querying and reproducing Osprey results fairly simple.

Osprey provides two command-line tools to quickly digest your results: current_best and plot. current_best, as the name suggests, prints out the best scoring model currently in your trials database, as well as the parameters used to create it. Here’s some example output from our SVM classification tutorial above:

$ osprey current_best config.yaml

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Best Current Model = 0.975515 +- 0.013327
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
             C       7.957695018309156
             gamma   0.0004726222555749291

This is useful if you just want to get a sense of how well your trials are doing or just want to quickly get the best current result from Osprey. The plot functionality provides interactive HTML charts using bokeh (note that bokeh must be installed to use osprey plot).

$ osprey plot config.yaml

The command above opens a browser window with a variety of plots. An example of one such plot, showing the running best SVM model over many iterations, can be seen below:

An alternative way to access trial data is to use the Python API to directly access the SQL-like database. Here’s an example of loading your Osprey results as a pandas.DataFrame:

# Imports
from osprey.config import Config

# Load Configuation File
my_config = 'path/to/config.yaml'
config = Config(my_config)

# Retrieve Trial Results
df = config.trial_results()