Configuration File¶
osprey
jobs are configured via a small configuration file, which is written
in a hand-editable YAML markup.
The command osprey skeleton
will create an example config.yaml
file
for you to get started with. The sections of the file are described below.
Estimator¶
The estimator section describes the model that osprey
is tasked
with optimizing. It can be specified either as a python entry point,
a pickle file, or as a raw string which is passed to python’s eval()
.
However specified, the estimator should be an instance or subclass of
sklearn’s BaseEstimator
Examples:
estimator:
entry_point: sklearn.linear_model.LinearRegression
estimator:
eval: Pipeline([('vectorizer', TfidfVectorizer), ('logistic', LogisticRegression())])
eval_scope: sklearn
estimator:
pickle: my-model.pkl # path to pickle file on disk
Search Space¶
The search space describes the space of hyperparameters to search over
to find the best model. It is specified as the product space of
bounded intervals for different variables, which can either be of type
int
, float
, or enum
. Variables of type float
can also
be warped into log-space, which means that the optimization will be
performed on the log of the parameter instead of the parameter itself.
Example:
search_space:
logistic__C:
min: 1e-3
max: 1e3
type: float
warp: log
logistic__penalty:
choices:
- l1
- l2
type: enum
You can also transform float
and int
variables into enumerables by
declaring a jump
variable:
Example:
search_space:
logistic__C:
min: 1e-3
max: 1e3
num: 10
type: jump
var_type: float
warp: log
In the example above, we have declared a jump
variable C
for the
logistic
estimator. This variable is essentially an enum
with
10 possible float
values that are evenly spaced apart in log-space within
the given min
and max
range.
Strategy¶
Three probablistic search strategies and grid search are supported. First,
random search (strategy: {name: random}
) can be used, which samples
hyperparameters randomly from the search space at each model-building iteration.
Random search has been shown to be significantly more effiicent than pure grid search. Example:
strategy:
name: random
strategy: {name: hyperopt_tpe}
is an alternative strategy which uses a Tree of Parzen
estimators, described in this paper. This algorithim requires that the external
package hyperopt be installed. Example:
strategy:
name: hyperopt_tpe
osprey
supports a Gaussian process expected improvement search
strategy, using the package GPy, with
strategy: {name: gp}
.
url
param. Example:
strategy:
name: gp
Finally, and perhaps simplest of all, is the
grid search strategy
(strategy: {name: grid}
). Example:
strategy:
name: grid
Please note, that grid search only supports enum
and jump
variables.
Dataset Loader¶
Osprey supports a wide variety of file formats. These include pickle files,
numpy
files, delimiter-separated values files (e.g. .csv
, .tsv`),
``hdf5
files, and most molecular trajectory file formats (see mdtraj.org for reference).
For more information about formatting your dataset for use with Osprey, please
refer to our “Getting Started” page.
Below is an example of using the dsv
loader to load multiple .csv
files
into Osprey:
Example:
dataset_loader:
name: dsv
params:
filenames: /path/to/files/*.csv, /another/path/to/myfile.csv
delimiter: ','
skip_header: 2
skip_footer: 1
y_col: 42
usecols: 0, 1, 2, 3, 4, 5
concat: True
Notice that we can pass a glob string and/or a comma-separated list of paths to
filenames
to tell Osprey where our data is located. delimiter
defines
the separator pattern used to parse the data files (default: ','
).
skip_header
and skip_footer
tell Osprey how many lines to ignore at the
beginning and end of the files, respectively (default: 0
). y_col
is used
to specify which column to select as a response variable (default: None
).
usecols
can be used to specify which columns to use as explanatory variables
(default: uses all columns). And finally, concat
specifies whether or not to
treat all loaded files as a single dataset (defaut: False
).
Here’s a complete list of supported file formats, along with their loader
name
mappings:
numpy
: NumPy formatmsmbuilder
: MSMBuilder dataset formathdf5
: HDF5 formatdsv
: Delimiter-separated value (DSV) formatjoblib
: Pickle and Joblib formats
In addition, we provide two additional loaders:
sklearn_dataset
: Allows users to load anyscikit-learn
datasetfilename
: Allows users to pass a set of filenames to the Osprey estimator. Useful for custom dataset loading.
Cross Validation¶
Many types of cross-validation iterators are supported. The simplest
option is to simply pass an int
, which sets up k-fold cross validation.
Example:
cv: 5
To access the other iterators, use the name
and params
keywords:
cv:
name: shufflesplit
params:
n_splits: 5
test_size: 0.5
random_state: 42
Here’s a complete list of supported iterators, along with their name
mappings:
kfold
: KFoldshufflesplit
: ShuffleSplitloo
: LeaveOneOutstratifiedkfold
: StratifiedKFoldstratifiedshufflesplit
: StratifiedShuffleSplit
Random Seed¶
In case you need reproducible Osprey trials, you can also include an optional random seed as seen below:
Example:
random_seed: 42
Please note that this makes parallel trials redundant and, thus, not recommended when scaling across multiple jobs. However, a workaround would be to create multiple copies of the configuration file, each with a unique random seed, for each independent worker to run.
Max Parameter Suggestion Retries¶
By default Osprey will create trials that were previously tested. This can
occur for example when restarting a grid search. By setting the optional
max_param_suggestion_retries
parameter, Osprey will exit if it fails to
generate a parameter set that is not already in the database after
max_param_suggestion_retries
attempts.
Example:
max_param_suggestion_retries: 10
Trials Storage¶
Example:
trials:
# path to a databse in which the results of each hyperparameter fit
# are stored any SQL database is suppoted, but we recommend using
# SQLite, which is simple and stores the results in a file on disk.
# the string format for connecting to other database is described here:
# http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html#database-urls
uri: sqlite:///osprey-trials.db