Datasets and Persistence¶
MSMBuilder has utility functions for persisting data (loading and saving). The two types of objects to be persisted on disk are datasets and models.
Datasets¶
In MSMBuilder, a dataset
is a collection of timeseries, or “sequences”.
Each timeseries usually represents a single molecular
dynamics trajectory, and may be represented in a number of different
formats
- A sequence may be an instance of
mdtraj.Trajectory
, a molecular dynamics trajectory object. - A sequence may be a
numpy
2D array with shapen_frames x n_features
, representing the projection of each frame in molecular dynamics trajectory into some vector space of dimension \(\mathbb{R}^{n_{features}}\). The leading dimension of lengthn_frames
indexes over the timeseries. For example, featurization takes a list of trajectories and returns a list of feature arrays. - A sequence may be an integer-valued 1D array with shape
n_frames
. For example, clustering takes a list of feature arrays and returns a list of sequences of state indices.
Datasets on Disk¶
MSMBuilder can read and write datasets to and from disk (resp) in two
formats: hdf5
and dir-npy
. From the Python API, you must choose
which format to write. The command-line application chooses the most
sensible option for you.
With HDF5, the dataset containing all of the trajectories is contained in a
single file on disk. This is generally the most convenient, but can be
unwieldy for large datasets. The transformed output of msmb tICA
,
msmb PCA
, and clustering commands is stored in HDF5 format.
The dir-npy
format stores the dataset as a collection of uncompressed
numpy .npy
files in a directory on disk. This is the most suitable for
large datasets, because it enables features like memory-mapped IO. The
transformed output of msmb
Featurizer commands are stored in
dir-npy
format.
Trajectory Datasets - Read only¶
Trajectory datasets are loaded using MDTraj. This requires specifying a glob pattern for the trajectories, as well as the topology. MSMBuilder does not write trajectory datasets.
Provenance Information¶
When msmbuilder saves a dataset, it also saves information which can be used to trace the provenance of the dataset.
$ msmb AtomPairsFeaturizer --out atom_pairs --trjs '*.dcd' --pair_indices atom_indices.txt --top top.pdb
[...]
$ ls atom_pairs
00000000.npy 00000002.npy 00000004.npy 00000006.npy 00000008.npy PROVENANCE.txt
00000001.npy 00000003.npy 00000005.npy 00000007.npy 00000009.npy
$ cat atom_pairs/PROVENANCE.txt
MSMBuilder Dataset:
MSMBuilder: 3.0.0-beta.dev-99bc8a9
Command: msmb AtomPairsFeaturizer --out atom_pairs --trjs '*.dcd' --pair_indices
Path: atom_pairs/
Username: rmcgibbo
Hostname: Computer-3.local
Date: December 01, 2014 12:16 AM
Comments:
== Derived from ==
MDTraj dataset:
path: *.dcd
topology: /Users/rmcgibbo/msmbuilder_data/alanine_dipeptide/top.pdb
stride: 1
atom_indices None
Models¶
MSMBuilder models can be losslessly persisted to disk using Python’s pickle
infrastructure. We recommend using the functions
msmbuilder.utils.load()
and msmbuilder.utils.dump()
to load and
save models respectively. The pickle format is not secure against malicious
attacks. Don’t load MSMBuilder models from untrusted sources.
Functions¶
-
msmbuilder.dataset.
dataset
(path, mode='r', fmt=None, verbose=False, **kwargs)¶ Open a dataset object
MSMBuilder supports several dataset ‘formats’ for storing lists of sequences on disk.
This function can also be used as a context manager.
Parameters: path : str
The path to the dataset on the filesystem
mode : {‘r’, ‘w’, ‘a’}
Open a dataset for reading, writing, or appending. Note that some formats only support a subset of these modes.
fmt : {‘dir-npy’, ‘hdf5’, ‘mdtraj’}
The format of the data on disk
dir-npy
A directory of binary numpy files, one file per sequence
hdf5
A single hdf5 file with each sequence as an array node
mdtraj
A read-only set of trajectory files that can be loaded with mdtraj
verbose : bool
Whether to print information about the dataset
-
msmbuilder.utils.
dump
(value, filename, compress=None, cache_size=None)¶ Save an arbitrary python object using pickle.
Parameters: value : any Python object
The object to store to disk using pickle.
filename : string
The name of the file in which it is to be stored
compress : None
No longer used
cache_size : positive number, optional
No longer used
See also
load
- corresponding loader
-
msmbuilder.utils.
load
(filename)¶ Load an object that has been saved with dump.
We try to open it using the pickle protocol. As a fallback, we use joblib.load. Joblib was the default prior to msmbuilder v3.2
Parameters: filename : string
The name of the file to load.