datastream - datastream class

The file yann.modules.datastream.py contains the definition for the datastream:

class yann.modules.datastream.datastream(dataset_init_args, borrow=True, verbose=1)[source]

This module initializes the dataset to the network class and provides all dataset related functionalities. It also provides for dynamically loading and caching dataset batches. :mod: add_layer will use this to initialize.

Parameters:
  • dataset_init_args – Is a dictionary of the form:
  • borrow

    Theano’s borrow. Default value is True.

    dataset_init_args = {
                "dataset":  <location>
                "svm"    :  False or True
                     ``svm`` if ``True``, a one-hot label set will also be setup.
                "n_classes": <int>
                    ``n_classes`` if ``svm`` is ``True``, we need to know how
                     many ``n_classes`` are present.
                "id": id of the datastream
        }
    
  • verbose – Similar to verbose throughout the toolbox.
Returns:

A dataset module object that has the details of loader and other things.

Return type:

dataset

Todo

  • Datastream should work with Fuel perhaps ?
  • Support HDf5 perhaps
initialize_dataset(verbose=1)[source]

Load the initial training batch of data on to data_x and data_y variables and create shared memories.

Todo

I am assuming that training has the largest number of data. This is immaterial when caching but during set_data routine, I need to be careful.

Parameters:verbose – Toolbox style verbose.
load_data(type='train', batch=0, verbose=2)[source]

Will load the data from the file and will return the data. The important thing to note is that all the datasets in :mod: yann all require a y or a variable to predict. In case of auto-encoder for instance, the thing to predict is the image itself. Setup dataset thusly.

Parameters:
  • typetrain, test or valid. default is train
  • batch – Supply an integer
  • verbose – Simliar to verbose in toolbox.

Todo

Create and load dataset for type = ‘x’

Returns:data_x, data_y
Return type:numpy.ndarray
one_hot_labels(y, verbose=1)[source]

Function takes in labels and returns a one-hot encoding. Used for max-margin loss. :param y: Labels to be encoded.n_classes :param verbose: Typical as in the rest of the toolbox.

Notes

self.n_classes: Number of unique classes in the labels.

This could be found out using the following: .. code-block: python

import numpy n_classes = len(numpy.unique(y))

This might be potentially dangerous in case of cached dataset. Although this is the default if n_classes is not provided as input to this module, I discourage anyone from using this.

Returns:one-hot encoded label list.
Return type:numpy ndarray
set_data(type='train', batch=0, verbose=2)[source]

This can work only after network is cooked.

Parameters:
  • batch – which batch of data to load and set
  • verbose – as usual