Cooking a matlab dataset for Yann.

By virture of being here, it is assumed that you have gone through the Quick Start.

This tutorial will help you convert a dataset from matlab workspace to yann. To begin let us acquire Google’s Street View House Numbers dataset in Matlab [1]. Download from the url three .mat files: test_32x32.mat, train_32x32.mat and extra_32x32.mat. Once downloaded we need to divide this mat dump of data into training, testing and validation minibatches appropriately as used by yann. This can be accomplished by the steps outlined in the code yann\pantry\matlab\make_svhn.m. This will create data with 500 samples per mini batch with 56 training batches, 42 testing batches and 28 validation batches.

Once the mat files are setup appropriately, they are ready for yann to load and convert them into yann data. In case of data that is not form svhn, you can open one of the ‘batch’ files in matlab to understand how the data is spread. Typically, the x variable is vectorized images, in this case 500X3072 (500 images per batch, 32*32*3 pixels per image). y is an integer vector labels going from 0-10 in this case.

References

[1]Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.

To convert the code into yann, we can use the setup_dataset module at yann.utils.dataset.py file. Simply call the initializer as,

dataset = setup_dataset(dataset_init_args = data_params,
                        save_directory = save_directory,
                        preprocess_init_args = preprocess_params,
                        verbose = 3 )

where, data_params contains information about the dataset thusly,

data_params = {
               "source"             : 'matlab',
               # "name"               : 'yann_svhn', # some name.
               "location"                   : location,    # some location to load from.
               "height"             : 32,
               "width"              : 32,
               "channels"           : 3,
               "batches2test"       : 42,
               "batches2train"      : 56,
               "batches2validate"   : 28,
               "mini_batch_size"    : 500  }

and the preprocess_params contains information on how to process the images thusly,

preprocess_params = {
                        "normalize"     : True,
                        "ZCA"           : False,
                        "grayscale"     : False,
                        "zero_mean"     : False,
                    }

save_directory is simply a location to save the yann dataset. Customarialy, it is save_directory = '_datasets'

The full code for this tutorial with additional commentary can be found in the file pantry.tutorials.mat2yann.py.

If you have toolbox cloned or downloaded or just the tutorials downloaded, Run the code using,

pantry.tutorials.mat2yann.cook_svhn_normalized(location, verbose=1, **kwargs)[source]

This method demonstrates how to cook a dataset for yann from matlab. Refer to the pantry/matlab/setup_svhn.m file first to setup the dataset and make it ready for use with yann.

Parameters:
  • location – provide the location where the dataset is created and stored. Refer to prepare_svhn.m file to understand how to prepare a dataset.
  • save_directory – which directory to save the cooked dataset onto.
  • dataset_parms – default is the dictionary. Refer to setup_dataset
  • preprocess_params – default is the dictionary. Refer to setup_dataset.

Notes

By default, this will create a dataset that is not mean-subtracted.

class yann.utils.dataset.setup_dataset(dataset_init_args, save_directory='_datasets', verbose=1, **kwargs)[source]

The setup_dataset class is used to create and assemble datasets that are friendly to the Yann toolbox.

Todo

images option for the source. skdata pascal isn’t working imagenet dataset and coco needs to be setup.

Parameters:
  • dataset_init_args

    is a dictonary of the form:

    data_init_args = {
    
        "source" : <where to get the dataset from>
                    'pkl' : A theano tutorial style 'pkl' file.
                    'skdata' : Download and setup from skdata
                    'matlab' : Data is created and is being used from Matlab
                    'images-only' : Data is created from a directory of images. This
                            will be an unsupervised dataset with no labels.
        "name" : necessary only for skdata
                  supports
                    * ``'mnist'``
                    * ``'mnist_noise1'``
                    * ``'mnist_noise2'``
                    * ``'mnist_noise3'``
                    * ``'mnist_noise4'``
                    * ``'mnist_noise5'``
                    * ``'mnist_noise6'``
                    * ``'mnist_bg_images'``
                    * ``'mnist_bg_rand'``
                    * ``'mnist_rotated'``
                    * ``'mnist_rotated_bg'``.
                    * ``'cifar10'``
                    * ``'caltech101'``
                    * ``'caltech256'``
    
            Refer to original paper by Hugo Larochelle [1] for these dataset details.
    
        "location"                  : necessary for 'pkl' and 'matlab' and
                                        'images-only'
        "mini_batch_size"           : 500, # some batch size
        "mini_batches_per_batch"    : (100, 20, 20), # trianing, testing, validation
        "batches2train"             : 1, # number of files will be created.
        "batches2test"              : 1,
        "batches2validate"          : 1,
        "height"                    : 28, # After pre-processing
        "width"                     : 28,
        "channels"                  : 1 , # color (3) or grayscale (1) ...
    
            }
    
  • preprocess_init_args

    provide preprocessing arguments. This is a dictionary:

    args =  {
        "normalize" : <bool> True for normalize across batches
        "GCN"       : True for global contrast normalization
        "ZCA"       : True, kind of like a PCA representation (not fully tested)
        "grayscale" : Convert the image to grayscale
            }
    
  • save_directory – <string> a location where the dataset is going to be saved.
[2]Larochelle H, Erhan D, Courville A, Bergstra J, Bengio Y. An empirical evaluation of deep architectures on problems with many factors of variation. InProceedings of the 24th international conference on Machine learning 2007 Jun 20 (pp. 473-480). ACM.

Notes

Yann toolbox takes datasets in a .pkl format. The dataset requires a directory structure such as the following:

location/_dataset_XXXXX
|_ data_params.pkl
|_ train
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .
|_ valid
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .
|_ test
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .

The location id (XXXXX) is generated by this class file. The five digits that are produced is the unique id of the dataset.

The file data_params.pkl contains one variable dataset_args used by datastream.