Cooking a matlab dataset for Yann.¶
By virture of being here, it is assumed that you have gone through the Quick Start.
This tutorial will help you convert a dataset from matlab workspace to yann. To begin let us
acquire Google’s Street View House Numbers dataset in Matlab [1]. Download from the url three
.mat files: test_32x32.mat, train_32x32.mat and extra_32x32.mat. Once downloaded we need to
divide this mat dump of data into training, testing and validation minibatches appropriately as
used by yann. This can be accomplished by the steps outlined in the code
yann\pantry\matlab\make_svhn.m
. This will create data with 500 samples per mini batch with
56 training batches, 42 testing batches and 28 validation batches.
Once the mat files are setup appropriately, they are ready for yann to load and convert them into
yann data. In case of data that is not form svhn, you can open one of the ‘batch’ files in matlab
to understand how the data is spread. Typically, the x
variable is vectorized images, in this
case 500X3072 (500 images per batch, 32*32*3 pixels per image). y
is an integer vector labels
going from 0-10 in this case.
References
[1] | Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. |
To convert the code into yann, we can use the setup_dataset
module at yann.utils.dataset.py
file. Simply call the initializer as,
dataset = setup_dataset(dataset_init_args = data_params,
save_directory = save_directory,
preprocess_init_args = preprocess_params,
verbose = 3 )
where, data_params
contains information about the dataset thusly,
data_params = {
"source" : 'matlab',
# "name" : 'yann_svhn', # some name.
"location" : location, # some location to load from.
"height" : 32,
"width" : 32,
"channels" : 3,
"batches2test" : 42,
"batches2train" : 56,
"batches2validate" : 28,
"mini_batch_size" : 500 }
and the preprocess_params
contains information on how to process the images thusly,
preprocess_params = {
"normalize" : True,
"ZCA" : False,
"grayscale" : False,
"zero_mean" : False,
}
save_directory
is simply a location to save the yann dataset. Customarialy, it is
save_directory = '_datasets'
The full code for this tutorial with additional commentary can be found in the file
pantry.tutorials.mat2yann.py
.
If you have toolbox cloned or downloaded or just the tutorials downloaded, Run the code using,
-
pantry.tutorials.mat2yann.
cook_svhn_normalized
(location, verbose=1, **kwargs)[source]¶ This method demonstrates how to cook a dataset for yann from matlab. Refer to the
pantry/matlab/setup_svhn.m
file first to setup the dataset and make it ready for use with yann.Parameters: - location – provide the location where the dataset is created and stored. Refer to prepare_svhn.m file to understand how to prepare a dataset.
- save_directory – which directory to save the cooked dataset onto.
- dataset_parms – default is the dictionary. Refer to
setup_dataset
- preprocess_params – default is the dictionary. Refer to
setup_dataset
.
Notes
By default, this will create a dataset that is not mean-subtracted.
-
class
yann.utils.dataset.
setup_dataset
(dataset_init_args, save_directory='_datasets', verbose=1, **kwargs)[source]¶ The setup_dataset class is used to create and assemble datasets that are friendly to the Yann toolbox.
Todo
images
option for thesource
.skdata pascal
isn’t workingimagenet
dataset andcoco
needs to be setup.Parameters: - dataset_init_args –
is a dictonary of the form:
data_init_args = { "source" : <where to get the dataset from> 'pkl' : A theano tutorial style 'pkl' file. 'skdata' : Download and setup from skdata 'matlab' : Data is created and is being used from Matlab 'images-only' : Data is created from a directory of images. This will be an unsupervised dataset with no labels. "name" : necessary only for skdata supports * ``'mnist'`` * ``'mnist_noise1'`` * ``'mnist_noise2'`` * ``'mnist_noise3'`` * ``'mnist_noise4'`` * ``'mnist_noise5'`` * ``'mnist_noise6'`` * ``'mnist_bg_images'`` * ``'mnist_bg_rand'`` * ``'mnist_rotated'`` * ``'mnist_rotated_bg'``. * ``'cifar10'`` * ``'caltech101'`` * ``'caltech256'`` Refer to original paper by Hugo Larochelle [1] for these dataset details. "location" : necessary for 'pkl' and 'matlab' and 'images-only' "mini_batch_size" : 500, # some batch size "mini_batches_per_batch" : (100, 20, 20), # trianing, testing, validation "batches2train" : 1, # number of files will be created. "batches2test" : 1, "batches2validate" : 1, "height" : 28, # After pre-processing "width" : 28, "channels" : 1 , # color (3) or grayscale (1) ... }
- preprocess_init_args –
provide preprocessing arguments. This is a dictionary:
args = { "normalize" : <bool> True for normalize across batches "GCN" : True for global contrast normalization "ZCA" : True, kind of like a PCA representation (not fully tested) "grayscale" : Convert the image to grayscale }
- save_directory – <string> a location where the dataset is going to be saved.
[2] Larochelle H, Erhan D, Courville A, Bergstra J, Bengio Y. An empirical evaluation of deep architectures on problems with many factors of variation. InProceedings of the 24th international conference on Machine learning 2007 Jun 20 (pp. 473-480). ACM. Notes
Yann toolbox takes datasets in a
.pkl
format. The dataset requires a directory structure such as the following:location/_dataset_XXXXX |_ data_params.pkl |_ train |_ batch_0.pkl |_ batch_1.pkl . . . |_ valid |_ batch_0.pkl |_ batch_1.pkl . . . |_ test |_ batch_0.pkl |_ batch_1.pkl . . .
The location id (
XXXXX
) is generated by this class file. The five digits that are produced is the unique id of the dataset.The file
data_params.pkl
contains one variabledataset_args
used by datastream.- dataset_init_args –