dataset - provides a nice port to benchmark and matlab-based datasets

The file yann.utils.dataset.py contains the definition for the dataset ports. It contains support to various benchmark datasets through skdata. There is also support to a dataset that can be imported from matlab.

Todo

  • None of the PASCAL dataset retrievers from skdata is working. This need to be coded in.
  • Need a method to create dataset from a directory of images. - prepare for imagenet and coco.
  • See if support can be made for fuel.
yann.utils.dataset.create_shared_memory_dataset(data_xy, borrow=True, verbose=1, **kwargs)[source]

This function creates a shared theano memory to be used for dataset purposes.

Parameters:
  • data_xy[data_x, data_y] that will be assigned to shared_x and shared_y on output.
  • borrow – default value is True. This is a theano shared memory type variabe.
  • verbose – Similar to verbose everywhere else.
  • svm – default is False. If True, we also return a shared_svm_y for max-margin type last layer.
Returns:

shared_x, shared_y is svm is False. If not, ``shared_x,

shared_y, shared_svm_y``

Return type:

theano.shared

yann.utils.dataset.download_data(url, location)[source]
yann.utils.dataset.load_cifar100()[source]

Function that downloads the cifar 100 dataset and returns the dataset in full

TODO: Need to implement this.

yann.utils.dataset.load_data_mat(height, width, channels, location, batch=0, type_set='train', load_z=False)[source]

Use this code if the data was created in matlab in the right format and needed to be loaded. The way to create is to have variables x, y, z with z being an optional data to load. x is assumed to be the data in matrix double format with rows being each image in vectorized fashion and y is assumed to be lables in int or double.

The files are stored in the following format: loc/type/batch_0.mat. This code needs scipy to run.

Parameters:
  • height – The height of each image in the dataset.
  • width – The width of each image in the dataset.
  • channels3 if RGB, 1 if grayscale and so on.
  • location – Location of the dataset.
  • batch – if multi batch, then how many batches of data is present if not use 1
Returns:

Tuple (data_x, data_y) if requested, also (data_x,data_y,data_z)

Return type:

float32 tuple

Todo

Need to add preprocessing in this.

yann.utils.dataset.load_images_only(batch_size, location, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=218, width=178, channels=3, verbose=False)[source]

Function that downloads the dataset and returns the dataset in full.

Parameters:
  • mini_batch_size – What is the size of the batch.
  • n_train_images – number of training images.
  • n_test_images – number of testing images.
  • n_valid_images – number of validating images.
  • rand_perm – Create a random permutation list of images to be sampled to batches.
  • type_set – What dataset you need, test, train or valid.
  • height – Height of the image
  • width – Width of the image.
  • verbose – similar to dataset.
Returns:

data_x

Return type:

list

yann.utils.dataset.load_skdata_caltech101(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]

Function that downloads the dataset from skdata and returns the dataset in part

Parameters:
  • batch_size – What is the size of the batch.
  • n_train_images – number of training images.
  • n_test_images – number of testing images.
  • n_valid_images – number of validating images.
  • rand_perm – Create a random permutation list of images to be sampled to batches.
  • type_set – What dataset you need, test, train or valid.
  • height – Height of the image
  • width – Width of the image.
  • verbose – similar to dataset.

Todo

This is not a finished function.

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_caltech256(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]

Function that downloads the dataset from skdata and returns the dataset in part

Parameters:
  • mini_batch_size – What is the size of the batch.
  • n_train_images – number of training images.
  • n_test_images – number of testing images.
  • n_valid_images – number of validating images.
  • rand_perm – Create a random permutation list of images to be sampled to batches.
  • type_set – What dataset you need, test, train or valid.
  • height – Height of the image
  • width – Width of the image.
  • verbose – similar to dataset.

Todo

This is not a finished function.

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_cifar10()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_bg_images()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_bg_rand()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise1()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise2()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise3()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise4()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise5()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_noise6()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y),(test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_rotated()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.load_skdata_mnist_rotated_bg()[source]

Function that downloads the dataset from skdata and returns the dataset in full

Returns:[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type:list
yann.utils.dataset.pickle_dataset(loc, batch, data)[source]

Function that stores down an object as a pickle file given its filename and obj

Parameters:
  • loc – Provide location to save as a string
  • batch – provide a batch number to save the file as
  • data – Pass the data that needs to be picked down. Could also be a tuple
class yann.utils.dataset.setup_dataset(dataset_init_args, save_directory='_datasets', verbose=1, **kwargs)[source]

The setup_dataset class is used to create and assemble datasets that are friendly to the Yann toolbox.

Todo

images option for the source. skdata pascal isn’t working imagenet dataset and coco needs to be setup.

Parameters:
  • dataset_init_args

    is a dictonary of the form:

    data_init_args = {
    
        "source" : <where to get the dataset from>
                    'pkl' : A theano tutorial style 'pkl' file.
                    'skdata' : Download and setup from skdata
                    'matlab' : Data is created and is being used from Matlab
                    'images-only' : Data is created from a directory of images. This
                            will be an unsupervised dataset with no labels.
        "name" : necessary only for skdata
                  supports
                    * ``'mnist'``
                    * ``'mnist_noise1'``
                    * ``'mnist_noise2'``
                    * ``'mnist_noise3'``
                    * ``'mnist_noise4'``
                    * ``'mnist_noise5'``
                    * ``'mnist_noise6'``
                    * ``'mnist_bg_images'``
                    * ``'mnist_bg_rand'``
                    * ``'mnist_rotated'``
                    * ``'mnist_rotated_bg'``.
                    * ``'cifar10'``
                    * ``'caltech101'``
                    * ``'caltech256'``
    
            Refer to original paper by Hugo Larochelle [1] for these dataset details.
    
        "location"                  : necessary for 'pkl' and 'matlab' and
                                        'images-only'
        "mini_batch_size"           : 500, # some batch size
        "mini_batches_per_batch"    : (100, 20, 20), # trianing, testing, validation
        "batches2train"             : 1, # number of files will be created.
        "batches2test"              : 1,
        "batches2validate"          : 1,
        "height"                    : 28, # After pre-processing
        "width"                     : 28,
        "channels"                  : 1 , # color (3) or grayscale (1) ...
    
            }
    
  • preprocess_init_args

    provide preprocessing arguments. This is a dictionary:

    args =  {
        "normalize" : <bool> True for normalize across batches
        "GCN"       : True for global contrast normalization
        "ZCA"       : True, kind of like a PCA representation (not fully tested)
        "grayscale" : Convert the image to grayscale
            }
    
  • save_directory – <string> a location where the dataset is going to be saved.
[1]Larochelle H, Erhan D, Courville A, Bergstra J, Bengio Y. An empirical evaluation of deep architectures on problems with many factors of variation. InProceedings of the 24th international conference on Machine learning 2007 Jun 20 (pp. 473-480). ACM.

Notes

Yann toolbox takes datasets in a .pkl format. The dataset requires a directory structure such as the following:

location/_dataset_XXXXX
|_ data_params.pkl
|_ train
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .
|_ valid
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .
|_ test
    |_ batch_0.pkl
    |_ batch_1.pkl
    .
    .
    .

The location id (XXXXX) is generated by this class file. The five digits that are produced is the unique id of the dataset.

The file data_params.pkl contains one variable dataset_args used by datastream.

dataset_location()[source]

Use this function that return the location of dataset.

yann.utils.dataset.shuffle(data, verbose=1)[source]

Method shuffles the dataset with x and y

Parameters:data – Either a tuple of a nd array. If tuple, will assume x and y.
Returns:Shuffled version of the same.
Return type:data

Notes

Obnly tuple works at the moment.