`dataset` - provides a nice port to benchmark and matlab-based datasets¶

The file yann.utils.dataset.py contains the definition for the dataset ports. It contains support to various benchmark datasets through skdata. There is also support to a dataset that can be imported from matlab.

Todo

None of the PASCAL dataset retrievers from skdata is working. This need to be coded in.
Need a method to create dataset from a directory of images. - prepare for imagenet and coco.
See if support can be made for fuel.

yann.utils.dataset.create_shared_memory_dataset(data_xy, borrow=True, verbose=1, **kwargs)[source]¶

This function creates a shared theano memory to be used for dataset purposes.

Parameters:

data_xy – [data_x, data_y] that will be assigned to shared_x and shared_y on output.
borrow – default value is True. This is a theano shared memory type variabe.
verbose – Similar to verbose everywhere else.
svm – default is False. If True, we also return a shared_svm_y for max-margin type last layer.

Returns:

shared_x, shared_y is svm is False. If not, ``shared_x,: shared_y, shared_svm_y``

Return type:

theano.shared

yann.utils.dataset.download_data(url, location)[source]¶

yann.utils.dataset.load_cifar100()[source]¶

Function that downloads the cifar 100 dataset and returns the dataset in full

TODO: Need to implement this.

yann.utils.dataset.load_data_mat(height, width, channels, location, batch=0, type_set='train', load_z=False)[source]¶

Use this code if the data was created in matlab in the right format and needed to be loaded. The way to create is to have variables x, y, z with z being an optional data to load. x is assumed to be the data in matrix double format with rows being each image in vectorized fashion and y is assumed to be lables in int or double.

The files are stored in the following format: loc/type/batch_0.mat. This code needs scipy to run.

Parameters:	height – The height of each image in the dataset. width – The width of each image in the dataset. channels – `3` if RGB, `1` if grayscale and so on. location – Location of the dataset. batch – if multi batch, then how many batches of data is present if not use `1`
Returns:	Tuple (data_x, data_y) if requested, also (data_x,data_y,data_z)
Return type:	float32 tuple

Todo

Need to add preprocessing in this.

yann.utils.dataset.load_images_only(batch_size, location, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=218, width=178, channels=3, verbose=False)[source]¶

Function that downloads the dataset and returns the dataset in full.

Parameters:	mini_batch_size – What is the size of the batch. n_train_images – number of training images. n_test_images – number of testing images. n_valid_images – number of validating images. rand_perm – Create a random permutation list of images to be sampled to batches. type_set – What dataset you need, test, train or valid. height – Height of the image width – Width of the image. verbose – similar to dataset.
Returns:	`data_x`
Return type:	list

yann.utils.dataset.load_skdata_caltech101(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]¶

Function that downloads the dataset from skdata and returns the dataset in part

Parameters:

batch_size – What is the size of the batch.
n_train_images – number of training images.
n_test_images – number of testing images.
n_valid_images – number of validating images.
rand_perm – Create a random permutation list of images to be sampled to batches.
type_set – What dataset you need, test, train or valid.
height – Height of the image
width – Width of the image.
verbose – similar to dataset.

Todo

This is not a finished function.

Returns:	`[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]`
Return type:	list

yann.utils.dataset.load_skdata_caltech256(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]¶

Function that downloads the dataset from skdata and returns the dataset in part

Parameters:

mini_batch_size – What is the size of the batch.
n_train_images – number of training images.
n_test_images – number of testing images.
n_valid_images – number of validating images.
rand_perm – Create a random permutation list of images to be sampled to batches.
type_set – What dataset you need, test, train or valid.
height – Height of the image
width – Width of the image.
verbose – similar to dataset.

Todo

This is not a finished function.

Returns:	`[(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]`
Return type:	list

yann.utils.dataset.load_skdata_cifar10()[source]¶