dataset
- provides a nice port to benchmark and matlab-based datasets¶
The file yann.utils.dataset.py
contains the definition for the dataset ports. It contains
support to various benchmark datasets
through skdata. There is also support to a dataset that can be imported from matlab.
Todo
- None of the PASCAL dataset retrievers from
skdata
is working. This need to be coded in. - Need a method to create dataset from a directory of images. - prepare for imagenet and coco.
- See if support can be made for fuel.
This function creates a shared theano memory to be used for dataset purposes.
Parameters: - data_xy –
[data_x, data_y]
that will be assigned toshared_x
andshared_y
on output. - borrow – default value is
True
. This is a theano shared memory type variabe. - verbose – Similar to verbose everywhere else.
- svm – default is
False
. IfTrue
, we also return ashared_svm_y
for max-margin type last layer.
Returns: shared_x, shared_y
issvm
isFalse
. If not, ``shared_x,shared_y, shared_svm_y``
Return type: theano.shared
- data_xy –
-
yann.utils.dataset.
load_cifar100
()[source]¶ Function that downloads the cifar 100 dataset and returns the dataset in full
TODO: Need to implement this.
-
yann.utils.dataset.
load_data_mat
(height, width, channels, location, batch=0, type_set='train', load_z=False)[source]¶ Use this code if the data was created in matlab in the right format and needed to be loaded. The way to create is to have variables
x, y, z
withz
being an optional data to load.x
is assumed to be the data in matrixdouble
format with rows being each image in vectorized fashion andy
is assumed to be lables inint
ordouble
.The files are stored in the following format:
loc/type/batch_0.mat
. This code needs scipy to run.Parameters: - height – The height of each image in the dataset.
- width – The width of each image in the dataset.
- channels –
3
if RGB,1
if grayscale and so on. - location – Location of the dataset.
- batch – if multi batch, then how many batches of data is present if not use
1
Returns: Tuple (data_x, data_y) if requested, also (data_x,data_y,data_z)
Return type: float32 tuple
Todo
Need to add preprocessing in this.
-
yann.utils.dataset.
load_images_only
(batch_size, location, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=218, width=178, channels=3, verbose=False)[source]¶ Function that downloads the dataset and returns the dataset in full.
Parameters: - mini_batch_size – What is the size of the batch.
- n_train_images – number of training images.
- n_test_images – number of testing images.
- n_valid_images – number of validating images.
- rand_perm – Create a random permutation list of images to be sampled to batches.
- type_set – What dataset you need, test, train or valid.
- height – Height of the image
- width – Width of the image.
- verbose – similar to dataset.
Returns: data_x
Return type: list
-
yann.utils.dataset.
load_skdata_caltech101
(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]¶ Function that downloads the dataset from skdata and returns the dataset in part
Parameters: - batch_size – What is the size of the batch.
- n_train_images – number of training images.
- n_test_images – number of testing images.
- n_valid_images – number of validating images.
- rand_perm – Create a random permutation list of images to be sampled to batches.
- type_set – What dataset you need, test, train or valid.
- height – Height of the image
- width – Width of the image.
- verbose – similar to dataset.
Todo
This is not a finished function.
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_caltech256
(batch_size, n_train_images, n_test_images, n_valid_images, rand_perm, batch=1, type_set='train', height=256, width=256, verbose=False)[source]¶ Function that downloads the dataset from skdata and returns the dataset in part
Parameters: - mini_batch_size – What is the size of the batch.
- n_train_images – number of training images.
- n_test_images – number of testing images.
- n_valid_images – number of validating images.
- rand_perm – Create a random permutation list of images to be sampled to batches.
- type_set – What dataset you need, test, train or valid.
- height – Height of the image
- width – Width of the image.
- verbose – similar to dataset.
Todo
This is not a finished function.
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_cifar10
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_bg_images
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_bg_rand
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise1
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise2
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise3
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise4
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise5
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_noise6
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y),(test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_rotated
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
load_skdata_mnist_rotated_bg
()[source]¶ Function that downloads the dataset from skdata and returns the dataset in full
Returns: [(train_x, train_y, train_y),(valid_x, valid_y, valid_y), (test_x, test_y, test_y)]
Return type: list
-
yann.utils.dataset.
pickle_dataset
(loc, batch, data)[source]¶ Function that stores down an object as a pickle file given its filename and obj
Parameters: - loc – Provide location to save as a string
- batch – provide a batch number to save the file as
- data – Pass the data that needs to be picked down. Could also be a tuple
-
class
yann.utils.dataset.
setup_dataset
(dataset_init_args, save_directory='_datasets', verbose=1, **kwargs)[source]¶ The setup_dataset class is used to create and assemble datasets that are friendly to the Yann toolbox.
Todo
images
option for thesource
.skdata pascal
isn’t workingimagenet
dataset andcoco
needs to be setup.Parameters: - dataset_init_args –
is a dictonary of the form:
data_init_args = { "source" : <where to get the dataset from> 'pkl' : A theano tutorial style 'pkl' file. 'skdata' : Download and setup from skdata 'matlab' : Data is created and is being used from Matlab 'images-only' : Data is created from a directory of images. This will be an unsupervised dataset with no labels. "name" : necessary only for skdata supports * ``'mnist'`` * ``'mnist_noise1'`` * ``'mnist_noise2'`` * ``'mnist_noise3'`` * ``'mnist_noise4'`` * ``'mnist_noise5'`` * ``'mnist_noise6'`` * ``'mnist_bg_images'`` * ``'mnist_bg_rand'`` * ``'mnist_rotated'`` * ``'mnist_rotated_bg'``. * ``'cifar10'`` * ``'caltech101'`` * ``'caltech256'`` Refer to original paper by Hugo Larochelle [1] for these dataset details. "location" : necessary for 'pkl' and 'matlab' and 'images-only' "mini_batch_size" : 500, # some batch size "mini_batches_per_batch" : (100, 20, 20), # trianing, testing, validation "batches2train" : 1, # number of files will be created. "batches2test" : 1, "batches2validate" : 1, "height" : 28, # After pre-processing "width" : 28, "channels" : 1 , # color (3) or grayscale (1) ... }
- preprocess_init_args –
provide preprocessing arguments. This is a dictionary:
args = { "normalize" : <bool> True for normalize across batches "GCN" : True for global contrast normalization "ZCA" : True, kind of like a PCA representation (not fully tested) "grayscale" : Convert the image to grayscale }
- save_directory – <string> a location where the dataset is going to be saved.
[1] Larochelle H, Erhan D, Courville A, Bergstra J, Bengio Y. An empirical evaluation of deep architectures on problems with many factors of variation. InProceedings of the 24th international conference on Machine learning 2007 Jun 20 (pp. 473-480). ACM. Notes
Yann toolbox takes datasets in a
.pkl
format. The dataset requires a directory structure such as the following:location/_dataset_XXXXX |_ data_params.pkl |_ train |_ batch_0.pkl |_ batch_1.pkl . . . |_ valid |_ batch_0.pkl |_ batch_1.pkl . . . |_ test |_ batch_0.pkl |_ batch_1.pkl . . .
The location id (
XXXXX
) is generated by this class file. The five digits that are produced is the unique id of the dataset.The file
data_params.pkl
contains one variabledataset_args
used by datastream.- dataset_init_args –