How to Prepare a Data Set
You can provide your original data set for training or prediction on SmallTrain.
Overview
In operation file, you can check and edit setting abount data set:
{
...
"data_dir_path": "/var/data/cifar-10-image/",
"data_set_def_path": "/var/data/cifar-10-image/data_set_def/train_cifar10_classification.csv",
"cache_data_set_id": "train_cifar10_classification",
...
}
All you have to do is:
- Put data files to data directory, your local directory given by
data_dir_path
. - Create data set definition with csv format at the local path
data_set_def_path
. - Set
cache_data_set_id
in order to identify your data set.
Data Set Structure
As an example, suppose you are accessing the server after running Getting Started(CIFAR-10 image classification).
Your local directory /var/data/cifar-10-image/
has following structure:
/var/data/cifar-10-image/
├── data_batch_1 // One of the training data directories
├── ...
├── data_batch_5 // One of the training data directories
├── data_set_def // data set definition directory
| └── train_cifar10_classification.csv // Data set definition file for training and prediction
└── test_batch // The testing data directory
and data set definition file (in this case /var/data/cifar-10-image/data_set_def/train_cifar10_classification.csv
) is:
data_set_id,label,sub_label,test,group
/var/data/cifar-10-image/data_batch_1/data_batch_1_i0_c6.png,6,6,0,TRAIN
/var/data/cifar-10-image/data_batch_1/data_batch_1_i1_c9.png,9,9,0,TRAIN
/var/data/cifar-10-image/data_batch_1/data_batch_1_i2_c9.png,9,9,0,TRAIN
...
/var/data/cifar-10-image/data_batch_5/data_batch_5_i9999_c1.png,1,1,0,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i0_c3.png,3,3,1,TRAIN
...
/var/data/cifar-10-image/test_batch/test_batch_i9997_c5.png,5,5,1,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i9998_c1.png,1,1,1,TRAIN
/var/data/cifar-10-image/test_batch/test_batch_i9999_c7.png,7,7,1,TRAIN
If you want to add a new data file /var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c9.png
as training data with labeled class = 9
,
- Put the new file on the path:
/var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c1.png
- Add the following row to the data set definition file.
/var/data/cifar-10-image/data_batch_6/data_batch_6_i10000_c1.png,9,9,0,TRAIN
In another example, if you want to add a new data file /var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png
as testing data with labeled class = 0
,
- Put the new file on the path:
/var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png
- Add the following row to the data set definition file.
/var/data/cifar-10-image/test_batch/test_batch_i10000_c0.png,0,0,1,TRAIN
Data Set Specifications
-
operation file
data_dir_path
: String, the directory path which contains data files.data_set_def_path
: String, the file path of data set definition file.cache_data_set_id
: String, the identifier of the data set.target_group
: String, the identifier for the group which to use as data set(see alsogroup
in data set definition).
-
data set definition file
- format: csv
- columns:
data_set_id
: String, the file path of the data file. It also works as the unique id that represents the data file.label
: Integer, the label which represents class for data.sub_label
: Integer, The sub lavel which is used if you want to label with a combination oflabel
andsub_label
.test
: Integer, the flag whether to use as testing data or not. If1
then used as testing data.group
: String, the group identifier. If you don’t want to use the data, you can exclude the data by settinggroup
not equal totarget_group
in operation file setting.