Getting Started
How to run SmallTrain on a Linux server using Docker
SmallTrain trains small data on Linux server Docker
Here, as an example, you can see how to install and set up SmallTrain on your DGX STATION using MacOS. You can experience a learning demo using CIFAR-10 data.
Make appropriate changes and adjustments to the settings that suit your environment, such as your Linux server.
Environment example: Linux server (NVIDIA DGX Station on Ubuntu 18.04), local machine (macOS)
(NVIDIA Docker is already installed)
Check docker-compose
on the host
$ docker-compose -v
docker-compose version 1.22.0, build f46880fe
Install docker-compose if not exists
on the host by host sudoers
$ sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose
Clone SmallTrain repository
on the host
$ mkdir -p ~/github/geek-guild/
$ cd  ~/github/geek-guild/
$ git clone https://github.com/geek-guild/smalltrain.git
Clone GGUtils repository
on the host
$ mkdir -p ~/github/geek-guild/
$ cd  ~/github/geek-guild/
# Authenticate with your Github account on github.com/geek-guild repository
$ git clone https://github.com/geek-guild/ggutils.git
If docker is not running, run docker. However, sudoers permission is required.
on the host by host sudoers
$ sudo service docker start
Create a Docker bridge network for SmallTrain
In Docker’s bridge network, containers connected on the same bridge network can communicate with each other.
- Bridge network name: smalltrain_network
- Subnet: 172.28.0.0/24
- Gateway: 172.28.0.1
on the host
$ docker network create -d bridge smalltrain_network --gateway=172.28.0.1 --subnet=172.28.0.0/24
Run docker image
Run the docker script to create a docker image. (It is a work to set SmallTrain on docker.)
on the host
# SmallTrain
$ cd ~/gitlab/geek-guild/smalltrain/docker/
$ docker-compose up -d
Building smalltrain
Step 1/18 : FROM nvcr.io/nvidia/tensorflow:19.10-py3
...
Creating smalltrain ... done
Check a new SmallTrain container running and its CONTAINER ID
on the host
$ docker ps -a
CONTAINER ID        IMAGE                     COMMAND                  CREATED             STATUS              PORTS                                              NAMES
YYYYYYYYYYYY        docker_smalltrain-redis   "docker-entrypoint.s…"   15 minutes ago      Up 15 minutes       0.0.0.0:6379->6379/tcp, 0.0.0.0:16379->16379/tcp   smalltrain-redis
XXXXXXXXXXXX        docker_smalltrain         "/usr/local/bin/entr…"   15 minutes ago      Up 15 minutes       0.0.0.0:6006->6006/tcp                             smalltrain
Check the log of running SmallTrain container
on the host
$ CONTAINER_ID=XXXXXXXXXXXX
$ docker logs $CONTAINER_ID
...
Exec operation id: IR_2D_CNN_V2_l49-c64_20200109-TRAIN
nohup: appending output to 'nohup.out'
On the host, Check GPU usage
on the host
$ watch -n 1 nvidia-smi
Every 1.0s: nvidia-smi                                                                                                                                                                       gg-sta-20200116-volta: Tue Jan 21 10:42:27 2020
Tue Jan 21 10:42:27 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-DGXS...  On   | 00000000:07:00.0  On |                    0 |
| N/A   32C    P0    37W / 300W |    316MiB / 32475MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-DGXS...  On   | 00000000:08:00.0 Off |                    0 |
| N/A   33C    P0    35W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-DGXS...  On   | 00000000:0E:00.0 Off |                    0 |
| N/A   34C    P0   104W / 300W |   2893MiB / 32478MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-DGXS...  On   | 00000000:0F:00.0 Off |                    0 |
| N/A   32C    P0    36W / 300W |      1MiB / 32478MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      6903      G   /usr/lib/xorg/Xorg                            40MiB |
|    0      7058      G   /usr/bin/gnome-shell                         148MiB |
|    0     40041      G   /usr/lib/xorg/Xorg                            39MiB |
|    0     40082      G   /usr/bin/gnome-shell                          86MiB |
|    2      1441      C   python                                      2879MiB |
+-----------------------------------------------------------------------------+
- Check that the GPU device which set with environment value (e.g. NVIDIA_VISIBLE_DEVICES=2) is running.
Login SmallTrain container
on the host
$ docker exec -it $CONTAINER_ID /bin/bash
Check log of tutorial operation
on the container
# check log
$ less /var/smalltrain/logs/IR_2D_CNN_V2_l49-c64_20200109-TRAIN.log
2020-01-20 14:34:45.125276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
...
========================================
step 49, training loss 0.103139
========================================
test cross entropy 0.44906
save model to save_file_path:/var/model/image_recognition/tutorials/tensorflow/model/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/model-nn_lr-0.001_bs-128.ckpt
DONE train data
====================
Run TensorBoard
on the container
$ nohup tensorboard --logdir /var/model/image_recognition/tutorials/tensorflow/logs/ &
Check the result of the tutorial operation
on the container
# Report directory
$ ls -l /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/
total 424
-rw-r--r-- 1 root root  38074 Jan 20 15:12 all_variables_names.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e49_all.csv
-rw-r--r-- 1 root root  77687 Jan 20 15:13 prediction_e9_all.csv
-rw-r--r-- 1 root root     28 Jan 20 15:12 summary_layers_9.json
-rw-r--r-- 1 root root 109286 Jan 20 15:12 test_plot__.png
-rw-r--r-- 1 root root  55458 Jan 20 15:13 test_plot_e49_all.png
-rw-r--r-- 1 root root  54259 Jan 20 15:13 test_plot_e9_all.png
-rw-r--r-- 1 root root   6406 Jan 20 15:12 trainable_variables_names.csv
# Prediction after 49steps of training
$ less /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/prediction_e49_all.csv
DateTime,Estimated,MaskedEstimated,True
/var/data/cifar-10-image/test_batch/test_batch_i9_c1.png_0,1,0.0,1
/var/data/cifar-10-image/test_batch/test_batch_i90_c0.png_0,0,0.0,0
/var/data/cifar-10-image/test_batch/test_batch_i91_c3.png_0,6,0.0,3
/var/data/cifar-10-image/test_batch/test_batch_i92_c8.png_0,8,0.0,8
- 
You can see how to read the result as follows: 
 each result line shows “DateTime”, “Estimated”, “MaskedEstimated”, “True”. The part of “True” shows 5 digits in which the second digits shows output and the last digit shows true label. Therefore,this means the image is incorrect: 
 /var/data/cifar-10-image/test_batch/test_batch_i91_c3.pngis6but the true label is3(incorrect),this means the image is correct: 
 /var/data/cifar-10-image/test_batch/test_batch_i92_c8.pngis8and the true label is also8(correct)
Done!