Getting Started
How to run SmallTrain on a Linux server using Docker
SmallTrain trains small data on Linux server Docker
Here, as an example, you can see how to install and set up SmallTrain on your DGX STATION using MacOS. You can experience a learning demo using CIFAR-10 data.
Make appropriate changes and adjustments to the settings that suit your environment, such as your Linux server.
Environment example: Linux server (NVIDIA DGX Station on Ubuntu 18.04), local machine (macOS)
(NVIDIA Docker is already installed)
Check docker-compose
on the host
$ docker-compose -v
docker-compose version 1.22.0, build f46880fe
Install docker-compose if not exists
on the host
by host sudoers
$ sudo curl -L "https://github.com/docker/compose/releases/download/1.22.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
$ sudo chmod +x /usr/local/bin/docker-compose
Clone SmallTrain repository
on the host
$ mkdir -p ~/github/geek-guild/
$ cd ~/github/geek-guild/
$ git clone https://github.com/geek-guild/smalltrain.git
Clone GGUtils repository
on the host
$ mkdir -p ~/github/geek-guild/
$ cd ~/github/geek-guild/
# Authenticate with your Github account on github.com/geek-guild repository
$ git clone https://github.com/geek-guild/ggutils.git
If docker is not running, run docker. However, sudoers permission is required.
on the host
by host sudoers
$ sudo service docker start
Create a Docker bridge network for SmallTrain
In Docker’s bridge network, containers connected on the same bridge network can communicate with each other.
- Bridge network name:
smalltrain_network
- Subnet: 172.28.0.0/24
- Gateway: 172.28.0.1
on the host
$ docker network create -d bridge smalltrain_network --gateway=172.28.0.1 --subnet=172.28.0.0/24
Run docker image
Run the docker script to create a docker image. (It is a work to set SmallTrain on docker.)
on the host
# SmallTrain
$ cd ~/gitlab/geek-guild/smalltrain/docker/
$ docker-compose up -d
Building smalltrain
Step 1/18 : FROM nvcr.io/nvidia/tensorflow:19.10-py3
...
Creating smalltrain ... done
Check a new SmallTrain container running and its CONTAINER ID
on the host
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
YYYYYYYYYYYY docker_smalltrain-redis "docker-entrypoint.s…" 15 minutes ago Up 15 minutes 0.0.0.0:6379->6379/tcp, 0.0.0.0:16379->16379/tcp smalltrain-redis
XXXXXXXXXXXX docker_smalltrain "/usr/local/bin/entr…" 15 minutes ago Up 15 minutes 0.0.0.0:6006->6006/tcp smalltrain
Check the log of running SmallTrain container
on the host
$ CONTAINER_ID=XXXXXXXXXXXX
$ docker logs $CONTAINER_ID
...
Exec operation id: IR_2D_CNN_V2_l49-c64_20200109-TRAIN
nohup: appending output to 'nohup.out'
On the host, Check GPU usage
on the host
$ watch -n 1 nvidia-smi
Every 1.0s: nvidia-smi gg-sta-20200116-volta: Tue Jan 21 10:42:27 2020
Tue Jan 21 10:42:27 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-DGXS... On | 00000000:07:00.0 On | 0 |
| N/A 32C P0 37W / 300W | 316MiB / 32475MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-DGXS... On | 00000000:08:00.0 Off | 0 |
| N/A 33C P0 35W / 300W | 1MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-DGXS... On | 00000000:0E:00.0 Off | 0 |
| N/A 34C P0 104W / 300W | 2893MiB / 32478MiB | 23% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-DGXS... On | 00000000:0F:00.0 Off | 0 |
| N/A 32C P0 36W / 300W | 1MiB / 32478MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 6903 G /usr/lib/xorg/Xorg 40MiB |
| 0 7058 G /usr/bin/gnome-shell 148MiB |
| 0 40041 G /usr/lib/xorg/Xorg 39MiB |
| 0 40082 G /usr/bin/gnome-shell 86MiB |
| 2 1441 C python 2879MiB |
+-----------------------------------------------------------------------------+
- Check that the GPU device which set with environment value (e.g.
NVIDIA_VISIBLE_DEVICES=2
) is running.
Login SmallTrain container
on the host
$ docker exec -it $CONTAINER_ID /bin/bash
Check log of tutorial operation
on the container
# check log
$ less /var/smalltrain/logs/IR_2D_CNN_V2_l49-c64_20200109-TRAIN.log
2020-01-20 14:34:45.125276: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
...
========================================
step 49, training loss 0.103139
========================================
test cross entropy 0.44906
save model to save_file_path:/var/model/image_recognition/tutorials/tensorflow/model/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/model-nn_lr-0.001_bs-128.ckpt
DONE train data
====================
Run TensorBoard
on the container
$ nohup tensorboard --logdir /var/model/image_recognition/tutorials/tensorflow/logs/ &
Check the result of the tutorial operation
on the container
# Report directory
$ ls -l /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/
total 424
-rw-r--r-- 1 root root 38074 Jan 20 15:12 all_variables_names.csv
-rw-r--r-- 1 root root 77687 Jan 20 15:13 prediction_e49_all.csv
-rw-r--r-- 1 root root 77687 Jan 20 15:13 prediction_e9_all.csv
-rw-r--r-- 1 root root 28 Jan 20 15:12 summary_layers_9.json
-rw-r--r-- 1 root root 109286 Jan 20 15:12 test_plot__.png
-rw-r--r-- 1 root root 55458 Jan 20 15:13 test_plot_e49_all.png
-rw-r--r-- 1 root root 54259 Jan 20 15:13 test_plot_e9_all.png
-rw-r--r-- 1 root root 6406 Jan 20 15:12 trainable_variables_names.csv
# Prediction after 49steps of training
$ less /var/model/image_recognition/tutorials/tensorflow/report/IR_2D_CNN_V2_l49-c64_20200109-TRAIN/prediction_e49_all.csv
DateTime,Estimated,MaskedEstimated,True
/var/data/cifar-10-image/test_batch/test_batch_i9_c1.png_0,1,0.0,1
/var/data/cifar-10-image/test_batch/test_batch_i90_c0.png_0,0,0.0,0
/var/data/cifar-10-image/test_batch/test_batch_i91_c3.png_0,6,0.0,3
/var/data/cifar-10-image/test_batch/test_batch_i92_c8.png_0,8,0.0,8
-
You can see how to read the result as follows:
each result line shows “DateTime”, “Estimated”, “MaskedEstimated”, “True”. The part of “True” shows 5 digits in which the second digits shows output and the last digit shows true label. Therefore,this means the image is incorrect:
/var/data/cifar-10-image/test_batch/test_batch_i91_c3.png
is6
but the true label is3
(incorrect),this means the image is correct:
/var/data/cifar-10-image/test_batch/test_batch_i92_c8.png
is8
and the true label is also8
(correct)
Done!