README.md 2.2 KB
Newer Older
William Stonewall Monroe's avatar
William Stonewall Monroe committed
1
2
# horovod-environment

William Stonewall Monroe's avatar
William Stonewall Monroe committed
3
4
A yml and a set of instructions to build a functioning horovod environment for distributed learning using keras and tensorflow (and torch on Cheaha

John-Paul Robinson's avatar
John-Paul Robinson committed
5
6
7
8
9
10
11
12
13
# clone this repo

and cd to the working directory

```
git clone git@gitlab.rc.uab.edu:wsmonroe/horovod-environment.git
cd horovod-environment
```

14
# request gpu resources (one way of doing it), this needs to be done everytime
15

16
17
18
```
sinteractive --ntasks=8 --time=08:00:00 --exclusive --partition=pascalnodes -N2 --gres=gpu:4
```
19

20
# load modules, this needs to be done everytime
21
22
```
module load Anaconda3/5.2.0
William Stonewall Monroe's avatar
William Stonewall Monroe committed
23

24
module load cuda91
William Stonewall Monroe's avatar
William Stonewall Monroe committed
25

26
27
module load OpenMPI/3.1.2-gcccuda-2018b
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
28
29
30
31

# create anaconda environment
Download distribLearn2.yml from this repo

32
33
34
```
conda env create -f distribLearn2.yml --name distributedLearning
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
35

36
## source activate env needs to be done everytime
37
38
39
```
source activate distributedLearning
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
40

41
These next 3 bits only need to be done to setup the env
42

43
44
```
conda update automat
William Stonewall Monroe's avatar
William Stonewall Monroe committed
45

46
pip uninstall horovod
William Stonewall Monroe's avatar
William Stonewall Monroe committed
47

48
49
pip install --no-cache-dir horovod
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
50

51
# Download examples
52
This can be downloaded from https://github.com/uber/horovod
William Stonewall Monroe's avatar
William Stonewall Monroe committed
53

54
55
56
As always, it is recommended to download data and scripts to your data directory if you would like it to remain persistent
```
/data/user/$USER/horovod-master/examples/
57
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
58

59
60
61
62
63
64
65
66
# Run the mnist example
While it is possible to run the 
```
mpirun
```
command from the command line, errors are much less likely to occur if the code is run via job script. If the examples are downloaded at the above location, the following example should work
```
sbatch horovod-mnist-training.job
67
```
68
69
make sure you are running your sbatch command from the login node.
# Run benchmarks
William Stonewall Monroe's avatar
William Stonewall Monroe committed
70

71
```
72
cd /data/user/$USER
73

74
75
git clone -b cnn_tf_v1.10_compatible https://github.com/tensorflow/benchmarks
```
William Stonewall Monroe's avatar
William Stonewall Monroe committed
76

77
78
79
assuming the benchmarks were downloaded at the above location, the following job script should run a benchmark test
```
sbatch horovod-benchmark.job
80
```
81
make sure you are running your sbatch command from the login node.
William Stonewall Monroe's avatar
William Stonewall Monroe committed
82

83
For the resnet101 benchmark test, 
84
85

running using 4 GPUs across 1 nodes gives: total images/sec: 491.34
86

87
running using 8 GPUs across 2 nodes gives: total images/sec: 915.31
88

89
running using 12 GPUs across 3 nodes gives: total images/sec: 1450.00