Skip to content

CM-DARE is a measurement infrastructure for monitoring distributed training in Google Cloud (ICDCS'20)

License

Notifications You must be signed in to change notification settings

cake-lab/CM-DARE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CM-DARE: Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

About

This repo contains code and data for the paper Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers in proceedings of ICDCS'20.

In this paper, we tackle the problem of how to achieve cost saving and speeding up distributed training using cloud-based GPU servers, fully exploiting the transient option provided by the cloud vendors. Transient cloud servers are just as powerful as regular ones, but can be revoked at any time by the vendors. We streamline the distributed training deployment, training and measurement with the framework CM-DARE built on top of TensorFlow and Tensor2Tensor. Our empirical datasets collected using CM-DARE include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. In addition, we demonstrate the feasibility of predicting training speed and overhead using regression-based models. We also discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

Citation

If you use our CM-DARE data or code, please either cite the arxiv version or the ICDCS version:

Bibtex for the ICDCS version

@INPROCEEDINGS{li2019cmdare_icdcs,  
author={Li, Shijian and Walls, Robert J. and Guo, Tian},  
booktitle={2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)},   
title={Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers},   
year={2020}
}

Bibtex for the arXiv version

@article{li2019cmdare_arxiv,
  title={Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers},
  author={Shijian Li and Robert J. Walls and Tian Guo},
  journal={arXiv preprint arXiv:2004.03072},
  year={2019}
}

Highlight

  • Several performance models that predict distributed training on transient servers, e.g. the training speed and fault-tolerance overhead of GPU clusters, with as low as 3.4% mean absolute percentage error. We identified appropriate deployment scenarios for each performance model, including in a heterogeneous environment.

  • Large scale transient server lifetime and revoke pattern study shows several important factors to more efficiently use transient servers on Google Compute Engine.

  • Potential use cases to further demostrate the potential of our performance modeling, particularly shedding light on the bottleneck problem in training.

Fig 1. Predicting training speed with particular GPU, using model complexity.

Fig 2. Lifetime CDF of transient K80 GPU servers in 4 different regions over 156 instances.

Fig 3. Use case: identifing bottleneck imposed by parameter servers.

How to use the code

Dependency and cloud image

The code used custom cloud images for both GPU and CPU servers, and the images are currently not public. Thus in order to run it, you need to create two custom images first.

We ran the code on cloud servers with Ubuntu 18.04 LTS, 4 vCPU cores and 24 GB memory for PS, 8 vCPU cores and 51 GB for workers, with 100 GB HD disk space. Ubuntu 16.04 LTS might work, but wil have unexpected behavior such as shutdown scripts not working properly.

First create a VM and ssh into it.

Then update apt-get and install the dependencies.

sudo apt-get update
sudo apt-get install -y openjdk-8-jdk git python-dev python3-dev python-numpy python3-numpy build-essential python-pip python3-pip python-virtualenv swig python-wheel libcurl3-dev curl g++ freeglut3-dev libx11-dev libxmu-dev libxi-dev libglu1-mesa libglu1-mesa-dev parallel

Install nvidia drivers; the code is based on CUDA 9.0. Notice: all the CUDA and CUDNN related dependencies are not required for the CPU image.

### Install NVIDIA driver
sudo apt install nvidia-384 nvidia-384-dev
### CUDA 9.0 requires gcc 6.0
sudo apt install gcc-6
sudo apt install g++-6
### Get CUDA 9.0 files and install
wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run
chmod +x cuda_9.0.176_384.81_linux.run
sudo ./cuda_9.0.176_384.81_linux.run --override

After rebooting the VM, check if CUDA is installed properly.

sudo reboot
nvidia-smi

An operational GPU would return something like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8    27W / 149W |     15MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1658      G   /usr/lib/xorg/Xorg                            14MiB |
+-----------------------------------------------------------------------------+

Install CUDNN 7.5; you need to go to the Nvidia website and register, then download the tar file and install it.

Edit cuda path to ~/.bashrc and reload it.

echo 'export PATH=/usr/local/cuda-9.0/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Don't forget to move CUDNN to the CUDA folder.

sudo cp -P cuda/include/cudnn.h /usr/local/cuda-9.0/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda-9.0/lib64/
sudo chmod a+r /usr/local/cuda-9.0/lib64/libcudnn*

The last step would be to install TensorFlow 1.10 and modified Tensor2Tensor. Tensor2Tensor can be found in the code folder.

sudo pip install tensorflow-gpu==1.10
## for cpu servers install tensorflow==1.10 instead
pip install -e ~/code/tensor2tensor
sudo pip install mesh-tensorflow == 0.0.5
sudo pip install upgrade google-api-python-client

After the dependency installation, make two images, one for workers and one for parameter servers. Example command as below (to create the instance-gpu and instance-cpu images that we used in the code):

gcloud compute instances set-disk-auto-delete instance-gpu \
--disk instance-gpu --no-auto-delete

gcloud compute instances set-disk-auto-delete instance-cpu \
--disk instance-cpu --no-auto-delete

gcloud compute images create gpu-ubuntu18 \
--source-disk instance-gpu

gcloud compute images create cpu-ubuntu18 \
--source-disk instance-cpu

Running the code

The code supports training models implemented in the Tensor2Tensor library. For the paper we mainly used ResNet models. The code currently supports Google Compute Engine.

To run the code, simply input the following command. It will set up a cluster with 1 parameter server and 4 workers equipped with K80 GPU, and train the CIFAR-10 dataset on ResNet-32 for 64k steps. The trained model will be generated in the specified cloud bucket.

python main.py --proj-name=YOUR_PROJ_NAME --cred-path=YOUR_GCE_CREDENTIAL_PATH --job-name=res32 --num-ps=1 --ps-core-num=4 --num-worker=4 --num-shard=1 --bucket-dir=gs://YOUR_BUCKET/ --model=resnet --hparam-set=resnet_cifar_32 --problem=image_cifar10 --train-steps=64000 --ckpt-frequency=100000 --automation-test=0 --setSlot=1 --maxWorker=8 --zone=us-west1-b --gpu=k80 --hetero=0

Explanations of the other parameters can be found below; they are experimental and not the core focus of the paper:

  • ps-core-num: determines the number of vCPU cores for parameter servers.

  • num-shard: how many shards to partition the parameter set.

  • ckpt-frequency: how frequent to checkpoint during training.

  • automation-test: only used in combination with a monitor, currently not supported.

  • setSlot: part of test for dynamic learning rate.

  • maxWorker: part of test for dynamic learning rate.

Alternatively, if you want to test out heterogeneous cluster config, for example 4 workers and 1 parameter: parameter server in us-west1-b, 2 K80 GPU servers in us-west1-b, 1 P100 server in us-central1-a, and 1 V100 server in us-east1-a, you can use the following command:

python main.py --proj-name=YOUR_PROJ_NAME --cred-path=YOUR_GCE_CREDENTIAL_PATH --job-name=res32 --num-ps=1 --ps-core-num=4 --num-worker=4 --num-shard=1 --bucket-dir=gs://YOUR_BUCKET/ --model=resnet --hparam-set=resnet_cifar_32 --problem=image_cifar10 --train-steps=64000 --ckpt-frequency=100000 --automation-test=0 --setSlot=1 --maxWorker=8 --hetero=1 --gpu_array=k80 k80 p100 v100 --zone_array=us-west1-b us-west1-b us-west1-b us-central1-a us-east1-b

Acknowledgement

This work is supported in part by National Science Foundation grants #1755659 and #1815619, and Google Cloud Platform Research credits.

Contact

More project information can be found in our lab's project site.

About

CM-DARE is a measurement infrastructure for monitoring distributed training in Google Cloud (ICDCS'20)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published