📄 Sequence Parallel

🔗Table of Contents

📄 Sequence Parallel

📌 Introduction

In this codebase, we implemented BERT with sequence parallelism. Sequence parallelism splits the input tensor and intermediate activation along the sequence dimension. This method can achieve better memory efficiency and allows us to train with larger batch size and longer sequence length.

Paper: Sequence Parallelism: Long Sequence Training from System Perspective

🛠 Environment Setup

To run this codebase, the following environment is required:

CUDA: 11.3
Python: > 3.7
PyTorch: 1.11.0

You can follow the script below to set up the environment for this codebase.

# create conda environment
conda create -n seq python=3.9
conda activate seq

# install PyTorch
pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113

# install nvidia apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" git+https://github.com/NVIDIA/apex.git@22.03

# install apex
pip install -v -r requirements.txt

🪅 Dataset Preparation

For this codebase, we will use the Wikipedia dataset and process the dataset with Megatron-LM's preprocessing script.

Step 1: Download and extract the Wikipedia Dataset

You can use the following script to download and extract the Wikipedia dataset. The extracted corpus will be stored in ./dataset/raw/corpus.json.

Execute the scripts in the root directory of this codebase

# go to the root directory
cd Sequence-Parallelism

# create dataset workspace
mkdir dataset && cd ./dataset

# download
mkdir raw && cd ./raw 
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

# install wiki extractor
pip install git+https://github.com/FrankLeeeee/wikiextractor.git@v3.0.6

# extract text
wikiextractor --json enwiki-latest-pages-articles.xml.bz2
cat text/*/* > ./corpus.json

Step 2: Process the Wikipedia dataset

The following commands can be used to process the Wikipedia dataset. Many thanks to Megatron-LM for providing a preprocessing script to generate the corpus file.

# go to the root directory
cd Sequence-Parallelism

# download vocab file
mkdir vocab && cd ./vocab
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt
cd ..

# clone the Megatron-LM repository
git clone https://github.com/NVIDIA/Megatron-LM.git
cd ./Megatron-LM
git checkout b037a69eb2622bd824d88cc230ada94077b3716f

# run processing
python tools/preprocess_data.py \
    --input ../dataset/raw/corpus.json \
    --output-prefix my-bert \
    --vocab ../vocab/bert-large-uncased-vocab.txt \
    --dataset-impl mmap \
    --tokenizer-type BertWordPieceLowerCase \
    --split-sentences \
    --workers 48

# move the processed data to the dataset directory
cd ..
mkdir -p dataset/processed
mv Megatron-LM/my-bert_text_sentence.* ./dataset/processed

After running these commands, you should see the following files in dataset/processed:

my-bert_text_sentence.bin
my-bert_text_sentence.idx

🚀 Run the codebase

We provided train.py for you to execute training. Before invoking the script, there are several steps to perform.

Step 1. Set data path and vocab path

At the top of config.py, you can see two global variables DATA_PATH and VOCAB_FILE_PATH.

DATA_PATH = './dataset/processed/my-bert_text_sentence'
VOCAB_FILE_PATH = './vocab/bert-large-uncased-vocab.txt'

DATA_PATH refers to the path to the data file generated by Megatron's script. For example, in the section above, you should get two data files (my-bert_text_sentence.bin and my-bert_text_sentence.idx). You just need to DATA_PATH to the path to the bin file without the file extension.

The VOCAB_FILE_PATH refers to the path to the vocabulary downloaded when you prepare the dataset (e.g. bert-large-uncased-vocab.txt).

Step 3. Make Dataset Helper

Build BERT dataset helper. Requirements are CUDA, g++, pybind11 and make.

cd ./data/datasets
make

Step 3. Configure your parameters

In the config.py provided, a set of parameters are defined including training scheme, model, etc. You can also modify the ColossalAI setting. For example, if you wish to parallelize over the sequence dimension on 8 GPUs. You can change size=4 to size=8. If you wish to use pipeline parallelism, you can set pipeline=<num_of_pipeline_stages>.

Step 4. Invoke parallel training

Lastly, you can start training with sequence parallelism with the following command. You need to replace <num-of-gpus> with the number of GPUs you want to use.

python -m torch.distributed.launch --nproc_per_node <num-of-gpus> --master_addr localhost --master_port 29500 train.py

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
loss_func		loss_func
lr_scheduler		lr_scheduler
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

loss_func

loss_func

lr_scheduler

lr_scheduler

model

model

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

config.py

config.py

requirements.txt

requirements.txt

train.py

train.py

Repository files navigation

📄 Sequence Parallel

🔗Table of Contents

📌 Introduction

🛠 Environment Setup

🪅 Dataset Preparation

Step 1: Download and extract the Wikipedia Dataset

Step 2: Process the Wikipedia dataset

🚀 Run the codebase

Step 1. Set data path and vocab path

Step 3. Make Dataset Helper

Step 3. Configure your parameters

Step 4. Invoke parallel training

About

Releases

Packages

Languages

License

FrankLeeeee/Sequence-Parallelism

Folders and files

Latest commit

History

Repository files navigation

📄 Sequence Parallel

🔗Table of Contents

📌 Introduction

🛠 Environment Setup

🪅 Dataset Preparation

Step 1: Download and extract the Wikipedia Dataset

Step 2: Process the Wikipedia dataset

🚀 Run the codebase

Step 1. Set data path and vocab path

Step 3. Make Dataset Helper

Step 3. Configure your parameters

Step 4. Invoke parallel training

About

Resources

License

Stars

Watchers

Forks

Languages