This repo contains the training and evaluation scripts, experiment data, model outputs for our paper: "On Learning to Summarize with Large Language Models as References".
This repo is intended and licensed for research use only. The data and model outputs are licensed under CC BY-NC 4.0 (allowing only non-commercial use).
- Requirements
- Description of Codes
- Training
- Evaluation
- Data
- Model Outputs
We use Python 3.8, PyTorch 1.12.1 and Transformers 4.21.2 for our experiments. Please refer to requirements.txt
for more requirements.
config.py
-> model configurationdata_utils.py
-> dataloadermain.py
-> training script of contrastive learning (BRIO)main_mle.py
-> training script of MLEmodel.py
-> models and loss functionsutils.py
-> utility functionstest.py
-> evaluation scriptgpt_score.py
-> GPTScore evaluation scriptgpt_rank.py
-> GPTRank evaluation scriptgpt.py
-> OpenAI's GPT-3 API and other utility functions
The directory ./cache
should be created for our experiments, which will store all the experiment results and model checkpoints.
./prompt
stores the LLM prompts, ./cnndm
stores the experiment data, ./outputs/system
stores the model outputs, and ./outputs/llm
stores the LLM outputs.
In this section, we will introduce how to run the code for our experiments. Please note that while our script supports multi-gpu training, all the configurations are based on single-gpu training so you may need to modify the configurations (e.g. per-gpu batch size) for multi-gpu training.
For the MLE training, any GPU with 16GB memory should be sufficient. For the contrastive learning, at least 24GB memory is required.
At the first step, we need to train the model with MLE for warmup using the ChatGPT-generated summaries. Please run the following command:
python main_mle.py --cuda --gpuid [list of gpuid] --config chatgpt_warmup -l
The checkpoint based on the validation MLE loss (cache/*/model_mle
) will be used for the next step.
For the MLE training, please run the following command:
python main_mle.py --cuda --gpuid [list of gpuid] --config gpt3_mle --model_pt [path to the checkpoint from the warmup training] -l
The checkpoint based on the validation MLE loss (cache/*/model_mle
) will be used for evaluation.
We also need to train another checkpoint with less training data for the next step:
python main_mle.py --cuda --gpuid [list of gpuid] --config gpt3_mle_small --model_pt [path to the checkpoint from the warmup training] -l
For the contrastive learning, please run the following command:
python main.py --cuda --gpuid [list of gpuid] --config gpt3_brio --model_pt [path to the checkpoint from the MLE training] -l
The checkpoint based on the validation contrastive loss (cache/*/model_ranking
) will be used for evaluation.
For the MLE training, please run the following command:
python main_mle.py --cuda --gpuid [list of gpuid] --config chatgpt_mle --model_pt [path to the checkpoint from the warmup training] -l
This is similar to the MLE training in the warmup step, but we use more training data here.
The checkpoint based on the validation MLE loss (cache/*/model_mle
) will be used for evaluation.
For the contrastive learning, please run the following command:
python main.py --cuda --gpuid [list of gpuid] --config chatgpt_brio --model_pt [path to the checkpoint from the warmup training] -l
The checkpoint based on the validation contrastive loss (cache/*/model_ranking
) will be used for evaluation.
For the MLE training, please run the following command:
python main_mle.py --cuda --gpuid [list of gpuid] --config gpt4_mle --model_pt [path to the checkpoint from the warmup training] -l
The checkpoint based on the validation MLE loss (cache/*/model_mle
) will be used for evaluation.
For the contrastive learning, please run the following command (we will use the checkpoint from the warmup training here):
python main.py --cuda --gpuid [list of gpuid] --config gpt4_brio --model_pt [path to the checkpoint from the warmup training] -l
The checkpoint based on the validation contrastive loss (cache/*/model_ranking
) will be used for evaluation.
To generate the model outputs, please run the following command:
python test.py --gpuid gpuid --src_dir [path to the source article file] --tgt_dir [path to the system output file]
--ref_dir [path to the reference summary file] --model_dir [path to the model checkpoint] --batch_size BATCH_SIZE --num_beams NUM_BEAMS --length_penalty LENGTH_PENALTY
Please note that the length of summaries can have a large impact on the evaluation results (ref). Therefore, we tried to maintain the system output length to be similar to the reference summary length, and tune the beam search parameters on the validation set to achieve this goal.
We recommend using the following parameters for the beam search:
Model | Beam Size | Length Penalty |
---|---|---|
BART.GPT3D3 | 4 | 0 |
BRIO.GPT3D3 | 64 | 0.6 |
BART.ChatGPT | 4 | 0.8 |
BRIO.ChatGPT | 64 | 0.8 |
BART.GPT-4 | 4 | 0.8 |
BRIO.GPT-4 | 128 | 0 |
We note that for checkpoints trained with contrastive learning we use a large beam size since previous work has found that a large beam size is beneficial for checkpoints trained with contrastive learning (ref).
For LLM-based evaluation, you will need to use OpenAI's APIs. In gpt.py
, please replace the openai.organization
and openai.api_key
with your own credentials.
To evaluate the model outputs with GPTScore, please first get the raw scores using the following command:
python gpt_score.py -r --src_dir [path to the source article file] --tgt_dir [path to the system output file] --output_dir [path to the output file]
The output file will be saved as a jsonl file at the output_dir
.
Then, please run the following command to get the actual scores:
python gpt_score.py -s --length_peanlty LENGTH_PENALTY --output_dir [path to the output file]
To get the original GPTScore, please set the length penalty to 1.
To evaluate the model outputs with GPTRank (pair-wise comparison), please run the following command to get the raw outputs:
python gpt_rank.py -r --src_dir [path to the source article file] --cand_dir_1 [path to the system1 output file] --cand_dir_2 [path to the system2 summary file] --tgt_json_dir [path to the output jsonl file] --tgt_txt_dir [path to the output txt file] --model [OpenAI model name]
The output file will be saved as a jsonl file at the tgt_json_dir
and a txt file at the tgt_txt_dir
.
Then, please run the following command to get the actual scores:
python gpt_rank.py -s --tgt_json_dir [path to the output jsonl file] --system_1 [system1 name] --system_2 [system2 name] --output_dir [path to the output file]
The output file will be saved as a json file at the output_dir
.
In addition to the pair-wise comparison, we also provide the code for list-wise comparison that is used in our contrastive training. Please refer to the function gpt_rank
in gpt_rank.py
for more details.
Please find the training data under the cnndm
directory. Each subdirectory contains the training data and the validation data for the corresponding model. The input articles used for testing can be found at cnndm/test.source
.
Here is the description of the different data sets:
./chatgpt
: data used for the warmup training./gpt3
: data used for the MLE training with GPT-3, which is used for the next step of contrastive learning./gpt3_all
: data used for the MLE training with GPT-3, which is used for the evaluation./gpt_brio_gptscore
: data used for the contrastive learning with GPT-3 and GPTScore./chatgpt_all
: data used for the MLE training with ChatGPT, which is used for the evaluation./brio_chatgpt
: data used for the contrastive learning with ChatGPT and GPTRank./gpt4
: data used for the MLE training with GPT-4./brio_gpt4
: data used for the contrastive learning with GPT-4 and GPTRank
The model outputs on the test set can be found under the outputs
directory.
The LLM outputs are in the outputs/llm
directory, and the system outputs are in the outputs/system
directory.