Apply BRIO to other generation tasks #9

HillZhang1999 · 2022-06-22T14:37:08Z

Hi, thanks for this fantastic work.
Here is my question: I try to use BRIO in another generation task and re-implement it in Fairseq. However, I find that the performance is relatively poor after incorporating BRIO.
I look further into the generation results and find that many results are just a single period. Moreover, the distribution of scores of candidates seems to be isotropic after training with the contrastive loss (I set the hyper-parameters following the CNN setting in your paper), such as the example shown below:

before training with the contrastive loss (16 candidates, sorted):
[-0.2314, -0.2862, -0.2660, -0.2471, -0.2442, -0.2796, -0.2611, -0.2617, -0.2608, -0.2984, -0.2622, -0.5395, -0.5655, -0.4688, -0.5250, -0.5317],

after:
[-1.1421, -1.1402, -1.1290, -1.1524, -1.1554, -1.1483, -1.1415, -1.1476, -1.1527, -1.1472, -1.1538, -1.1437, -1.1555, -1.1722, -1.1440, -1.1427]

Can you give me any advice?

The text was updated successfully, but these errors were encountered:

yixinL7 · 2022-06-24T21:11:41Z

Hi, thank you for your interest in our work. I'd recommend several things:

Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):
- margin of the contrastive loss
- scale of the contrastive loss
- length penalty for calculating the model-predicted probability
  These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).
For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.
For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.
You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

Hannibal046 · 2022-06-27T09:45:37Z

Hi, @yixinL7 , could you please share some insights about the scale parameter ? Why should we need this ? And how to set this hyper parameter ? Thanks

Hannibal046 · 2022-06-27T09:57:43Z

BTW, how to come up with this eval function for different dataset ? Any criterion ? So much thanks for your amazing work !

BRIO/main.py

Lines 411 to 416 in 135f0e5

    
           if args.dataset == "xsum": 
        
               def eval_fn(rouge1, rouge2, rougeLsum): 
        
                   return 1 - 2 * rouge1 * rouge2 / (rouge1 + rouge2) 
        
           else: 
        
               def eval_fn(rouge1, rouge2, rougeLsum): 
        
                   return 1 - (rouge1 * rouge2 + rougeLsum) / 3

HillZhang1999 · 2022-06-28T04:09:46Z

Hi, thank you for your interest in our work. I'd recommend several things:

Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):

margin of the contrastive loss

scale of the contrastive loss

length penalty for calculating the model-predicted probability
These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).

For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.

For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.

You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

Thank you for your advice and kind words!
Indeed, I have tried a few different hyperparameters but still couldn't get positive results. I think that the reason may be the characteristic of my task, i.e., grammatical error correction (GEC). Since GEC is a local sequence transduction task, so many candidates in beam search only have minimal differences, which may make the contrastive loss hard to optimize. I also noticed that you used the diverse beam search, but I found that this technique performs poorly in GEC.
Can you provide any further advice for using BRIO in local sequence transduction tasks like GEC and text simplification?

Hannibal046 · 2022-06-28T04:50:42Z

Hi, @HillZhang1999 . I think this may help:
yixinL7/SimCLS#14
https://arxiv.org/pdf/1512.02433.pdf ( Since NMT is also a 1-to-n generation task where n is relatively small)

HillZhang1999 · 2022-06-28T11:27:17Z

@Hannibal046 Thanks a lot!

yixinL7 · 2022-07-20T17:20:34Z

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples:
Massive-scale Decoding for Text Generation using Lattices
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

HillZhang1999 · 2022-07-22T06:57:02Z

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples: Massive-scale Decoding for Text Generation using Lattices A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

Dear Yixin, thank you for your help, i will check it out.

thaokimctu mentioned this issue Sep 27, 2022

Using BRIO for text summarization on another language #24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply BRIO to other generation tasks #9

Apply BRIO to other generation tasks #9

HillZhang1999 commented Jun 22, 2022 •

edited

yixinL7 commented Jun 24, 2022

Hannibal046 commented Jun 27, 2022

Hannibal046 commented Jun 27, 2022

HillZhang1999 commented Jun 28, 2022

Hannibal046 commented Jun 28, 2022 •

edited

HillZhang1999 commented Jun 28, 2022 •

edited

yixinL7 commented Jul 20, 2022

HillZhang1999 commented Jul 22, 2022

Apply BRIO to other generation tasks #9

Apply BRIO to other generation tasks #9

Comments

HillZhang1999 commented Jun 22, 2022 • edited

yixinL7 commented Jun 24, 2022

Hannibal046 commented Jun 27, 2022

Hannibal046 commented Jun 27, 2022

HillZhang1999 commented Jun 28, 2022

Hannibal046 commented Jun 28, 2022 • edited

HillZhang1999 commented Jun 28, 2022 • edited

yixinL7 commented Jul 20, 2022

HillZhang1999 commented Jul 22, 2022

HillZhang1999 commented Jun 22, 2022 •

edited

Hannibal046 commented Jun 28, 2022 •

edited

HillZhang1999 commented Jun 28, 2022 •

edited