Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply BRIO to other generation tasks #9

Open
HillZhang1999 opened this issue Jun 22, 2022 · 8 comments
Open

Apply BRIO to other generation tasks #9

HillZhang1999 opened this issue Jun 22, 2022 · 8 comments

Comments

@HillZhang1999
Copy link

HillZhang1999 commented Jun 22, 2022

Hi, thanks for this fantastic work.
Here is my question: I try to use BRIO in another generation task and re-implement it in Fairseq. However, I find that the performance is relatively poor after incorporating BRIO.
I look further into the generation results and find that many results are just a single period. Moreover, the distribution of scores of candidates seems to be isotropic after training with the contrastive loss (I set the hyper-parameters following the CNN setting in your paper), such as the example shown below:

before training with the contrastive loss (16 candidates, sorted):
[-0.2314, -0.2862, -0.2660, -0.2471, -0.2442, -0.2796, -0.2611, -0.2617, -0.2608, -0.2984, -0.2622, -0.5395, -0.5655, -0.4688, -0.5250, -0.5317],

after:
[-1.1421, -1.1402, -1.1290, -1.1524, -1.1554, -1.1483, -1.1415, -1.1476, -1.1527, -1.1472, -1.1538, -1.1437, -1.1555, -1.1722, -1.1440, -1.1427]

Can you give me any advice?

@yixinL7
Copy link
Owner

yixinL7 commented Jun 24, 2022

Hi, thank you for your interest in our work. I'd recommend several things:

  1. Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):
    • margin of the contrastive loss
    • scale of the contrastive loss
    • length penalty for calculating the model-predicted probability
      These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).
  2. For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.
  3. For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.
  4. You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

@Hannibal046
Copy link

Hi, @yixinL7 , could you please share some insights about the scale parameter ? Why should we need this ? And how to set this hyper parameter ? Thanks

@Hannibal046
Copy link

BTW, how to come up with this eval function for different dataset ? Any criterion ? So much thanks for your amazing work !

BRIO/main.py

Lines 411 to 416 in 135f0e5

if args.dataset == "xsum":
def eval_fn(rouge1, rouge2, rougeLsum):
return 1 - 2 * rouge1 * rouge2 / (rouge1 + rouge2)
else:
def eval_fn(rouge1, rouge2, rougeLsum):
return 1 - (rouge1 * rouge2 + rougeLsum) / 3

@HillZhang1999
Copy link
Author

Hi, thank you for your interest in our work. I'd recommend several things:

  1. Following CNNDM setting may not always be suitable depending on the dataset you are working on. There are several hyperparameters needed to be tuned (among others):

    • margin of the contrastive loss
    • scale of the contrastive loss
    • length penalty for calculating the model-predicted probability
      These hyperparameters can be sensitive (e.g., they are very different for CNNDM and XSum).
  2. For the length penalty, you can start your search from the length penalty used in the original beam search for the MLE-trained baseline.

  3. For others you may need to try a few different numbers. A rule of thumb is to look at the MLE loss during training. If MLE loss becomes too large, it basically means that you haven't found the appropriate setting.

  4. You may also start with training BRIO as a re-ranker by setting the MLE loss as zero, which should give you some ideas of how to set the hyperparameters and if using BRIO on your dataset is going to work at all. But please note that the hyperparameters used for training BRIO as a re-ranker can be different from training it as a generation model - I found that training it as a generation model is more sensitive to the hyperparameters.

Please let me know if you have more questions. Good luck!

Thank you for your advice and kind words!
Indeed, I have tried a few different hyperparameters but still couldn't get positive results. I think that the reason may be the characteristic of my task, i.e., grammatical error correction (GEC). Since GEC is a local sequence transduction task, so many candidates in beam search only have minimal differences, which may make the contrastive loss hard to optimize. I also noticed that you used the diverse beam search, but I found that this technique performs poorly in GEC.
Can you provide any further advice for using BRIO in local sequence transduction tasks like GEC and text simplification?

@Hannibal046
Copy link

Hannibal046 commented Jun 28, 2022

Hi, @HillZhang1999 . I think this may help:
yixinL7/SimCLS#14
https://arxiv.org/pdf/1512.02433.pdf ( Since NMT is also a 1-to-n generation task where n is relatively small)

@HillZhang1999
Copy link
Author

HillZhang1999 commented Jun 28, 2022

@Hannibal046 Thanks a lot!

@yixinL7
Copy link
Owner

yixinL7 commented Jul 20, 2022

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples:
Massive-scale Decoding for Text Generation using Lattices
A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

@HillZhang1999
Copy link
Author

Thanks @Hannibal046 for the comment.

Hi @HillZhang1999, I'm not very familiar with GEC but I think your observation makes sense. It's very critical to have diverse candidates to make sure the model can learn something meaningful. I'd recommend you try some other decoding algorithms. There are actually several new papers about this. To give some examples: Massive-scale Decoding for Text Generation using Lattices A Well-Composed Text is Half Done! Composition Sampling for Diverse Conditional Generation

Dear Yixin, thank you for your help, i will check it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants