Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is factCC reliable for factual correctness evaluation? #6

Open
nightdessert opened this issue Dec 22, 2020 · 9 comments
Open

Is factCC reliable for factual correctness evaluation? #6

nightdessert opened this issue Dec 22, 2020 · 9 comments

Comments

@nightdessert
Copy link

I really appreciate the excellent paper.
I tested factCC on CNN/DM dataset using gold reference sentences as claims(splitted into single sentence).
I strictly followed md, and used the official pre-trained factCC checkpoint.
I labeled all the claims as 'CORRECT' (because they are gold references).
The accuracy output by factCC is around 42% which means the model thinks only 42% of the reference sentences is factuality correct.
Is this reasonable or did I wrongly use the metric ?

@nightdessert
Copy link
Author

Noticed the metric based on uncased-bert, I did use lower-cased inputs .

@nightdessert nightdessert changed the title Is factCC reliable ? Is factCC reliable for factual correctness evaluation? Dec 22, 2020
@sanghyuk-choi
Copy link

sanghyuk-choi commented Feb 3, 2021

I've got same result...
When I used 'generated data' and 'annotated data', It worked well. But gold data(cnndm) results strange.

I used summary sentences as claim.
(one summary has several sentences, I splitted it and used each sentence as each claim.)

@gaozhiguang
Copy link

As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result:
***** Eval results *****
bacc = 0.41546565056595314
f1 = 0.41546565056595314
loss = 3.5899247798612546

On the author annotated dataset,give the result as follows:
***** Eval results *****
bacc = 0.7611692646110668
f1 = 0.8614393125671321
loss = 0.8623681812816171

@YipingNUS
Copy link

YipingNUS commented Jul 8, 2021

Some of my observations below:

  1. The manual test set is strongly label-imbalanced. Only 62 out of 503 examples are incorrect. A majority voting baseline would give acc and F1 of 0.5 and 0.88, basically the same as the MNLI or FEVER baselines.
  2. The FactCC model performs much worse on the incorrect class. The F1 scores for the correct and incorrect classes are 0.92 and 0.49.
  3. The CNN/DM models that generated the predictions are highly extractive (validated by many papers). So in the case of "correct", it's very trivial since almost the exact sentence is contained in the source article.
  4. I'm not surprised that it performed poorly on the gold summaries because 1) the dataset contains noise. If you take individual sentences out of the summary, it may not make sense (e.g. unresolved pronouns). 2) As I mentioned above in point 3, the correct cases in the manual test set are often trivial. However, the gold summaries contain a lot of paraphrase and abstraction. There's also considerable "hallucination" (the summary contains information not mentioned in the article). Therefore, the model is very likely to predict as "incorrect".

In summary, I think FactCC can identify local errors like swapping entities or numbers. However, don't count on it to solve the hard NLI problem. Overall, it's still one of the better metrics. You can also check out the following paper.

Goyal, Tanya, and Greg Durrett. "Evaluating factuality in generation with dependency-level entailment." arXiv preprint arXiv:2010.05478 (2020).

@Ricardokevins
Copy link

Greatly Appreciate discussion above.
Is there anyone retrain and finetuing the model to get result on CNNDM or other dataset?
Will that help to get a more precision Fact-Evalutaion? If not, FactCC as evaluator is reliable to mentioned in Paper as a Metric?

@Ricardokevins
Copy link

I notice that somepaper use FactCC as a metric
If FactCC remains problem, then the result maybe not reliable to be a metric mentioned in Paper

@YipingNUS
Copy link

@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.

Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).

Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).

@Ricardokevins
Copy link

@Ricardokevins, you can take a look at the following two comprehensive surveys on actuality metrics. What's disturbing is that they have very different conclusions. If you're writing a paper, the best you can do is to pick 1-2 metric from each category (e.g., entailment, QA, optionally IE) and report the result of all of them. Also you need to do a small scale human evaluation like 50-100 summaries.

Gabriel, Saadia, et al. "Go figure! a meta evaluation of factuality in summarization." arXiv preprint arXiv:2010.12834 (2020).

Pagnoni, Artidoro, Vidhisha Balachandran, and Yulia Tsvetkov. "Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics." arXiv preprint arXiv:2104.13346 (2021).

thanks a lot <3

@xeniaqian94
Copy link

As a fact, I also encountered such a problem. Just like the method mentioned above, I used the gold summary to evaluate and got the following result: ***** Eval results ***** bacc = 0.41546565056595314 f1 = 0.41546565056595314 loss = 3.5899247798612546

On the author annotated dataset,give the result as follows: ***** Eval results ***** bacc = 0.7611692646110668 f1 = 0.8614393125671321 loss = 0.8623681812816171

My annotated dataset result is the same as yours. This result, however, is not consistent with the Table-3 F-1 score for FactCC. Anyone have an intuition for why?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants