Skip to content

Using questions to summarize large amounts of textual data.

Notifications You must be signed in to change notification settings

unicamp-dl/corpus2question

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

corpus2question

This repository presents corpus2question, a method for summarizing and exploring datasets based on latent questions on documents. It also contains the reference implementation for the paper Can questions summarize a corpus? Using question generation for characterizing COVID-19 research.

The method

Open All Collab

corpus2question relies on the question generation network used in doc2query and frequency aggregations. Check our tutorial for a small example.

Results over the CORD-19 dataset

All raw generated questions over the CORD-19 dataset are available at this link in the CSV format. You can also find the aggregated top 10k at this link. The reference implementation for the paper is available at this notebook.

Citing this work

If you use corpus2question on your academic work, or use the generated questions over the CORD-19 dataset, please cite us with:

@misc{surita2020questions,
    title={Can questions summarize a corpus? Using question generation for characterizing COVID-19 research},
    author={Gabriela Surita and Rodrigo Nogueira and Roberto Lotufo},
    year={2020},
    eprint={2009.09290},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}