Skip to content

plandes/cnndmdb

Repository files navigation

CNN/DailyMail Dataset as SQLite

PyPI Python 3.9 Python 3.10 Build Status

Creates a SQLite database if the CNN and DailyMail summarization dataset.

Documentation

See the full documentation. The API reference is also available.

Obtaining

The easiest way to install the command line program is via the pip installer:

pip3 install zensols.cnndmdb

Binaries are also available on pypi.

Usage

This package can be used from the command line with the cnndmdb command, or as a Python API.

Install

  1. Install the Python dependencies: pip install -r src/python/requirements.txt
  2. Create the SQLite database file: cnndmdb load. This takes a while since the entire corpus is first downloaded and then inserted into the SQLite file.
  3. Check to make sure the file data/cnndm.sqlite3 was created.
  4. Optionally create a ~/.cnndmdbrc to relocate the data/cnndm.sqlite3 file.

To relocate the SQLite file, add the following to the ~/.cnndmdbrc file:

[cnndmdb_default]
db_file = ~/path/to/cnndm.sqlite3

Command Line

The SQLite database keys can be given:

cnndmdb keys

Then the command line can also be used to print articles:

cnndmdb show -t org 3b07f5102c69e3e609d73b2ccb0dc5549d4fbaf6

The -t org tells it to use the original corpus keys. This option also allows for selected SQLite rowid keys or a Kth smallest article.

API

The corpus objects are accessible as mapped Python objects. For example:

corpus: Corpus = ApplicationFactory.get_corpus()
art: Article = next(iter(corpus.stash.values()))
print(art.text)

Data Source

The data is sourced from a Tensorflow dataset, which in turn uses the Abigail See GitHub repository.

@article{DBLP:journals/corr/SeeLM17,
  author    = {Abigail See and
               Peter J. Liu and
               Christopher D. Manning},
  title     = {Get To The Point: Summarization with Pointer-Generator Networks},
  journal   = {CoRR},
  volume    = {abs/1704.04368},
  year      = {2017},
  url       = {http://arxiv.org/abs/1704.04368},
  archivePrefix = {arXiv},
  eprint    = {1704.04368},
  timestamp = {Mon, 13 Aug 2018 16:46:08 +0200},
  biburl    = {https://dblp.org/rec/bib/journals/corr/SeeLM17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
@inproceedings{hermann2015teaching,
  title={Teaching machines to read and comprehend},
  author={Hermann, Karl Moritz and Kocisky, Tomas and Grefenstette, Edward and Espeholt, Lasse and Kay, Will and Suleyman, Mustafa and Blunsom, Phil},
  booktitle={Advances in neural information processing systems},
  pages={1693--1701},
  year={2015}
}

Changelog

An extensive changelog is available here.

License

MIT License

Copyright (c) 2023 Paul Landes