Skip to content

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Notifications You must be signed in to change notification settings

niderhoff/nlp-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 

Repository files navigation

nlp-datasets

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP). Most stuff here is just raw unstructured text data, if you are looking for annotated corpora or Treebanks refer to the sources at the bottom.

Datasets (English, multilang)

Sources

Datasets (Albanian)

  • Albanian News Articles Dataset: Over 3 million Albanian news articles alongwith metadata, extracted from various albanian news sources (see list in link).

Datasets (Arabic)

  • SaudiNewsNet: 31,030 Arabic newspaper articles alongwith metadata, extracted from various online Saudi newspapers. (2 MB)

Datasets (Urdu)

Datasets (German)

  • German Political Speeches Corpus: collection of recent speeches held by top German representatives (25 MB, 11 MTokens)

  • NEGRA: A Syntactically Annotated Corpus of German Newspaper Texts. Available for free for all Universities and non-profit organizations. Need to sign and send form to obtain. (on request)

  • Ten Thousand German News Articles Dataset: 10273 german language news articles categorized into nine classes for topic classification. (26.1 MB)

  • 100k German Court Decisions: Open Legal Data releases a dataset of 100,000 German court decisions and 444,000 citations (772 MB)

Datasets (Kinyarwanda and Kirundi)

  • KINNEWS and KIRNEWS: Two annotated and cleaned datasets of more than 20k Kinyarwanda and 4k Kirundi news articles. (65 MB)

About

Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published