hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

We achieved State of the Art Perplexity = 46.81 for Hindi compared to 40.68 for English (lower is better)

To the best of my knowledge on September 18, 2018

Update: nlp-for-hindi uses sentencepiece instead of the word based spacCy tokenizer which I use. On those tokens, the measured perplexity for that LM is ~35. I encourage you to check that work out as well.

Downloads

Pretrained Language Models that you can use in your classification for transfer learning
EXCLUSIVE: BBC Hindi data of 4335 documents for text classification and text summarization. Release Notes
Raw Data for Language Model shared above: Hindi Wikipedia with about 21k unique tokens for minfreq = 50
- Wikipedia Processed Data - please use this to train your model

TODO

Language modeling based on wikipedia dump
Release Language Models: Hindi Language Model
Create Text classification Datasets: BBC Hindi
Benchmark text classification with FastText

Idea Dump

Change the custom head to be used for transliteration instead of classification, Hindi script (Devnagri) to English script (Roman)
MTL tasks for training and inference using custom heads
Text to Speech - using datasets from news recordings or Hindi subtitles of dubbed movies

FastAI Installation

This version of the notebook uses fastai lib's v0.7, used in their Part 2 v2 course in Summer 2018. The best way to install it via conda as mentioned here

Special thanks to Jeremy, Rachel and other contributors to fastai. This work is a reproduction of their work in English to Hindi. Thanks to @cstorm125 for thai2vec which inspired this work.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets/images		assets/images
.gitignore		.gitignore
Hindi-Language-Modeling.ipynb		Hindi-Language-Modeling.ipynb
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
create_wikitext_stanfordnlp.py		create_wikitext_stanfordnlp.py
hiwiki_lm.ipynb		hiwiki_lm.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets/images

assets/images

.gitignore

.gitignore

Hindi-Language-Modeling.ipynb

Hindi-Language-Modeling.ipynb

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

create_wikitext_stanfordnlp.py

create_wikitext_stanfordnlp.py

hiwiki_lm.ipynb

hiwiki_lm.ipynb

Repository files navigation

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

Downloads

TODO

Idea Dump

FastAI Installation

About

Releases 1

Packages

Languages

License

NirantK/hindi2vec

Folders and files

Latest commit

History

Repository files navigation

hindi2vec

State-of-the-Art Language Modeling and Text Classification in Hindi Language

Results

Downloads

TODO

Idea Dump

FastAI Installation

About

Resources

License

Stars

Watchers

Forks

Languages