PictoBERT: Transformers for Next Pictogram Prediction

Original code implementation of the paper "PictoBERT: Transformers for Next Pictogram Prediction".

Pictogram is the term used by the Augmentative and Alternative Communication (AAC) community for an image with a label that represents a place, person, action, object and animal. AAC systems like the shown below allow message construction and communication by arranging pictograms in sequence.

Pictogram prediction is an important task for AAC systems for it can facilitate communication. Previous works used n-gram statistical models or knowledge bases to accomplishing this task. Our proposal is an adaptation of the BERT (Bidirectional Encoder Representations from Transformers) model to perform pictogram prediction. We changed the BERT vocabulary and input embeddings to allow the usage of word-senses, considering that a word-sense represents better a pictogram. We call our version PictoBERT.

We trained the model using the CHILDES (Child Language Data Exchange System) corpora as a dataset. We annotated the North American English version of CHILDES with word-senses using supWSD. PictoBERT performance was compared to n-gram models and achieved good results, as show in the table bellow.

The PictoBERT is capable of predicting pictograms in different contexts. And its main characteristic is the ability to transfer learning for it allows other models focused on users' specific needs to be trained.

Software requirements

Execution

You can run the PictoBERT scripts using Google Colab or clone the repository in your machine and open the notebooks.

git clone https://github.com/jayralencar/pictoBERT.git

We present each of the notebooks below and their relationship with the paper's content. You may execute the notebooks following the sequence we give below. However, downloadable versions of the resources are available in each step.

1. PictoBERT

In the paper, we present PictoBERT construction (Section 4.1) in three steps: corpus construction, BERT adaptation and pretraining.

1.2 Dataset Creation

The dataset creation is described in Section 4.1.1 of the paper and consists of downloading and annotating the North American English part of the CHILDES dataset.

SemCHILDES.ipynb

Run in Google Colab

View source on GitHub

NA-EN SemCHILDES

In addition, we also annotated the British English part of CHILDES with semantic roles to use for fine-tuning PictoBERT to perform pictogram prediction based on a grammatical structure.

Create_SRL_semCHILDES.ipynb

Run in Google Colab

View source on GitHub

UK-EN SemCHILDES

1.3 Updating BERT Vocabulary and Embeddings Layer

For updating BERT vocabulary and Embeddings Layer, as described in Section 4.1.2 of the paper, we first trained a Word Level tokenizer and prepared the dataset for future training.

Train_Tokenizer_and_Prepare_Dataset.ipynb

Run in Google Colab

View source on GitHub

Then, we created the models by changing the BERT embeddings and vocabulary:

Create_Models.ipynb

Run in Google Colab

View source on GitHub

PictoBERT contextualized

PictoBERT gloss-based

1.4 Pre-Training PictoBERT

As described in section 4.1.3 of the paper, we splited semCHILDES in a 98/1/1 split for training, validation, and test. We used a batch size of 128 sequences with 32 tokens. Each data batch was collated to choose 15% of the tokens for prediction. We used a learning rate of $1 \times 10 ^{-4}$, with $\beta_1 = 0.9$, $\beta_2 = 0.999$, L2 weight decay of 0.01 and linear decay of learning rate. Training PictoBERT was performed in a single 16GB NVIDIA Tesla V100 GPU for 500 epochs for each version.

Training_PictoBERT.ipynb

Run in Google Colab

View source on GitHub

PictoBERT contextualized

PictoBERT gloss-based

1.5 Training n-gram models

As mentioned in the paper (section 5.1), we compare PictoBERT performance rather n-gram models performance. Using the notebook below, we trained n-gram models with orders varying from 2 to 7.

N-gram models.ipynb

Run in Google Colab

View source on GitHub

N-gram models

2. Fine-tuning PictoBERT

As described in Section 5.2 of PictoBERT's paper, we fine-tuned two versions of the model: one for pictogram prediction based on a grammatical structure and the other for making predictions based on the ARASAAC vocabulary.

2.1. Pictogram Prediction Based on a Grammatica Structure

This section refers to the section 5.2.1 of the PictoBERT paper.

For fine-tuning the model, we used as basis the UK-EN SemCHILDES presented on section 1.2 of this document.

All the procedures for fine-tuning are described on the following notebook:

Fine_tuning_PictoBERT_(colourful_semantics).ipynb

Run in Google Colab

View source on GitHub

Fine-tuned PictoBERT (contextualized)

Fine-tuned PictoBERT (gloss-based)

Tokenizer

In addition, we replicated the method proposed by Pereira et al. (2020) for constructing semantic grammars to compare with PictoBERT. Semantic grammars are generally represented using OWL ontologies. We opted to represent using relational databases to facilitate faster queries.

Semantic_Grammar.ipynb

Run in Google Colab

View source on GitHub

Semantic Grammars (db versions)

2.2 Using ARASAAC vocabulary

This section refers to the section 5.2.2 of the PictoBERT paper.

The notebook presents:

The procedure for mapping ARASAAC pictograms to WordNET word-senses
The procedure for changing SemCHILDES to keep only sentences in which all tokens are also in the vocabulary generated 1.
Train tokenizer
Train models

ARASAAC_fine_tuned_PictoBERT.ipynb

Run in Google Colab

View source on GitHub

Pictogram to word-sense mappings

Reduced SemCHILDES (corpus)

Tokenizer

ARASAAC PictoBERT (contextualized)

ARASAAC PictoBERT (gloss-based)

We also trained n-gram models to compare with the fine-tuned models.

N-gram models.ipynb

Run in Google Colab

View source on GitHub

N-gram models

Evaluation

The evaluation scripts, as well as the results, are in the following notebook.

evaluation_codeocean.ipynb

Run in Google Colab

View source on GitHub

Cite

@article{PEREIRA2022117231,
	title = {Picto{BERT}: Transformers for next pictogram prediction},
	journal = {Expert Systems with Applications},
	volume = {202},
	pages = {117231},
	year = {2022},
	issn = {0957-4174},
	doi = {https://doi.org/10.1016/j.eswa.2022.117231},
	url = {https://www.sciencedirect.com/science/article/pii/S095741742200611X},
	author = {Jayr Alencar Pereira and David Macêdo and Cleber Zanchettin and Adriano Lorena Inácio {de Oliveira} and Robson do Nascimento Fidalgo},
	keywords = {Augmentative and alternative communication, Language modeling, Pictogram prediction},
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
python_files		python_files
.gitignore		.gitignore
ARASAAC_fine_tuned_PictoBERT.ipynb		ARASAAC_fine_tuned_PictoBERT.ipynb
Create_Models.ipynb		Create_Models.ipynb
Create_SRL_semCHILDES.ipynb		Create_SRL_semCHILDES.ipynb
Fine_tuning_PictoBERT_(colourful_semantics).ipynb		Fine_tuning_PictoBERT_(colourful_semantics).ipynb
Finetuning.ipynb		Finetuning.ipynb
LICENSE		LICENSE
N-gram models.ipynb		N-gram models.ipynb
README.md		README.md
SemCHILDES.ipynb		SemCHILDES.ipynb
Semantic_Grammar.ipynb		Semantic_Grammar.ipynb
Train_Tokenizer_and_Prepare_Dataset.ipynb		Train_Tokenizer_and_Prepare_Dataset.ipynb
Training_PictoBERT.ipynb		Training_PictoBERT.ipynb
evaluation_codeocean.ipynb		evaluation_codeocean.ipynb
main.py		main.py

License

jayralencar/pictoBERT

Folders and files

Latest commit

History

Repository files navigation

PictoBERT: Transformers for Next Pictogram Prediction

Software requirements

Execution

1. PictoBERT

1.2 Dataset Creation

1.3 Updating BERT Vocabulary and Embeddings Layer

1.4 Pre-Training PictoBERT

1.5 Training n-gram models

2. Fine-tuning PictoBERT

2.1. Pictogram Prediction Based on a Grammatica Structure

2.2 Using ARASAAC vocabulary

Evaluation

Cite

About

Resources

License

Stars

Watchers

Forks

Languages