GMB_corpus_ner

This is the repository fot the link https://www.kaggle.com/shoumikgoswami/ner-using-random-forest-and-crf It uses a ensemble model combines (xgboost,crf,random forest,bilist+attn+crf) and a model based on Bert(since my computer does not support trainning a model like bert,my laptop is only a macbook and I have to use the old computer in my house to use cuda , so I do not have the full result of the Bert,but the code works fine just the speed is very slow which takes a long time to see a proper result) In BiLSTM, I use the glove vector and concat it with a 50-dim vector which uses it to describe the syntax label,so the word embedding in the BiLSTM is actually 150 dim.

Running Procedure:
1.Run make_data.py which you can adjust the size of training,val,test dataset,in the experiment I use training (2349 sentences),val (150 sentences),test (350 sentences)
2.Run each model's traininig script,like ner_Xgboost.py,Rf.py,CRF.py,Copy_Attn.py,it will save the model automatically
3.Run the vote_classifer.py to get the hard_voting result on the test dataset
4.To balance the dataset, I put a 'balanced' parameter in the sklearn's model, also I give a small value to the O labeled word embedding
Result:
On the test set,the result gives like below:

Labels	precision	recall	f1-score	support
B-art	0.500	0.100	0.167	10
B-eve	0.750	0.273	0.400	11
B-geo	0.726	0.869	0.791	335
B-gpe	0.877	0.758	0.813	198
B-nat	0.500	0.500	0.500	2
B-org	0.733	0.664	0.697	211
B-per	0.749	0.753	0.751	182
B-tim	0.916	0.840	0.877	169
I-art	0.000	0.000	0.000	6
I-eve	1.000	0.111	0.200	9
I-geo	0.753	0.753	0.743	75
I-gpe	0.000	0.000	0.000	6
I-nat	1.000	1.000	1.000	1
I-org	0.742	0.685	0.712	168
I-per	0.813	0.880	0.845	217
I-tim	0.750	0.488	0.592	43
O	0.990	0.994	0.992	9591
acc	0.958	0.957	0.960	11234
macro avg	0.694	0.567	0.593	11234
weighted avg	0.959	0.960	0.959	11234

We see that the classifer:
1.reach the weighted accuracy weight of 0.959 and non weighted accuracy 0.958
2.reach the f1 score of 0.96
3.due to the severly unbalanced dataset,like I-art or I-nat could only count for 20~40 counts in this dataset that contains more than 60,000 words,therefore, event I try to balance the dataset using upsampling and downsampling the result did not change quite much,due to the lack of data,all those models do not perform well on these extremely unbalanced labels.
Since I still have to finish my graduate paper and finish at least one implementation , I did not go deeper to solove this problem , but I already have several ideas in my mind like transform the CV's focal loss into this NER problem which focal loss is used to slove a unbalanced dataset classification problem. Additionally, using Syntax Tree Analysis we can give different score to each syntax label which could help us to better recongnize the Named Entity.Also,using Knowledege Base like Yago as a resource would increase the result too.The pretrained model like Bert,XLNet could also improve the result.

Script and file Descibe:
1.models_func.py，contains the all the model and function that used in the NER process
2.data_make.py, spilit the data under the num that you give
3.CRF,ner_Xgboost,Copy_Attn,Rf,Bert_BiLSTM_CRF are different scirpts that you can run
4.event_tensors is the Glove Vector Package,in order to use this package,you also need to download a glove vector,which I use a 100d Glove Vector. You can also choose to use a random word vector,however,based on the accuracy on the BiLSTM, using Glove vector will raise the accuracy for serveral points.
5.GMB_dataset.txt is the dataset used in this task
6.non_O_BiLSTM_CRF_constrained.py is the BiLSTM that does not use O labeled words, however,maybe due to the lack of context,the performance is bad, whichi I did not use it as a model.
7.crf1,crf2,xgb,rf,attn_bilstm_crf are model files which you can load using sklearn's joinlib or torch's torch.load
8.syntax_embeds is the embedding that trained to describe the syntax label which is used to cancat with word vector.
9.train data,val data,test data,are the datas that spilited using make_data.py

For Glove vector,due to its size,I can not upload it to the github,you can download it at https://nlp.stanford.edu/projects/glove/
I am still training a deeper BiLSTM model,since I did not adjust any hyperparameter (only a 1 layer BiLSTM with 256 hidden dims) to find a nice setting,the result still has space to raise since the BiLSTM model could be improved.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
event_tensors		event_tensors
model		model
Bert_BiLSTM_CRF.py		Bert_BiLSTM_CRF.py
CRF.py		CRF.py
GMB_dataset.txt		GMB_dataset.txt
README.md		README.md
RF.py		RF.py
attn_bilstm_crf.pkl		attn_bilstm_crf.pkl
copy_Atten.py		copy_Atten.py
crf1.pickle		crf1.pickle
crf2.pickle		crf2.pickle
data_make.py		data_make.py
models_func.py		models_func.py
ner_Xgboost.py		ner_Xgboost.py
nonO_BiLSTM_CRF_constrained.py		nonO_BiLSTM_CRF_constrained.py
rf		rf
syntax_embeds		syntax_embeds
test_data		test_data
train_data		train_data
val_data		val_data
vote_classifer.py		vote_classifer.py
xgb		xgb

PosoSAgapo/GMB_corpus_ner

Folders and files

Latest commit

History

Repository files navigation

GMB_corpus_ner

About

Resources

Stars

Watchers

Forks

Languages