Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-NER tags are missing one letter #75

Open
versae opened this issue Apr 15, 2021 · 7 comments
Open

Non-NER tags are missing one letter #75

versae opened this issue Apr 15, 2021 · 7 comments

Comments

@versae
Copy link

versae commented Apr 15, 2021

How to reproduce the behaviour

If I execute the next code with POS tags

y_pred = [['ADJ', 'CONJ', 'VERB', 'AUX', 'NOUN', 'ADJ', 'SCONJ'], ['CONJ', 'SCONJ', 'X']]
y_true = [['ADJ', 'DET', 'VERB', 'AUX', 'NOUN', 'ADJ', 'SCONJ'], ['CONJ', 'ART', 'X']]
print(classification_report(y_true, y_pred))

What I get is:

              precision    recall  f1-score   support

        CONJ       0.50      1.00      0.67         1
          DJ       1.00      1.00      1.00         2
         ERB       1.00      1.00      1.00         1
          ET       0.00      0.00      0.00         1
         ONJ       0.50      1.00      0.67         1
         OUN       1.00      1.00      1.00         1
          RT       0.00      0.00      0.00         1
          UX       1.00      1.00      1.00         1

   micro avg       0.78      0.78      0.78         9
   macro avg       0.62      0.75      0.67         9
weighted avg       0.67      0.78      0.70         9

Here, all tags are missing the first letter. If I pass in suffix=True, now the missing letter of the tags is the last one:

              precision    recall  f1-score   support

          AD       1.00      1.00      1.00         2
          AR       0.00      0.00      0.00         1
          AU       1.00      1.00      1.00         1
         CON       0.50      1.00      0.67         1
          DE       0.00      0.00      0.00         1
         NOU       1.00      1.00      1.00         1
        SCON       0.50      1.00      0.67         1
         VER       1.00      1.00      1.00         1

   micro avg       0.78      0.78      0.78         9
   macro avg       0.62      0.75      0.67         9
weighted avg       0.67      0.78      0.70         9

Moreover, one letter tags are ignored.

Your Environment

  • Operating System: Ubuntu 20.10
  • Python Version: Python 3.8.6
  • Package Version: seqeval==1.2.2
@IssamAssafi
Copy link

Can confirm I have the same issue.

@mirfan899
Copy link

Same issue here. It does not work with POS tags.

@DuyguA
Copy link

DuyguA commented Jun 17, 2021

I have the same issue here!

@liaeh
Copy link

liaeh commented Jun 24, 2021

same issue!!

@liaeh
Copy link

liaeh commented Aug 16, 2021

This problem only occurs if you are missing the IOB-style tags, e.g. ENTITY instead of B-ENTITY, I-ENTITY...
I think it is caused by line 189, which removes the first character of the tag name because it assumes it to have a prefix.

@versae
Copy link
Author

versae commented Aug 30, 2021

Thanks for finding the key line, @liaeh! As I see it then, we only have two options:

  1. We re-label our datasets if not IOB-style to start each label with B.
  2. We add an option to the library to not remove the first character if not IOB-style.

@liaeh
Copy link

liaeh commented Sep 2, 2021

Thanks for finding the key line, @liaeh! As I see it then, we only have two options:

1. We re-label our datasets if not IOB-style to start each label with `B`.

2. We add an option to the library to not remove the first character if not IOB-style.

Option 2 would make most sense! I've been using option 1 as a workaround though :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants