Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PAD_TOKEN] is not used, but just adding 0 #97

Open
goonbamm opened this issue Jan 27, 2023 · 0 comments
Open

[PAD_TOKEN] is not used, but just adding 0 #97

goonbamm opened this issue Jan 27, 2023 · 0 comments

Comments

@goonbamm
Copy link

goonbamm commented Jan 27, 2023

Thanks to your code, I am growing every day. Thank you very much.

In every dataloader, the special tokens are initialized below.

self.SPECIAL_TOKEN = {"CLS_TOKEN": "<|startoftext|>", "SEP_TOKEN": "<|endoftext|>",
                      "MASK_TOKEN": "[MASK]", "UNK_TOKEN": "[UNK]", "PAD_TOKEN": "[PAD]"}

But I found that [MASK], [UNK], and [PAD] are not used in the code. But the problem happens when adding just zero as pad token like below.

while len(input_ids) < self.max_words:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

In vocab, there is no number for [PAD]. Token id '0' is paired with '!'.

vocab = {'!': 0, '"': 1, '#': 2, '$': 3, '%': 4, '&': 5, ... }

If some captions contains '!' and shorter than max_length, the embedding of token '!' and 'pad' will be exactly same because the token embedding method uses nn.embedding.

self.vocab_size = vocab_size
self.token_embedding = nn.Embedding(vocab_size, transformer_width)
self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
self.ln_final = LayerNorm(transformer_width)

Example

caption1 = 'The boy is crying ! ! ! [PAD] [PAD]'
caption2 = 'The boy is crying [PAD] [PAD] [PAD] [PAD] [PAD]'

I think there is no way to differentiate between the two captions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant