Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data preprocessing #23

Open
xjw-star opened this issue Sep 27, 2022 · 3 comments
Open

data preprocessing #23

xjw-star opened this issue Sep 27, 2022 · 3 comments

Comments

@xjw-star
Copy link

Hi Yixin, thank you for this fantastic work.
I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.

@xjw-star
Copy link
Author

In fact, I want to preprocess NYT dataset, but there is no off-the-shelf code to achieve them.

@mrasyadc
Copy link

mrasyadc commented May 2, 2023

Hi it's been a year and i want to know how would you differentiate it? or what other solution did you do?

@mrasyadc
Copy link

mrasyadc commented May 2, 2023

I'm sorry i wasn't reading it clear enough, but i think tokenized means using the PTB tokenizer right?

QUOTE --
We use the PTB tokenizer provided by Standford CoreNLP (download here). Please note that tokenized texts are only used for evaluation. To tokenize a file, you may run (using test.source as an example)

export CLASSPATH=/your_path/stanford-corenlp-3.8.0.jar
cat test.source | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.source.tokenized

We have provided the examples files in ./examples/raw_data.
--QUOTE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants