data preprocessing #23

xjw-star · 2022-09-27T05:37:10Z

Hi Yixin, thank you for this fantastic work.
I am reproducing the BRIO model and would like to realize the difference between the data and data.tokenized files, since there seems to be no code to discriminate them.

xjw-star · 2022-09-27T05:39:52Z

In fact, I want to preprocess NYT dataset, but there is no off-the-shelf code to achieve them.

mrasyadc · 2023-05-02T00:56:25Z

Hi it's been a year and i want to know how would you differentiate it? or what other solution did you do?

mrasyadc · 2023-05-02T00:59:30Z

I'm sorry i wasn't reading it clear enough, but i think tokenized means using the PTB tokenizer right?

QUOTE --
We use the PTB tokenizer provided by Standford CoreNLP (download here). Please note that tokenized texts are only used for evaluation. To tokenize a file, you may run (using test.source as an example)

export CLASSPATH=/your_path/stanford-corenlp-3.8.0.jar
cat test.source | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > test.source.tokenized

We have provided the examples files in ./examples/raw_data.
--QUOTE

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data preprocessing #23

data preprocessing #23

xjw-star commented Sep 27, 2022

xjw-star commented Sep 27, 2022

mrasyadc commented May 2, 2023

mrasyadc commented May 2, 2023

data preprocessing #23

data preprocessing #23

Comments

xjw-star commented Sep 27, 2022

xjw-star commented Sep 27, 2022

mrasyadc commented May 2, 2023

mrasyadc commented May 2, 2023