process for adding new bilara root texts #2421

sujato · 2023-05-16T06:35:35Z

There is no standard way of adding texts to Bilara. In the past, we processed the whole Pali canon and it evolved over time. Obviously so long as the data format is good it doesn't matter how it is created. But we should offer a reasonable story for adding new texts.

Most of the fundamentals of this have been built, so it is a matter of lining them all up and testing the whole pipeline.

preparing HTML files

There is a spec for creating HTML files. It is designed for Sanskrit but will work for anything.

https://github.com/suttacentral/bilara-data/wiki/Sanskrit-text-preparation

Here is an explainer for certain details:

https://github.com/suttacentral/bilara-data/wiki/Overlapping-(text-critical)-markup-in-Bilara

The basic idea is that we use punctuation to segment the text, then wrap it up in HTML as specified.

converting html to tsv

Next we convert the HTML file to TSV. There's a script for this already, although it is not bug free.

https://github.com/sc-voice/bilara-html-tsv

The basic point of this is to cleanly separate the data types, add segment number, and ready it for the next step.

convert tsv to bilara-data via bilara i/o

We then use our Bilara i/o utility to convert the tsv files to bilara-data.

https://github.com/suttacentral/bilara-data/wiki/Bilara-io

What this does is consumes a properly-formed tsv file and exports it directly as json to the relevant bilara-data folders.

why tsv?

This is basically because it is what bilara i/o was designed to use. There's no particular reason there needs to be an intermediate step here, we could go directly from HTML to JSON. One advantage of tsv, however, is in debugging. When things go wrong, we can inspect and edit in a spreadsheet, which is super handy for this sort of thing.

pipeline

I suggest that we use a new dedicated repo, such as /suttacentral/bilara-data-preparation

Use the same unpublished/published branches as on bilara-data.
User adds texts in bilara-HTML to unpublished branch
user makes a PR when they are ready
When the PR is accepted, it runs a GA
The GA runs bilara-html-tsv.js to convert to tsv, then bilara i/o to convert to json for bilara-data, and adds it to the relevant repo.

Note that bilara-html-tsv runs on node (I think) and bilara i/o is python. Let's see how complex it is to rewrite them to work together more nicely. Maybe do both as Go?

The text was updated successfully, but these errors were encountered:

sujato added enhancement New feature or request major change requires major work on both front and back end back end and removed major change requires major work on both front and back end labels May 16, 2023

sujato changed the title ~~process for adding new bilara texts~~ process for adding new bilara root texts Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

process for adding new bilara root texts #2421

process for adding new bilara root texts #2421

sujato commented May 16, 2023 •

edited

process for adding new bilara root texts #2421

process for adding new bilara root texts #2421

Comments

sujato commented May 16, 2023 • edited

preparing HTML files

converting html to tsv

convert tsv to bilara-data via bilara i/o

why tsv?

pipeline

sujato commented May 16, 2023 •

edited