Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process for adding new bilara root texts #2421

Open
sujato opened this issue May 16, 2023 · 0 comments
Open

process for adding new bilara root texts #2421

sujato opened this issue May 16, 2023 · 0 comments
Labels
back end enhancement New feature or request

Comments

@sujato
Copy link
Contributor

sujato commented May 16, 2023

There is no standard way of adding texts to Bilara. In the past, we processed the whole Pali canon and it evolved over time. Obviously so long as the data format is good it doesn't matter how it is created. But we should offer a reasonable story for adding new texts.

Most of the fundamentals of this have been built, so it is a matter of lining them all up and testing the whole pipeline.

preparing HTML files

There is a spec for creating HTML files. It is designed for Sanskrit but will work for anything.

https://github.com/suttacentral/bilara-data/wiki/Sanskrit-text-preparation

Here is an explainer for certain details:

https://github.com/suttacentral/bilara-data/wiki/Overlapping-(text-critical)-markup-in-Bilara

The basic idea is that we use punctuation to segment the text, then wrap it up in HTML as specified.

converting html to tsv

Next we convert the HTML file to TSV. There's a script for this already, although it is not bug free.

https://github.com/sc-voice/bilara-html-tsv

The basic point of this is to cleanly separate the data types, add segment number, and ready it for the next step.

convert tsv to bilara-data via bilara i/o

We then use our Bilara i/o utility to convert the tsv files to bilara-data.

https://github.com/suttacentral/bilara-data/wiki/Bilara-io

What this does is consumes a properly-formed tsv file and exports it directly as json to the relevant bilara-data folders.

why tsv?

This is basically because it is what bilara i/o was designed to use. There's no particular reason there needs to be an intermediate step here, we could go directly from HTML to JSON. One advantage of tsv, however, is in debugging. When things go wrong, we can inspect and edit in a spreadsheet, which is super handy for this sort of thing.

pipeline

I suggest that we use a new dedicated repo, such as /suttacentral/bilara-data-preparation

  • Use the same unpublished/published branches as on bilara-data.
  • User adds texts in bilara-HTML to unpublished branch
  • user makes a PR when they are ready
  • When the PR is accepted, it runs a GA
  • The GA runs bilara-html-tsv.js to convert to tsv, then bilara i/o to convert to json for bilara-data, and adds it to the relevant repo.

Note that bilara-html-tsv runs on node (I think) and bilara i/o is python. Let's see how complex it is to rewrite them to work together more nicely. Maybe do both as Go?

@sujato sujato added enhancement New feature or request major change requires major work on both front and back end back end and removed major change requires major work on both front and back end labels May 16, 2023
@sujato sujato changed the title process for adding new bilara texts process for adding new bilara root texts Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
back end enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant