Skip to content

Sanskrit text preparation

sujato edited this page Apr 15, 2022 · 15 revisions

Here is an outline of the steps for preparing a Sanskrit text for translation on Bilara.

define terms

Let's first define some terms:

  • properHTML is the HTML that ends up in the Bilara HTML files.
  • inlineHTML is any HTML included inside segments, such as <i>, <supplied>, etc. These tags end up inside the Bilara root files.
    • Lists of possible inlineHTML tags are found here and here. However, always use custom tags rather than classes.
  • fakeHTML is any HTML that is not included in the final output but merely structures content.
    • Use these custom tags: <segment>,<root>, <translation>, <reference>, <comment>, <variant>.

prepare HTML: responsibility of Sanskrit editor

Note: Keep the git repo clean. As a general rule, the only content that should be committed are the source files and the final product, nothing in-between.

  1. Select a source text.
    • Let’s assume our text is the Candrasūtra.
  2. If the text is already on SC, identify it by its project and UID.
    • project = sf, UID = sf276
    • If it is not on SC, assign a project and UID.
  3. Add the folder named with the SC UID to the appropriate project in publication-sources.
  4. Copy the source file or files to the folder.
    • Keep the original file name: sa_candrasUtra.xml
    • Any kind of content can be added to this folder.
  5. Make an HTML file from a local copy of the text.
  6. Delete all front and end matter, including metadata etc.
  7. Ensure the HTML file is well-structured with appropriate heading and <p> tags. Occasionally other semantic tags such as lists might be used. Ensure each text is wrapped in <article id='uid'>, and each <h1> is wrapped in <header>.
  8. Niceties: add the following where appropriate.
    • wrap in span: <span class='evam'>evam mayā śrutam</span>
    • add class to paragraph for remarks at end of sutra, etc: <p class='end'>śarabha iti sūtraṃ</p>
    • likewise for verse of homage at start of sutra, etc: <p class='namo'>namo buddhāya</p>
  9. Make sure all HTML uses 'single quotes'.
  10. Any text-critical markup or plain-text marks must be replaced with inlineHTML.
    • Where meaning of markup is unclear, refer back to original printed edition ideally, else consult old SC versions.
  11. Create segments.
    • Typically, use punctuation as the basis, then refine it by an initial reading of the text. It is much more efficient to get the segmenting right now than fix it later!
  12. Wrap segments in <segment>.
    • All properHTML is outside <segment>.
  13. Make sure all content inside <segment> is wrapped in fakeHTML tags as <root>, <translation>, <reference>, <comment>, or <variant>.
  14. Follow instructions here for running HTML tidy and eliminating overlapping markup.

Here is an example HTML file for sf276.

<article id='sf276'>
<header>
<h1>
<segment><root>Candrasūtra</root></segment>
</h1>
</header>
<p>
<segment><root>evaṃ mayā śrutam</root></segment>
<segment><root>ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied></root></segment>
</p>
<p>
<segment><root>tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></root></segment>
</p>
<p>
<segment><root><supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //</root></segment>
</p>
<p>
<segment><root>buddhavīra namas te 'stu vipramuktāya sarvataḥ</root><comment>Ed. bhitā but MS reads bhītā</comment></segment>
<segment><root>saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :</root><comment>Ed. buddha vīra</comment></segment>
</p>
<blockquote class='gatha'>
<p>
<span class='verse-line'><segment><root>arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ</root></segment></span>
<span class='verse-line'><segment><root><span class='verse-line'>rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //</root></segment></span>
</p>
</blockquote>
<p>
<segment><root>bhagavān āha //</root></segment>
</p>
<p>
<segment><root>tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*</root></segment>
<segment><root>rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //</root></segment>
</p>
<p>
<segment><root>atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩</root></segment>
<segment><root>tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied></root></segment>
<segment><root><supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //</root></segment>
</p>
<p>
<segment><root>adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //</root></segment>
</p>
<p>
<segment><root>ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·</root></segment>
<segment><root>saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //</comment></segment>
</p>
<p>
<segment><root><supplied>rāhur avocat* //</supplied></root></segment>
</p>
<p>
<segment><root><supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ</root></segment>
<segment><root>ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*</root><comment>Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā</comment></segment>
</p>
<p>
<segment><root><supplied>baḍir vairocano 'vocat* /</supplied></root></segment>
<segment><root>x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied></root></segment>
<segment><root><supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied></root><comment>Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce</comment></segment>
</p>
<p>
<segment><root><supplied>candrasūtraṃ samāptam* //</supplied></root></segment>
</p>
</article>

convert HTML to TSV

The next step is to create a TSV file. This will allow the data to be separated into its different types.

For this we use Karl's bilara-html-tsv script, which is currently found here:

https://github.com/sc-voice/bilara-html-tsv

  1. Get rid of document-level HTML.
  2. Run bilara-html-tsv, this creates a TSV file
  3. The first row has column headers,
    • the first column header is segment_id.
    • the second column header is html. This contains the properHTML with {} as placeholder for <segment> content. If there is no properHTML, still use {}.
    • remaining column headers are identical to the names of the fakeHTML custom tags.

That gives us something like:

segment_id	html	root	comment
sf276:0.1	<article id='sf276'><header><h1>{}</h1></header>	Candrasūtra	
sf276:1.1	<p>{}	evaṃ mayā śrutam	
sf276:1.2	{}</p>	ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>	
sf276:2.1	<p>{}</p>	tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam*<supplied>/</supplied>	
sf276:3.1	<p>{}</p>	<supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //	Ed. bhitā but MS reads bhītā
sf276:4.1	<p>{}	buddhavīra namas te 'stu vipramuktāya sarvataḥ	Ed. buddha vīra
sf276:4.2	{}</p>	saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :	
sf276:5.1	<blockquote class='gatha'><p><span class='verse-line'>{}</span>	arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ	
sf276:5.2	<span class='verse-line'>{}</span></p></blockquote>	rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //	
sf276:6.1	<p>{}</p>	bhagavān āha //	
sf276:7.1	<p>{}	tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*	
sf276:7.2	{}</p>	rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //	
sf276:8.1	<p>{}	atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* ⟨/⟩	
sf276:8.2	{}	tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>	
sf276:8.3	{}</p>	<supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //	
sf276:9.1	<p>{}</p>	adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //	
sf276:10.1	<p>{}	ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·	Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //
sf276:10.2	{}</p>	saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>	
sf276:11.1	<p>{}</p>	<supplied>rāhur avocat* //</supplied>	
sf276:12.1	<p>{}	<supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ	Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā
sf276:12.2	{}</p>	ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*	
sf276:13.1	<p>{}	<supplied>baḍir vairocano 'vocat* /</supplied>	
sf276:13.2	{}	x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied>	Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce
sf276:13.3	{}</p>	<supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>	
sf276:14.1	<p>{}</p></article>	<supplied>candrasūtraṃ samāptam* //</supplied>	

convert TSV to Bilara JSON

From here, we hand it over to bilara i/o. If creating a new collection, we need to add the details to a config file.

https://github.com/suttacentral/bilara-data/tree/unpublished/.scripts/bilara-io/config

  • ! make sure the tsv file has columns for each header in config, else it will throw an error !

Once config is defined, save the file in .scripts/bilara-io/ and run:

./sheet_import.py sf276.tsv -c

This will separate the content types and place them in the correct folders. Ready to translate!

/html/sf276.json

{
  "sf276:0.1": "<article id='sf276'><header><h1>{}</h1></header>",
  "sf276:1.1": "<p>{}",
  "sf276:1.2": "{}</p>",
  "sf276:2.1": "<p>{}",
  "sf276:3.1": "<p>{}</p>",
  "sf276:4.1": "<p>{}",
  "sf276:4.2": "{}</p>",
  "sf276:5.1": "<blockquote class='gatha'><p><span class='verse-line'>{}</span>",
  "sf276:5.2": "<span class='verse-line'>{}</span></p></blockquote>",
  "sf276:6.1": "<p>{}</p>",
  "sf276:6.2": "<p>{}",
  "sf276:6.3": "{}</p>",
  "sf276:7.1": "<p>{}",
  "sf276:7.2": "{}",
  "sf276:7.3": "{}</p>",
  "sf276:8.1": "<p>{}</p>",
  "sf276:9.1": "<p>{}",
  "sf276:9.2": "{}</p>",
  "sf276:10.1": "<p>{}</p>",
  "sf276:10.2": "<p>{}",
  "sf276:10.3": "{}</p>",
  "sf276:11.1": "<p>{}",
  "sf276:11.2": "{}",
  "sf276:11.3": "{}</p>",
  "sf276:12.1": "<p>{}</p></article>"
}

/root/sf276.json

{
  "sf276:0.1": "Candrasūtra",
  "sf276:1.1": "evaṃ mayā śrutam",
  "sf276:1.2": "ekasama<supplied>yaṃ bhagavāñ</supplied> śrāvastyāṃ viharati jet<supplied>a</supplied>v<supplied>a</supplied>n<supplied>a</supplied> anāthapiṇḍad<supplied>ā</supplied>r<supplied>ā</supplied>m<supplied>e /</supplied>",
  "sf276:2.1": "tena khalu samayena rāhuṇā asurendreṇa sarvaṃ candramaṇḍalam āvṛtam* <supplied>/</supplied></p>",
  "sf276:3.1": "<supplied>atha</supplied> yā devatā tasmiṃ<supplied>ś</supplied> candramaṇḍala adhyuṣitā sā bhītā trast<supplied>ā</supplied> saṃvignā āhṛṣṭaromakūpā yena bhagavāṃs teno<supplied>pajagāma /</supplied> upetya bha<supplied>ga</supplied>v<supplied>a</supplied>tpādau śirasā <supplied>vanditvaikāṃ</supplied>te 'sthād ekāntasthitā sā devatā tasyāṃ velāyāṃ gāthā babhāṣe //",
  "sf276:4.1": "buddhavīra namas te 'stu vipramuktāya sarvataḥ",
  "sf276:4.2": "saṃbādhapratipannāsmi tasya me śaraṇaṃ bhava :",
  "sf276:5.1": "arhantaṃ sugataṃ loke candramāḥ śaraṇaṃ gataḥ ",
  "sf276:5.2": "rāhoś candramasaṃ muñca buddhā lokānukampakāḥ //",
  "sf276:6.1": "bhagavān āha //",
  "sf276:6.2": "tamonudaṃ taṃ nabhasi prabhākaraṃ virocanaṃ śukla<supplied>v</supplied>iśuddhavarcasam*",
  "sf276:6.3": "rāho ś<supplied>a</supplied>śāṅkaṃ grasa māntarīkṣe praj<supplied>ā</supplied>pr<supplied>a</supplied>dīpaṃ drutam utsṛjainam* //",
  "sf276:7.1": "atha rāhuṇā as<supplied>u</supplied>rendreṇa tvaritatvaritaṃ candramaṇḍalam utsṛṣṭam* /",
  "sf276:7.2": "tataḥ sa<supplied>ṃ</supplied>tvaramāṇo 'sau rāhuś candram avāsṛ<supplied>jat*</supplied>",
  "sf276:7.3": "<supplied>saṃsvinnagātro vya</supplied>thitaḥ saṃbhr<supplied>ānta āturo ya</supplied>thā //",
  "sf276:8.1": "adrākṣīd baḍir vairocano <supplied>rāhuṇā</supplied> asurendreṇa tvaritatvaritaṃ candr<supplied>a</supplied>maṇḍala<supplied>m utsṛṣṭam* / dṛṣṭvā ca baḍi</supplied>r gāthāṃ babhāṣe //",
  "sf276:9.1": "ki<supplied>ṃ</supplied> nu sa<supplied>ṃ</supplied>tv<supplied>aramāṇas</supplied> tv<supplied>aṃ</supplied> rāhuś candraṃ vimuñcasi ·",
  "sf276:9.2": "saṃsvinnagātro vyathitaḥ saṃ<supplied>bhrānta āturo yathā</supplied> <supplied>//</supplied>",
  "sf276:10.1": "<supplied>rāhur avocat* //</supplied>",
  "sf276:10.2": "<supplied>sa</supplied>ptadhā me sphalen mūrdhā <supplied>jīvan na sukha</supplied>m āp<supplied>nu</supplied>yāṃ",
  "sf276:10.3": "ta<supplied>tra buddh</supplied>ābhigītena muñceyaṃ śaśinaṃ na cet*",
  "sf276:11.1": "<supplied>baḍir vairocano 'vocat* /</supplied>",
  "sf276:11.2": "x x x x x - - - x x x x madarśi<supplied>nāṃ</supplied>",
  "sf276:11.3": "<supplied>teṣāṃ gāthābhigītena rāhuś candraṃ vimuñcati //</supplied>",
  "sf276:12.1": "<supplied>candrasūtraṃ samāptam* //</supplied>"
}

/comment/sf276.json

{
  "sf276:4.1": "Ed. bhitā but MS reads bhītā",
  "sf276:4.2": "Ed. buddha vīra",
  "sf276:9.2": "Cf. Pelliot Sanskrit bleu 449 Ac: /// ro yathā //",
  "sf276:10.3": "Cf. Pelliot Sanskrit bleu 449 Ac: rāhu prāha // saptadhā me sphal[e] mūrdhā",
  "sf276:11.3": "Cf. Pelliot Sanskrit bleu 449 Ad: + + + + + .. .. .. .. .. .. .. (bh)i(g)itena muñce"
}