Skip to content

Overlapping (text critical) markup in Bilara

sujato edited this page Mar 29, 2022 · 10 revisions

The problem of nested hierarchical tags is a thorny one, and one of the key reasons for preferring standoff properties.

https://en.wikipedia.org/wiki/Overlapping_markup

Nonetheless, even if a full standoff spec for text is employed, it will still have to be converted to HTML for the web. So we need to think how to support overlapping tags now.

tl;dr

When preparing texts with inline text-critical markup:

  1. mark each segment with <segment> … </segment>
  2. do not use <p> or <hX> tags, instead use <paragraph> … </paragraph> and <heading1>, etc.
  3. for verses, use <vl> instead of `
  4. run tidy --doctype html5 --output-html 1 --tidy-mark 0 --quiet 1 --output-encoding utf8 -w 0 --show-warnings 0 --new-blocklevel-tags paragraph,segment,vl,heading1,heading2,heading3,heading4,heading5,heading6 --new-inline-tags supplied,comment -m *.html
  5. convert <paragraph> back to <p> and <headingX> back to <hX>.

This method works for all inline markup, whether standard tags such as <i>, spans with classes, or custom tags such as <supplied> so long as they are declared to HTML tidy (--new-inline-tags supplied). I recommend using custom tags; the browser will recognize them as “unknown elements” and treat them like a span.


y tho?

legacy HTML

Consider the following case (using a paragraph from sf78 as example). Here <supplied> is a special custom HTML tag. Use these to help disambiguate.

<p>
<supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma. upetya bhaga</supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat. ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</supplied>vāsyābhūt.
</p>

This is unproblematic, because the inline <supplied> tags do not cross the <p> boundaries. (There are, to be sure, cases where markup does cross <p> boundaries.)

Overlapping tags in Bilara

For Bilara texts, when presented on the web, each segment is wrapped in a <span class='segment'>. In fact each segment has multiple spans, for example, “segment”, “root”, “text”. But for simplicity let’s just consider it wrapped in a single inline-element.

If we divide the paragraph we get the following segments.

<p>
<segment><supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma.</segment>
<segment>upetya bhaga</supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat.</segment>
<segment>ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</supplied>vāsyābhūt.</segment>
</p>

Here the first <supplied> tag is contained within the segment, no problems. The second <supplied> tag opens near the end of the first segment, and continues into the second segment. The next <supplied> tag likewise overlaps the segment boundary.

Let’s apply some styles for fun.

supplied {
color: green; 
background-color:  beige
}
segment {
outline:  2px dotted red;
}

If you open that up in a browser, you’ll see that it doesn’t handle it correctly. The <supplied> styles work across the segment boundaries, but the segments, surrounded by dotted outline, don’t apply where they follow a </supplied> within the segment. Yikes!

Here’s what the browser is doing to make this problem.

<p>
<segment><supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma.</supplied></segment><supplied>
<segment>upetya bhaga</segment></supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat.
<segment>ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</segment></supplied>vāsyābhūt.
</p>

As you can see, it closes the segment before the <supplied>: <supplied><segment>upetya bhaga</segment></supplied>vatpā<supplied>dau…

Obviously we need to create markup that is not subject to such brittleness.

HTML tidy

HTML tidy will wrap up overlapping tags, like a browser. But unlike the browser, we can tell Tidy to treat new tags as block level elements. When we do that, it does the wrapping just fine.

<segment><supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma.</supplied></segment>
<segment><supplied>upetya bhaga</supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat.</supplied></segment>
<segment><supplied>ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</supplied>vāsyābhūt.</segment>

Each <supplied> is closed before the </segment>, and there’s no funny business with unwrapoped content or extra tags. Cool.

But wait! Now the <p> tags are a problem, because you can’t nest a block-level element inside <p>. It’s a whole thing. Most block-level tags like lists, and <blockquote> are okay, as they are not phrasing content. <h1>, <h2> etc. are unfortunately not okay, so change to <heading1>, <heading2> etc.

Let us then embrace the chaos: change <p> to <paragraph>, then run:

tidy –doctype html5 –output-html 1 –tidy-mark 0 –quiet 1 –output-encoding utf8 -w 0 –show-warnings 0 –new-blocklevel-tags paragraph,segment,heading1,heading2,heading3,heading4,heading5,heading6 –new-inline-tags supplied,comment -m  test.html

This outputs:

<paragraph>
<segment><supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma.</supplied></segment>
<segment><supplied>upetya bhaga</supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat.</supplied></segment>
<segment><supplied>ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</supplied>vāsyābhūt.</segment>
</paragraph>

Now we just have to change back to regular <p> tags. Let’s keep the other custom tags.

<p>
<segment><supplied>athāyuṣmān pūrṇiko ‘ciraprakrāntān saṃbahulān anyatīrthikaparivrājakān vi</supplied>ditvā yena bhagavāṃs tenopa<supplied>jagāma.</supplied></segment>
<segment><supplied>upetya bhaga</supplied>vatpā<supplied>dau śirasā vanditvā, ekānte nyaṣīdat.</supplied></segment>
<segment><supplied>ekāntaniṣanna āyuṣmān pūrṇiko yāvad e</supplied>vāsyābhūt.</segment>
</p>

Now even when set to display block, everything just works. Also, the browser HTML is identical with what we have served, so no unexpected side effects.