Glossary not support non-english character #312

lindeer · 2022-11-07T02:22:09Z

Summary

more precisely, it could not support glossary term with non-word character on heading or trailing, e.g. Connétable could be matched, but Brancaleone von Andalò failed. If one term starting with ( or ending with ), it could not be matched either. Actually, words with special character are so common in glossary, especially in books of other language(french, german, japanese).

also an old bug since gitbook, hope it could be fixed. I think the root case of #233 is the same.

HonKit version: 4.0.4

Step to reproduce

create one term Brancaleone von Andalò in GLOSSARY.md
insert Brancaleone von Andalò in paragraph of one page file

Link to code example:

Der Papst gehört nicht nach Anagni oder Lyon, nicht nach Perugia oder Assisi, sondern nach Rom.« Ein kraftvoller Mann gab den Römern diese Sprache ein, Brancaleone von Andalò, ihr damaliger Senator.

Expected results

Brancaleone von Andalò in article could be a glossary link.

Actual results

no link created.

The text was updated successfully, but these errors were encountered:

azu · 2022-11-07T03:11:17Z

Maybe, this \b cause this error.
(toLowerCase is also a bit suspicious)

honkit/packages/honkit/src/output/modifiers/annotateText.ts

Line 69 in 8a47d19

    
           const searchRegex = new RegExp(`\\b(${pregQuote(name.toLowerCase())})\\b`, "gi");

function pregQuote(str) {
    return `${str}`.replace(/([\\\.\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:])/g, "\\$1");
}

const name = "Brancaleone von Andalò";
const searchRegex = new RegExp(`\\b(${pregQuote(name.toLowerCase())})\\b`, "gi");
console.log(searchRegex.test("test Brancaleone von Andalò test")); // => false

\b is for ASCII characters.
We need to use unicode safe word boundary.

regex - Javascript RegExp + Word boundaries + unicode characters - Stack Overflow

lindeer · 2022-11-07T03:37:35Z

@azu Great! \b in regex indeed only bound word character, and unicode word boundary could meet most cases, but what if term like Henry (VII)? parentheses in word trailing

azu added the Type: Bug Bug or Bug fixes label Nov 7, 2022

azu added the Status: PR Welcome Welcome to Pull Request label Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glossary not support non-english character #312

Glossary not support non-english character #312

lindeer commented Nov 7, 2022 •

edited

azu commented Nov 7, 2022 •

edited

lindeer commented Nov 7, 2022

Glossary not support non-english character #312

Glossary not support non-english character #312

Comments

lindeer commented Nov 7, 2022 • edited

Summary

Step to reproduce

Expected results

Actual results

azu commented Nov 7, 2022 • edited

lindeer commented Nov 7, 2022

lindeer commented Nov 7, 2022 •

edited

azu commented Nov 7, 2022 •

edited