Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glossary not support non-english character #312

Open
lindeer opened this issue Nov 7, 2022 · 2 comments
Open

Glossary not support non-english character #312

lindeer opened this issue Nov 7, 2022 · 2 comments
Labels
Status: PR Welcome Welcome to Pull Request Type: Bug Bug or Bug fixes

Comments

@lindeer
Copy link

lindeer commented Nov 7, 2022

Summary

more precisely, it could not support glossary term with non-word character on heading or trailing, e.g. Connétable could be matched, but Brancaleone von Andalò failed. If one term starting with ( or ending with ), it could not be matched either. Actually, words with special character are so common in glossary, especially in books of other language(french, german, japanese).

also an old bug since gitbook, hope it could be fixed. I think the root case of #233 is the same.

  • HonKit version: 4.0.4

Step to reproduce

  1. create one term Brancaleone von Andalò in GLOSSARY.md
  2. insert Brancaleone von Andalò in paragraph of one page file
  • Link to code example:
Der Papst gehört nicht nach Anagni oder Lyon, nicht nach Perugia oder Assisi, sondern nach Rom.« Ein kraftvoller Mann gab den Römern diese Sprache ein, Brancaleone von Andalò, ihr damaliger Senator. 

Expected results

Brancaleone von Andalò in article could be a glossary link.

Actual results

no link created.

@azu azu added the Type: Bug Bug or Bug fixes label Nov 7, 2022
@azu
Copy link
Member

azu commented Nov 7, 2022

Maybe, this \b cause this error.
(toLowerCase is also a bit suspicious)

const searchRegex = new RegExp(`\\b(${pregQuote(name.toLowerCase())})\\b`, "gi");

function pregQuote(str) {
    return `${str}`.replace(/([\\\.\+\*\?\[\^\]\$\(\)\{\}\=\!\<\>\|\:])/g, "\\$1");
}

const name = "Brancaleone von Andalò";
const searchRegex = new RegExp(`\\b(${pregQuote(name.toLowerCase())})\\b`, "gi");
console.log(searchRegex.test("test Brancaleone von Andalò test")); // => false

\b is for ASCII characters.
We need to use unicode safe word boundary.

@lindeer
Copy link
Author

lindeer commented Nov 7, 2022

@azu Great! \b in regex indeed only bound word character, and unicode word boundary could meet most cases, but what if term like Henry (VII)? parentheses in word trailing

@azu azu added the Status: PR Welcome Welcome to Pull Request label Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: PR Welcome Welcome to Pull Request Type: Bug Bug or Bug fixes
Projects
None yet
Development

No branches or pull requests

2 participants