Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include language directives? [proposed label: enhancement] #715

Open
Mrodent opened this issue May 17, 2024 · 0 comments
Open

Include language directives? [proposed label: enhancement] #715

Mrodent opened this issue May 17, 2024 · 0 comments

Comments

@Mrodent
Copy link

Mrodent commented May 17, 2024

Feature request

I'm often examining documents which are not written in English, or where I have a mixture of languages.

I'm doing a project where identifying the language is important because I'm putting the text in an Elasticsearch index. Stemming using an English analyser on French text, for example, makes absolutely no sense, and in fact will tend to deliver worse results than no stemmer at all. So identifying the languages correctly matters.

Sometimes these documents or document fragments will be properly indicated by setting the text (or fragment thereof) with the right language. This usually gets translated like this, either in styles.xml or in document.xml, (NB case of a French document, so "fr-FR"), output using grep on decompressed .docx file:

./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t>elle a subi du fait des livraisons
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t> ;</w:t></w:r></w:p><w:p w14:paraI
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr></w:p><w:p w14:paraId="5C118616
./word/document.xml:w:hAnsi="Arial" w:cs="Arial"/><w:lang w:eastAsia="fr-FR"/></w:rPr></w:pPr><w:r w:rsidRPr="00D000FB"><w:rP
./word/document.xml:" w:cs="Arial"/><w:b/><w:bCs/><w:lang w:eastAsia="fr-FR"/></w:rPr><w:t xml:space="preserve">CONDAMNER </w
./word/styles.xml:sz w:val="22"/><w:szCs w:val="22"/><w:lang w:val="fr-FR" w:eastAsia="en-US" w:bidi="ar-SA"/><w14:ligature
./word/styles.xml:hanging="283"/></w:pPr><w:rPr><w:lang w:eastAsia="fr-FR"/></w:rPr></w:style><w:style w:type="paragraph" w
./word/styles.xml:val="32"/><w:szCs w:val="32"/><w:lang w:eastAsia="fr-FR"/><w14:ligatures w14:val="none"/></w:rPr></w:styl

Solution I'd like

It'd be nice if indications as to language, both for the global document (i.e. from styles.xml) and for text runs (as found in document.xml) could be detected in the json object delivered.

Alternatives?

It's fairly practical to find the global settings for the document's language, i.e. by examing styles.xml. This is what the crate docx-rust let you do for example. But this only gives you the "globally set" language for the docx. In fact it appears to be the "lang" property for the default character style.

But getting indications concerning individual runs of text seems currently to be impossible using either that crate or this one.

Having said that, there is a crate, lingua, which is intended to identify languages from fragments of text. It's pretty good, but usually the directives actually found in Word documents will be better (at least when these directives state a language other than English).

@Mrodent Mrodent changed the title Include language directives? Include language directives? [proposed label: enhancement] May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant