Bug 115856 - Docx export: In non-English versions, styles.xml does not contain reference to international styles
Summary: Docx export: In non-English versions, styles.xml does not contain reference t...
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
5.0 all versions
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords: filter:docx
Depends on:
Blocks: DOCX-Styles
  Show dependency treegraph
 
Reported: 2018-02-19 14:30 UTC by fralau
Modified: 2020-02-15 18:27 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description fralau 2018-02-19 14:30:12 UTC
Description:
In general, MS Word apps provide localized styles for non-Englis languages (e.g. Titre1, Titre2... in French, instead of Heading1, Heading2...) and they are exported verbatim in the document.xml file. 

However, the styles.xml file should contains a reference to standard "latent" styles so that any application receiving those localized styles can recognized what they actually mean, an can process them correctly.

Fixing that aspect would allow non-Englis versions of OpenOffice to "speak" not only to Ms Word, but also to a host of libraries that expect docx documents.

Steps to Reproduce:
1. Create the a file with a **non-English version of LibreOffice**, with a heading.
2. Export the file in .docx
3. Give the file to pandoc: pandoc xxxx.docx. The headings will be missed and converted into plain text.



Actual Results:  
In style.xml:

<w:style w:type="paragraph" w:styleId="Titre1">
  <w:name w:val="Titre 1"/>
  <w:basedOn w:val="Titre"/>
  <w:next w:val="Corpsdetexte"/>
  <w:pPr>
    <w:numPr>
      <w:ilvl w:val="0"/>
      <w:numId w:val="1"/>
    </w:numPr>
    <w:spacing w:before="240" w:after="120"/>
    <w:outlineLvl w:val="0"/>
    <w:outlineLvl w:val="0"/>
  </w:pPr>
  <w:rPr>
    <w:b/>
    <w:bCs/>
    <w:sz w:val="36"/>
    <w:szCs w:val="36"/>
  </w:rPr>
</w:style>

Expected Results:
<w:style w:type="paragraph" w:styleId="Titre1">
   <w:name w:val="heading 1"/>
   <w:basedOn w:val="Titre"/>
   <w:next w:val="Corpsdetexte"/>
   <w:pPr>
     <w:numPr>
       <w:numId w:val="1"/>
     </w:numPr>
     <w:outlineLvl w:val="0"/>
   </w:pPr>
   <w:rPr>
     <w:b/>
     <w:bCs/>
     <w:sz w:val="36"/>
     <w:szCs w:val="36"/>
   </w:rPr>
 </w:style>


Reproducible: Always


User Profile Reset: No



Additional Info:
1. Word (as an application) is itself tolerant toward that kind of omission and it will spontaneously correct it.

2. In general, feeding a docx file to pandoc is a sure way to make a litmus check.

3. I have check in the release notes to see whether this bug has been fixed in later versions, but I have failed to see any.


User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:58.0) Gecko/20100101 Firefox/58.0
Comment 1 Alex Thurgood 2018-02-23 08:24:36 UTC
@fralau : pandoc is only available via homebrew, not something that most OSX users commonly install on the OSX boxes.

What other easily installable tool can one use to test your affirmation ?

From your description, it sounds like the problem lies with pandoc, not with LO.

If I open my test docx document created by following your description and then open it in Word 16.10 (180210), I see the correct heading.
Comment 2 Alex Thurgood 2018-02-23 08:25:33 UTC
Tested with

Version: 6.0.1.1
Build ID: 60bfb1526849283ce2491346ed2aa51c465abfe6
Threads CPU : 4; OS : Mac OS X 10.13.3; UI Render : par défaut; 
Locale : fr-FR (fr_FR.UTF-8); Calc: group
Comment 3 fralau 2018-02-23 08:45:58 UTC
I used pandoc to illustrate the problem, but the problem lies in the xmlfiles generated by LO. Any other tool might make the point.


The fact that Word actually reacts well to the way to the styles in docx files generated by LO is a facility of Word (retranslating the foreign style names into standard styles). They can do that, in my understanding, thanks to "latent styles", i.e. an underlying model that is not in the docx/xml file.  Said otherwise, MS Word has a "fault-tolerant" feature that is not part of the ISO spec of docx files. Unfortunately, other apps that rely on the standard spec of docx files (as they should) will fail.


I found this explanation useful:
http://python-docx.readthedocs.io/en/latest/user/styles-understanding.html


But the underlying issue is HOW one would define a compliant docx file? There is a difference between impunity and legality: "Compliant to ISO/IEC 29500 " or "loads satisfactorily into Ms Word" are NOT interchangeable definitions. In the spirit of OO, the first is safer, as it is relatively stable definition, it has general agreement, and if it changes, everyone will be notified in time. By contrast doing something illegal but with impunity may elicit controversy: indeed this feature of correcting stylesheets in Word is largely undocumented, and they might change or alter it without notice.
Comment 4 fralau 2018-02-23 09:05:41 UTC
I might add another point: while it is of course essential to make sure that Ms Word can read the docx files produced, it is *also* important to meet the  specification of Office Open XML concerning styles, so that other open source projects can benefit from a docx file produced by LO. 

Ignoring that requirement, might be excluding other open source software from the ecosystem of LO, and indirectly favor a closed source software (Ms Word).
Comment 5 Alex Thurgood 2018-02-23 11:37:22 UTC
OK, so I'm not a developer, merely a volunteer QAer, so how do I go about confirming the problem you experience ? Therein lies the immediate issue, irrespective of its merits.


I would add from a personal viewpoint that what you suggest sounds like it implies including extra xml information that is currently not stored in ODF documents - this can only therefore make them even larger still and more verbose when that information gets mapped to docx open xml. Surely, that is something we would wish to avoid for performance reasons (it is bad enough already) ?
Comment 6 Alex Thurgood 2018-02-23 11:38:14 UTC
@Mikos : any thoughts on this ?
Comment 7 Alex Thurgood 2018-02-23 11:39:10 UTC
(In reply to Alex Thurgood from comment #6)
> @Mikos : any thoughts on this ?

@Miklos
Comment 8 Xisco Faulí 2018-02-23 11:50:41 UTC
Hi,
Thanks for reporting the issue.
I think it's a dupe of bug 44451.

*** This bug has been marked as a duplicate of bug 44451 ***
Comment 9 fralau 2018-02-23 14:34:13 UTC
I don't think it's a duplicate. While there is a rough similarity in functionally (and general question), the other ones have to do with references and tables, while this one is very narrow in its technical scope: styles. 

The place in the XML file where the bug occurred has been identified and a solution has been proposed.

Marking it as duplicate of broad questions would result on the diagnostic information on this bug being lost.
Comment 10 fralau 2018-02-23 14:39:37 UTC
(In reply to Alex Thurgood from comment #5)

> I would add from a personal viewpoint that what you suggest sounds like it
> implies including extra xml information that is currently not stored in ODF
> documents - this can only therefore make them even larger still and more
> verbose when that information gets mapped to docx open xml. Surely, that is
> something we would wish to avoid for performance reasons (it is bad enough
> already) ?

This is valid objection in principle, but the additional information required in practice is very minimal (a mere indirection).


Instead of saying:

" Here is 'Titre 1' ..."

You basically have to say:

1. "'Titre 1' => 'Heading 1'
2. "Here is 'Heading 1': ..."
Comment 11 fralau 2018-02-23 14:47:39 UTC
Another possibility, if you don't want to touch the structure of the XML file produced, might be to simply have a conversion table in LO (I guess there already is) and convert all standard styles to their English name. 

I surmise LO has a similar issue with style names in native mode, so it might have some translation mechanism already implemented?

This shouldn't change anything for ordinary users, since Word would automatically "re-localize" their styles upon next opening. I haven't tested whether LO does it as well, but I guess it would.
Comment 12 eisa01 2018-04-08 10:37:32 UTC
I changed my UI language to German, and the export from 6.0.2.1 seems to be compliant with your expected results?

I extracted the styles.xml from the exported docx and opened in BBEdit and reflowed

Can you retest in a current version of LibreOffice and paste the info in About LibreOffice?

  <w:style w:type="paragraph" w:styleId="Berschrift1">
    <w:name w:val="Heading 1" />
    <w:basedOn w:val="Berschrift" />
    <w:next w:val="Textkrper" />
    <w:qFormat />
    <w:pPr>
      <w:numPr>
        <w:ilvl w:val="0" />
        <w:numId w:val="1" />
      </w:numPr>
      <w:spacing w:before="240" w:after="120" />
      <w:outlineLvl w:val="0" />
    </w:pPr>
    <w:rPr>
      <w:b />
      <w:bCs />
      <w:sz w:val="36" />
      <w:szCs w:val="36" />
    </w:rPr>
  </w:style>

Version: 6.0.2.1
Build-ID: f7f06a8f319e4b62f9bc5095aa112a65d2f3ac89
CPU-Threads: 2; BS: Mac OS X 10.12.6; UI-Render: Standard; 
Gebietsschema: en-US (en_US.UTF-8); Calc: group
Comment 13 QA Administrators 2018-11-05 16:08:02 UTC Comment hidden (obsolete)
Comment 14 fralau 2018-11-05 21:09:59 UTC
I have checked (on 6.1.3.2) and, indeed, the output seems the expected one, with the reference to the canonic style Heading1.

This is how Microsoft Word presents it:

  <w:style w:type="paragraph" w:styleId="Titre1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>
    <w:next w:val="Normal"/>
    <w:link w:val="Titre1Car"/>
    <w:uiPriority w:val="9"/>
    <w:qFormat/>
    <w:rsid w:val="00D363B2"/>
    <w:pPr>
      <w:keepNext/>
      <w:keepLines/>
      <w:spacing w:before="240"/>
      <w:outlineLvl w:val="0"/>
    </w:pPr>
    <w:rPr>
      <w:rFonts w:asciiTheme="majorHAnsi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi" w:cstheme="majorBidi"/>
      <w:color w:val="365F91" w:themeColor="accent1" w:themeShade="BF"/>
      <w:sz w:val="32"/>
      <w:szCs w:val="32"/>
    </w:rPr>
  </w:style>

And this is how LibreOffice presents it:

  <w:style w:type="paragraph" w:styleId="Titre1">
    <w:name w:val="Heading 1"/>
    <w:basedOn w:val="Normal"/>
    <w:next w:val="Normal"/>
    <w:link w:val="Titre1Car"/>
    <w:uiPriority w:val="9"/>
    <w:qFormat/>
    <w:rsid w:val="00d363b2"/>
    <w:pPr>
      <w:keepNext w:val="true"/>
      <w:keepLines/>
      <w:spacing w:before="240" w:after="0"/>
      <w:outlineLvl w:val="0"/>
    </w:pPr>
    <w:rPr>
      <w:rFonts w:ascii="Calibri" w:hAnsi="Calibri" w:eastAsia="MS ゴシック" w:cs="" w:asciiTheme="majorHAnsi" w:cstheme="majorBidi" w:eastAsiaTheme="majorEastAsia" w:hAnsiTheme="majorHAnsi"/>
      <w:color w:val="365F91" w:themeColor="accent1" w:themeShade="bf"/>
      <w:sz w:val="32"/>
      <w:szCs w:val="32"/>
    </w:rPr>
  </w:style>

It's pretty much identical!

This should make it possible for pandoc to process it correctly... And yet pandoc finds a difference between the 2. It compiles correctly the docx produced by Word: 

$ pandoc simple.docx
<h1 id="this-is-the-title-1">This is the title 1</h1>
<h2 id="this-is-the-title-2">This is the title 2</h2>
<p>Hello this is the paragraph hello. …</p>

while for the same docx produced by LibreOffice:

$ pandoc simple.docx
<p>This is the title 1</p>
<p>This is the title 2</p>
<p>Hello this is the paragraph hello.</p>

Admittedly, passing the pandoc compilation might not be the in the spec of LibreOffice Write. But, as Data would say, "This is intriguing."
Comment 15 Dieter 2018-11-09 07:30:12 UTC
(In reply to fralau from comment #14)
> I have checked (on 6.1.3.2) and, indeed, the output seems the expected one,
> with the reference to the canonic style Heading1.

Does it mean, that we can close this bug? If not, what problem remains? => NEEDINFO
Comment 16 fralau 2018-12-26 19:37:32 UTC
I have made another check to this issue by comparing the styles.xml for the same document generated by MS Word and LibreOffice and I may have found what is wrong! The problem is not with document.xml.

The predefined styles are defined, in this extract from the styles.xml generated by MS Word, with lower case:

  <w:latentStyles w:defLockedState="0" w:defUIPriority="99" w:defSemiHidden="0" w:defUnhideWhenUsed="0" w:defQFormat="0" w:count="375">
    <w:lsdException w:name="Normal" w:uiPriority="0" w:qFormat="1"/>
    <w:lsdException w:name="heading 1" w:uiPriority="9" w:qFormat="1"/>
    <w:lsdException w:name="heading 2" w:semiHidden="1" w:uiPriority="9" w:unhideWhenUsed="1" w:qFormat="1"/>
...

Whereas LibreOffice used capitals:
  <w:style w:type="paragraph" w:styleId="Titre1">
    <w:name w:val="Heading 1"/>
    <w:basedOn w:val="Titre"/>
    <w:next w:val="Corpsdetexte"/>
    <w:qFormat/>
  ...

Is that the cause of the issue? Just to verify this hypothesis, I manually changed 'Heading 1' and 'Heading 2' into 'heading 1' and 'heading 2' in styles.xml, regenerated the docx file (by zipping, etc.), and then ran that document through pandoc. And it worked, pandoc recognized the standard headings!

Conclusion: in order to make the styles.xml file really standard, the w:val attribute of the w:name tag should use lowercase, e.g.:

<w:name w:val="heading 1"/>

It seems that would fix the issue.
Comment 17 Dieter 2020-02-14 14:15:59 UTC
Hallo Fralau, a new major release of LibreOffice is  available since this bug was reported. Could you please try to reproduce it with the latest version of LibreOffice from https://www.libreoffice.org/download/libreoffice-fresh/ ?I have set the bug's status to 'NEEDINFO'. Please change it back to 'UNCONFIRMED' if the bug is still present in the latest versiona
Comment 18 eisa01 2020-02-15 18:27:11 UTC
Ok, so based on comment #14 and comment #16 this is fixed.

The remaining issue seems to be that pandoc doesn't recognize the "Heading 1" style as equivalent to "heading 1"

Briefly googling, the reference spec seems to use "Heading 1" as an example

To me, this seems like a bug in how pandoc parses the style names. Word has picked up my custom formats for Heading 1 correctly and doesn't show a duplicate default

Setting as resolved works for me as it was a bug in the LO behavior, but this new issue seems to be a parsing bug in pandoc

Fraulau, I would submit the bug to pandoc

Version: 7.0.0.0.alpha0+
Build ID: 0cb4f304abf6f8dd6b40eb800788d2fe80581813
CPU threads: 4; OS: Mac OS X 10.14.6; UI render: default; VCL: osx; 
Locale: en-US (en_US.UTF-8); UI-Language: en-US
Calc: threaded