Bug 161509 - Built-in Microsoft Word style names are remapped on import but not on export
Summary: Built-in Microsoft Word style names are remapped on import but not on export
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
7.4.7.2 release
Hardware: All All
: medium normal
Assignee: Mike Kaganski
URL:
Whiteboard: target:25.2.0 target:24.8.0.0.beta2
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-11 12:31 UTC by David Huggins-Daines
Modified: 2024-06-13 20:36 UTC (History)
0 users

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Huggins-Daines 2024-06-11 12:31:15 UTC
Description:
See https://github.com/python-openxml/python-docx/issues/494#issuecomment-2160619976 for some context.

LibreOffice will remap Microsoft Word's style names to its own when importing DOCX.  For example, "heading 1" becomes "Heading 1".  The full table appears to be in core/sw/source/writerfilter/dmapper/StyleSheetTable.cxx:1555 (StyleSheetTable::ConvertStyleName)

The comment in that function implies that the names will be converted back on export.  Unfortunately this is not entirely correct.  They are remapped, but not to the right names, for instance, "Heading 1" remains "Heading 1" and not "heading 1".  See core/sw/source/filter/ww8/styles.cxx (GetStiNames).  Again the comment does not match the code, it says "keep in sync with StyleSheetTable::ConvertStyleName" but it is not in sync.

Surprisingly for a Microsoft product, the names are case-sensitive.  This doesn't cause any actual issues when loading the files in Word since the style definitions are copied into the document.  However it causes unexpected errors when using third-party code such as the python-docx module that expect the Microsoft internal names for built-in styles.

Steps to Reproduce:
1. Create a file in Microsoft Word with a "Heading 1"
2. Load the file in LibreOffice
3. Re-export the file as DOCX with LibreOffice

Actual Results:
word/styles.xml in the output has:

  <w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="Heading 1"/>
    <w:basedOn w:val="Normal"/>

Expected Results:
It should have (as in the original Word file):

  <w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>


Reproducible: Always


User Profile Reset: No

Additional Info:
Version: 7.4.7.2 / LibreOffice Community
Build ID: 40(Build:2)
CPU threads: 4; OS: Linux 6.1; UI render: default; VCL: gtk3
Locale: en-CA (en_CA.UTF-8); UI: en-US
Debian package version: 4:7.4.7-1+deb12u2
Calc: threaded
Comment 1 Mike Kaganski 2024-06-11 13:39:53 UTC
(In reply to David Huggins-Daines from comment #0)
> Surprisingly for a Microsoft product, the names are case-sensitive.

Sorry, how is that? are the names sensitive in Word, or in the third-party Python library?
Comment 2 Mike Kaganski 2024-06-11 13:56:48 UTC
(In reply to David Huggins-Daines from comment #0)
> The comment in that function implies that the names will be converted back
> on export.  Unfortunately this is not entirely correct.  They are remapped,
> but not to the right names, for instance, "Heading 1" remains "Heading 1"
> and not "heading 1"

This ignores the fact that the mapping on import is many-to-one, and mapping on export is one-to-one. Do you claim, that there is no "Heading 1" -> "Heading 1" mapping on import? Or is there an evidence that the use of lowercase "heading 1" in current Word version (when the UI shows "Heading 1", and the attribute we are discussing is not an identifier, but the *UI name*) is the only one? Otherwise, what is the "not entirely correct" is based upon?

> See core/sw/source/filter/ww8/styles.cxx
> (GetStiNames).  Again the comment does not match the code, it says "keep in
> sync with StyleSheetTable::ConvertStyleName" but it is not in sync.

Really? What is not in sync?
Comment 3 David Huggins-Daines 2024-06-11 15:16:29 UTC
Hi, I’m surprised by the defensiveness of your response.  I am just trying to solve a problem here and don’t claim to be an expert in anything, particularly the inner workings of Word or the Open Office XML file format standard.  So, if you are certain that I’m wrong, I’d like to know!

In theory these are UI names, but in practice, built-in styles (the ones that have latent style definitions) are special - the name used by Word in <w:name> is *not* always the same as the one shown in the Word UI, because of internationalization among other things.  

For example if I create a document with Word in French locale, the UI shows "Titre 1" but word/styles.xml still has:

<w:name w:val="heading 1" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"/>

This is why python-docx special-cases them:

https://github.com/python-openxml/python-docx/blob/0cf6d71fb47ede07ecd5de2a8655f9f46c5f083d/src/docx/styles/__init__.py#L8

See also the note here:

https://python-docx.readthedocs.io/en/latest/user/styles-using.html

I don’t really know what you mean by a "many-to-one mapping on input". Again, built-in styles are special, they have conventional names, and that is why LibreOffice does the mapping in the first place.  Sure, in *theory* there is a mapping of "Heading 1" to itself, but in practice, Word will never allow you to create a custom style with "Heading 1" (as opposed to "heading 1") as its <w:name>.  I invite you to try it.

If you think the maintainer of python-docx is wrong I would be happy to know, and to know why.  Otherwise please try to have an open mind.
Comment 4 David Huggins-Daines 2024-06-11 15:18:27 UTC
Perhaps also I should have said "case-preserving" and not "case-sensitive".  Sorry if that wasn't clear.

Anyway there is a workaround, so this isn't a serious problem, but it does appear to be an inconsistency with Word's behaviour when exporting DOCX.
Comment 5 David Huggins-Daines 2024-06-11 15:39:25 UTC
But also, I said "case-sensitive" because *the comment in LibreOffice source code* says it:

    // Also very unclear: at least in DOCX, style names appear to be case
    // sensitive; if Word imports 2 styles that have the same case-insensitive
    // name as a built-in style, it renames one of them by appending a number.

Again, my reading of the comment is that (again, consider that these are *special* "UI Names" for built-in styles, custom styles are not affected) the intent is to map "heading 1" to "Heading 1" and then subsequently to RES_POOLCOLL_HEADLINE1 or ww:sti:stiLev1 (again, please don't flame me if I'm wrong, I am *not* an expert in LibreOffice internals) and then to map those constants back to "heading 1" on DOCX export, but this isn't what happens.

If you understand the comment differently please explain why.
Comment 6 Mike Kaganski 2024-06-11 15:44:12 UTC
(In reply to David Huggins-Daines from comment #3)
> Hi, I’m surprised by the defensiveness of your response.

Sorry, must be something about me being the last one who touched that area, tidied that up, made sure to sync that all together, and documented all the relations between different structures and functions, and then read all those "does not match" and "not correct".

> I don’t really know what you mean by a "many-to-one mapping on input".

I meant the code that you pointed to:

https://github.com/LibreOffice/core/blob/4c179c1e62778274e7604b3317300f6df809cd80/sw/source/writerfilter/dmapper/StyleSheetTable.cxx#L1560

it maps the possible input from DOCX: both "heading 1" and "Heading 1" in OOXML are mapped to "Heading 1" in Writer. Thus, on export, it is not known which of the two were mapped initially.

Anyway, I didn't dismiss your report (if I thought it is plain wrong, I'd close it e.g. NOTABUG). I need to check it more.
Comment 7 Mike Kaganski 2024-06-11 15:52:19 UTC
(In reply to David Huggins-Daines from comment #5)
>     // Also very unclear: at least in DOCX, style names appear to be case
>     // sensitive; if Word imports 2 styles that have the same
> case-insensitive
>     // name as a built-in style, it renames one of them by appending a
> number.

As far as I read it, Michael wrote about some inconsistency of OOXML documentation (*claiming* the case sensitivity) vs. Word implementation (actually treating case-different names as identical, and thus auto-renaming).

Anyway, the task is to know, if the lowercase special names are the "canonical" ones.
Comment 8 David Huggins-Daines 2024-06-11 15:53:48 UTC
Thanks! I understand, my tone was a bit overconfident in the initial report.  Sorry for my overreaction.

Another data point, definitely "case-sensitive" isn't the right term for Word's behaviour.  For example, I notice this sequence:

- create a document in Word: <w:name w:val="heading 1">
- re-export with LibreOfifce: <w:name w:val="Heading 1">
- re-load the exported document in Word and save: <w:name w:val="heading 1">

So Word (at least the Office 365 version) is able to match "Heading 1"/"heading 1".

I'll let you recheck at your leisure, again, it's not an urgent problem.
Comment 9 Mike Kaganski 2024-06-12 06:03:20 UTC
(In reply to David Huggins-Daines from comment #4)
> Perhaps also I should have said "case-preserving" and not "case-sensitive". 
> Sorry if that wasn't clear.

(In reply to David Huggins-Daines from comment #8)
> Another data point, definitely "case-sensitive" isn't the right term for
> Word's behaviour.  For example, I notice this sequence:
> 
> - create a document in Word: <w:name w:val="heading 1">
> - re-export with LibreOfifce: <w:name w:val="Heading 1">
> - re-load the exported document in Word and save: <w:name w:val="heading 1">

:-D So even "case-preserving" would not be correct in relation to Word (as well as to Writer, sure) ;-D

(In reply to David Huggins-Daines from comment #3)
> In theory these are UI names, but in practice, built-in styles (the ones
> that have latent style definitions) are special

Let me use the w:latentStyles element of a Word-generated document as a reference, to fix the export mapping part. Using Word 2016 (that I have access to), I have this list:

Normal
heading 1
heading 2
heading 3
heading 4
heading 5
heading 6
heading 7
heading 8
heading 9
index 1
index 2
index 3
index 4
index 5
index 6
index 7
index 8
index 9
toc 1
toc 2
toc 3
toc 4
toc 5
toc 6
toc 7
toc 8
toc 9
Normal Indent
footnote text
annotation text
header
footer
index heading
caption
table of figures
envelope address
envelope return
footnote reference
annotation reference
line number
page number
endnote reference
endnote text
table of authorities
macro
toa heading
List
List Bullet
List Number
List 2
List 3
List 4
List 5
List Bullet 2
List Bullet 3
List Bullet 4
List Bullet 5
List Number 2
List Number 3
List Number 4
List Number 5
Title
Closing
Signature
Default Paragraph Font
Body Text
Body Text Indent
List Continue
List Continue 2
List Continue 3
List Continue 4
List Continue 5
Message Header
Subtitle
Salutation
Date
Body Text First Indent
Body Text First Indent 2
Note Heading
Body Text 2
Body Text 3
Body Text Indent 2
Body Text Indent 3
Block Text
Hyperlink
FollowedHyperlink
Strong
Emphasis
Document Map
Plain Text
E-mail Signature
HTML Top of Form
HTML Bottom of Form
Normal (Web)
HTML Acronym
HTML Address
HTML Cite
HTML Code
HTML Definition
HTML Keyboard
HTML Preformatted
HTML Sample
HTML Typewriter
HTML Variable
Normal Table
annotation subject
No List
Outline List 1
Outline List 2
Outline List 3
Table Simple 1
Table Simple 2
Table Simple 3
Table Classic 1
Table Classic 2
Table Classic 3
Table Classic 4
Table Colorful 1
Table Colorful 2
Table Colorful 3
Table Columns 1
Table Columns 2
Table Columns 3
Table Columns 4
Table Columns 5
Table Grid 1
Table Grid 2
Table Grid 3
Table Grid 4
Table Grid 5
Table Grid 6
Table Grid 7
Table Grid 8
Table List 1
Table List 2
Table List 3
Table List 4
Table List 5
Table List 6
Table List 7
Table List 8
Table 3D effects 1
Table 3D effects 2
Table 3D effects 3
Table Contemporary
Table Elegant
Table Professional
Table Subtle 1
Table Subtle 2
Table Web 1
Table Web 2
Table Web 3
Balloon Text
Table Grid
Table Theme
Placeholder Text
No Spacing
Light Shading
Light List
Light Grid
Medium Shading 1
Medium Shading 2
Medium List 1
Medium List 2
Medium Grid 1
Medium Grid 2
Medium Grid 3
Dark List
Colorful Shading
Colorful List
Colorful Grid
Light Shading Accent 1
Light List Accent 1
Light Grid Accent 1
Medium Shading 1 Accent 1
Medium Shading 2 Accent 1
Medium List 1 Accent 1
Revision
List Paragraph
Quote
Intense Quote
Medium List 2 Accent 1
Medium Grid 1 Accent 1
Medium Grid 2 Accent 1
Medium Grid 3 Accent 1
Dark List Accent 1
Colorful Shading Accent 1
Colorful List Accent 1
Colorful Grid Accent 1
Light Shading Accent 2
Light List Accent 2
Light Grid Accent 2
Medium Shading 1 Accent 2
Medium Shading 2 Accent 2
Medium List 1 Accent 2
Medium List 2 Accent 2
Medium Grid 1 Accent 2
Medium Grid 2 Accent 2
Medium Grid 3 Accent 2
Dark List Accent 2
Colorful Shading Accent 2
Colorful List Accent 2
Colorful Grid Accent 2
Light Shading Accent 3
Light List Accent 3
Light Grid Accent 3
Medium Shading 1 Accent 3
Medium Shading 2 Accent 3
Medium List 1 Accent 3
Medium List 2 Accent 3
Medium Grid 1 Accent 3
Medium Grid 2 Accent 3
Medium Grid 3 Accent 3
Dark List Accent 3
Colorful Shading Accent 3
Colorful List Accent 3
Colorful Grid Accent 3
Light Shading Accent 4
Light List Accent 4
Light Grid Accent 4
Medium Shading 1 Accent 4
Medium Shading 2 Accent 4
Medium List 1 Accent 4
Medium List 2 Accent 4
Medium Grid 1 Accent 4
Medium Grid 2 Accent 4
Medium Grid 3 Accent 4
Dark List Accent 4
Colorful Shading Accent 4
Colorful List Accent 4
Colorful Grid Accent 4
Light Shading Accent 5
Light List Accent 5
Light Grid Accent 5
Medium Shading 1 Accent 5
Medium Shading 2 Accent 5
Medium List 1 Accent 5
Medium List 2 Accent 5
Medium Grid 1 Accent 5
Medium Grid 2 Accent 5
Medium Grid 3 Accent 5
Dark List Accent 5
Colorful Shading Accent 5
Colorful List Accent 5
Colorful Grid Accent 5
Light Shading Accent 6
Light List Accent 6
Light Grid Accent 6
Medium Shading 1 Accent 6
Medium Shading 2 Accent 6
Medium List 1 Accent 6
Medium List 2 Accent 6
Medium Grid 1 Accent 6
Medium Grid 2 Accent 6
Medium Grid 3 Accent 6
Dark List Accent 6
Colorful Shading Accent 6
Colorful List Accent 6
Colorful Grid Accent 6
Subtle Emphasis
Intense Emphasis
Subtle Reference
Intense Reference
Book Title
Bibliography
TOC Heading
Plain Table 1
Plain Table 2
Plain Table 3
Plain Table 4
Plain Table 5
Grid Table Light
Grid Table 1 Light
Grid Table 2
Grid Table 3
Grid Table 4
Grid Table 5 Dark
Grid Table 6 Colorful
Grid Table 7 Colorful
Grid Table 1 Light Accent 1
Grid Table 2 Accent 1
Grid Table 3 Accent 1
Grid Table 4 Accent 1
Grid Table 5 Dark Accent 1
Grid Table 6 Colorful Accent 1
Grid Table 7 Colorful Accent 1
Grid Table 1 Light Accent 2
Grid Table 2 Accent 2
Grid Table 3 Accent 2
Grid Table 4 Accent 2
Grid Table 5 Dark Accent 2
Grid Table 6 Colorful Accent 2
Grid Table 7 Colorful Accent 2
Grid Table 1 Light Accent 3
Grid Table 2 Accent 3
Grid Table 3 Accent 3
Grid Table 4 Accent 3
Grid Table 5 Dark Accent 3
Grid Table 6 Colorful Accent 3
Grid Table 7 Colorful Accent 3
Grid Table 1 Light Accent 4
Grid Table 2 Accent 4
Grid Table 3 Accent 4
Grid Table 4 Accent 4
Grid Table 5 Dark Accent 4
Grid Table 6 Colorful Accent 4
Grid Table 7 Colorful Accent 4
Grid Table 1 Light Accent 5
Grid Table 2 Accent 5
Grid Table 3 Accent 5
Grid Table 4 Accent 5
Grid Table 5 Dark Accent 5
Grid Table 6 Colorful Accent 5
Grid Table 7 Colorful Accent 5
Grid Table 1 Light Accent 6
Grid Table 2 Accent 6
Grid Table 3 Accent 6
Grid Table 4 Accent 6
Grid Table 5 Dark Accent 6
Grid Table 6 Colorful Accent 6
Grid Table 7 Colorful Accent 6
List Table 1 Light
List Table 2
List Table 3
List Table 4
List Table 5 Dark
List Table 6 Colorful
List Table 7 Colorful
List Table 1 Light Accent 1
List Table 2 Accent 1
List Table 3 Accent 1
List Table 4 Accent 1
List Table 5 Dark Accent 1
List Table 6 Colorful Accent 1
List Table 7 Colorful Accent 1
List Table 1 Light Accent 2
List Table 2 Accent 2
List Table 3 Accent 2
List Table 4 Accent 2
List Table 5 Dark Accent 2
List Table 6 Colorful Accent 2
List Table 7 Colorful Accent 2
List Table 1 Light Accent 3
List Table 2 Accent 3
List Table 3 Accent 3
List Table 4 Accent 3
List Table 5 Dark Accent 3
List Table 6 Colorful Accent 3
List Table 7 Colorful Accent 3
List Table 1 Light Accent 4
List Table 2 Accent 4
List Table 3 Accent 4
List Table 4 Accent 4
List Table 5 Dark Accent 4
List Table 6 Colorful Accent 4
List Table 7 Colorful Accent 4
List Table 1 Light Accent 5
List Table 2 Accent 5
List Table 3 Accent 5
List Table 4 Accent 5
List Table 5 Dark Accent 5
List Table 6 Colorful Accent 5
List Table 7 Colorful Accent 5
List Table 1 Light Accent 6
List Table 2 Accent 6
List Table 3 Accent 6
List Table 4 Accent 6
List Table 5 Dark Accent 6
List Table 6 Colorful Accent 6
List Table 7 Colorful Accent 6
Mention
Smart Hyperlink
Hashtag
Unresolved Mention
Smart Link

Using an Online Word (starting from OneDrive.com) in the hope that it generates an up-to-date version, I created an empty document, downloaded a copy, and checked its respective element. It gave me ~the same, except for absent five last entries (Mention to Smart Link), so I think it's a good list to use as the reference.
Comment 10 David Huggins-Daines 2024-06-12 12:21:35 UTC
(In reply to Mike Kaganski from comment #9)
> 
> :-D So even "case-preserving" would not be correct in relation to Word (as
> well as to Writer, sure) ;-D

Indeed I was all mixed up!  It's quite the opposite, case-insensitive with a canonical casing.  The only surprise being that the canonical case is lowercase :-)

> Let me use the w:latentStyles element of a Word-generated document as a
> reference, to fix the export mapping part.

> Using an Online Word (starting from OneDrive.com) in the hope that it
> generates an up-to-date version, I created an empty document, downloaded a
> copy, and checked its respective element. It gave me ~the same, except for
> absent five last entries (Mention to Smart Link), so I think it's a good
> list to use as the reference.

Yes, this seems like the right approach, since the latent styles in an empty document should correspond to the built-in styles.  I get the same list from Office365 in French, so these should be the canonical names (which get localized after the fact...)

I can ask someone for a file from a desktop Word to double-check if necessary.
Comment 11 Mike Kaganski 2024-06-12 12:50:49 UTC
Hmm, but now, without further work, the style ids would have wrong case:

  <w:style w:type="paragraph" w:styleId="heading1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>
Comment 12 David Huggins-Daines 2024-06-12 13:01:35 UTC
(In reply to Mike Kaganski from comment #11)
> Hmm, but now, without further work, the style ids would have wrong case:
> 
>   <w:style w:type="paragraph" w:styleId="heading1">
>     <w:name w:val="heading 1"/>
>     <w:basedOn w:val="Normal"/>

Hmm... is the styleId derived from the <w:name> in LibreOffice?  My feeling is that the styleId *is* case-sensitive, though I don't know if the built-in styles have canonical styleIds.

I think this is why python-docx ends up special-casing a subset of styles to give them lower-case <w:name>.  My intuition is that these styles seem to correspond to the built-in styles in the old binary .DOC format, which you can see in https://github.com/LibreOffice/core/blob/b0aff34ccb12e1af815a059957d7c4f6a14eeaea/sw/source/filter/ww8/wrtw8sty.cxx#L147
Comment 13 Mike Kaganski 2024-06-12 20:36:12 UTC
https://gerrit.libreoffice.org/c/core/+/168755
Comment 14 Commit Notification 2024-06-13 05:49:57 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/e74c94c1a6ae47eb507eec610e231ebb6b02a8be

tdf#161509: Output the same special style names and identifiers as Word

It will be available in 25.2.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 15 Commit Notification 2024-06-13 20:36:38 UTC
Mike Kaganski committed a patch related to this issue.
It has been pushed to "libreoffice-24-8":

https://git.libreoffice.org/core/commit/b3f503c5d88b2314fca9fc9124f918090c8c427b

tdf#161509: Output the same special style names and identifiers as Word

It will be available in 24.8.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.