Bug 55820 - FILESAVE: Invalid XML produced as hyperlink tag outputted in wrong order
Summary: FILESAVE: Invalid XML produced as hyperlink tag outputted in wrong order
Status: RESOLVED WORKSFORME
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
3.5.4 release
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Not Assigned
URL: http://bugs.debian.org/cgi-bin/bugrep...
Whiteboard:
Keywords:
Depends on:
Blocks: DOCX
  Show dependency treegraph
 
Reported: 2012-10-09 20:46 UTC by Rene Engelhard
Modified: 2017-05-14 00:00 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
the example file from the debian bug (96.61 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2012-10-09 20:46 UTC, Rene Engelhard
Details
file created with save-as (from bar.docx) (49.41 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2012-10-09 20:47 UTC, Rene Engelhard
Details
Minimal word/ subdirectory triggering the bug. (2.88 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document)
2012-11-30 20:09 UTC, nicolas.boulenguez
Details
clarification of hyperlink closing steps (2.63 KB, patch)
2013-01-13 17:09 UTC, nicolas.boulenguez
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Rene Engelhard 2012-10-09 20:46:22 UTC
Created attachment 68361 [details]
the example file from the debian bug

Reported in http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=690066:

--- snip ---
Package: libreoffice-writer
Version: 1:3.5.4+dfsg-2
Severity: grave
Justification: causes non-serious data loss

The data loss scenario is as follows: 

1) Open the attached docx file, 
2) Edit it, save as docx.

The file is now un-openable by MS Office, and only the first 3 pages
are visible in libreoffice.

The corruption actually does not rely on editing the file; this can be
confirmed using "save-as" to a second docx file; the same apparent
truncation happens.

Unzipping the truncated file, it looks like the user data (i.e. text
of paragraphs) is actually still there, but according to xmllint
word/document.xml does not parse.

    word/document.xml:2: parser error : Opening and ending tag mismatch: hyperlink line 2 and p
    ="18"/><w:szCs w:val="20"/></w:rPr><w:t xml:space="preserve"> </w:t></w:r></w:p>
										   ^
    word/document.xml:2: parser error : Opening and ending tag mismatch: p line 2 and body
    docGrid w:charSpace="0" w:linePitch="360" w:type="default"/></w:sectPr></w:body>
										   ^
    word/document.xml:2: parser error : Opening and ending tag mismatch: body line 2 and document
    rSpace="0" w:linePitch="360" w:type="default"/></w:sectPr></w:body></w:document>
										   ^
    word/document.xml:2: parser error : Premature end of data in tag document line 2
    rSpace="0" w:linePitch="360" w:type="default"/></w:sectPr></w:body></w:document>

I suppose it might in principle be possible to recover the data from
the corrupted XML file. That seems daunting enough that it still seems
to be an RC bug to me.

FWIW, I get this message in the terminal where I started lowriter

  /tmp/buildd/libreoffice-3.5.4+dfsg/writerfilter/source/dmapper/GraphicImport.cxx:1486 failed. Message :GraphicCrop

-- System Information:
Debian Release: wheezy/sid
  APT prefers testing
  APT policy: (900, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-3-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_CA.UTF-8, LC_CTYPE=en_CA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages libreoffice-writer depends on:
ii  libc6                  2.13-35
ii  libgcc1                1:4.7.1-7
ii  libicu48               4.8.1.1-9
ii  libreoffice-base-core  1:3.5.4+dfsg-2
ii  libreoffice-core       1:3.5.4+dfsg-2
ii  libstdc++6             4.7.1-7
ii  libwpd-0.9-9           0.9.4-3
ii  libwpg-0.2-2           0.2.1-1
ii  libwps-0.2-2           0.2.7-1
ii  libxml2                2.8.0+dfsg1-5
ii  uno-libs3              3.5.4+dfsg-2
ii  ure                    3.5.4+dfsg-2
ii  zlib1g                 1:1.2.7.dfsg-13

Versions of packages libreoffice-writer recommends:
ii  default-jre [java5-runtime]    1:1.6-47
ii  libreoffice-emailmerge         1:3.5.4+dfsg-2
ii  libreoffice-filter-binfilter   1:3.5.4+dfsg-2
ii  libreoffice-java-common        1:3.5.4+dfsg-2
ii  libreoffice-math               1:3.5.4+dfsg-2
ii  openjdk-6-jre [java5-runtime]  6b24-1.11.4-3

Versions of packages libreoffice-writer suggests:
pn  libreoffice-base  <none>
pn  libreoffice-gcj   <none>

Versions of packages libreoffice-core depends on:
ii  fontconfig                       2.9.0-7
ii  fonts-opensymbol                 2:102.2+LibO3.5.4+dfsg-2
ii  libc6                            2.13-35
ii  libcairo2                        1.12.2-2
ii  libcmis-0.2-0                    0.1.0-1+b1
ii  libcurl3-gnutls                  7.26.0-1
ii  libdb5.1                         5.1.29-5
ii  libexpat1                        2.1.0-1
ii  libexttextcat0                   3.2.0-2
ii  libfontconfig1                   2.9.0-7
ii  libfreetype6                     2.4.9-1
ii  libgcc1                          1:4.7.1-7
ii  libglib2.0-0                     2.32.3-1
ii  libgraphite2-2.0.0               1.1.3-1
ii  libgstreamer-plugins-base0.10-0  0.10.36-1
ii  libgstreamer0.10-0               0.10.36-1
ii  libhunspell-1.3-0                1.3.2-4
ii  libhyphen0                       2.8.3-2
ii  libice6                          2:1.0.8-2
ii  libicu48                         4.8.1.1-9
ii  libjpeg8                         8d-1
ii  libmythes-1.2-0                  2:1.2.2-1
ii  libneon27-gnutls                 0.29.6-3
ii  libnspr4                         2:4.9.2-1
ii  libnspr4-0d                      2:4.9.2-1
ii  libnss3                          2:3.13.6-1
ii  libnss3-1d                       2:3.13.6-1
ii  libpng12-0                       1.2.49-1
ii  librdf0                          1.0.15-1+b1
ii  libreoffice-common               1:3.5.4+dfsg-2
ii  librsvg2-2                       2.36.1-1
ii  libsm6                           2:1.2.1-2
ii  libssl1.0.0                      1.0.1c-4
ii  libstdc++6                       4.7.1-7
ii  libx11-6                         2:1.5.0-1
ii  libxext6                         2:1.3.1-2
ii  libxinerama1                     2:1.1.2-1
ii  libxml2                          2.8.0+dfsg1-5
ii  libxrandr2                       2:1.3.2-2
ii  libxrender1                      1:0.9.7-1
ii  libxslt1.1                       1.1.26-14
ii  uno-libs3                        3.5.4+dfsg-2
ii  ure                              3.5.4+dfsg-2
ii  zlib1g                           1:1.2.7.dfsg-13
--- snip ---

I can reproduce this with master as of 20120927
Comment 1 Rene Engelhard 2012-10-09 20:47:44 UTC
Created attachment 68362 [details]
file created with save-as (from bar.docx)
Comment 2 Rene Engelhard 2012-10-09 20:48:42 UTC
confirm myself (I see it)
Comment 3 Roman Eisele 2012-10-17 11:10:25 UTC
Comment on attachment 68361 [details]
the example file from the debian bug


Fixed MIME type.
Comment 4 Roman Eisele 2012-10-17 11:10:44 UTC
Comment on attachment 68362 [details]
file created with save-as (from bar.docx)

Fixed MIME type.
Comment 5 nicolas.boulenguez 2012-11-30 00:12:25 UTC
The bug disappears if the
  <w:hyperlink r:id="rId15" w:history="1"/>
markup is removed from word/document.xml.
Comment 6 nicolas.boulenguez 2012-11-30 20:09:21 UTC
Created attachment 70843 [details]
Minimal word/ subdirectory triggering the bug.

Please find a minimal test case, with only a few suspect lines. Everything outside the word/ subdirectory is identical to what lowriter produces for a freshly created empty document.
You may reproduce quickly with
# unoconv --format=docx --output=converted.docx bar.docx
# unzip -p converted.docx word/document.xml | xmllint --noout -
Comment 7 nicolas.boulenguez 2013-01-11 23:48:44 UTC
Converting to odt instead of docx does not trigger the bug.
In converted.docx/word/document.xml, the problem is caused by
    </w:hyperlink><w:hyperlink r:id="rId2">
The docx output filter writes these items in the wrong order.
sw/source/filter/ww8/docxattributeoutput.hxx declares two private booleans:
    // close of hyperlink needed
    bool m_closeHyperlinkInThisRun;
    bool m_closeHyperlinkInPreviousRun;
The body uses them to store persistent information across DOM callbacks.
Initialization sets them to FF
EndURL() sets the former to V
VF -> EndRun() -> FF (serialize an end element late)
VF -> RunText() -> FV
FV -> EndRun() -> FF (serialize an end element quick)
I fail to understand the detail right now, but I strongly guess that:
- since the hyperlink contains no text, RunText() is never called.
- m_closeHyperLinkInPreviousRun never replaces m_closeHyperLinkInThisRun.
- in EndRun(), serialization of the end element occurs too quick.
Good night.
Comment 8 nicolas.boulenguez 2013-01-13 14:16:53 UTC
Last post was based on 3.5.4 sources.
The bug is reproducible it with 3.6.4.
At least two similar bugs (Bug 52610 and Bug 53175) have been patched meanwhile.
Comment 9 nicolas.boulenguez 2013-01-13 16:58:54 UTC
The attached patch sligthly improves readability. This is useful for code that already caused at least 3 bugs.
If I understand well, for each element:
- EndURL() is called at most once,
- then RunText() is called an arbitrary number of times,
- then EndRun() is called exactly once.
If so, the variable m_startedHyperlink may be read before it is set for the
current element. One solution would be to make it local to EndRun().
It is possible that the bug is caused by this sequence.
- EndURL then RunText then Endrun are called for Element 1
  assuming m_pHyperlinkAttrList, m_startedHyperlink is set when we exit
- EndURL then RunText then EndRun are called for Element 2
  closing of Element 2 happens to quick.
Comment 10 nicolas.boulenguez 2013-01-13 17:09:11 UTC
Created attachment 72963 [details]
clarification of hyperlink closing steps
Comment 11 nicolas.boulenguez 2013-01-13 17:18:13 UTC
Bug 47669 is also related.
Comment 12 Michael Stahl (allotropia) 2013-03-28 12:12:47 UTC
there's a patch here, can somebody who knows the WW8 export review this?

i can reproduce the bug on current master.

Nicolas, could you please send a mail with a text like 
http://permalink.gmane.org/gmane.comp.documentfoundation.libreoffice.devel/38402
to the mailing list libreoffice@lists.freedesktop.org ?

also in the future it's best to send a patch either to mailing list or to
gerrit (http://wiki.documentfoundation.org/Development/gerrit) because
developers look there far more often than at bugzilla for patches.
Comment 13 nicolas.boulenguez 2013-03-28 19:34:36 UTC
Be warned that the diff simplifies the faulty code, but does not solve the problem. I only posted each progress to help next bug squasher.
A true patch for these bugs would need a clear specification of the order in which callbacks are called, and I did not find time for that yet.
Comment 14 Manuel Widmer 2013-06-15 11:36:16 UTC
I can also confirm with the current release:
Version 4.0.3.3
Comment 15 Benjamin Herr 2014-01-04 03:15:37 UTC
I'm able to reproduce this by creating a new document, typing http://example.com blah, enter, blah, saving as .docx, closing libreoffice and then opening that file in libreoffice again, and the second blah has disappeared. "Version: 4.1.4.2", "Build ID: Gentoo official package"
Comment 16 retired 2014-01-04 11:00:47 UTC
Tried Benjamin's steps but NoRepro:4.2.0.1:Ubuntu13.10

the document shows fine. Since I'm not sure if that means the bug is fixed in 4.2.0.1 I ask for more tests with 4.2.0.1. Anybody?
Comment 17 QA Administrators 2014-08-04 16:17:58 UTC Comment hidden (obsolete)
Comment 18 nicolas.boulenguez 2014-08-04 18:49:16 UTC
I checked every test described in this bug log with 4.2.5.2 on Debian,
and all was OK. It seems that this bug is fixed. Congratulations.
Comment 19 Michael Stahl (allotropia) 2014-08-04 19:09:46 UTC
please note the "version" should be *earliest* one that *has* the bug :)

resolving WFM per comment #18