38311 – headless conversion of some documents have missing Chinese characters

Bug 38311 - headless conversion of some documents have missing Chinese characters

Summary: headless conversion of some documents have missing Chinese characters

Status:	RESOLVED INVALID

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	LibreOffice (show other bugs)
Version: (earliest affected)	3.4.0 release
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-06-14 12:24 UTC by Brandon Simmons
Modified:	2012-08-31 10:05 UTC (History)
CC List:	2 users (show)

See Also:
Crash report or crash signature:

Attachments
five documents illustrating chinese character conversion issues (183.55 KB, application/x-gzip) 2011-06-14 12:24 UTC, Brandon Simmons	Details
two simpler files illustrating the issue (2.17 KB, application/x-gzip) 2011-06-22 12:07 UTC, Brandon Simmons	Details
Google docs export that converts incorrectly (11.00 KB, application/msword) 2011-06-23 10:52 UTC, Brandon Simmons	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Brandon Simmons 2011-06-14 12:24:00 UTC

Created attachment 47967 [details]
five documents illustrating chinese character conversion issues

$ /opt/libreoffice3.4/program/soffice --version
LibreOffice 3.4  340m1(Build:12)

Conversion from the command line has missing Chinese characters in the resulting PDF (see below).

Conversion via the GUI with default options makes a good PDF. Copying and pasting the characters from the original into a new document and converting from the CLI also makes a good pdf.

Here is a list of the attached files, descriptions and command line used to generate them:

--------

chinese_problem_public.doc

    Original file exhibiting missing characters in headless 
    converted PDF

chinese_problem_public.copy-pasted.doc

    New document created by copying and pasting text from original
    to new LO document, saving as .doc

chinese_problem_public.gui.pdf

    Good pdf, exported from GUI from original, using default 
    options.

chinese_problem_public.copy-pasted.headless.pdf

    Good pdf, created from copy/pasted doc with:
      $> soffice --headless --convert-to pdf \
         chinese_problem_public.copy-pasted.doc
    
chinese_problem_public.headless.pdf

    Bad pdf, with missing characters, created as above from original

------

Thanks for any guidance if a workaround is possible.

Sincerely,
Brandon Simmons
http://coder.bsimmons.name

p.s. thanks to all the developers working on LibreOffice. It was exciting to see the improvements to the soffice binary :)

Comment 1 Brandon Simmons 2011-06-16 14:22:43 UTC

I've just come across this bug:

    https://bugs.freedesktop.org/show_bug.cgi?id=36313

This may be related or a duplicate.

Comment 2 Brandon Simmons 2011-06-22 12:05:39 UTC

I've added another attachment containing two files that are a simplified version of the original problem:

    chars.fails.doc - created by taking the original problem file and replacing text with an example of problem characters

    chars.converts.doc - created new file in LO and pasted in the same text

I assume this is some encoding issue in the template in the original problem file, but have no idea.

Comment 3 Brandon Simmons 2011-06-22 12:07:08 UTC

Created attachment 48300 [details]
two simpler files illustrating the issue

Comment 4 Brandon Simmons 2011-06-23 10:52:48 UTC

Created attachment 48349 [details]
Google docs export that converts incorrectly

The same characters as in the previous attachment set, but from a document created with Google Docs and exported as a Word file (they use Aspose.Words under the hood). 

This converts to PDF with missing characters as well.

Comment 5 Brandon Simmons 2011-07-08 11:51:07 UTC

Another clue: when doing a conversion from the chars.fails.doc (attached previously) to a text file using the UNO API (via pyODConverter) the characters convert correctly.

Comment 6 Brandon Simmons 2011-07-08 12:01:00 UTC

(In reply to comment #5)
> Another clue: when doing a conversion from the chars.fails.doc (attached
> previously) to a text file using the UNO API (via pyODConverter) the characters
> convert correctly.

I wanted to add that the resulting converted file is identified by 'file' as:

    /tmp/chars.txt: UTF-8 Unicode (with BOM) text

I'm not sure if it's relevant.

Comment 7 Simos Xenitellis 2011-07-11 07:18:42 UTC

This appears to be a font issue. You can compare the fonts in the different PDF documents, and you can see that the problematic PDF does not have the Chinese font that the good PDFs use.

Comment 8 Brandon Simmons 2011-07-12 07:45:16 UTC

(In reply to comment #7)
> This appears to be a font issue. You can compare the fonts in the different PDF
> documents, and you can see that the problematic PDF does not have the Chinese
> font that the good PDFs use.

Thanks for looking into this, Simos. I didn't notice that the fonts in my small test document were different. It looks like the original doc is using "SimSun" which I guess I don't have installed (it's an MS font). Is the font embedded in the .doc or something? 

Is there a workaround you could suggest? I'm quite lost.

Comment 9 Brandon Simmons 2011-07-13 08:19:35 UTC

After installing the SimSun font, the document converted correctly in headless mode. I suppose if there's still a bug here it is that the behavior in headless mode is inconsistent with the behavior in GUI mode.

Thanks for the help.

Comment 10 Björn Michaelsen 2011-12-23 12:25:20 UTC

[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html

Comment 11 Florian Reisinger 2012-08-14 14:00:54 UTC

Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian

Comment 12 Florian Reisinger 2012-08-14 14:02:01 UTC

Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian

Comment 13 Florian Reisinger 2012-08-14 14:06:42 UTC

Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian

Comment 14 Florian Reisinger 2012-08-14 14:08:44 UTC

Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian