Bug 141709 - Opening Chinese PDF files generated by XeLaTeX loses Chinese characters
Summary: Opening Chinese PDF files generated by XeLaTeX loses Chinese characters
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
6.1 all versions
Hardware: All All
: medium normal
Assignee: Michael Warner
URL:
Whiteboard: target:7.3.0 target:7.2.0.0.beta2 tar...
Keywords:
: 128735 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-04-16 09:05 UTC by Icenowy Zheng
Modified: 2022-03-12 03:02 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
PDF file generated by XeLaTeX (10.86 KB, application/pdf)
2021-04-16 09:05 UTC, Icenowy Zheng
Details
The same LaTeX source generated by LuaLaTeX (11.79 KB, application/pdf)
2021-04-16 09:06 UTC, Icenowy Zheng
Details
Original LaTeX source (138 bytes, text/x-tex)
2021-04-16 09:06 UTC, Icenowy Zheng
Details
Build Log (1.38 MB, text/x-log)
2021-11-12 13:07 UTC, Michael Warner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Icenowy Zheng 2021-04-16 09:05:38 UTC
Created attachment 171234 [details]
PDF file generated by XeLaTeX

When trying to open (both Writer and Draw are tested) a PDF file generated with XeLaTeX (using the CTeX macro and xeCJK underneath) with Chinese characters, the Chinese characters are missing, despite the English characters keep.

Opening a PDF file generated by LuaLaTeX (using CTeX too, but LuaTex-JP underneath) is however okay.
Comment 1 Icenowy Zheng 2021-04-16 09:06:22 UTC
Created attachment 171235 [details]
The same LaTeX source generated by LuaLaTeX
Comment 2 Icenowy Zheng 2021-04-16 09:06:53 UTC
Created attachment 171236 [details]
Original LaTeX source
Comment 3 Michael Warner 2021-04-18 22:49:16 UTC
Repro in latest master:
Version: 7.2.0.0.alpha0+ / LibreOffice Community
Build ID: ab4a244d980061d8f68766c1b9662e07c268d62c
CPU threads: 12; OS: Linux 4.15; UI render: default; VCL: gtk3
Locale: en-US (en_US.UTF-8); UI: en-US
Calc: CL

One difference I noticed between opening the two PDFs is that when opening xe.pdf, but not when opening lua.pdf, I get this message on the console:

warn:legacy.osl:10089:10089:unotools/source/config/moduleoptions.cxx:472: unknown factory
Comment 4 Michael Warner 2021-04-19 04:09:05 UTC
The Chinese characters are in the temporary PDF file written to /tmp, but don't make it as far as the call to drawGlyphs in wrapper.cxx:384. They probably get dropped in the PDF Import extension, maybe somewhere in pdfparse.cxx.
Comment 5 Michael Warner 2021-05-03 12:48:37 UTC
Icenowy, are you running this on Linux?

The PDF generated by LuaLaTex sets the collection for the FandolSong-Regular font to Adobe-Identity. The PDF generated by XeLaTeX sets the collection for the FandolSong-Regular font to Adobe-GB1. 

In order to display text using a font, Poppler needs a character code to Unicode mapping. For Adobe-Identity, it generates this on the fly. For Adobe-GB1 (and others), it tries to load it from the cidToUnicode file.

On Linux, Poppler sets the default path for its data files (POPPLER_DATADIR) to /usr/share/poppler in its config.h, and this is where my distribution places those files, when I install Poppler from the package manager. 

However, in external/poppler/poppler-config.patch.1, the POPPLER_DATADIR is set to /usr/local/share/poppler. This is the directory it searches when I run xpdfimport from the working directory of my checkout.  That directory is not created or populated when I run "sudo make install". If I look at the Linux Debian packages offered for download from libreoffice.org, it creates an /opt/libreoffice7.1/share directory, but there is no poppler subdir there. Neither the debs nor the working directory contain a cidToUnicode file. 

This makes me think that poppler library is an external dependency that LO expects to be installed by the system package manager. So, I don't see why the patch would set the POPPLER_DATADIR to /usr/local/share/poppler. 

The patch file was added in https://gerrit.libreoffice.org/c/core/+/56228.

It's possible to set the poppler data directory at runtime, by providing an argument to the GlobalParams constructor, but xpdfwrapper does not do this in wrapper_gpl.cxx. 

Michael Stahl, is there any downside to just changing the POPPLER_DATADIR to /usr/share/poppler in poppler-config.patch.1?
Comment 6 Michael Warner 2021-05-03 16:11:21 UTC
> Is there any downside to just changing the POPPLER_DATADIR to
> /usr/share/poppler in poppler-config.patch.1?

It won't be correct on MS Windows. I'm actually not sure how this is packaged and distributed for Windows or Mac. Easiest solution would probably be to just bundle these files in with LO; put them in the LO/share/xpdfimport directory; pass a data dir into the GlobalParams constructor that is relative to the program installation directory. But IDK if redistributing those files would cause some other concern. Poppler itself is GPL and LO is not (that is the reason xpdfimport exists as a separate executable, after all), but these files are just data, we aren't building or linking with them.

I would appreciate some input here.
Comment 7 Icenowy Zheng 2021-05-03 16:45:12 UTC
(In reply to Michael Warner from comment #5)
> Icenowy, are you running this on Linux?

Yes, and a version packaged by my distro (although this package uses shipped poppler inside LO, not system poppler).

> 
> The PDF generated by LuaLaTex sets the collection for the FandolSong-Regular
> font to Adobe-Identity. The PDF generated by XeLaTeX sets the collection for
> the FandolSong-Regular font to Adobe-GB1. 
> 
> In order to display text using a font, Poppler needs a character code to
> Unicode mapping. For Adobe-Identity, it generates this on the fly. For
> Adobe-GB1 (and others), it tries to load it from the cidToUnicode file.
> 
> On Linux, Poppler sets the default path for its data files (POPPLER_DATADIR)
> to /usr/share/poppler in its config.h, and this is where my distribution
> places those files, when I install Poppler from the package manager. 
> 
> However, in external/poppler/poppler-config.patch.1, the POPPLER_DATADIR is
> set to /usr/local/share/poppler. This is the directory it searches when I
> run xpdfimport from the working directory of my checkout.  That directory is
> not created or populated when I run "sudo make install". If I look at the
> Linux Debian packages offered for download from libreoffice.org, it creates
> an /opt/libreoffice7.1/share directory, but there is no poppler subdir
> there. Neither the debs nor the working directory contain a cidToUnicode
> file. 

Thanks for this infomation. I tried to copy /usr/share/poppler to /usr/local/share/, and it now works.

> 
> This makes me think that poppler library is an external dependency that LO
> expects to be installed by the system package manager. So, I don't see why
> the patch would set the POPPLER_DATADIR to /usr/local/share/poppler. 

This seems to be mysterious, yes.

> 
> The patch file was added in https://gerrit.libreoffice.org/c/core/+/56228.
> 
> It's possible to set the poppler data directory at runtime, by providing an
> argument to the GlobalParams constructor, but xpdfwrapper does not do this
> in wrapper_gpl.cxx. 
> 
> Michael Stahl, is there any downside to just changing the POPPLER_DATADIR to
> /usr/share/poppler in poppler-config.patch.1?
Comment 8 Icenowy Zheng 2021-05-03 16:46:25 UTC
(In reply to Michael Warner from comment #6)
> > Is there any downside to just changing the POPPLER_DATADIR to
> > /usr/share/poppler in poppler-config.patch.1?
> 
> It won't be correct on MS Windows. I'm actually not sure how this is
> packaged and distributed for Windows or Mac. Easiest solution would probably
> be to just bundle these files in with LO; put them in the
> LO/share/xpdfimport directory; pass a data dir into the GlobalParams
> constructor that is relative to the program installation directory. But IDK
> if redistributing those files would cause some other concern. Poppler itself
> is GPL and LO is not (that is the reason xpdfimport exists as a separate
> executable, after all), but these files are just data, we aren't building or
> linking with them.

Setting it to /usr/local/share/poppler is as not correct as /usr/share/poppler on Windows, right?

So setting it to /usr/share/poppler at least fixes Linux.

> 
> I would appreciate some input here.
Comment 9 Michael Stahl (allotropia) 2021-05-03 17:19:17 UTC
i guess if any data files are missing they need to be bundled with LO.

there is no guarantee that any such files in /usr are compatible with the version of poppler shipped in LO.

apparently there's a separate "poppler-data" source package, maybe that contains those files.
Comment 10 Michael Warner 2021-06-23 03:00:25 UTC
*** Bug 128735 has been marked as a duplicate of this bug. ***
Comment 11 Commit Notification 2021-06-23 09:22:09 UTC
Michael Warner committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/648e4106cc002ff5b8184a8c104f93cb06e4b540

tdf#141709: Use poppler_data

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Commit Notification 2021-06-23 11:09:33 UTC
Michael Warner committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/98be6ca36a6e509303b69514d85471032d0dffce

tdf#141709: Use poppler_data

It will be available in 7.2.0.0.beta2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 13 Michael Warner 2021-10-29 17:07:37 UTC
I just installed:

Version: 7.2.2.2 (x64) / LibreOffice Community
Build ID: 02b2acce88a210515b4a5bb2e46cbfb63fe97d56
CPU threads: 6; OS: Windows 6.1 Service Pack 1 Build 7601; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded


The poppler_data directory isn't present in C:\Program Files\LibreOffice\share\xpdfimport for some reason. I checked the HEAD in master and the build files still point to that directory, so nobody else moved it to some other location. 

When I re-test opening xe.pdf, the characters don't appear. Perhaps Windows builds have the SYSTEM_POPPLER flag enabled for some reason? Whatever, I have to reopen this bug now.
Comment 14 Michael Warner 2021-11-06 16:09:19 UTC
poppler_data also not present in the Linux 7.2.2.2 build I installed from LibreOffice_7.2.2_Linux_x86-64_deb.tar.gz downloaded from libreoffice.org.
Comment 15 Michael Stahl (allotropia) 2021-11-11 15:37:37 UTC
i think what is missing is that the poppler_data package isn't added to the installation set.

try to add something like this in RepositoryExternal.mk same place as commit 648e4106cc002ff5b8184a8c104f93cb06e4b540


$(eval $(call gb_Helper_register_packages_for_install,pdfimport,\
	poppler_data \
))


then try it with autogen.input containing --with-package-format=archive (or msi/rpm/your platform format) for testing.
Comment 16 Michael Warner 2021-11-12 13:07:08 UTC
Created attachment 176208 [details]
Build Log

Finds poppler_data.filelist on line 649
Starts creating directories on line 2132
Starts copying files on line 3809

Get an error on line 6198:
ERROR: Could not copy /media/data/libreoffice/libreoffice/instdir/share/extensions to /media/data/libreoffice/libreoffice/workdir/installation/LibreOfficeDev/archive/install/en-US_inprogress/LibreOfficeDev_7.3.0.0.alpha1_Linux_x86-64_archive/./share/extensions Is a directory
Comment 17 Michael Warner 2021-11-12 13:51:12 UTC
Comment 16 is what happens when using --with-package-format=archive.

If I instead use:
--enable-epm
--with-package-format=deb

It seems to work. No errors in packaging, and I see the poppler_data directory in

lodevbasis7.3-extension-pdf-import_7.3.0.0.alpha1-1_amd64.deb
Comment 18 Commit Notification 2021-11-15 15:29:05 UTC
Michael Warner committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/6ea7ca45782a7e1b46e18e994534ec0a7c71951b

tdf#141709 Register poppler_data for install

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 19 Commit Notification 2021-11-15 17:36:40 UTC
Michael Warner committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/b635846280c8fb4fb4d68f95af383ef1337eb430

tdf#141709 Register poppler_data for install

It will be available in 7.2.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 20 Christian Lohmaier 2021-12-06 13:24:09 UTC
7.2.4 was a hotfix release, updating target in status-whiteboard
Comment 21 Michael Warner 2022-03-12 03:02:46 UTC
I was able to open xe.pdf and see the Chinese characters in Windows, Mac, and Linux versions of 7.3 so I will mark this as resolved.