Bug 143095 - Draw and Writer PDF filter import misinterpret font family names
Summary: Draw and Writer PDF filter import misinterpret font family names
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: Not Assigned
URL: https://gerrit.libreoffice.org/c/core...
Whiteboard:
Keywords:
: 104264 113124 151247 153907 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2021-06-27 13:16 UTC by Kevin Suo
Modified: 2023-04-24 18:06 UTC (History)
10 users (show)

See Also:
Crash report or crash signature:


Attachments
test odt file (8.75 KB, application/vnd.oasis.opendocument.text)
2021-06-27 13:16 UTC, Kevin Suo
Details
pdf file exported from the test odt file (21.46 KB, application/pdf)
2021-06-27 13:17 UTC, Kevin Suo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Suo 2021-06-27 13:16:30 UTC
Created attachment 173232 [details]
test odt file

Steps to Reproduce:

1. Type "English Normal Liberation Serif" in Writer. 
The default font used should be "Liberation Serif" (note there is a space). Manually set to this font if needed.
2. Export to PDF.
3. Open the saved PDF file with Draw.

Expected Result:
The font for the characters "English Normal Liberation Serif" should be "Liberation Serif", with a space in the font name, so that the correct system font is used to render the content.

Current Result
The font is "LiberationSerif", without a space. As there is no such font, a fallback font is used in Draw for rendering.

Version: 7.1.5.0.0+ / LibreOffice Community
Build ID: 8619e743564a241eb951866616aec82e1ab3965f
CPU threads: 4; OS: Linux 5.12; UI render: default; VCL: gtk3
Locale: zh-CN (zh_CN.UTF-8); UI: zh-CN
Calc: threaded

Fedora 33 x64.
Comment 1 Kevin Suo 2021-06-27 13:17:03 UTC
Created attachment 173233 [details]
pdf file exported from the test odt file
Comment 2 V Stuart Foote 2021-06-27 18:54:30 UTC
Observation is valid

Version: 7.3.0.0.alpha0+ (x64) / LibreOffice Community
Build ID: 1db375e06516d0532f01f9585986617aa3079866
CPU threads: 8; OS: Windows 10.0 Build 19042; UI render: Skia/Vulkan; VCL: win
Locale: en-US (en_US); UI: en-US
Calc: threaded

But is this a bug? I don't think so...

LibreOffice is not a PDF "editor", we filter import a PDF (to Draw by default, but also to Writer, or Imrpess).

While font names held within PDF (including our filter exports to PDF) are never the installed system font names. Rather they are the PDF parsed names for the font that may or maynot be embedded--with subset or fully into the PDF.

We should handle the subsetting better--bug 82163 or bug 101220, needed for touchup of a PDF--as other 'viewers' support. But since we do not 'edit' a PDF the font names are irrelevant--and correctly receive a fall back font assignment.

Withing LO filter import, the fallback font assignment could be improved--to better match what we exported into the PDF against the internal PDF subset fonts.  But that is an enhancement.

Our other import filter is pdfium based. Used for the Insert -> Image dialog, it recomposes the PDF page(s) using the internal font glyph paths to render the text runs described. If the fonts are not embedded it just uses the paths. Fidelity is high (though the bitmap rendering remains low resolution, bug 115811).
Comment 3 V Stuart Foote 2021-06-27 19:30:39 UTC
*** Bug 104264 has been marked as a duplicate of this bug. ***
Comment 4 Kevin Suo 2021-06-27 23:44:55 UTC
(In reply to V Stuart Foote from comment #2)
We changed the font to LiberationSerif on pdf export, so why not change it back to Liberation Serif on pdf import? I think a pdfToDrawFontMap can do this. We can convert the commonly known fonts back to the corrent names, and leave those unknown fonts "as is".

I understand Draw is not a pdf editor, but since we have the pdf import filter, we surely should make this filter work better.
Comment 5 Callegar 2021-06-28 07:56:50 UTC
At least on Linux, it should be possible to query the system font machinery for the font that actually uses that postscript name.
Comment 6 Kevin Suo 2021-06-29 12:57:52 UTC
See the function familyNameOverride in codes prior to the following commit;
https://cgit.freedesktop.org/libreoffice/core/commit/?id=abe4d7bd0a1ec4b5a31cc5622080952e4cd53ebf

This map was removed in that commit, but this is perfect to resolve this bug.

The related code for the setting of font family name is in the following line
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/tree/drawtreevisiting.cxx?r=e6dfaf9f#830

At this moment, the family name as represened by;
aFontProps[ "fo:font-family" ] = rFont.familyName;
is the wrong name without space. 

This family name was passed here after been processed by LineParser::parseFontFamilyName in:
https://opengrok.libreoffice.org/xref/core/sdext/source/pdfimport/wrapper/wrapper.cxx?r=12362fc4#511

After this line, the PDF font names is interpreted as the following and is then shown on Draw:
LiberationSerif (should be "Liberation Serif" in ODF XML stream)
TimesNewRommanBold (should be "Times New Roman" and the isBold should be true)
SimSun (should be Chinese name "宋体")
...
Comment 7 Kevin Suo 2021-06-29 15:34:35 UTC
*** Bug 113124 has been marked as a duplicate of this bug. ***
Comment 8 Kevin Suo 2021-06-29 15:37:14 UTC
Mark as New since there is at least one duplicates.

I did some debugging on this issue these days. I already found out how those related codes work, but I don't think I have the ability to revolve this issue.
Comment 10 Kevin Suo 2021-07-12 09:55:03 UTC
Just for those who are interested:

rResult.familyName, as returned by:
LineParser::parseFontFamilyName( FontAttributes& rResult )
in: sdext/source/pdfimport/wrapper/wrapper.cxx

is eventually passed to:

void DrawXmlFinalizer::visit( TextElement& elem, const std::list< std::unique_ptr<Element> >::const_iterator& )
in:sdext/source/pdfimport/tree/drawtreevisiting.cxx
and:
void WriterXmlFinalizer::visit( TextElement& elem, const std::list< std::unique_ptr<Element> >::const_iterator& )
in:sdext/source/pdfimport/tree/writertreevisiting.cxx

and eventually is written to an xml stream of ODF XML format and is then imported to Draw or Writer by
bool xpdf_ImportFromStream(...)
in: sdext/source/pdfimport/wrapper/wrapper.cxx

So, before it is written to the xml stream:
* In case rResult.familyName is "TimesNewRoman", it should be converted to "Times New Roman";
* In case rResult.familyName is "SimSun", it should be converted to the localized name "宋体" if the currently locale of the user is Chinese; and should be "SimSun" if the currently locale is not Chinese.
Comment 11 Kevin Suo 2022-09-30 10:31:50 UTC
*** Bug 151247 has been marked as a duplicate of this bug. ***
Comment 12 Kevin Suo 2022-12-05 04:34:29 UTC
I have submitted a change in:
https://gerrit.libreoffice.org/c/core/+/143659

Could someone who can build by yourself, pull that change, build and test, and provide your feedback?
Comment 13 Eyal Rozenberg 2022-12-05 09:42:49 UTC
(In reply to Kevin Suo from comment #12)
I don't do builds, but I will ask that you also try the reproduction instructions on my dupe 151247.
Comment 14 V Stuart Foote 2022-12-05 13:33:48 UTC
@خالد, could you compare notes with 琨珑?
Comment 15 ⁨خالد حسني⁩ 2022-12-05 19:35:54 UTC
If you have acess the the actual font data embedded in the PDF, most likely the full name is their as well, either in the “name” table of the TTF fonts or in the TopDICT of CFF/Type1 fonts. We probably have code to parse these somewhere in vcl/source/fontsubset/.
Comment 16 V Stuart Foote 2022-12-05 22:57:34 UTC
(In reply to خالد حسني from comment #15)
> If you have acess the the actual font data embedded in the PDF, most likely
> the full name is their as well, either in the “name” table of the TTF fonts
> or in the TopDICT of CFF/Type1 fonts. We probably have code to parse these
> somewhere in vcl/source/fontsubset/.

I've uncompressed a few PDFs and all pretty consistently have a /BaseFont that collapses the font name to the PS flavor.  There is usually a single /Fontfamily stanza, but not clear that is granular enough for parsing the correct font details out of the CMAP or toUnicode entries we depend on.

AFAICS Kevin's font equiv lookup seems necessary...
Comment 17 ⁨خالد حسني⁩ 2022-12-05 23:01:56 UTC
(In reply to V Stuart Foote from comment #16)
> (In reply to خالد حسني from comment #15)
> > If you have acess the the actual font data embedded in the PDF, most likely
> > the full name is their as well, either in the “name” table of the TTF fonts
> > or in the TopDICT of CFF/Type1 fonts. We probably have code to parse these
> > somewhere in vcl/source/fontsubset/.
> 
> I've uncompressed a few PDFs and all pretty consistently have a /BaseFont
> that collapses the font name to the PS flavor.  There is usually a single
> /Fontfamily stanza, but not clear that is granular enough for parsing the
> correct font details out of the CMAP or toUnicode entries we depend on.
> 
> AFAICS Kevin's font equiv lookup seems necessary...

The font data is stored in binary format inside the PDF file, you wont see much in the uncompressed PDF. Try opening the PDF in FontForge, it might be able to extract the embedded fonts, or use a tool like mutool to extract the fonts if you want to manually examine them.
Comment 18 Eyal Rozenberg 2022-12-05 23:12:53 UTC
(In reply to V Stuart Foote from comment #16)

> AFAICS Kevin's font equiv lookup seems necessary...

As I've said in my review content - that can't really solve the problem. The static list cannot possibly know the names of fonts which will be release _in the future_. Plus, the chance that we could cover enough of the fonts existing out there in world seems low.
Comment 19 ⁨خالد حسني⁩ 2022-12-25 00:18:02 UTC
(In reply to خالد حسني from comment #17)
> (In reply to V Stuart Foote from comment #16)
> > (In reply to خالد حسني from comment #15)
> > > If you have acess the the actual font data embedded in the PDF, most likely
> > > the full name is their as well, either in the “name” table of the TTF fonts
> > > or in the TopDICT of CFF/Type1 fonts. We probably have code to parse these
> > > somewhere in vcl/source/fontsubset/.
> > 
> > I've uncompressed a few PDFs and all pretty consistently have a /BaseFont
> > that collapses the font name to the PS flavor.  There is usually a single
> > /Fontfamily stanza, but not clear that is granular enough for parsing the
> > correct font details out of the CMAP or toUnicode entries we depend on.
> > 
> > AFAICS Kevin's font equiv lookup seems necessary...
> 
> The font data is stored in binary format inside the PDF file, you wont see
> much in the uncompressed PDF. Try opening the PDF in FontForge, it might be
> able to extract the embedded fonts, or use a tool like mutool to extract the
> fonts if you want to manually examine them.

The PDF font descriptor can also (optionally) include font family and font weight, so if present they should be used over the PostScript name that is currently used. Apparently not many tools write these to the PDF, including LibreOffice itself.
Comment 20 Eyal Rozenberg 2022-12-26 00:02:52 UTC
(In reply to Kevin Suo from comment #4)
> I understand Draw is not a pdf editor

LibreOffice is a PDF editor. Comments to the contrary are, well, mistaken - both in principle and in practice. See the discussions of this point in bug 49705 and bug 32249. The main problem with LO as a PDF editor is that its import filter is too rudimentary/simplistic and still somewhat buggy.

In particular, the import filters must make a significant effort to determine which font families are used in a PDF. The absolute very least is to apply the inverse of the font-name-mangling the PDF export filter uses, and perhaps  to verify that the resulting name does correspond to what's in the file.

I would go well beyond that and say that stronger heuristics need to be applied, covering the common schemes of naming font (families) within PDFs, and again, perhaps some matching up of such fonts against those available on the system.
Comment 21 Eyal Rozenberg 2022-12-26 00:03:51 UTC
And at least the basic ability to get the font family right in an ODT-PDF-ODT roundtrip is an absolute requirement, and the lack of it is a bug. So, enhancement -> normal.
Comment 22 V Stuart Foote 2022-12-26 18:07:55 UTC
(In reply to Eyal Rozenberg from comment #20)

Doers decide, and the PDF filters are most definitely not core to the project.

So, looking forward to read your code contribution--since this will require a complete refactoring--including likely replacement of the poppler/cairo and xpdf object import framework with a pdfium based implementation direct to vcl canvas and ODF archive.

And as a reality check reminder, we do not edit PDFs, we can only apply filters to parse their content. And after manipulation in an LO module we generate a new PDF via export filter.
Comment 23 Eyal Rozenberg 2022-12-26 23:15:50 UTC
(In reply to V Stuart Foote from comment #22)
> (In reply to Eyal Rozenberg from comment #20)
> 
> Doers decide, and the PDF filters are most definitely not core to the
> project.
> 
> So, looking forward to read your code contribution

Actually, I had given some thought to may dipping my feet into the LO code; not on this issue, elsewhere; but I was rather offput by the constraint of relying on a lot of old and outdated fundamental code in the name of a frozen interface vis-a-vis extensions. Regardless, it's only half-true that doers decide: There are user needs and interests; there's the TDF; the ESC; there is the design team and perhaps other bodies... those all affect decisions.

Also, who knows? Maybe this could be a tender.

> --since this will require
> a complete refactoring--including likely replacement of the poppler/cairo
> and xpdf object import framework with a pdfium based implementation direct
> to vcl canvas and ODF archive.

I didn't say this is an "easy hack" or something I expect to happen overnight. I mean, even the import filters for MSO formats isn't properly sorted out w.r.t. tables and RTL and a bunch of other things. But it's important that we point clearly in the direction in which to progress.

> And as a reality check reminder, we do not edit PDFs, we can only apply
> filters to parse their content. And after manipulation in an LO module we
> generate a new PDF via export filter.

That's how we edit DOC files, and that's also how we edit PDF files.
Comment 24 Callegar 2022-12-27 10:45:27 UTC
My two cents as one of the initial reporters of this issue (as bug 104264).

I understand that there is disagreement on whether the issue can be addressed, because even if there is a solution path, following it would take a significant amount of resources (taking them away from something else, I suppose).

IMHO, providing the ability to read PDFs and convert them to an editable odg format is a notable selling point for LibO representing something that LibO can do and its competition cannot. Doing this as accurately as possible can help capitalize this advantage as long as it lasts. PDF support has for long been a distinct trait of LibO that came first to out of the box PDF export and import.

Said that, this is issue not the only pain point with PDF, nor the major one, IMHO. Currently, the biggest annoyances with PDF are:

- Mismanagement of background. PDF exports have a solid background unless one exports a selection even if in the document properties the background is set to None, in opposition to solid white color (bug 114725, as said it can be worked around by selecting all and exporting a selection, but this is not quite discoverable).

- No option to export a PDF whose size reflects that of the drawing (bug 40163, can be worked around using the pdfcrop utility from TeXLive, but not everybody has PDF cropping utilities around. An alternate and possibly better solution would be to modify the page properties dialog to have a button to adapt the page size to the drawing).

- PDF images used as backgrounds in documents get exported to PDFs as bitmaps (bug 150076, this is the most annoying one IMHO. You use vector formats such as PDFs precisely because you do not want the limitations of bitmaps).

So I would say, please prioritize those as the resource requirements may be lower while keeping this one on the radar.
Comment 25 Eyal Rozenberg 2022-12-27 18:26:43 UTC
(In reply to Callegar from comment #24)
> I understand that there is disagreement on whether the issue can be
> addressed, because even if there is a solution path, following it would
> take a significant amount of resources

Actually, it should really not be a big deal to solve this problem 90% of the way. A heuristic could be applied ex-post-facto, i.e. after import, to consider possible split-ups of phrases like "LiberationSerif" or "DavidCLMNormal"; to see whether any of them match an actual font family; and to prefer the split-up form if the original form is not a known font family. This could in principle get some cases wrong (maybe there are font families FooBold and Foo with a Bold variant for example) - but the errors will be minor relative to the majority of cases. A more fundamental solution may require a lot of work.

> IMHO, providing the ability to read PDFs and convert them to an editable odg
> format

Don't forget the ODT format...

> Currently, the biggest annoyances with PDF are: etc. etv.

I think a "prioritizing list" of annoyances belongs on the tracker/meta-bug page of PDF import and export filters. I mean, I'm not saying you're wrong, or that this issue is the most important one regarding PDFs (it isn't); but - this is the page for a single issue, not for everything else, even if our discussion has veered off track :-)
Comment 26 Eyal Rozenberg 2023-03-01 21:12:51 UTC
Bug 153907 is closed related: It's the need to add or remove spaces from font family names when looking up just the font family (even before the extra complication of variants-as-part-of-the-name is considered).
Comment 27 Eyal Rozenberg 2023-03-02 08:38:19 UTC
*** Bug 153907 has been marked as a duplicate of this bug. ***