Bug 78427 - FILEOPEN PDF Import: sometimes bold and italic font properties are imported incorrectly (see comment 34 for TODO list)
Summary: FILEOPEN PDF Import: sometimes bold and italic font properties are imported i...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
4.3.0.0.alpha0+ Master
Hardware: Other All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:7.3.0 target:7.2.0.2
Keywords:
: 116059 (view as bug list)
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2014-05-08 08:49 UTC by vvort
Modified: 2021-10-11 08:02 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments
Test file 1 (23.21 KB, application/pdf)
2014-05-08 08:49 UTC, vvort
Details
Test file 2 (4.98 KB, application/pdf)
2014-05-08 08:50 UTC, vvort
Details
info fonts acrobat reader of file 1 (18.33 KB, image/png)
2018-08-25 11:04 UTC, paulystefan
Details
info fonts acrobat reader of file 2 (16.65 KB, image/png)
2018-08-25 11:05 UTC, paulystefan
Details
506_Vorsorgevollmacht.pdf (73.78 KB, application/pdf)
2019-11-19 10:17 UTC, DerMartin
Details
Times.pdf (sdext.pdfimport failed to detect) (75.09 KB, application/pdf)
2021-07-04 09:13 UTC, Kevin Suo
Details
Times MS.pdf (sdext.pdfimport success to detect) (286.71 KB, application/pdf)
2021-07-04 09:20 UTC, Kevin Suo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description vvort 2014-05-08 08:49:56 UTC
Created attachment 98668 [details]
Test file 1

Here is two examples:
1. Bold property is specified, but text is imported as normal.
2. Bold italic text is imported as italic text.
Comment 1 vvort 2014-05-08 08:50:50 UTC
Created attachment 98669 [details]
Test file 2
Comment 2 vvort 2014-05-08 08:59:54 UTC
Example #2 is fixed here:
https://gerrit.libreoffice.org/9276
Comment 3 Commit Notification 2014-05-08 09:05:24 UTC
Vort committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=e0bde4c53b1b8412833d4b84a214da8b8fc1f6e7

fdo#78427 PDF Import: Improve detection of bold italic font



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 4 Kevin Suo 2014-07-24 12:09:40 UTC
added bug 81484 as see also.
Comment 5 QA Administrators 2015-09-04 02:49:35 UTC Comment hidden (obsolete)
Comment 6 Alexander Tselikov 2016-02-05 10:32:45 UTC
(In reply to QA Administrators from comment #5)
> ** Please read this message in its entirety before responding **
> 
> To make sure we're focusing on the bugs that affect our users today,
> LibreOffice QA is asking bug reporters and confirmers to retest open,
> confirmed bugs which have not been touched for over a year.
> 
> There have been thousands of bug fixes and commits since anyone checked on
> this bug report. During that time, it's possible that the bug has been
> fixed, or the details of the problem have changed. We'd really appreciate
> your help in getting confirmation that the bug is still present.
> 
> If you have time, please do the following:
> 
> Test to see if the bug is still present on a currently supported version of
> LibreOffice (5.0.0.5 or later)
>    https://www.libreoffice.org/download/
> 
>    If the bug is present, please leave a comment that includes the version
> of LibreOffice and your operating system, and any changes you see in the bug
> behavior
>  
>  If the bug is NOT present, please set the bug's Status field to
> RESOLVED-WORKSFORME and leave a short comment that includes your version of
> LibreOffice and Operating System
> 
> Please DO NOT
> 
> Update the version field
> Reply via email (please reply directly on the bug tracker)
> Set the bug's Status field to RESOLVED - FIXED (this status has a particular
> meaning that is not appropriate in this case)
> 
> 
> If you want to do more to help you can test to see if your issue is a
> REGRESSION. To do so: 
> 
> 1. Download and install oldest version of LibreOffice (usually 3.3 unless
> your bug pertains to a feature added after 3.3)
> 
> http://downloadarchive.documentfoundation.org/libreoffice/old/
> 
> 2. Test your bug 
> 3. Leave a comment with your results. 
> 
> 4a. If the bug was present with 3.3 - set version to "inherited from OOo"; 
> 4b. If the bug was not present in 3.3 - add "regression" to keyword
> 
> 
> Feel free to come ask questions or to say hello in our QA chat:
> http://webchat.freenode.net/?channels=libreoffice-qa
> 
> Thank you for your help!
> 
> -- The LibreOffice QA Team This NEW Message was generated on: 2015-09-03

Case 1 still reproducing.

Version: 5.0.4.2
Build ID: 1:5.0.4~rc2-0ubuntu1~trusty1
Locale: en-US (en_US.UTF-8)
Linux Mint 17.3
Comment 7 Heiko Tietze 2016-05-10 10:03:21 UTC
Confirmed

Version: 5.2.0.0.alpha0+
Build ID: 6b232aeecc55f1715bc111e636e36a8e24827efb
CPU Threads: 4; OS Version: Windows 6.1; UI Render: default; 
TinderBox: Win-x86@39, Branch:master, Time: 2016-01-26_07:40:04
Locale: de-DE (de_DE)
Comment 8 QA Administrators 2017-09-01 11:20:10 UTC Comment hidden (obsolete)
Comment 9 Kevin Suo 2017-09-02 00:11:59 UTC
Bug still exists in the most recent version.
Comment 10 paulystefan 2017-11-27 20:55:07 UTC
5.4.3.2 x64 win 10

bug present in test file 1
Comment 11 paulystefan 2018-08-25 11:04:35 UTC
Created attachment 144421 [details]
info fonts acrobat reader of file 1

different fonts in files 

truetype and type1 in first and truetype only in second file
Comment 12 paulystefan 2018-08-25 11:05:10 UTC
Created attachment 144422 [details]
info fonts acrobat reader of file 2

different fonts in files 

truetype and type1 in first and truetype only in second file
Comment 13 paulystefan 2019-08-23 21:28:21 UTC
same problem in 6.3.0.4 with different fonts in relation to acro reader
Comment 14 DerMartin 2019-11-19 10:15:22 UTC
Attached file "506_Vorsorgevollmacht.pdf" has probably the same issue.
Comment 15 DerMartin 2019-11-19 10:17:06 UTC
Created attachment 155937 [details]
506_Vorsorgevollmacht.pdf
Comment 16 DerMartin 2019-11-19 10:20:02 UTC
LO version 6.3.3.2.0+
Comment 17 Kevin Suo 2021-06-29 15:10:05 UTC
I think I can fix the bold/italic issue, 

but this fix only works to specific fonts (i.e., Arial, Calibri etc which have the exact same name as in PDF and in the system), 

but does not work for some other fonts (i.e., the fonts which have no space in PDF but do have spaces in your system, like "LiberationSerif" in pdf but "Libreration Serif" in your system, and also not for fake-bold fonts like "SimSun" which uses fill+stroke as bold effect in PDF but is imported as "outline" effect in Draw). This two situation are discussed in bug 81484 and bug 143095.
Comment 18 Kevin Suo 2021-06-29 15:12:21 UTC
I have submitted a patch in gerrit for review:
https://gerrit.libreoffice.org/c/core/+/118122
Comment 19 vvort 2021-06-30 14:01:12 UTC
I do not know why this code was added so I can't say what will happen if it will be removed.
Fixing one problem and potentaiily creating some unknown amount of other problems is not what I like to do.
(Also I do not develop for LO nowadays and even Gerrit do not want to authorize me, so my help will be _very_ limited)
Comment 20 Kevin Suo 2021-07-03 14:30:34 UTC
Reinvestigated and resubmitted a patch 
in https://gerrit.libreoffice.org/c/core/+/118354

T
Comment 21 Kevin Suo 2021-07-04 09:13:39 UTC
Created attachment 173336 [details]
Times.pdf (sdext.pdfimport failed to detect)

sdext.pdfimport failed to detect the font in this PDF file.
in: vcl/source/fontsubset/sft.cxx the function OpenTTFontBuffer, the clause (*ttf)->open(facenum) returned vcl::SFErrCodes::TtFormat, which means "incorrect TrueType font format".

However, the fonts in this PDF is TrueType:

$ pdffonts ./Times.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
BAAAAA+TimesNewRomanPSMT             TrueType          WinAnsi          yes yes yes     19  0
CAAAAA+TimesNewRomanPS-BoldMT        TrueType          WinAnsi          yes yes yes     24  0
DAAAAA+TimesNewRomanPS-ItalicMT      TrueType          WinAnsi          yes yes yes     34  0
EAAAAA+TimesNewRomanPS-BoldItalicMT  TrueType          WinAnsi          yes yes yes     29  0
FAAAAA+ArialMT                       TrueType          WinAnsi          yes yes yes     14  0
GAAAAA+Arial-BoldMT                  TrueType          WinAnsi          yes yes yes      9  0

Note that the "uni" for these fonts are "yes", means "there  is an explicit "ToUnicode" map in the PDF file". 

This PDF file was created using Writer. Seems this "ToUnicode" string was added by the following code:
https://opengrok.libreoffice.org/xref/core/vcl/source/gdi/pdfwriter_impl.cxx?r=07556be5#2830

from the code we can see that createToUnicodeCMap was called so there must be a ToUnicode map in the PDF file.
Comment 22 Kevin Suo 2021-07-04 09:20:25 UTC
Created attachment 173337 [details]
Times MS.pdf (sdext.pdfimport success to detect)

This PDF file was created using MS Word. sdext.pdfimport (which called "Font::identifyFont") successfully detected the font to be truetype and extracted the font attributes from the font file and applied these attributes, thus the correct fonts are shown in Draw.

$ pdffonts ./Times\ MS.pdf
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
ABCDEE+Times New Roman               TrueType          WinAnsi          yes yes no       5  0
ABCDEE+Times New Roman,Bold          TrueType          WinAnsi          yes yes no       7  0
ABCDEE+Times New Roman,Italic        TrueType          WinAnsi          yes yes no       9  0
ABCDEE+Times New Roman,BoldItalic    TrueType          WinAnsi          yes yes no      11  0
ABCDEE+Arial,Bold                    TrueType          WinAnsi          yes yes no      13  0
ABCDEE+Arial                         TrueType          WinAnsi          yes yes no      15  0

From above we see the uni is "no", meaning that there  is no explicit "ToUnicode" map in the PDF file.
Comment 23 Kevin Suo 2021-07-04 10:09:06 UTC
Adding Khaled Hosny to cc as he has worked on commit b94a66ebc8db6c5ca9c7dcfdfbb06b49deae4939.
Would you please take a look?

I remove myself from the assignee as the toUnicode code is out of my capability. The above patch is still valid but it is not a complete fix.
Comment 24 Commit Notification 2021-07-12 08:58:23 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/da59686672fd2bc98f8cb28d5f04dc978b50ac13

tdf#78427 sdext.pdfimport: No need to read a font file for the purpose of...

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 25 Commit Notification 2021-07-12 16:47:49 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "libreoffice-7-2":

https://git.libreoffice.org/core/commit/ac3207d3b2c3b6580de14132fd12e9c6fedc6502

tdf#78427 sdext.pdfimport: No need to read a font file for the purpose of...

It will be available in 7.2.0.2.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 26 Commit Notification 2021-07-12 18:22:52 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/cffd97193f7468f770368559d5a5c58bd0bb2327

tdf#78427 sdext.pdfimport: refactor the conversion of font family names

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 27 Kevin Suo 2021-07-13 06:45:31 UTC
*** Bug 116059 has been marked as a duplicate of this bug. ***
Comment 28 Commit Notification 2021-07-14 16:22:58 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/6a1de4f74e2510029313771d2751b6cd59141acf

tdf#78427 sdext.pdfimport: more bold/italic/Oblique fixes

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 29 Commit Notification 2021-07-14 16:56:22 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/12b57e43563a643dd653d78f3e2877ef75998d82

tdf#78427 tdf#81481 sdext.pdfimport: added unittest

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 30 Kevin Suo 2021-07-16 05:29:41 UTC
Fixed on master. Please test.
Comment 31 Timur 2021-07-16 07:06:38 UTC
In attachment 98668 [details] I see bold (font is Arial-BoldMT - Embedded Subset), but not italic (font is CairoFont-1-0 - Embedded Subset) and bold italic (font is CairoFont-3-0 - Embedded Subset). 

attachment 98669 [details] is OK .

attachment 155937 [details] (apart from separate wrap problem) went worse, headings went from bold to not bold (for replaced font for MyriadPro-Semibold - Embedded Subset). 

attachment 140174 [details] from duplicate bug is OK.

There's a separate bug 82163 for not reading embedded font.  Not sure if Comment 17 is sufficient explanation. But apart from that, this seems like immediate regression. Or 

Note: attachment from bug 59870 not affected.
Comment 32 V Stuart Foote 2021-08-22 13:51:44 UTC
(In reply to Timur from comment #31)
 
> There's a separate bug 82163 for not reading embedded font.  Not sure if
> Comment 17 is sufficient explanation.

Also have the broader see also of bug 101220 (a dupe of bug 82163?)
Comment 33 Kevin Suo 2021-08-23 06:19:04 UTC
(In reply to Timur from comment #31)
> In attachment 98668 [details] I see bold (font is Arial-BoldMT - Embedded Subset), but not italic (font is CairoFont-1-0 - Embedded Subset) and bold italic (font is CairoFont-3-0 - Embedded Subset). 

CairoFont is a Type 1C font, which libreoffice can not detect. Currently, as far as I know, LibreOffice can only detect TrueType font and Type 1 font, see:
https://opengrok.libreoffice.org/s?refs=identifyFont&project=core

Because it fails to detect the font attributes (family name, bold/italic etc) from the embedded font file, the only way is to guess them from the font name. However, the only clue we get is that the font name is "CairoFont-1-0" and "CairoFont-3-0". I assume one of them is the normal Cairo font while the other one is bold or italic Cairo, but we do not know exactly from the embedded font name. 
Meanwhile, the font information as returned by the xpdfimport script is incorrect. The bold/italic is wrong, e.g. for the bold italic font it returned:
updateFont 8 0 0 0 0 3840.000000 0 CairoFont-3-0

So the solution would be either adding the Type 1C font detection in vcl/source/font/font.cxx, or tweak the xpdfimport script to properly detect the bold/italic property in the pdf.
Comment 34 Kevin Suo 2021-10-08 15:30:31 UTC
Despite the various commits merged above, below are some TODO list related to this bug:

1. To correctly parse documents like attachment 155937 [details] (which contains fonts with Semibold / Light features), the current way of font handling in sdext/pdfimport need to be reworked. E.g., the FontAttributes model in sdext/source/pdfimport/inc/contentsink.hxx need to be improved to include semibold, light etc features, but I think a better way would be to replace those FontAttributes with the more feature-rich vcl::Font type).

2. To correctly parse attachment 98668 [details], we need to be able to detect Type 1C font. My understanding is thatcurrently  LibreOffice can only detect TrueType font and Type 1 font, but an not detect Type 1C ("C" means compressed?) font.
The italic font in attachment 98668 [details] is named "CairoFont-1-0" - although we can guess that the font name is CairoFont, there is no indication from the name that it is italic, thus the only way to let it show as italic is to detect from the embeded Type 1C font.

While I will continue to try to fix #1, I don't think I can handle #2.
As a result, I set status back to NEW and set ASSIGNEE to default, so that others can continue to contribute their effort to fix this bug.
Comment 35 Commit Notification 2021-10-11 07:08:16 UTC
Kevin Suo committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/4eef83dc4a8879f21ee6c98226510ac728bc317a

sdext.pdfimport tdf#78427: Add support for more Font Weight features

It will be available in 7.3.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 36 Kevin Suo 2021-10-11 08:01:39 UTC
#1 is fixed on master. Please test.

#2 is splitted to bug 145061.