Bug 67370 - PDF: Hyphenation not visible when text is exported as tagged PDF (applies to PDF/A-1 as well)
Summary: PDF: Hyphenation not visible when text is exported as tagged PDF (applies to ...
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
4.1.0.4 release
Hardware: Other Linux (All)
: high major
Assignee: Khaled Hosny
URL:
Whiteboard: BSA Confirmed:4.1.2.3:Ubuntu target:4...
Keywords: bibisected, regression
: 68836 (view as bug list)
Depends on:
Blocks: mab4.1
  Show dependency treegraph
 
Reported: 2013-07-26 19:56 UTC by kivi
Modified: 2015-12-17 07:16 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
A document, PDF exports and a image of the PDF export form of LO (139.67 KB, application/zip)
2013-08-07 08:22 UTC, kivi
Details
.ODT source and PDF/A-1a output (21.50 KB, application/zip)
2013-12-11 15:32 UTC, ozerski
Details
Patch to use ActualText for the hyphens (1.87 KB, patch)
2013-12-11 18:25 UTC, Khaled Hosny
Details

Note You need to log in before you can comment on or make changes to this bug.
Description kivi 2013-07-26 19:56:38 UTC
Problem description: 

Steps to reproduce:
1. Open Writer and write some cyrillic text
2. Hiphenate this text
3. Export this text to PDF

Current behavior:
The hiphenatin signs "-" are not visible in the PDF

Expected behavior:
The hiphenatin signs "-" must be visible in the PDF

              
Operating System: openSUSE
Version: 4.1.0.4 release
Last worked in: 3.6.6.2 release
Comment 1 Joel Madero 2013-08-06 20:46:57 UTC
Please attach a test document so we can test easily.

Marking as NEEDINFO - once attachment is done mark as UNCONFIRMED and we'll verify the bug. Thanks!
Comment 2 kivi 2013-08-07 08:22:15 UTC
Created attachment 83760 [details]
A document, PDF exports and a image of the PDF export form of LO

Testing additionally the same problem I found that "-" signs are missing only when export to PDF/A-1.
When exporting without this option hiphenation in the exported PDF will be OK.
I have attached here a document with some text to test it.
Please export as PDF/A-1 and also without this option. You will have same as attached 2 PDF files - one without the "-" signs and second with them.
Comment 3 Robinson Tryon (qubit) 2013-10-16 00:42:49 UTC
Test documents attached, so I'm changing status from NEEDINFO -> UNCONFIRMED.

I'll try to confirm this bug now.
Comment 4 Robinson Tryon (qubit) 2013-10-16 01:22:36 UTC
Hi Krasimir,

I'm testing on Ubuntu 12.04.3 + LO 4.1.2.3.

The exported PDF does appear to be missing the hyphens as you describe, so I'm going to mark this as NEW.

Testing on Ubuntu 12.04.3 + LO 3.5.7.2 confirms that this is a regression in export. As such, I'm adding this 'bibisectrequest' to the Whiteboard.

Note: I got a warning about transparency when I exported the document to PDF in 4.1 but not in 3.5. Per the message in 4.1, PDF/A prohibits transparent objects, so something transparent was painted opaque. I'm not sure if this is a clue to the problem, but it does appear that something was tweaked in PDF export between these versions.

As this bug is preventing you from achieving proper PDF output and it is a regression, I agree with the Priority: High and Severity: Major.
Comment 5 Mirosław Zalewski 2013-10-21 07:33:11 UTC
*** Bug 68836 has been marked as a duplicate of this bug. ***
Comment 6 Mirosław Zalewski 2013-10-21 16:37:31 UTC
I can confirm this bug on 4.1.2, Debian testing amd64.
I can NOT confirm this behavior on Windows 7 (i386). I have tested 4.1.0 (Build ID: 89ea49ddacd9aa532507cbf852f2bb22b1ace28) and 4.1.2 (Build ID: 40b2d7fde7e8d2d7bc5a449dc65df4d08a7dd38). When exporting testing document to PDF, LO does complain about transparency, but then resulting files contain hyphenation as expected.

Also, I have found it does not affect PDF/A-1 per se. In fact, PDF will be broken when "Create tagged PDF" is checked. Since checking PDF/A-1 automatically checks "tagged PDF", PDF/A-1 files are affected as well, but bug seems to be in tagging code.

@Krasimir, Qubit: could you verify that bug exists when you check "tagged PDF", but leave "PDF/A-1" unchecked?

I am thinking about adding this issue to 4.1 MAB. On one hand, it breaks the only reliable way of interchanging electronic documents, which is rather serious. On the other hand, this seems to affect only Linux and more or less convenient workaround exists (just create untagged PDF).

Best regards
Mirosław Zalewski
Comment 7 kivi 2013-10-22 06:40:30 UTC
Dear Miroslav,

I will try it later today and will report results after it.

With best regards,
Comment 8 Robinson Tryon (qubit) 2013-10-22 18:15:04 UTC
(In reply to comment #6)
> 
> Also, I have found it does not affect PDF/A-1 per se. In fact, PDF will be
> broken when "Create tagged PDF" is checked. Since checking PDF/A-1
> automatically checks "tagged PDF", PDF/A-1 files are affected as well, but
> bug seems to be in tagging code.
> 
> @Krasimir, Qubit: could you verify that bug exists when you check "tagged
> PDF", but leave "PDF/A-1" unchecked?

Testing again on Ubuntu 12.04.3 + LO 4.1.2.3

STEPS:
1) Open Example_file-for_PDF_export.doc
2) File -> Export as PDF
3) The only values under 'General' that are checked are

[See TEST1, TEST2 below]

4) Click 'Export'

TEST1: 
The only values checked are
  - Create PDF form
  - Export bookmarks

This PDF appears to contain the proper hyphens.

TEST2:
The only values checked are
  - Tagged PDF
  - Create PDF form
  - Export bookmarks

This PDF does NOT contain the proper hyphens.

So yes -- the bug appears to be in the PDF tagging code.

> I am thinking about adding this issue to 4.1 MAB. On one hand, it breaks the
> only reliable way of interchanging electronic documents, which is rather
> serious. On the other hand, this seems to affect only Linux and more or less
> convenient workaround exists (just create untagged PDF).

I wasn't sure what tagged PDFs included, so I did a little research:
http://www.pdfa.org/2011/10/the-value-of-tagged-pdf/

Tagged PDFs have semantic information, may be reflowed, etc. Tagging is necessary for 508 compliance and other accessibility guidelines.

I think that consistent, high-fidelity output of accessible PDF documents is key for any of our users -- business, government, or individual alike. I'd support adding this to 4.1 MAB.
Comment 9 kivi 2013-10-23 09:24:14 UTC
Hi every one!
I have tested again with Libre Office version: 4.1.2.3 Build ID: 40b2d7fde7e8d2d7bc5a449dc65df4d08a7dd38 downloaded from the Document Foundation and installed in  openSUSE 12.3.
I can confirm that checking Tagged Format PDF check box is cause for missing hiphenation signs in the exported PDF file.
Sorry for my earlier misinformation. On that time I did not realize that not PDF/A1 but the tagged format is the reason for missing hiphenation signs.
Comment 10 Mirosław Zalewski 2013-10-24 12:19:02 UTC
Hi

I have managed to get bibisect setup running and bibisect this issue. It seems that it was introduced somewhere during alpha testing. I am attaching full bibisect log below.

#v+
a46ea509c2186592f3705702573dbedaea50feeb is the first bad commit
commit a46ea509c2186592f3705702573dbedaea50feeb
Author: Jean-Baptiste Lallement <jean-baptiste.lallement@canonical.com>
Date:   Tue May 7 08:31:26 2013 +0000

    source-hash-9a7603187eb5cc580d33212ee147f9ac89de55f4
    
    commit 9a7603187eb5cc580d33212ee147f9ac89de55f4
    Author:     Michael Stahl <mstahl@redhat.com>
    AuthorDate: Mon May 6 17:19:41 2013 +0200
    Commit:     Michael Stahl <mstahl@redhat.com>
    CommitDate: Tue May 7 01:41:23 2013 +0200
    
        dbaccess: remove Package_inc
    
        Change-Id: I8e6748eef04f25603851a33d049cb9585fa04cc6

:100644 100644 eb867a4a16e11b4240c83c10d384a164eb9dd4ab 3b24de4bd63bea22705de8af055542e8db93492f M      autogen.log
:100644 100644 4c3702cf21398a18e40bfba26c2ec201c6a8b673 15ac2000952656b846178ecaed08dead6f937aaa M      ccache.log
:100644 100644 5aa01a110764ff8bbb8416aa0c88f1bdd5c22c07 c8fd26b288b93c193bcd50054f6b9bebe3e01100 M      commitmsg
:100644 100644 22d21e9210fa0a62e366a3ca4559ff5072930649 01770ad39635e540938502edbb97b86e2a96c409 M      dev-install.log
:100644 100644 fb7e0f36b841d2209ebc3a763b79ebe170e2604f 074d89851f9d5e5914463c912f9fe88c9a0c5c7c M      make.log
:040000 040000 f12af32a62495cefa9ea554e86db43ba0ae13914 88461f4da1abc8f13b4996a04428e2ee6ea3107b M      opt
minio@pingwin binrepo $ git bisect log 
# bad: [d31848bf3b700a22d127d7c775a0f910a7e133d0] source-hash-86cbe18a6143bf054c31f69dc97368dfdd3ad374
# good: [3e7462bd65e692bf0592d5b080b7716341b62a47] source-hash-1eddfce9894fd05315173744f495619189093dc7
git bisect start 'latest' 'oldest'
# good: [578fb08152ad11454e2f09ad6f8c8e527da817de] source-hash-4e3e171262aed0e52fa76158950d5be770249e80
git bisect good 578fb08152ad11454e2f09ad6f8c8e527da817de
# bad: [efb04c1c794ef7fc4cda1eb80880d333ca969a5e] source-hash-7908692490120350f2ad45241f7b19ba52dc0489
git bisect bad efb04c1c794ef7fc4cda1eb80880d333ca969a5e
# bad: [b46b5a58fcaec85eefb31b23afb0fc389a0c5334] source-hash-34c1b7bdd0bca4753f66a7d17ef46647a64a319e
git bisect bad b46b5a58fcaec85eefb31b23afb0fc389a0c5334
# bad: [5bc142137acfc7a70e919009ef5b64fc7163b75f] source-hash-a140350dae5db298094583763daf0a8bed8480cb
git bisect bad 5bc142137acfc7a70e919009ef5b64fc7163b75f
# bad: [097c8cd2e7db185e437f7d2d193975f908ffac75] source-hash-a71b30f6c20197eb07249aa91a85c83eb3d4fb2d
git bisect bad 097c8cd2e7db185e437f7d2d193975f908ffac75
# good: [d72b7e89dc951ec2565b2050a8497b719cfe10b5] source-hash-cdfad2dbbf180d3c556964c7aa8e0bb3b299d5e3
git bisect good d72b7e89dc951ec2565b2050a8497b719cfe10b5
# good: [73ab2a75583e16f3365877d5e3faf73ce3b8fd63] source-hash-66e47a5cfd177571dc0811abda4e0a5a6ae8c56a
git bisect good 73ab2a75583e16f3365877d5e3faf73ce3b8fd63
# bad: [a46ea509c2186592f3705702573dbedaea50feeb] source-hash-9a7603187eb5cc580d33212ee147f9ac89de55f4
git bisect bad a46ea509c2186592f3705702573dbedaea50feeb
# good: [32187e43966523317ef55793a968ceb78710ffa1] source-hash-3def5194ddaf9c4d766b71527874bd1a973b43e5
git bisect good 32187e43966523317ef55793a968ceb78710ffa1
# first bad commit: [a46ea509c2186592f3705702573dbedaea50feeb] source-hash-9a7603187eb5cc580d33212ee147f9ac89de55f4
#v-
Comment 11 Robinson Tryon (qubit) 2013-11-06 08:34:54 UTC
Adding my earlier Confirmation of Repro (on Ubuntu 12.04 + LO 4.1.2.3) to the whiteboard
Comment 12 Khaled Hosny 2013-12-06 11:47:50 UTC
This is an old bug that got exposed by recent changes in the underlying layout code.

Basically what is happening is that when we are exporting to tagged PDF, U+00AD (soft hyphen) is used as hyphen character (probably to allow for text re-flow) but this is wrong since U+00AD is a control character and should not have any visible output as it sole purpose it to indicate possible hyphenation point. Our new (more Unicode complaint) layout engine strips such control character and replaces them with a zero width space, the old engine did not do this and it worked to some degree (as long as the font has a visible glyph in that position, which most fonts do, but there is no requirement to do that).

I need to find what is the proper way to tag and automatically inserted hyphen in tagged PDF and implement that, so I appreciative if any one can research this.
Comment 13 ozerski 2013-12-11 15:22:28 UTC
I confirm this problem with PDF/A-1
Comment 14 ozerski 2013-12-11 15:27:34 UTC
I found this effect also when only latin symbols were used (LO 4.1.3.2 release / Linux, build ID 70feb7d99726f064edab4605a8ab840c50ec57a).
Comment 15 ozerski 2013-12-11 15:32:24 UTC
Created attachment 90611 [details]
.ODT source and PDF/A-1a output
Comment 16 Khaled Hosny 2013-12-11 18:25:31 UTC
Created attachment 90618 [details]
Patch to use ActualText for the hyphens

Here a preliminary patch that tries to use PDF’s ActualText feature to encode the softhyphen without using its glyph. The hyphen shows up and text extraction/search seems to work, but I’m not sure if it is a valid tagged PDF or not since I have no means to validate the file, help is appreciated.
Comment 17 Robinson Tryon (qubit) 2013-12-11 22:41:12 UTC
(In reply to comment #16)
> Here a preliminary patch that tries to use PDF’s ActualText feature to
> encode the softhyphen without using its glyph. The hyphen shows up and text
> extraction/search seems to work, but I’m not sure if it is a valid tagged
> PDF or not since I have no means to validate the file, help is appreciated.

I'm not sure about the robustness of this online PDF validator, but it's worth a shot:
http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx

Cheers,
--R
Comment 18 Khaled Hosny 2013-12-13 12:14:47 UTC
I tried this an another online validator, and both are happy even with no ActualText at all, which makes me skeptic about their usefulness (unless PDF/A-1a does not actually require tagged PDF).
Comment 19 Robinson Tryon (qubit) 2013-12-13 12:27:06 UTC
(In reply to comment #18)
> I tried this an another online validator, and both are happy even with no
> ActualText at all, which makes me skeptic about their usefulness (unless
> PDF/A-1a does not actually require tagged PDF).


It appears that PDF/A-1a does require a tagged PDF:
https://en.wikipedia.org/wiki/PDF/A#PDF.2FA-1

"PDF/A-1a includes all the requirements of PDF/A-1b and additionally requires:

* document structure must be included (hierarchy)
* Tagged PDF (use of alternative texts for images, tagging text spans and giving them an ID, replacement texts for symbols)
..."

And ActualText does sound very similar:
http://webaim.org/discussion/mail_thread?thread=4817

"From Adobe's Help for InDesign CS 5.5:

"PDF also supports actual text, in addition to Alt text. Actual text can be
applied to graphic elements that visually look like text. For example, a
scanned TIFF image. Actual text is used to represent words that were
converted to artwork. Actual text is only applicable for tagged PDFs."

Sooo... I'm not sure. Perhaps we could ask someone at Adobe and/or at one of these validator companies?
Comment 20 Khaled Hosny 2013-12-13 12:47:02 UTC
The PDF specification is clear on that an automatically inserted hyphen should be mapped to a soft hyphen when tagged PDF is used. Older versions of the spec just suggest using the soft hyphen glyph, which is what we are doing now but it is does not work with our strict layout engine. Later versions of the spec suggest using ActualText which I’m trying to do now, but not sure if it it actually working. Those two validators (http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx and http://www.validatepdfa.com/online.htm) seem to be happy even with just a regular ASCII hyphenminus, so I doubt they are actually checking the ActualText tags or even caring about the hyphen at all.
Comment 21 Commit Notification 2013-12-21 23:17:30 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=4dba6f5837539746293ef6808ea39a764ab7654d

fdo#67370: Hyphens are not visible in tagged PDF



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 22 Khaled Hosny 2013-12-21 23:44:20 UTC
After fixing this patch a bit and doing some tests, things seems to work fine and the PDF is standards compliant. I also checked InDesign and it seems to be using ActualText as well, so I pushed the fixed patch.
Comment 23 Commit Notification 2013-12-23 22:07:34 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "libreoffice-4-2":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=a9fa817b6876814b6ebc45c2534a769e1fa84cac&h=libreoffice-4-2

fdo#67370: Hyphens are not visible in tagged PDF


It will be available in LibreOffice 4.2.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 24 Commit Notification 2013-12-23 22:08:59 UTC
Khaled Hosny committed a patch related to this issue.
It has been pushed to "libreoffice-4-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=158027cd7fe1ea2faeb5d2220547b36c9cbfb9d3&h=libreoffice-4-1

fdo#67370: Hyphens are not visible in tagged PDF


It will be available in LibreOffice 4.1.5.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 25 wettererscheinung 2014-01-25 21:20:17 UTC
Sorry, but this bug is not solved in Version 4.1.4.2
Isn't it possible to make that patch for this version too as I guess it will still take some time, until it is available in distro packages.

Thanks 
(and apologies if I just broke some usual rules I don't know - don't want to bother anyone, I just need that patch as I am using persian texts)
Comment 26 Joel Madero 2014-01-25 21:24:18 UTC
Please do not update things without knowing procedures. Version is the oldest version not a newer one which also displays the behaviour, and it is fixed as listed in 4.1.5 (will not be fixed in 4.1.4). Moving back to Fixed and updating version back to previous version. 

For future reference when it says: target:4.1.5 it means that's when the patch will be seen by users :) Hope that helps
Comment 27 Robinson Tryon (qubit) 2015-12-17 07:16:37 UTC
Migrating Whiteboard tags to Keywords: (bibisected)
[NinjaEdit]