142359 – ACCESSIBILITY: Language tagging is lost when merging LO generated PDFs with Acrobat

Bug 142359 - ACCESSIBILITY: Language tagging is lost when merging LO generated PDFs with Acrobat

Summary: ACCESSIBILITY: Language tagging is lost when merging LO generated PDFs with A...

Status:	RESOLVED NOTOURBUG

Alias:	None

Product:	LibreOffice
Classification:	Unclassified
Component:	Printing and PDF export (show other bugs)
Version: (earliest affected)	3.3.0 release
Hardware:	All All

Importance:	medium normal
Assignee:	Not Assigned

URL:
Whiteboard:
Keywords:	accessibility, filter:pdf

Depends on:
Blocks:

Reported:	2021-05-18 17:37 UTC by devseppala
Modified:	2024-03-03 19:51 UTC (History)
CC List:	3 users (show)

See Also:
Crash report or crash signature:

Attachments
Example multilingual odt-document (9.21 KB, application/pdf) 2021-05-18 17:41 UTC, devseppala	Details
PDF export of odt-document with working language tagging (17.46 KB, application/pdf) 2021-05-18 17:42 UTC, devseppala	Details
odt -> PDF-export merged using Acrobat (language tags are lost) (18.61 KB, application/pdf) 2021-05-18 17:46 UTC, devseppala	Details
Example multilingual Word-document (13.11 KB, application/vnd.openxmlformats-officedocument.wordprocessingml.document) 2021-05-18 17:46 UTC, devseppala	Details
PDF export of Word-document with working language tagging (19.38 KB, application/pdf) 2021-05-18 17:47 UTC, devseppala	Details
Word -> PDF-export merged using Acrobat (language tags working) (25.85 KB, application/pdf) 2021-05-18 17:49 UTC, devseppala	Details
Example multilingual odt-document (9.21 KB, application/vnd.oasis.opendocument.text) 2021-05-18 17:59 UTC, devseppala	Details
Missing language in the object properties (before and after picture) (227.85 KB, image/gif) 2023-06-21 14:17 UTC, devseppala	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description devseppala 2021-05-18 17:37:00 UTC

When LibreOffice generated multilingual accessible PDF files are merged using Adobe Acrobat, the language information in document tag structure is lost.

To my understanding, this happens because there are two ways to do language tagging in PDF files:

https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=619

* Structure elements of any type, through a Lang entry in the structure element dictionary.

* Marked-content sequences that are not in the structure hierarchy, through a Lang entry in a property list attached to the marked-content sequence with a Span tag.

I think that LibreOffice uses the former strategy, where as Word uses the latter. When merging Word generated PDF-files with Acrobat the language information is retained and when merging LibreOffice generated files the language information is lost.

The real problem if of course that Acrobat does not support PDF-standard properly and it should fix their software.

However, it is the de facto tool for editing PDF-files and I think many users have to merge their LibreOffice generated PDF-document with other documents using Acrobat. This Acrobat incompatibility will result to a lot of multilingual documents not being properly accessible. This is problematic also, because normal accessibility checkers can not even detect that multilingual documents are not properly language tagged, they only check that a document level language property exists. So, in many cases language tagging will be silently lost.

Could LibreOffice also support the language tagging method favoured by Acrobat, in addition to the current method. I think this would resolve this issue.

Comment 1 devseppala 2021-05-18 17:41:17 UTC

Created attachment 172137 [details]
Example multilingual odt-document

Comment 2 devseppala 2021-05-18 17:42:51 UTC

Created attachment 172138 [details]
PDF export of odt-document with working language tagging

Comment 3 devseppala 2021-05-18 17:46:14 UTC

Created attachment 172139 [details]
odt -> PDF-export merged using Acrobat (language tags are lost)

Comment 4 devseppala 2021-05-18 17:46:58 UTC

Created attachment 172140 [details]
Example multilingual Word-document

Comment 5 devseppala 2021-05-18 17:47:34 UTC

Created attachment 172141 [details]
PDF export of Word-document with working language tagging

Comment 6 devseppala 2021-05-18 17:49:14 UTC

Created attachment 172142 [details]
Word -> PDF-export merged using Acrobat (language tags working)

Comment 7 devseppala 2021-05-18 17:59:27 UTC

Created attachment 172143 [details]
Example multilingual odt-document

I accidentally marked the first .odt example document as application/pdf

Comment 8 bunkem 2023-05-17 14:34:25 UTC

(In reply to devseppala from comment #0)
> To my understanding, this happens because there are two ways to do language
> tagging in PDF files:
> 
> https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.
> pdf#page=619
> 

I'm looking at page 619.  Are you referring to the "Figure 58 - Complex Web Capture file structure"?

Thanks.

Comment 9 devseppala 2023-05-28 19:04:27 UTC

(In reply to bunkem from comment #8)
> I'm looking at page 619.  Are you referring to the "Figure 58 - Complex Web
> Capture file structure"?
> 
> Thanks.

No, in the document footer page numbering I am referring to page 611 (chapter: 14.9.2 Natural Language Specification). The link points to page 619, because the first pages are not part of the document footer page numberig and '#page=xx' links refer to "real" page numbers.

I now see that the linked page url has changed. Here is a new working link.

https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf#page=619

Comment 10 bunkem 2023-06-12 21:00:22 UTC

Hi @devseppala,

The way I read the pg 611, it appears there are two different situations and so two ways the lang can be tagged in the text in a document. Please note the standard says "may" be specified.  *"the language may be specified for the following items"*  

So I'm not sure if it is missing in Acrobat.  It could be that LO has implemented only the one situation of the standard???  

If I got the right section here is what I base this on.  

Clipped below.  **by me**  
> Natural language **may** be specified for text in a document or for optional content.
>
> The natural language used for text in a document shall be determined in a hierarchical fashion, based on whether an optional Lang entry (PDF 1.4) is present in any of several possible locations. At the highest level, the document’s default language (which applies to both text strings and text within content streams) may be specified by a Lang entry in the document catalogue (see 7.7.2, “Document Catalog”). Below this, the language may be specified for the following items:
> 
> • Structure elements of any type (see 14.7.2, “Structure Hierarchy”), through a Lang entry in the structure element dictionary.
> • Marked-content sequences that are not in the structure hierarchy (see 14.6, “Marked Content”), through a Lang entry in a property list attached to the marked-content sequence with a Span tag.
> NOTE 1 - Although Span is also a standard structure type, as described under 14.8.4.4, “Inline-Level Structure Elements,” its use here is entirely independent of logical structure.
> NOTE 2 - The natural language used for optional content allows content to be hidden or revealed, based on the Langentry (PDF 1.5) in the Language dictionary of an optional content usage dictionary.
> NOTE 3 - The following sub-clauses provide details on the value of the Lang entry and the hierarchical manner in which the language for text in a document is determined.
>
> Text strings encoded in Unicode may include an escape sequence or language tag indicating the language of the text and overriding the prevailing Lang entry (see 7.9.2.2, “Text String Type”).

Please confirm I'm looking at the right section.

Then I'll have a look at your docs.

Comment 11 bunkem 2023-06-12 21:08:50 UTC

Could you also explain the steps how you do the "merge" operation in Acrobat?

Thanks

Comment 12 devseppala 2023-06-14 21:32:34 UTC

(In reply to bunkem from comment #10)
> Please confirm I'm looking at the right section.

Yes, you are looking at the right section.

I should say that I am not aPDF expert, during accessibility checking of documents I just happened to discover that language markings of LibreOffice generated PDF files disappear after merging them with Acrobat Pro. I found out this when JAWS screen reader could no longer dynamically change the reading language according to the content. Then I just started to investigate the issue a little further.

I should also mention that there are tools that can merge LO generated PDF files in a way that preserves the language tagging. One such tool is the Foxit PDF editor.
https://www.foxit.com/merge-pdf/

This is propably irrelevant, but PDF 1.3 reference manual seems to indicate that in the past there has been even third way of marking the language of text.
page 39, https://people.ksp.sk/~vlado/tex/pdfspec.pdf

"The text may also contain an escape sequence to indicate the language of the text. This is useful when the language cannot be determined from the character codes used in the text. The escape sequence uses the Unicode hex value U+001B followed by the two ASCII codes for the language identifiers defined by ISO 639 (see Appendix I), optionally followed by the two ASCII codes for country defined by IS0 3166 (see Appendix J), followed by U+001B."

Comment 13 devseppala 2023-06-14 21:41:04 UTC

(In reply to bunkem from comment #11)
> Could you also explain the steps how you do the "merge" operation in Acrobat?

I don't own Acrobat Pro my self and I won't have access to Acrobat until next week. I will come back to you to explain the steps once I have access to Acrobat, because I can't remember the process accurately enough from the top of my head.

Comment 14 bunkem 2023-06-20 12:49:01 UTC

(In reply to devseppala from comment #13)
> (In reply to bunkem from comment #11)
> > Could you also explain the steps how you do the "merge" operation in Acrobat?
> 
> I don't own Acrobat Pro my self and I won't have access to Acrobat until
> next week. I will come back to you to explain the steps once I have access
> to Acrobat, because I can't remember the process accurately enough from the
> top of my head.

Hello @devseppala.  Thank you for your detailed replies.

I look forward to hearing back how you merged the files with Acrobat so I can test it here.  I have a copy of Acrobat Pro DC.

As you mentioned that you don't have access to Acrobat until this week, how did you merge the files before?  

Thanks!

Comment 15 devseppala 2023-06-21 14:14:43 UTC

Hi @bunkem.

Here are the steps that I use to merge files with Acrobat Pro

* Tools  
* Combine Files
   * Add files (Select the first file, example file: accessibility_pdf_lang_tags_LO713.pdf) 
   * Add files (Select the second file, I selected the same file twice)
* Combine
* Save the resulting file

After the merge, language information is missing in the german language span-tag, see the picture "Missing language in the object properties".

(In reply to bunkem from comment #14)
> As you mentioned that you don't have access to Acrobat until this week, how did you merge the files before?  

I'm not excactly sure what you mean here. Perhaps I should have said ”the next time I have acess to Acrobat Pro, is not until next week.”

Thank you for taking a look at this issue!

Comment 16 devseppala 2023-06-21 14:17:11 UTC

Created attachment 188042 [details]
Missing language in the object properties (before and after picture)

Comment 17 devseppala 2023-06-21 14:32:06 UTC

Regarding the previous picture. In addition to the 'Language' input box on the 'Tag' tab, there is also another 'Language' input box in the 'Content' Tab of the 'Object properties' dialog window. These are two independent ways of marking the language of text. I believe Acrobat prefers the language marking on the 'Content' tab and supports only that one when combining PDF files.

Comment 18 bunkem 2023-06-28 17:40:51 UTC

Hi @devseppala

I apologize as I haven't had a chance to recreate your workflow yet.
However, I just saw that in the v7.6 there are some changes to Universal Accessibility.  Please see following link: [https://help.libreoffice.org/7.6/en-US/text/shared/01/ref_pdf_export_universal_accessibility.html]

I'll give a test using the new v7.6 test version and see if the issues you raised are still present.

Comment 19 devseppala 2023-06-30 09:04:45 UTC

Hi @bunkem.

Today I had some time and access to Acrobat Pro, so I tested this issue with LibreOffice 7.6 Beta build from June 11th. Unfortunately this issue is present in that version also. 

Please test this issue anyways, so we can have a confirmation of the bug.

Comment 20 bunkem 2023-07-21 20:01:19 UTC

Hi @devseppala,

I tried to recreate your workflow and I think this is a bug for Acrobat not LO.  Here is why.

1) I created one document with three paragraphs.  1st English (Canada), 2nd German, 3rd English (Canada).  LO shows the language properly.
2) I created a PDF accessibility setting enabled.  The English paragraphs show no language.  I'm not sure if this is due to the default language being English and if that doesn't need to tag the English text or if Acrobat doesn't show ENglish (Canada) because it isn't a language in their list.  Acrobat seems to do only English US and English UK.  I will check if this is the case next. 
The German language paragraph is tagged correctly.
3) I created another PDF with similar structure.
4) I merge using Acrobat (Cmd+i = insert).  The German language tag is OK on page 1 but is stripped out on page 2.  
So I believe this is a bug with Acrobat in the merging of two accessibility docs. LO created the tags in the first and second pdf OK, but the tags for the second doc were lost in the merged pdf.

I will submit a bug with Acrobat.

I will test also if I change the English text tag to something Acrobat accepts next.

Comment 21 bunkem 2023-07-21 20:33:47 UTC

Quick update.

I tagged the English portions English-US.  The accessibility enabled PDF created by LO shows the English-US (en-US) tag for the text.

I also created a doc with English Canada, German, English US.  Acrobat shows no tag for the English Canada text, German tag correctly and English US correctly.

So I can only conclude that accessibility language tags in PDF documents are only the ones that are in Acrobat by default.  Those Acrobat languages shown in the drop down are: Brazilian, Chinese, Danish, English, English UK, English US, Finnish, French, German, Italian, Japanese, Korean, Norwegian, Spanish and Swedish. 

I can't check if the paragraph in English (Canada) shows up correctly tagging as I don't have another pdf reader that shows accessibility tags. 

So I still believe this is an Acrobat bug not a LO bug.

Comment 22 devseppala 2023-07-24 11:58:19 UTC

Hi @bunkem

Thank you for debugging this issue with me, here are some comments on your tests.

1. 

I think you used the "Organize Pages" tool and not the "Combine Files" tool, which I used. I also now tested the "Organize Pages" method and it seems that it does retain the language tags in the LO generated PDF file that was open, when the "Organize Pages" tool was opened. However, language tags removed from the all the inserted PDF files (that are LibreOffice generated). When inserting MS Word generated PDF files, where the language is on the "Content" tab, the language information is preserved in all cases.

2. 

I don't think the problem is to do with the predefined language selection, but with the insertion order, when using the "Organize Pages" tool. "Combine Files" tool removes LO language tagging in all cases.

3.

I also think that this is primarily a Acrobat bug, but I am very sceptical that they would fix It. That is why in the first message I raised the question that could LO mark the language information the same way as MS Word, where the language information is shown on the Content-tab in Acrobat. 

It would be great if you can file a Acrobat bug report on this issue and also thank you again for your interest and help on this issue.

btw, I just thout I should mention that I am using Acrobat Pro 2017 and not the latest 2019 version.

Comment 23 bunkem 2023-07-24 14:20:16 UTC

Hi @devseppala,

I have created a bug on Acrobat last week.  

I've been beta testing Acrobat for a long time ... I think since version 2.3. :-)  

Accessibility is quite important for Acrobat being installed in government and large organizations so I'm hoping that the report will be accepted and that they will work to fix.  If LO has been following the ISO PDF standard and Acrobat has missed something, then I strongly feel that they will address the issue. 

As I've been using Acrobat for so long, so I don't often use the menu very often.  I typically use the keyboard shortcuts that I've learned.  To answer your question, it appears that the "Command+Shift+I" is both "Insert" on "Organize Pages" and also "Add Files" on "Combine Files".  I'll give it a try from the menus and update the Acrobat bug report.

I am running the most recent version of Acrobat rolling release.

I'll report back if I hear anything more.

Comment 24 bunkem 2023-07-26 18:00:28 UTC

@devseppala, I've tried the combine in Acrobat. Yes, Acrobat strips all the language tags when it creates the "binder" document.

I posted the bug on Jul 21 and got a notice on Jul 24 that one of the devs is investigating.  So let's cross our fingers.

I still believe this not a LO issue.  So perhaps we can close this?

Comment 25 Buovjaga 2023-07-27 04:01:39 UTC

(In reply to bunkem from comment #24)
> @devseppala, I've tried the combine in Acrobat. Yes, Acrobat strips all the
> language tags when it creates the "binder" document.
> 
> I posted the bug on Jul 21 and got a notice on Jul 24 that one of the devs
> is investigating.  So let's cross our fingers.
> 
> I still believe this not a LO issue.  So perhaps we can close this?

Yes, that was the consensus in the dev chat a couple of days ago.

Comment 26 devseppala 2023-09-25 09:32:24 UTC

@bumken, it's been now about two months since you reported this issue to Adobe and I was wondering, have you received any information if Adobe developers are going to fix this issue on the Acrobat side.

Also, does Adobe have a public bug database where I could follow this issue on my own.

Comment 27 bunkem 2023-09-25 23:00:56 UTC

Hi.

I'll have a look this evening when I get home.

No I don't think that Adobe has a public bug database.

B.

Comment 28 bunkem 2023-09-26 01:25:15 UTC

(In reply to bunkem from comment #27)

There is no update other than it is in development and is being investigated.

>
> I'll have a look this evening when I get home.
>

Comment 29 devseppala 2024-03-01 15:16:43 UTC

@bunken, thank you very much for your last status update. I hate to bother you again, but would you mind checking again how Adobe is progressing on this issue.

In addition, I found a very intresting thread on a Adobe related forum, that is about this very same issue, and is from way back 2018.
https://acrobat.uservoice.com/forums/590923-acrobat-for-windows-and-mac/suggestions/34078447-bug-language-identifiers-lost-re-post-with-attac

The thread has a very intresting ending:
'This issue was taken up to the higher management and we are not implementing it now. We have kept it in our radar and will implement it in near future.'

Comment 30 bunkem 2024-03-03 19:51:39 UTC

Hi.  I checked but can't see any update.  It has been "In Development" for ~6 months so I asked for a status.  If I get a reply, I'll post the update.