Bug 67866 - ACCESSIBILITY: Missing language information in exported PDF
Summary: ACCESSIBILITY: Missing language information in exported PDF
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
Version:
(earliest affected)
3.3.0 release
Hardware: Other All
: medium enhancement
Assignee: Michael Stahl (allotropia)
URL:
Whiteboard: target:7.5.0 target:7.4.4
Keywords: accessibility, filter:pdf
Depends on:
Blocks: a11y, Accessibility PDF-Export PDF-Accessibility
  Show dependency treegraph
 
Reported: 2013-08-07 13:08 UTC by Christophe Strobbe
Modified: 2024-03-08 16:23 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Strobbe 2013-08-07 13:08:02 UTC
When exporting Writer documents to PDF, the resulting PDF file lacks language information. Language information is necessary for accessibility, esp. rendering through text-to-speech synthesis or Braille.

Steps to reproduce the issue:
1. Set the default language for documents under Tools > Options > Language Settings > Languages. (For the purpose of this test, set only the Western language, and set both Asian and CTL to none; you will need to enable "UI elements for East Asian writings" and "UI elements for Bi-Directional writing" temporarily to do that.)
2. Create a document. Notice that the document language is visible in the status bar.
3. Export the file to PDF; in the PDF Options, check "Tagged PDF" (otherwise, the document will not be fully accessible anyway).
4. Open the PDF file in Adobe Acrobat Pro or in Adobe Reader and go to the Advanced tab of the document properties: the language field is empty.

Desired behaviour:
The language field shows the default language from the source document.

Note: this issue is also present in other ODF formats, e.g. presentations made with Impress and exported to PDF. But Impress and Calc do not show language information in the status bar; see bugs 34141 and 34142.

The corresponding Apache OpenOffice bug is https://issues.apache.org/ooo/show_bug.cgi?id=49654
Comment 1 Mike Kaganski 2013-08-08 03:05:37 UTC
According to ISO 32000-1:2008 Document management -- Portable document format -- Part 1: PDF 1.7, section 14.9.2 Natural Language Specification:

"The natural language used for text in a document shall be determined in a hierarchical fashion, based on whether an optional Lang entry (PDF 1.4) is present in any of several possible locations. At the highest level, the document’s default language (which applies to both text strings and text within content streams) may be specified by a Lang entry in the document catalogue (see 7.7.2, “Document Catalog”). Below this, the language may be specified for the following items:
•Structure elements of any type (see 14.7.2, “Structure Hierarchy”), through a Lang entry in the structure element dictionary.
•Marked-content sequences that are not in the structure hierarchy (see 14.6, “Marked Content”), through a Lang entry in a property list attached to the marked-content sequence with a Span tag."

The document in LO, AFAICT, doesn't have the "default language" (though a specific LO configuration does have default language - but this is a GUI concept). Instead, different parts of the document have their own languages. It may happen that the whole document has one language, but generally, this is not the case. Thus, it is not always possible to determine "document language" that should be put into PDF document catalogue (which is displayed in the UI). But please note that there's no requirement that each hierarchy level be set. If you open the LO-generated tagged PDF in a plain text editor, you may see that there are entries for distinct elements like "/Lang(pt-BR)" or "/Lang(ru-RU)". So a screen reader has the necessary information when reading a section.

I close this as NOTABUG, but if you feel that it is incorrect, please reopen it with comment why you feel so. Possibly it could be an enhancement request to fill the highest-level Lang entry (in document catalogue) when there's only one language in the document.
Comment 2 V Stuart Foote 2013-09-19 12:48:21 UTC
Reopened as enhancement.

Setting as enhancement request, reasonable that in efforts to produce ISO 14289-1.2012 compliant PDF/UA ( see bug 45636 )we also start to handle simple PDF document metadata like Default language.
Comment 3 Mike Kaganski 2013-09-19 19:57:20 UTC
(In reply to comment #2)
> reasonable that in efforts to produce ISO 14289-1.2012 compliant
> PDF/UA ( see bug 45636 )we also start to handle simple PDF document
> metadata like Default language.

Please be more specific. What do you propose? I cannot see the point of this support. What should be the default language of an arbitrary document?

Will it be the UI language? Or maybe the UI locale? If so, how to handle cases when I (living in Russia) receive a document from a Japan colleague, and convert it to PDF? Should it have Russian as default language?

Or should it be the language that is used in document? If so, how would you select the default language for an English-Russian Dictionary of My Favorite Idioms (it would contain roughly equal amount of both languages)?

Or should the default language be set only for those documents that contain only one language? If so, will all other PDFs be standards-incompliant?

Or do you suggest a new UI control to select default language explicitly (say, in Save As dialog)? (I myself would prefer that solution, after the program made a heuristic guess, or give the last-used value in difficult cases).
Comment 4 V Stuart Foote 2013-09-19 23:35:01 UTC
@Mike,

Kind of an apples and oranges issue. It is not a question of what language the ODF document is prepared in (that is richly customizable in the Tools -> Options -> Language Settings -> Languages tab). Rather it is an issue of a suitable /LANG TAG being generated when the document is "Exported to PDF", and since this is an accessibility issue--we are dealing exclusively with output of Tagged PDF--or of meeting more demanding PDF/UA compliance.

As you suggest, that might best be done with addition of a UI Widget in the PDF Options panel to select/override the ODF document's default language.

But in short we have a WCAG 2.0 Level A compliance issue for LibreOffice as a document preparation software. See W3C WCAG Technique PDF16( http://www.w3.org/WAI/GL/WCAG20-TECHS/PDF16.html ). As is, LibreOffice will not meet the current statutory accessibility requirements that are being derived from the WCAG 2.0 with our exported tagged PDFs.

With publication of ISO 14289:1-2012 for PDF/UA "Universal Accessibility" we need to progress beyond current limited capabilities and enhance our "Export to Tagged PDF" to be able to more fully support PDF/UA--which requires more refined control over PDF document structure. Correctly handling the language tag is just an overdue start.

Stuart
Comment 5 V Stuart Foote 2014-11-05 05:48:38 UTC
A little QA housekeeping, enhancement left as REOPENED in error. Setting back to NEW.
Comment 6 Michael Stahl (allotropia) 2022-11-25 16:24:57 UTC
some things i found:

* HTML filter writes a lang attribute on the body element - so the same thing asked here.
  this is done in SwHTMLWriter::WriteStream()

* LO Writer PDF export also writes a /Lang into the document catalog - which was added in 2008 apparently

* Evince/Okular do not display a "language" in document properties or anywhere?

* neither does Firefox, neither Chromium

* PAC3 does display the language from the catalog for a PDF exported from LO Writer

* veraPDF doesn't mention language anywhere

* i don't have Adobe Acrobat or Acrobat Reader. my colleague who has it tells me he can't find a language there.

* Impress doesn't export a language to the catalog, and PAC3 displays "(no language)"
  - only Writer calls the function to set the language

... so i dont understand what is supposed to be wrong here in Writer - which this bug was filed against - but there is obviously something missing in non-Writer applications.
Comment 7 Christophe Strobbe 2022-11-25 23:26:05 UTC
(In reply to Michael Stahl (allotropia) from comment #6)
> some things i found:
> (...)
> 
> * i don't have Adobe Acrobat or Acrobat Reader. my colleague who has it
> tells me he can't find a language there.
> 
> (...)
> ... so i dont understand what is supposed to be wrong here in Writer - which
> this bug was filed against - but there is obviously something missing in
> non-Writer applications.

In Adobe Acrobat, go to File > Document Properties > Advanced, and check the Language field in the group "Reading Optinons". (In German: Datei > Eigenschaften > Erweitert > Leseoptionen.)

When this issue was submitted, the language field was empty. It is possible that this is no longer the case for PDF documents exported from Writer. (I no longer have access to Adobe Acrobat to verify this.)
Comment 8 Commit Notification 2022-11-28 20:08:08 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/0e4ff2261f3c2c9dada5816f11095652e028c3dd

tdf#67866 sc,sd: PDF/UA export: set language in Catalog

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 9 Michael Stahl (allotropia) 2022-11-28 20:11:22 UTC
fixed for Calc and Impress too
Comment 10 Commit Notification 2022-11-29 09:41:02 UTC
Michael Stahl committed a patch related to this issue.
It has been pushed to "libreoffice-7-4":

https://git.libreoffice.org/core/commit/5afdc179c52d4a71def825eaeb9deef53d57942d

tdf#67866 sc,sd: PDF/UA export: set language in Catalog

It will be available in 7.4.4.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 11 Commit Notification 2022-11-30 00:28:22 UTC
Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/commit/98d8adc5b377039d5dee0d5046ece721010a960c

Unnecessary to convert to locale, tdf#67866 follow-up

It will be available in 7.5.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 12 Stéphane Guillou (stragu) 2022-12-28 12:41:35 UTC
Fix verified in Draw for PDF/UA export, with:

Version: 7.5.0.1 (X86_64) / LibreOffice Community
Build ID: 77cd3d7ad4445740a0c6cf977992dafd8ebad8df
CPU threads: 8; OS: Linux 5.15; UI render: default; VCL: gtk3
Locale: en-AU (en_AU.UTF-8); UI: en-US
Calc: threaded

Thanks Michael!