Bug 67866 - ACCESSIBILITY: Missing language information in exported PDF
Summary: ACCESSIBILITY: Missing language information in exported PDF
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Printing and PDF export (show other bugs)
(earliest affected)
3.3.0 release
Hardware: Other All
: medium enhancement
Assignee: Not Assigned
Keywords: accessibility, filter:pdf
Depends on:
Blocks: a11y PDF-Export PDF-Accessibility
  Show dependency treegraph
Reported: 2013-08-07 13:08 UTC by Christophe Strobbe
Modified: 2021-01-19 20:24 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Strobbe 2013-08-07 13:08:02 UTC
When exporting Writer documents to PDF, the resulting PDF file lacks language information. Language information is necessary for accessibility, esp. rendering through text-to-speech synthesis or Braille.

Steps to reproduce the issue:
1. Set the default language for documents under Tools > Options > Language Settings > Languages. (For the purpose of this test, set only the Western language, and set both Asian and CTL to none; you will need to enable "UI elements for East Asian writings" and "UI elements for Bi-Directional writing" temporarily to do that.)
2. Create a document. Notice that the document language is visible in the status bar.
3. Export the file to PDF; in the PDF Options, check "Tagged PDF" (otherwise, the document will not be fully accessible anyway).
4. Open the PDF file in Adobe Acrobat Pro or in Adobe Reader and go to the Advanced tab of the document properties: the language field is empty.

Desired behaviour:
The language field shows the default language from the source document.

Note: this issue is also present in other ODF formats, e.g. presentations made with Impress and exported to PDF. But Impress and Calc do not show language information in the status bar; see bugs 34141 and 34142.

The corresponding Apache OpenOffice bug is https://issues.apache.org/ooo/show_bug.cgi?id=49654
Comment 1 Mike Kaganski 2013-08-08 03:05:37 UTC
According to ISO 32000-1:2008 Document management -- Portable document format -- Part 1: PDF 1.7, section 14.9.2 Natural Language Specification:

"The natural language used for text in a document shall be determined in a hierarchical fashion, based on whether an optional Lang entry (PDF 1.4) is present in any of several possible locations. At the highest level, the document’s default language (which applies to both text strings and text within content streams) may be specified by a Lang entry in the document catalogue (see 7.7.2, “Document Catalog”). Below this, the language may be specified for the following items:
•Structure elements of any type (see 14.7.2, “Structure Hierarchy”), through a Lang entry in the structure element dictionary.
•Marked-content sequences that are not in the structure hierarchy (see 14.6, “Marked Content”), through a Lang entry in a property list attached to the marked-content sequence with a Span tag."

The document in LO, AFAICT, doesn't have the "default language" (though a specific LO configuration does have default language - but this is a GUI concept). Instead, different parts of the document have their own languages. It may happen that the whole document has one language, but generally, this is not the case. Thus, it is not always possible to determine "document language" that should be put into PDF document catalogue (which is displayed in the UI). But please note that there's no requirement that each hierarchy level be set. If you open the LO-generated tagged PDF in a plain text editor, you may see that there are entries for distinct elements like "/Lang(pt-BR)" or "/Lang(ru-RU)". So a screen reader has the necessary information when reading a section.

I close this as NOTABUG, but if you feel that it is incorrect, please reopen it with comment why you feel so. Possibly it could be an enhancement request to fill the highest-level Lang entry (in document catalogue) when there's only one language in the document.
Comment 2 V Stuart Foote 2013-09-19 12:48:21 UTC
Reopened as enhancement.

Setting as enhancement request, reasonable that in efforts to produce ISO 14289-1.2012 compliant PDF/UA ( see bug 45636 )we also start to handle simple PDF document metadata like Default language.
Comment 3 Mike Kaganski 2013-09-19 19:57:20 UTC
(In reply to comment #2)
> reasonable that in efforts to produce ISO 14289-1.2012 compliant
> PDF/UA ( see bug 45636 )we also start to handle simple PDF document
> metadata like Default language.

Please be more specific. What do you propose? I cannot see the point of this support. What should be the default language of an arbitrary document?

Will it be the UI language? Or maybe the UI locale? If so, how to handle cases when I (living in Russia) receive a document from a Japan colleague, and convert it to PDF? Should it have Russian as default language?

Or should it be the language that is used in document? If so, how would you select the default language for an English-Russian Dictionary of My Favorite Idioms (it would contain roughly equal amount of both languages)?

Or should the default language be set only for those documents that contain only one language? If so, will all other PDFs be standards-incompliant?

Or do you suggest a new UI control to select default language explicitly (say, in Save As dialog)? (I myself would prefer that solution, after the program made a heuristic guess, or give the last-used value in difficult cases).
Comment 4 V Stuart Foote 2013-09-19 23:35:01 UTC

Kind of an apples and oranges issue. It is not a question of what language the ODF document is prepared in (that is richly customizable in the Tools -> Options -> Language Settings -> Languages tab). Rather it is an issue of a suitable /LANG TAG being generated when the document is "Exported to PDF", and since this is an accessibility issue--we are dealing exclusively with output of Tagged PDF--or of meeting more demanding PDF/UA compliance.

As you suggest, that might best be done with addition of a UI Widget in the PDF Options panel to select/override the ODF document's default language.

But in short we have a WCAG 2.0 Level A compliance issue for LibreOffice as a document preparation software. See W3C WCAG Technique PDF16( http://www.w3.org/WAI/GL/WCAG20-TECHS/PDF16.html ). As is, LibreOffice will not meet the current statutory accessibility requirements that are being derived from the WCAG 2.0 with our exported tagged PDFs.

With publication of ISO 14289:1-2012 for PDF/UA "Universal Accessibility" we need to progress beyond current limited capabilities and enhance our "Export to Tagged PDF" to be able to more fully support PDF/UA--which requires more refined control over PDF document structure. Correctly handling the language tag is just an overdue start.

Comment 5 V Stuart Foote 2014-11-05 05:48:38 UTC
A little QA housekeeping, enhancement left as REOPENED in error. Setting back to NEW.