Bug Hunting Session
Bug 90904 - FILESAVE: OOXML export is missing document statistics
Summary: FILESAVE: OOXML export is missing document statistics
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: filters and storage (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: alexey.chemichev
URL:
Whiteboard: ToBeReviewed
Keywords: difficultyBeginner, easyHack, filter:ooxml, skillCpp
Depends on:
Blocks:
 
Reported: 2015-04-28 07:02 UTC by Dan
Modified: 2017-02-14 08:57 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:


Attachments
Windows Explorer screenshot of metadata 2007 and LO DOCX file (20.59 KB, image/png)
2015-04-28 07:02 UTC, Dan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dan 2015-04-28 07:02:57 UTC
Created attachment 115149 [details]
Windows Explorer screenshot of metadata 2007 and LO DOCX file

DOCX files saved using LO are missing standard metadata. (Pages, Word / Char / Line / Para counts)

Screenshot attached of Windows Explorer comparing files created in Word 2007 and LibreOffice 4.4.2.2, also seeing same results using Apache Tika 1.8 to pull the metadata.
Comment 1 raal 2015-04-28 10:40:17 UTC
I can confirm with LO 4.4.2, win7
Comment 2 Michael Stahl (CIB) 2015-04-28 11:23:49 UTC
it should be quite easy to add this, marking as easy-hack.

the document properties are exported to OOXML here:

oox/source/core/xmlfilterbase.cxx:XmlFilterBase& XmlFilterBase::exportDocumentProperties( Reference< XDocumentProperties > xProperties )

all that is missing is getting the XDocumentProperties::getDocumentStatistics()
and converting that to XML elements or attributes.

in ECMA-376 3rd edition the definition of the "Extended File Properties"
elements starts on page 4254, "22.2.2.1 Application" up to 
"22.2.2.28 Words (Word Count)".

http://www.ecma-international.org/publications/standards/Ecma-376.htm
Comment 3 Dan 2015-04-29 06:22:58 UTC
Thank you for your efforts on this guys :) The fast response is a pleasant surprise.

If I can be of any further help, please don't hesitate to let me know.
Comment 4 Yousuf Philips (jay) (retired) 2015-05-02 09:13:24 UTC
I'd assume this is a duplicate of bug 89775.
Comment 5 alexey.chemichev 2015-11-18 15:16:49 UTC
Hi, guys.
See no stat counter for the lines of text.

neither here:
    sw/source/filter/xml/xmlmeta.cxx: statistic s_stats []

nor here:
    sw/inc/docstat.hxx: SW_DLLPUBLIC SwDocStat

Quick and easy patch for
    Pages / Word count / Character count
could be quickly submitted (from me)
(Paragraph count is already exposed in MS Properties Explorer)
Comment 6 Commit Notification 2015-11-18 19:40:53 UTC
alexey.chemichev committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=8beea0f6b43b9fe893418687a75d28a6d624ede7

tdf#90904 DOCX export metadata for "Pages", "Word count", "Character count"

It will be available in 5.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 7 Commit Notification 2015-11-19 14:46:10 UTC
alexey.chemichev committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=24346dc6630471da65a2c19d767cb9deed73405a

tdf#90904 Sorry, mixed Characters and CharactersWithSpaces at a first time

It will be available in 5.1.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 alexey.chemichev 2015-11-20 16:18:55 UTC
Trying to define the scope (see also tdf#89775)...

ECMA describes 28 Extended Properties:

+   01. Application (Application Name)
    02. AppVersion (Application Version)
+   03. Characters (Total Number of Characters)
+   04. CharactersWithSpaces (Number of Characters (With Spaces))
    05. Company (Name of Company)
    06. DigSig (Digital Signature)
    07. DocSecurity (Document Security)
    08. HeadingPairs (Heading Pairs)
    09. HiddenSlides (Number of Hidden Slides)
    10. HLinks (Hyperlink List)
    11. HyperlinkBase (Relative Hyperlink Base)
    12. HyperlinksChanged (Hyperlinks Changed)
    13. Lines (Number of Lines)
    14. LinksUpToDate (Links Up-to-Date)
    15. Manager (Name of Manager)
    16. MMClips (Total Number of Multimedia Clips)
    17. Notes (Number of Slides Containing Notes)
+   18. Pages (Total Number of Pages)
+   19. Paragraphs (Total Number of Paragraphs)
    20. PresentationFormat (Intended Format of Presentation)
    21. Properties (Application Specific File Properties)
    22. ScaleCrop (Thumbnail Display Mode)
    23. SharedDoc (Shared Document)
    24. Slides (Slides Metadata Element)
+   25. Template (Name of Document Template)
    26. TitlesOfParts (Part Titles)
+   27. TotalTime (Total Edit Time Metadata Element)
+   28. Words (Word Count)

Someone please help to mark the props that are really valid for LO and are present (or can be calculated) in the codebase
Comment 9 Robinson Tryon (qubit) 2015-12-14 06:30:15 UTC Comment hidden (obsolete)
Comment 10 Robinson Tryon (qubit) 2016-02-18 14:51:52 UTC Comment hidden (obsolete)
Comment 11 Michael Stahl (CIB) 2016-02-19 22:46:57 UTC
oops, missed that bugzilla mail...

"AppVersion" would sound obvious but iirc i tried to add that once
and found that it really is "Microsoft Office version" - if the
version number isn't formatted exactly like MSO version numbers
are then MSO will complain that the document is invalid.

(also i'm surprised that the "HLinks" anachronism still exists)

one would think that Impress would have a SlideCount statistic
but apparently it doesn't.

so i think we're done here for now, nothing easily implemented left,
thanks Alexey.