Bug 124141 - Create a document analyser for LibreOffice triage and QA
Summary: Create a document analyser for LibreOffice triage and QA
Status: CLOSED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium normal
Assignee: wingednova
URL:
Whiteboard: target:7.3.0
Keywords: difficultyMedium, easyHack, skillJava, skillPython, skillScript, skillUno, topicQA
Depends on:
Blocks:
 
Reported: 2019-03-17 22:29 UTC by Björn Michaelsen
Modified: 2023-11-21 08:01 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
The attachment contains my easy hack to the document anayser for a Libreoffice Document (777 bytes, text/plain)
2019-04-09 16:25 UTC, ipshii1609
Details
Script for counting elements in *.odt documents (1.29 KB, text/x-python)
2020-03-30 18:13 UTC, Sebastian O.
Details
Document analyser (2.24 KB, text/x-python)
2020-12-12 08:12 UTC, wingednova
Details
document analyser (modified) (2.26 KB, text/x-python)
2020-12-12 14:03 UTC, wingednova
Details
document analyser (modified) (2.56 KB, text/x-python)
2020-12-22 13:13 UTC, wingednova
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Björn Michaelsen 2019-03-17 22:29:43 UTC
Description:
Often issues arise with specific documents, esp. performance issues. It will help LibreOffice QA and developers when triaging issues, if there is a overview of what might be special about one specific document.

This EasyHack is to create a script or LibreOffice extension that provides statistics about a document. For a text document including for example:

- number of paragraphs
- number of pages
- number of images/embedded media
- number of changetracking (redlines)
- number of styles, bookmarks, tables, indexes, text frames, OLE objects, sections, hyperlinks, references, comments ...

The extension should produce the output as simple text, so that this can be easily copypasted into a bugreport. For other document types, other information might be relevant. For a simple scope, it should be ok to start with basic numbers about text documents.


Steps to Reproduce:
.

Actual Results:
.

Expected Results:
.


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 Anuj Agrawal 2019-03-19 13:25:36 UTC
Hi,
I'm Anuj Agrawal. I'd like to work on this issue. Can you please elaborate on the format of the text output you wish the Script to generate?
Comment 2 ipshii1609 2019-04-09 16:25:28 UTC
Created attachment 150623 [details]
The attachment contains my easy hack to the document anayser for a Libreoffice Document
Comment 3 Pankaj Kumar 2019-04-23 07:55:03 UTC
Hi Anuj Agarwal,
Any update on the bug?
Comment 4 Ebrain Mirambeau 2019-07-31 04:33:44 UTC
I took a look at ipshii1609@gmail.com 's solution and I think that it's both incomplete and written in Python. I would like to work on it. Any objections?
Comment 5 gs_1001 2020-01-10 17:12:50 UTC
Can somebody please provide any update on this bug. One user did provide a script written in python. Is this bug still open. If yes then please do speak in the context of the mentioned script.
Comment 6 Piya 2020-01-20 15:10:47 UTC
Hi,
I would like to start working on this bug. Wish me luck!
Piya
Comment 7 Buovjaga 2020-03-14 19:39:12 UTC
It seems Piya abandoned this, so unassigning.
Comment 8 Sebastian O. 2020-03-30 18:13:50 UTC
Created attachment 159166 [details]
Script for counting elements in *.odt documents

Hello everyone!

I fixed some errors in the script from ipshii1609@gmail.com :
- No counting of tables
- No counting of images

Also rewrote it, to make future additions possible.

To be done:

- Adding more category's
- Fixing page counting when doing manual page breaks, or finding a proper way to count pages.

Will try to add more stuff in the near future.

Greetings
Comment 9 wingednova 2020-12-12 08:12:41 UTC
Created attachment 168089 [details]
Document analyser

Here is my attempt:

I modified Sebastian's function and extended the script to include the remaining document statistics. In total, the script outputs: bookmark count, cell count, changetracking count, character count, comment count, draw count, frame count, hyperlink count, image count, non-whitespace character count, object count, OLE object count, page count, paragraph count, row count, sentence count, syllable count, table count, textbox count, word count, and paragraph styles. 
Additionally, the script can be run on other than *.odt files.

Please let me know what I can do to extend this.
Comment 10 wingednova 2020-12-12 14:00:32 UTC
Comment on attachment 168089 [details]
Document analyser

>"""
>Document analyser uses the odfpy module: https://pypi.org/project/odfpy/
>
>This script prints:
>bookmark count, cell count, changetracking count, character count, 
>comment count, draw count, frame count, hyperlink count, 
>image count, non-whitespace character count, object count, OLE object count, 
>page count, paragraph count, row count, sentence count, 
>syllable count, table count, textbox count, word count, and paragraph styles.
>
>"""
>
>import odf
>from odf.namespaces import TEXTNS
>from odf.element import Element
>from odf.opendocument import load
>from odf import text,meta,office,draw
>
>print("Enter filename: ")
>filename=input()
>
>doc=load(filename)
>
>print("\nDOCUMENT STATISTICS\n")
>for stat in doc.getElementsByType(meta.DocumentStatistic):
>	print("Cell count",stat.getAttribute('cellcount'))
>	print("Character count:",stat.getAttribute('charactercount'))
>	print("Draw count:",stat.getAttribute('drawcount'))
>	print("Frame count:",stat.getAttribute('framecount'))	
>	print("Image count:",stat.getAttribute('imagecount'))
>	print("Non-whitespace character count:",stat.getAttribute('nonwhitespacecharactercount'))
>	print("Object count:",stat.getAttribute('objectcount'))
>	print("Object linking and embedding (OLE) object count:",stat.getAttribute('oleobjectcount'))
>	print("Page count:",stat.getAttribute('pagecount'))
>	print("Paragraph count:",stat.getAttribute('paragraphcount'))
>	print("Row count:",stat.getAttribute('rowcount'))
>	print("Sentence count:",stat.getAttribute('sentencecount'))
>	print("Syllable count:",stat.getAttribute('syllablecount'))
>	print("Table count:",stat.getAttribute('tablecount'))
>	print("Word count:",stat.getAttribute('wordcount'))
>
>#type counter for attributes not covered by odf.meta.DocumentStatistic
>def type_counter(doc,type):
>	count=0
>	for element in doc.getElementsByType(type):
>		count+=1
>	return count
>
>types={
>	'Bookmark':text.Bookmark,
>	'Changetracking':text.FormatChange,
>	'Comment':office.Annotation,
>	'Hyperlink':text.A,
>	'Textbox':draw.TextBox
>}
>
>for key,value in types.items():
>	print(key,'count:',type_counter(doc,value))
>
>def paragraph_style(doc):
>	i = 1
>	for paragraph in doc.getElementsByType(text.P):
>		print('Paragraph',i,'style:',paragraph.getAttribute('stylename'))
>		i+=1
>
>paragraph_style(doc)
Comment 11 wingednova 2020-12-12 14:03:03 UTC
Created attachment 168100 [details]
document analyser (modified)
Comment 12 wingednova 2020-12-22 13:13:50 UTC
Created attachment 168410 [details]
document analyser (modified)

Cleaned up indentation a bit.
Comment 13 Xisco Faulí 2021-01-08 18:18:29 UTC
(In reply to wingednova from comment #12)
> Created attachment 168410 [details]
> document analyser (modified)
> 
> Cleaned up indentation a bit.

Hello wingednova,
thanks for working on this.
i think the script should be in the dev-tools repository < https://gerrit.libreoffice.org/admin/repos/dev-tools >, there is a QA folder in there. I can submit the script to the repository on your behalf if you don't want to do it yourself, by first we need the licence statement < https://wiki.documentfoundation.org/Development/gerrit/SubmitPatch#Add_yourself_to_the_contributor_list >.
Could you please send it to the dev mailing list as described in the previous link ?
Comment 14 wingednova 2021-01-09 08:55:49 UTC
(In reply to Xisco Faulí from comment #13)
> (In reply to wingednova from comment #12)
> > Created attachment 168410 [details]
> > document analyser (modified)
> > 
> > Cleaned up indentation a bit.
> 
> Hello wingednova,
> thanks for working on this.
> i think the script should be in the dev-tools repository <
> https://gerrit.libreoffice.org/admin/repos/dev-tools >, there is a QA folder
> in there. I can submit the script to the repository on your behalf if you
> don't want to do it yourself, by first we need the licence statement <
> https://wiki.documentfoundation.org/Development/gerrit/
> SubmitPatch#Add_yourself_to_the_contributor_list >.
> Could you please send it to the dev mailing list as described in the
> previous link ?

Hello, I'm still working my way around gerrit so would be super grateful if you could submit this for me. I have sent my license statement to the mailing list. Thank you for your help!
Comment 15 Buovjaga 2021-04-04 10:23:21 UTC
I submitted the patch to Gerrit with Ahlaam as the author and Sebastian as co-author:
https://gerrit.libreoffice.org/c/dev-tools/+/113567

Sebastian: could you send a license statement to the dev list: https://wiki.documentfoundation.org/Development/GetInvolved#License_statement
Comment 16 Commit Notification 2021-09-17 11:25:45 UTC
Ahlaam Rafiq committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/dev-tools/commit/71ffc7eba9137e94a96b72fed762cc1c9a82baeb

tdf#124141 add document analyser
Comment 17 Xisco Faulí 2021-09-24 12:04:27 UTC
Now that the the UNO object inspector is included from LibreOffice 7.2 on, I'm wondering if the script is useful anymore ?
Comment 18 Buovjaga 2021-09-24 13:18:15 UTC
(In reply to Xisco Faulí from comment #17)
> Now that the the UNO object inspector is included from LibreOffice 7.2 on,
> I'm wondering if the script is useful anymore ?

I don't see the inspector providing such statistics nor having an ability to print a report. Or do you think these features should be added to it?
Comment 19 Xisco Faulí 2021-09-24 13:40:22 UTC
(In reply to Buovjaga from comment #18)
> (In reply to Xisco Faulí from comment #17)
> > Now that the the UNO object inspector is included from LibreOffice 7.2 on,
> > I'm wondering if the script is useful anymore ?
> 
> I don't see the inspector providing such statistics nor having an ability to
> print a report. Or do you think these features should be added to it?

On the left, in the Object box, you can see all the elements of the document and then explore them. It also works with any kind of document, the document_analyser only works with ODF text documents
Comment 20 Buovjaga 2021-09-24 13:50:45 UTC
(In reply to Xisco Faulí from comment #19)
> (In reply to Buovjaga from comment #18)
> > (In reply to Xisco Faulí from comment #17)
> > > Now that the the UNO object inspector is included from LibreOffice 7.2 on,
> > > I'm wondering if the script is useful anymore ?
> > 
> > I don't see the inspector providing such statistics nor having an ability to
> > print a report. Or do you think these features should be added to it?
> 
> On the left, in the Object box, you can see all the elements of the document
> and then explore them. It also works with any kind of document, the
> document_analyser only works with ODF text documents

Yes, but it has no statistics to copy & paste into bug reports as was Björn's idea.

I see you created bug 142373 for exporting info, but it would still need a statistics feature.
Comment 21 taylorms 2023-11-21 04:09:32 UTC Comment hidden (spam)