Bug 79845 - Mac OS X mdimporter for .odt files with tabulations doesn't import all text
Summary: Mac OS X mdimporter for .odt files with tabulations doesn't import all text
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: LibreOffice (show other bugs)
Version:
(earliest affected)
4.2.4.2 release
Hardware: x86-64 (AMD64) macOS (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-09 17:25 UTC by Dave Huang
Modified: 2022-12-21 12:53 UTC (History)
2 users (show)

See Also:
Crash report or crash signature:


Attachments
Text after the tab ("Column two text") does not get indexed (9.79 KB, application/vnd.oasis.opendocument.text)
2014-06-09 17:25 UTC, Dave Huang
Details
Keep collecting text until the ending <text:p> (829 bytes, patch)
2018-09-14 22:45 UTC, Dave Huang
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Huang 2014-06-09 17:25:34 UTC
Created attachment 100760 [details]
Text after the tab ("Column two text") does not get indexed

Mac OS X 10.9.3
LibreOffice Version: 4.2.4.2 (64-bit)
Build ID: 63150712c6d317d27ce2db16eb94c2f3d7b699f8

It seems that the mdimporter is skipping text on a line after a tab character. E.g., if I create a new Text document with the contents "Column one text[tab]Column two text[paragraph]Some other text" (see attached document), then run:

mdimport -d 2 -g "/Applications/LibreOffice.app/Contents/Library/Spotlight/OOoSpotlightImporter.mdimporter" mdimporter_test.odt

I get kMDItemTextContent = "Column one text Some other text";, and a Spotlight search for "Column two" does not find the document.

However, if I do:

mdimport -d 2 -g "/System/Library/Spotlight/RichText.mdimporter" mdimporter_test.odt

It says kMDItemTextContent = "Column one text\tColumn two text\nSome other text\n\n";, and a search for "Column two" does find the document.

As a side note, probably unrelated,

mdimport -d 1 mdimporter_test.odt

says Imported '/Users/khym/Desktop/mdimporter_test.odt' of type 'org.oasis-open.opendocument.text' with plugIn /System/Library/Spotlight/RichText.mdimporter.

I.e., it's not using OOoSpotlightImporter unless I force it to with -g... isn't it supposed to default to OOoSpotlightImporter rather than RichText?
Comment 1 Alex Thurgood 2014-09-30 09:29:42 UTC
Probably, but I have no idea how the SpotlightImporter is supposed to work
Comment 2 QA Administrators 2015-10-14 19:58:02 UTC Comment hidden (obsolete)
Comment 3 jotraeth 2016-07-20 12:08:58 UTC
Bug is still present on a currently supported version of LibreOffice

Mac OSX Version 10.9.5
LibreOffice Version 5.1.4.2 (64-bit)
Build-ID: f99d75f39f1c57ebdd7ffc5f42867c12031db97a

No indexing of text in .odt-files with tabulation in unordered list
Comment 4 QA Administrators 2017-09-01 11:17:05 UTC Comment hidden (obsolete)
Comment 5 Dave Huang 2017-09-13 06:08:51 UTC
I'm still seeing the bug in:

Version: 5.4.1.2
Build ID: ea7cb86e6eeb2bf3a5af73a8f7777ac570321527
CPU threads: 4; OS: Mac OS X 10.12.6; UI render: default; 
Locale: en-US (en_US.UTF-8); Calc: group
Comment 6 QA Administrators 2018-09-14 02:46:16 UTC Comment hidden (obsolete)
Comment 7 Dave Huang 2018-09-14 22:44:07 UTC
I'm still seeing the bug in:

Version: 6.1.1.2
Build ID: 5d19a1bfa650b796764388cd8b33a5af1f5baa1b
CPU threads: 4; OS: Mac OS X 10.13.6; UI render: default; 
Locale: en-US (en_US.UTF-8); Calc: group threaded

So I don't know Objective C, but I do know regular C... looking at the source, from what I can tell, it parses the XML content of the document and when it finds a <text:p> start tag, it starts collecting the text in the element until it finds any end tag.

The relevant part of the attached .odt is basically: <text:p><text:span>Column one text<text:tab />Column two text</text:span></text:p>

So I think the problem is that it sees the <text:p> and starts collecting text, but when it gets to the <text:tab />, it stops and hence ignores "Column two text". What if only only stops collecting text when it finds the ending </text:p>, rather than any end tag?

I'll attach a proposed patch, but as I said, I don't know Objective C. And I don't have an environment where I can try to build and test the change.
Comment 8 Dave Huang 2018-09-14 22:45:51 UTC
Created attachment 144885 [details]
Keep collecting text until the ending <text:p>
Comment 9 QA Administrators 2019-09-15 02:46:47 UTC Comment hidden (obsolete)
Comment 10 Dave Huang 2019-09-23 01:36:10 UTC
(In reply to QA Administrators from comment #9)
> There have been thousands of bug fixes and commits since anyone checked on
> this bug report. During that time, it's possible that the bug has been
> fixed, or the details of the problem have changed. We'd really appreciate
> your help in getting confirmation that the bug is still present.

But it looks like only two commits that affect the mdimporter, neither of which claim to fix this problem...

Perhaps someone could try out my proposed patch? Or if someone could just build it for me, I can try it myself.
Comment 11 Julien Nabet 2019-10-18 07:55:10 UTC
David: I can submit the patch for you on gerrit but first, could you send license statement (see https://wiki.documentfoundation.org/Development/Developers)?
Indeed, without license statement, we can't include your patch in LO.
Comment 12 QA Administrators 2021-10-18 03:48:52 UTC
Dear Dave Huang,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.
 
If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not 
appropriate in this case)


If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword


Feel free to come ask questions or to say hello in our QA chat: https://web.libera.chat/?settings=#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug