Bug 35143 - pdf import adds and removes spaces in text
Summary: pdf import adds and removes spaces in text
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
3.5.6.2 release
Hardware: x86-64 (AMD64) Windows (All)
: medium normal
Assignee: vvort
URL:
Whiteboard: target:4.3.0
Keywords:
: 49909 (view as bug list)
Depends on:
Blocks:
 
Reported: 2011-03-09 04:56 UTC by matt.reischer
Modified: 2014-04-17 08:59 UTC (History)
4 users (show)

See Also:
Crash report or crash signature:


Attachments
pdf before import (44.78 KB, application/pdf)
2011-03-09 04:56 UTC, matt.reischer
Details
odg after import (33.38 KB, application/vnd.oasis.opendocument.graphics)
2011-03-09 04:57 UTC, matt.reischer
Details
checked to see if bug released in current cersion (3.5.6.2) it has not. (272.15 KB, image/jpeg)
2012-08-31 20:14 UTC, matt.reischer
Details
patch v1 by Vort (12.02 KB, patch)
2014-01-20 06:20 UTC, vvort
Details
Example PDF with Spaces removed (23.47 KB, application/force-download)
2014-01-20 10:51 UTC, Samuel Mehrbrodt (CIB)
Details
patch v2 by Vort (13.53 KB, patch)
2014-01-20 13:43 UTC, vvort
Details

Note You need to log in before you can comment on or make changes to this bug.
Description matt.reischer 2011-03-09 04:56:48 UTC
Created attachment 44271 [details]
pdf before import

simple text is imported incorrectly.
blank spaces are added inside words, and removed between words
example1 after import:
"any informati on or t echnical data that is sensi tive material, includi ng" 
example 2 after import:
"authorizedrepresentativesofallparties.ThisAgreementandperformancethereundershallbe"
these are from the same document, but different pages.
before and after documents are attached.
Comment 1 matt.reischer 2011-03-09 04:57:20 UTC
Created attachment 44272 [details]
odg after import
Comment 2 Björn Michaelsen 2011-12-23 11:48:06 UTC
[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Comment 3 Florian Reisinger 2012-08-14 14:01:10 UTC
Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian
Comment 4 Florian Reisinger 2012-08-14 14:02:16 UTC
Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian
Comment 5 Florian Reisinger 2012-08-14 14:06:54 UTC
Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian
Comment 6 Florian Reisinger 2012-08-14 14:08:59 UTC
Dear bug submitter!

Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs.

To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement

Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem.

Yours!

Florian
Comment 7 matt.reischer 2012-08-31 20:14:05 UTC
Created attachment 66419 [details]
checked to see if bug released in current cersion (3.5.6.2)  it has not.

checked to see if bug released in current cersion (3.5.6.2)  it has not.
Comment 8 Buovjaga 2013-02-07 11:19:30 UTC
Financial bounty available for whoever wants to fix this: http://www.freedomsponsors.org/core/offer/153/pdf-import-adds-and-removes-spaces-in-text
Comment 9 Samuel Mehrbrodt (CIB) 2013-02-07 20:51:02 UTC
*** Bug 49909 has been marked as a duplicate of this bug. ***
Comment 10 Dennis Roczek 2013-02-09 21:09:51 UTC
The duplicated bug contains one other PDF examples which include a similar problem (the enwikibooks example, the other one seemed to be fixed in LO4.0.0!)

I tested again the enwikibooks and this example on my LO4.0.0 installation with Win764bit. The Mac is in use, but I highly doubt that this is a platform problem.

Interesting side node: The fixed file is using the PDF 1.4 standard (at least what my PDF viewer is saying);
the broken ones are using with PDF1.4 and PDF1.5 - so this doesn't seem to be a standard problem (?)
Comment 11 vvort 2014-01-20 06:20:59 UTC
Created attachment 92423 [details]
patch v1 by Vort

Hello!
The location of problematic algorithm is:
  Module: sdext
  File: pdfimport\tree\pdfiprocessor.cxx
  Function: PDFIProcessor::processGlyphLine
I've tried to figure out how it works, but attempt has failed.

But, as we can see, it do not works in fact.
Because of that, I have reimplemented it.

My version was tested particularly with files
'Autani - Non-Disclosure Agreement (Mutual with Business) (3)'
'Cascading Style Sheets_Print version - Wikibooks, open books for an open world.pdf'
And it works better than previous version.

Here is the patch. Please, test it.
And if you find regressions, let me know - I will look at problematic pdf file and will try to fix algorithm.
Comment 12 Samuel Mehrbrodt (CIB) 2014-01-20 10:51:40 UTC
Created attachment 92437 [details]
Example PDF with Spaces removed
Comment 13 Samuel Mehrbrodt (CIB) 2014-01-20 10:52:22 UTC
Hi Vort,

thanks for working on this. Looks good already, however there are still some issues.
See the file above ("Example PDF with Spaces removed"), there are still spaces removed 
e.g. first sentence:
  - "Anreise:Gern" instead of "Anreise: Gern"
  - "Hamburgeine" instead of "Hamburg eine"
Comment 14 vvort 2014-01-20 13:43:09 UTC
Created attachment 92448 [details]
patch v2 by Vort

Here is updated version of my algorithm.
Please recheck it.
-- Vort
Comment 15 matt.reischer 2014-01-20 19:56:04 UTC
Thank you for the work on this bug which I submitted.  I do not know how to apply your patch to test it.  If you can point me to info on how to apply your patch, I will try to test it.

Thank you for your efforts.
Comment 16 Samuel Mehrbrodt (CIB) 2014-01-20 19:59:25 UTC
I checked it with a few PDF files and it looks good to me - so thanks for this.
Can you submit the Patch to gerrit.libreoffice.org for a Code Review?
See https://wiki.documentfoundation.org/Development/gerrit for more information.

@Matt Reischner: You can use "patch -i PATCH_FILE.patch" to apply the patch in your LO working copy.
Comment 17 matt.reischer 2014-01-20 20:02:29 UTC
I am running Win7 SP1.  I do not believe that patch command will work from my command prompt (I tried anyway).
Comment 18 Samuel Mehrbrodt (CIB) 2014-01-20 20:23:40 UTC
Oh then you can use "git apply patch_file". If you use a graphic interface, there might be also an option to apply a patch.
Comment 19 matt.reischer 2014-01-20 20:43:56 UTC
(In reply to comment #18)
> Oh then you can use "git apply patch_file". If you use a graphic interface,
> there might be also an option to apply a patch.

I think for me to test LibreOffice would have to recompiled into a windows installer.  The only way I could make changes is from windows Control Panel...Add Remove Programs...and look for an "Uninstall/Change" option, but uses the installer that was used for the last installation of LibreOffice, and wouldn't know about the existence of a patch.
Comment 20 Samuel Mehrbrodt (CIB) 2014-01-20 20:45:46 UTC
Oh, I was assuming you have an own build of LO. So I guess you need to wait until this gets into the main codebase and then you can try a daily build. Will notify you then.
Comment 21 vvort 2014-01-21 08:02:56 UTC
(In reply to comment #16)
> Can you submit the Patch to gerrit.libreoffice.org for a Code Review?
> See https://wiki.documentfoundation.org/Development/gerrit for more
> information.

Here it is:
https://gerrit.libreoffice.org/#/c/7564/

(I didn't worked with gerrit before, because of this there can be some mistakes)
Comment 22 Commit Notification 2014-02-04 15:14:00 UTC
Vort committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=9db3b5585c5fa7fff633672fd32510c4066d035a

fdo#35143 PDF import: Reimplementation of whitespace detection function



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 23 Gerry 2014-04-16 10:23:44 UTC
Is the bug 72028 a duplicate of this one here?
Comment 24 vvort 2014-04-16 10:59:23 UTC
This is not a duplicate.
But...

When I was fixing this bug, I didn't know about possibility of opening pdf with Writer.
It is well hidden, and I was thinking that related to Writer code in importer is actually a dead code.
I will think what to do with this discovery.

For now I have found that you can just open pdf with Draw and Copy&Paste page contents to Writer.
Comment 25 Gerry 2014-04-17 08:57:59 UTC
Thanks vvort for looking into that other related PDF-import-in-Writer bug.

Yes, the problem is still there if you import in Writer via choose File -> Open -> select as file format filter "PDF - Portable Document Format (Writer)"