Created attachment 44271 [details] pdf before import simple text is imported incorrectly. blank spaces are added inside words, and removed between words example1 after import: "any informati on or t echnical data that is sensi tive material, includi ng" example 2 after import: "authorizedrepresentativesofallparties.ThisAgreementandperformancethereundershallbe" these are from the same document, but different pages. before and after documents are attached.
Created attachment 44272 [details] odg after import
[This is an automated message.] This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it started right out as NEW without ever being explicitly confirmed. The bug is changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases. Details on how to test the 3.5.0 beta1 can be found at: http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1 more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html
Dear bug submitter! Due to the fact, that there are a lot of NEEDINFO bugs with no answer within the last six months, we close all of these bugs. To keep this message short, more infos are available @ https://wiki.documentfoundation.org/QA/NeedinfoClosure#Statement Thanks for understanding and hopefully updating your bug, so that everything is prepared for developers to fix your problem. Yours! Florian
Created attachment 66419 [details] checked to see if bug released in current cersion (3.5.6.2) it has not. checked to see if bug released in current cersion (3.5.6.2) it has not.
Financial bounty available for whoever wants to fix this: http://www.freedomsponsors.org/core/offer/153/pdf-import-adds-and-removes-spaces-in-text
*** Bug 49909 has been marked as a duplicate of this bug. ***
The duplicated bug contains one other PDF examples which include a similar problem (the enwikibooks example, the other one seemed to be fixed in LO4.0.0!) I tested again the enwikibooks and this example on my LO4.0.0 installation with Win764bit. The Mac is in use, but I highly doubt that this is a platform problem. Interesting side node: The fixed file is using the PDF 1.4 standard (at least what my PDF viewer is saying); the broken ones are using with PDF1.4 and PDF1.5 - so this doesn't seem to be a standard problem (?)
Created attachment 92423 [details] patch v1 by Vort Hello! The location of problematic algorithm is: Module: sdext File: pdfimport\tree\pdfiprocessor.cxx Function: PDFIProcessor::processGlyphLine I've tried to figure out how it works, but attempt has failed. But, as we can see, it do not works in fact. Because of that, I have reimplemented it. My version was tested particularly with files 'Autani - Non-Disclosure Agreement (Mutual with Business) (3)' 'Cascading Style Sheets_Print version - Wikibooks, open books for an open world.pdf' And it works better than previous version. Here is the patch. Please, test it. And if you find regressions, let me know - I will look at problematic pdf file and will try to fix algorithm.
Created attachment 92437 [details] Example PDF with Spaces removed
Hi Vort, thanks for working on this. Looks good already, however there are still some issues. See the file above ("Example PDF with Spaces removed"), there are still spaces removed e.g. first sentence: - "Anreise:Gern" instead of "Anreise: Gern" - "Hamburgeine" instead of "Hamburg eine"
Created attachment 92448 [details] patch v2 by Vort Here is updated version of my algorithm. Please recheck it. -- Vort
Thank you for the work on this bug which I submitted. I do not know how to apply your patch to test it. If you can point me to info on how to apply your patch, I will try to test it. Thank you for your efforts.
I checked it with a few PDF files and it looks good to me - so thanks for this. Can you submit the Patch to gerrit.libreoffice.org for a Code Review? See https://wiki.documentfoundation.org/Development/gerrit for more information. @Matt Reischner: You can use "patch -i PATCH_FILE.patch" to apply the patch in your LO working copy.
I am running Win7 SP1. I do not believe that patch command will work from my command prompt (I tried anyway).
Oh then you can use "git apply patch_file". If you use a graphic interface, there might be also an option to apply a patch.
(In reply to comment #18) > Oh then you can use "git apply patch_file". If you use a graphic interface, > there might be also an option to apply a patch. I think for me to test LibreOffice would have to recompiled into a windows installer. The only way I could make changes is from windows Control Panel...Add Remove Programs...and look for an "Uninstall/Change" option, but uses the installer that was used for the last installation of LibreOffice, and wouldn't know about the existence of a patch.
Oh, I was assuming you have an own build of LO. So I guess you need to wait until this gets into the main codebase and then you can try a daily build. Will notify you then.
(In reply to comment #16) > Can you submit the Patch to gerrit.libreoffice.org for a Code Review? > See https://wiki.documentfoundation.org/Development/gerrit for more > information. Here it is: https://gerrit.libreoffice.org/#/c/7564/ (I didn't worked with gerrit before, because of this there can be some mistakes)
Vort committed a patch related to this issue. It has been pushed to "master": http://cgit.freedesktop.org/libreoffice/core/commit/?id=9db3b5585c5fa7fff633672fd32510c4066d035a fdo#35143 PDF import: Reimplementation of whitespace detection function The patch should be included in the daily builds available at http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More information about daily builds can be found at: http://wiki.documentfoundation.org/Testing_Daily_Builds Affected users are encouraged to test the fix and report feedback.
Is the bug 72028 a duplicate of this one here?
This is not a duplicate. But... When I was fixing this bug, I didn't know about possibility of opening pdf with Writer. It is well hidden, and I was thinking that related to Writer code in importer is actually a dead code. I will think what to do with this discovery. For now I have found that you can just open pdf with Draw and Copy&Paste page contents to Writer.
Thanks vvort for looking into that other related PDF-import-in-Writer bug. Yes, the problem is still there if you import in Writer via choose File -> Open -> select as file format filter "PDF - Portable Document Format (Writer)"