Bug 118370 - "Convert selected text frames into one text paragraph" function (for PDF import editing)
Summary: "Convert selected text frames into one text paragraph" function (for PDF impo...
Status: VERIFIED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Draw (show other bugs)
Version:
(earliest affected)
Inherited From OOo
Hardware: All All
: medium enhancement
Assignee: Justin L
URL:
Whiteboard: target:6.4.0
Keywords: filter:pdf
Depends on:
Blocks: PDF-Import-Draw
  Show dependency treegraph
 
Reported: 2018-06-25 12:15 UTC by Roman Kuznetsov
Modified: 2023-09-18 20:14 UTC (History)
10 users (show)

See Also:
Crash report or crash signature:


Attachments
Example PDF (10.29 KB, application/pdf)
2018-06-25 12:15 UTC, Roman Kuznetsov
Details
PDF from MS Word 2010 (82.21 KB, application/pdf)
2018-06-28 10:47 UTC, Roman Kuznetsov
Details
PDF from Scribus 1.5.3 (17.24 KB, application/pdf)
2018-06-28 10:48 UTC, Roman Kuznetsov
Details
Macro for getting of text from PDF (3.46 KB, text/plain)
2018-12-04 09:33 UTC, Roman Kuznetsov
Details
Some text documents (104.18 KB, application/x-zip-compressed)
2019-07-16 00:52 UTC, Regina Henschel
Details
PDF_import_testDoc2.odg: document I'm using to test patch functionality. (33.53 KB, application/vnd.oasis.opendocument.graphics)
2019-07-18 09:22 UTC, Justin L
Details
PDF_import_testDoc.odg: the original testing document during development (21.61 KB, application/vnd.oasis.opendocument.graphics)
2019-08-03 07:01 UTC, Justin L
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roman Kuznetsov 2018-06-25 12:15:36 UTC
Description:
place text when importing from PDF in one text box instead many text boxes with one string

Steps to Reproduce:
1. Open PDF from attach
2. Try select any text. You can select only one string of text instead all text in one big text box

Actual Results:
You can select only one string of text

Expected Results:
You can select all text,that will be in one big text box


Reproducible: Always


User Profile Reset: No



Additional Info:
Comment 1 Roman Kuznetsov 2018-06-25 12:15:58 UTC
Created attachment 143096 [details]
Example PDF
Comment 2 Aron Budea 2018-06-25 12:31:10 UTC
Doesn't this depend on how the text is exported in the PDF?
Comment 3 Roman Kuznetsov 2018-06-25 13:07:45 UTC
(In reply to Aron Budea from comment #2)
> Doesn't this depend on how the text is exported in the PDF?

I don't know =(

of course if i have PDF from scanner, then i will don't have any text when import.

Need check PDF from MS Word or may be from Scribus or Publisher.
Comment 4 V Stuart Foote 2018-06-26 23:13:32 UTC
Understand that the "Text runs" inside the PDF are not organized in any lexical sense as PDF is a presentation format--content of a PDF can be split apart and structured with no regard for the source content.

But we already are looking at improving fidelity of the Unicode text run composition for CTL and embedding strings of "Actual text" in our PDF export (bug 11748)--IMHO would expect we can implement that same filter handling for import or break.  

And believe attempts to assemble the "runs" into some minimum set of text boxes on import (or break) would be a reasonable filter enhancement--and would in general improve preparation of structured accessible documents.


However beyond that we must continue to reject any perception of LO as a PDF "editor".
Comment 5 Roman Kuznetsov 2018-06-28 10:47:51 UTC
Created attachment 143184 [details]
PDF from MS Word 2010
Comment 6 Roman Kuznetsov 2018-06-28 10:48:14 UTC
Created attachment 143185 [details]
PDF from Scribus 1.5.3
Comment 7 Roman Kuznetsov 2018-06-28 11:01:52 UTC
I added else two PDF examples - one from Scribus and one from MS Word/

It opens in LO Draw and selecting of text works same as for first PDF (from one string to other string)
Comment 8 Matěj Cepl 2018-12-04 09:08:58 UTC
Looking at https://ask.libreoffice.org/en/question/53238/ it seems like a step in the right direction, but I think I would hope at least for function "Convert selected text frames into one text paragraph", that would help a lot in meanwhile.
Comment 9 Roman Kuznetsov 2018-12-04 09:33:44 UTC
Created attachment 147265 [details]
Macro for getting of text from PDF
Comment 10 V Stuart Foote 2019-07-03 13:56:12 UTC
Justin Luth has a patch up for review...

https://gerrit.libreoffice.org/#/c/75043

Adds a UNO action to merge selected Draw Text boxes imported from PDF runs into a single Draw Text box.
Comment 11 Regina Henschel 2019-07-12 15:59:32 UTC
How do you determine "imported from PDF"? As far as I see, it will become a general command. And that would raise some questions:

Text boxes might be on different layers. What is the resulting layer?

What is size and position of resulting text box?

Is the resulting text box a new text box? If yes, what happens with the previous shapes?

The text boxes have styles, automatic or custom. Are they used? If you yes, how they are used? If not, what style is used?

Each of the text boxes can have a numbered list. Are the lists joined?
Comment 12 Justin L 2019-07-12 17:50:19 UTC
(In reply to Regina Henschel from comment #11)
> How do you determine "imported from PDF"? As far as I see, it will become a
> general command.
Yes, there is nothing automatic here. The user has to select an area and then command "Consolidate Text". After that it is the user's responsibility to verify proper paragraph content and set paragraph properties. Nothing requires that it can only be used for PDF imports, that is just the design-scope.

> Text boxes might be on different layers. What is the resulting layer?
Good question. I have no idea about layers The layer inserted would likely be the same layer as the first textbox encountered, since xpInsOL = pObj->getParentSdrObjListFromSdrObject().

> What is size and position of resulting text box?
It is the size of all of the marked text objects.
  rReplacement->SetSnapRect(GetMarkedObjRect());

> Is the resulting text box a new text box? If yes, what happens with the
> previous shapes?
Yes, it is a new textbox. SdrRectObj* rReplacement = new SdrRectObj( getSdrModelFromSdrView(), OBJ_TEXT ).
The old textframes are deleted.
In terms of shapes, I would expect the text portion to just be removed from the shapes, but the shapes themselves would not be deleted. (The initial implementation ignores shapes and only deals with textframes, but I'm interested in exploring the utility of supporting general SdrObjs that HaveText.)

> The text boxes have styles, automatic or custom. Are they used? If you yes,
> how they are used? If not, what style is used?
No, the textboxes styles would not be used for the new textbox. It would just be the default style that any new SdrRectObj gets. I expect that a PDF import doesn't use textbox styles. The only way I could imagine applying a style would be if all textboxes uses an identical style, and in that case it could conceivably be applied to the new textbox as well. 

> Each of the text boxes can have a numbered list. Are the lists joined?
No.  Other than guessing at sentence-end, there is no content analysis done during the consolidation.
Comment 13 Justin L 2019-07-12 19:15:00 UTC
(In reply to Justin L from comment #12)
> > Text boxes might be on different layers. What is the resulting layer?
Testing shows that it is always created in the first layer.

> > Each of the text boxes can have a numbered list. Are the lists joined?
Actually testing indicates yes. Whatever rDrawOutliner.AddText(OutlinerParaObject*) does.
Comment 14 Regina Henschel 2019-07-12 21:00:06 UTC
Hi Justin, I have made some tests too:

For the desired use case, it works fine. It preserves character formatting and hyperlinks. It works with vertical text too.
But because it is a general command, you should consider to explain some details in the release notes and/or help text. So that users do not have unrealistic expectations.

I have found one error, you should address: Draw two text boxes with text and group them. "Consolidate Text" is not available on the group. That is OK. A user might then enter the group and mark the text boxes. Then "Consolidate Text" is available. But if you use it, the entire group including the text vanishes.

There is another error. But I think, that should be treated at a different place, not in your patch. If the layer "Layout" is locked, the new text box is inserted in this layer anyway.

Further, not severe problems are:
* "Consolidate Text" works on Callouts too. I guess, that is not intended.
* The command works in Impress too, if you use a macro. If you use it with a presentation outline object and a text box, it generates a text box and deletes the presentation object. Presentation objects should be excluded like other shapes. 
* If the single text box objects are named, the resulting text box is unnamed despite of that.
* Title and description of the single text boxes are lost.
* Numbered list in one text box followed by numbered list in the next text box is turned into a bullet list in my tests.
* If one text box is marked together with another shape with text, e.g. a rectangle, then the text box is replaced by a new one, which has only the text of the text box. Because name, description and style are lost, it would be better to keep the text box, in case only one text box remains after un-marking all the other shapes.
* If the user marks a shape in addition, the size and position for the new text box is taken from the entire marked area. I would expect, that it is only taken from the area of the text boxes.
* Text boxes in "text along path"-mode are treated too. I consider such shapes as decorative and would not include them.



I would not consider to implement something similar for arbitrary shapes. They are too different. There are primitive shapes (rectangle, circle, ellipse), paths and curves, callouts, measure lines, "text along path"-mode, custom shapes (with intrinsic text area), custom shapes in Fontwork-mode, images and Math-OLE. And in Writer, there are in addition custom shapes with text box, and frames (which are text boxes too in file format). And in Impress there are in addition presentation objects.
Comment 15 Justin L 2019-07-15 17:02:36 UTC
I updated my proposal to patch set 3.

(In reply to Regina Henschel from comment #14)
Thanks for giving your opinion that this should NOT be opened up to general shapes, but should be strictly for textboxes. I've limited it to OBJ_TEXT items.

> Grouped textboxes: enter group, consolidate, and everything vanishes.
Kinda fixed with another sledgehammer approach to just ungrouping everything. I have no idea what is going on behind the scenes here. I suspect that either objects are being freed, or SdrOutliner is being reset.

> If the layer is locked, the new text box is inserted in this layer anyway.
I changed the code so that the new text box is inserted into the current layer. I wasn't able to discover the locked status of the layer though, since the only lock function [IsLockedODF()] reports on something different.

> "Consolidate Text" works on Callouts and presentation objects too.
I can't reproduce. In my testing, callouts are !IsTextFrame(), so they shouldn't have been included. However, since the code is now modified to only include OBJ_TEXT, perhaps it works as you expect now.

> Numbered list in one text box followed by numbered list in the next text
> box is turned into a bullet list in my tests.
I can't reproduce.

> * Text boxes in "text along path"-mode are treated too.
I couldn't figure out how to create these.
It sounds like you may have developed a fairly comprehensive test document. Perhaps you could attach it to this report. That would be helpful.

> If only one text box is marked together with another shape, then the
> text box is replaced by a new one, losing name, description and style.
Now that only OBJ_TEXT are supported, this was easily fixed.

> * If the user marks a shape in addition, the size and position for the new
> text box is taken from the entire marked area.
GetMarkedObjRect was returning a cached size, so it needed to be invalidated and re-calculated.
Comment 16 Regina Henschel 2019-07-16 00:52:26 UTC
Created attachment 152799 [details]
Some text documents

Hi Justin, here are some test documents. Please do not try to address all such edge cases. They will not appear in PDF import. But you should be aware, that they exist. Instead clearly describe the purpose of your tool in the release notes.

I cannot test your new version of the patch before Friday. Perhaps someone else has time to test it earlier. So please be patient.

"Text along path" is a feature for paths and curves, but works too on text boxes, circles and rectangles. It is named "Fontwork". But that term conflicts with the "Fontwork" from the "Fontwork Gallery". Therefore I prefer to say "Text along path". With this term you will find it in other applications like Corel Draw. You need a shape with text (not a custom shape!) and then click on one of the right four icons in the first line of the Fontwork-dialog.
The uno-command for opening and closing the dialog for "Text along path" is .uno:FontWork. You need to customize the UI to get it. It has the name "Fontwork" and is in category "Format".
 
The shapes from the "Fontwork Gallery" are those, which are named "WordArt" in Microsoft Office and are a special mode of custom shapes. Inside the code you need to be careful, what feature is meant, if a variable or method contains "fontwork" in its name.

A similar confusion is about "callout". There exists a group of custom shapes, which are named "Callouts". But on the other hand there exists the icon "Callouts" in the Text toolbox. You need to customize your UI to get this toolbox. The command for the toolbox is in category Drawing.
Comment 17 Justin L 2019-07-18 09:22:25 UTC
Created attachment 152855 [details]
PDF_import_testDoc2.odg: document I'm using to test patch functionality.

(In reply to Regina Henschel from comment #16)
Thanks for your excellent tests. They helped a lot in finalizing and verifying my proposed patch.

> "Text along path" is a feature for paths and curves.
IsFontwork() is reporting false for this object, so it is still being consolidated. I didn't see any way to differentiate it. Perhaps IsFontwork() is broken.
Comment 18 Regina Henschel 2019-07-18 20:07:49 UTC
(In reply to Justin L from comment #17)
> 
> > "Text along path" is a feature for paths and curves.
> IsFontwork() is reporting false for this object, so it is still being
> consolidated. I didn't see any way to differentiate it. Perhaps IsFontwork()
> is broken.

You are right. Not really broken, but not usable in your context. IsFontwork() is defined as 
bool SdrTextObj::IsFontwork() const
{
   return !bTextFrame // Default is FALSE
     && GetObjectItemSet().Get(XATTR_FORMTXTSTYLE).GetValue() != XFormTextStyle::NONE;
}
in svx/source/svdraw/svdotext.cxx. I see it similar already in OOo1.1.5. That was correct at that time, because Fontwork was not possible on text boxes. Fontwork was changed sometime between OOo2.2 and OOo3.2 (I have only those to test) so that it is applicable to text boxes too. But the function was not adapted. The function is available as property for macros and people might rely on it in its current form. Therefore I would not change the function.
Comment 19 Regina Henschel 2019-07-20 00:44:37 UTC
(In reply to Justin L from comment #17)
> IsFontwork() is reporting false for this object, so it is still being
> consolidated. I didn't see any way to differentiate it. Perhaps IsFontwork()
> is broken.

Perhaps:

 && pTextObj->GetMergedItem(XATTR_FORMTXTSTYLE).GetValue() == XFormTextStyle::NONE // not fancy text

(and #include <svx/xtextit0.hxx>)

Consolidation of lists fails. All items go to one paragraph, if the list items do not end with punctuation. In case of numbering of style "1.", list items are split after the dot. But I have no idea how to do it better. The user can repair the lists, because the items are now all inside the same text box. So that is not an obstacle for me.

Do you will update the help yourself?
Comment 20 Commit Notification 2019-07-22 04:08:21 UTC
Justin Luth committed a patch related to this issue.
It has been pushed to "master":

https://git.libreoffice.org/core/+/82c3e68642de445064313e353812b54df76c7fe9%5E%21

tdf#118370 Draw: add option to consolidate multiple textObjs

It will be available in 6.4.0.

The patch should be included in the daily builds available at
https://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
https://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 21 Justin L 2019-07-22 04:53:14 UTC
Treating this issue as limited to comment 8's enhancement request. Wider discussion of import filter improvements should use duplicate bug 32249.

Considering this bug as fixed with comment 20's patch, although additional comments about better implementation are welcome here.

One improvement is better end of sentence detection. For that, I'd guess that UNO XBreakIterator needs to be extended to include an isBoundary(pos) function - to test if the chosen position is a legitimate boundary for the selected break. (I don't know how to go about changing UNO stuff.) Currently beginOfSentence/endOfSentence treat the end of any string as legitimate start/end of a sentence which isn't valid in our text fragment situation.

(In reply to Regina Henschel from comment #19)
> Will you update the help?
No. I don't know how to modify the help documentation.
Comment 22 Regina Henschel 2019-07-22 12:04:38 UTC
(In reply to Justin L from comment #21)
> > Will you update the help?
> No. I don't know how to modify the help documentation.

I have written bug 126507 to track the missing documentation.
Comment 23 Roman Kuznetsov 2019-07-24 09:07:48 UTC
Verified in

Version: 6.4.0.0.alpha0+ (x86)
Build ID: 29fbb2512c741bc34221b7d13b9958c832f0a3f7
CPU threads: 4; OS: Windows 6.1; UI render: default; VCL: win; 
TinderBox: Win-x86@42, Branch:master, Time: 2019-07-24_03:19:14
Locale: ru-RU (ru_RU); UI-Language: en-US
Calc: threaded

Works fine, but:

1. It need select all text boxes before consolidate => user can be confuse
2. Doesn't save a table structure (it's another enhancement of course)

Thank you, Justin!
Comment 24 Justin L 2019-07-24 09:44:31 UTC
(In reply to Roman Kuznetsov from comment #23)
> 1. It need select all text boxes before consolidate => user can be confused
Yes. The only way the "order" is know is by the order of creation. First textbox created is the first paragraph in the consolidated textbox. Obviously a consolidated textbox becomes the last one created, so it will end up at the bottom if multiple consolidates are done.
Comment 25 Justin L 2019-08-03 07:01:15 UTC
Created attachment 153113 [details]
PDF_import_testDoc.odg: the original testing document during development
Comment 26 Cor Nouws 2020-01-14 16:01:11 UTC
ah wow - amazing that this is done.
Huge thanks to all who helped here :) !