Bug 132773 - Copying an underlined word from an html document loses underline property
Summary: Copying an underlined word from an html document loses underline property
Status: NEW
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Writer (show other bugs)
Version:
(earliest affected)
6.4.3.2 release
Hardware: All All
: medium normal
Assignee: Not Assigned
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: HTML-Paste
  Show dependency treegraph
 
Reported: 2020-05-06 13:46 UTC by Konstantin Kharlamov
Modified: 2020-12-18 06:11 UTC (History)
6 users (show)

See Also:
Crash report or crash signature:
Regression By:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Konstantin Kharlamov 2020-05-06 13:46:59 UTC
I don't know if its cause is the same as of #132770 report, so I'm reporting it separately, feel free to make as dup if that's the case.

# Steps to reproduce

1. Create a test.html file with the following content
    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
    <head>
      <meta charset="utf-8" />
      <meta name="generator" content="pandoc" />
      <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
      <title>test2</title>
      <style>
        code{white-space: pre-wrap;}
        span.smallcaps{font-variant: small-caps;}
        span.underline{text-decoration: underline;}
        div.column{display: inline-block; vertical-align: top; width: 50%;}
        div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
        ul.task-list{list-style: none;}
      </style>
      <!--[if lt IE 9]>
        <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
      <![endif]-->
    </head>
    <body>
    <ins>
    Test
    </ins>
    </body>
    </html>
2. Open it in a browser (tested with Qutebrowser, Chromium, Firefox)
3. Copy the underlined word "Test" from a browser
4. Paste in a LO Writer document

## Expected

Word "Test" that appears in lowriter is underlined (same as what you see in a browser)

## Actual

"Test" is not underlined
Comment 1 Konstantin Kharlamov 2020-05-06 14:52:10 UTC
It is worth noting that copying underlined text from google docs works fine.

I compared lists of formats provided by google docs and by plain html document, and here's the list:

    From Google-docs           | From plain html
    --------------------------------------------
    chromium/x-web-custom-data |
    text/html                  | text/html
    MULTIPLE                   | MULTIPLE
    SAVE_TARGETS               | SAVE_TARGETS
    STRING                     | STRING
    TARGETS                    | TARGETS
    TEXT                       | TEXT
    text/plain                 | text/plain
    TIMESTAMP                  | TIMESTAMP
    UTF8_STRING                | UTF8_STRING

As you can see the only difference in formats is `x-web-custom-data`. I'm not sure though LO Writer uses that, so it's intersting to see why such difference
Comment 2 Konstantin Kharlamov 2020-05-06 15:26:17 UTC
I suspect LO uses text/html format of clipboard, so I managed to get contents of that format for Google Docs (copying from works) and plain html (it doesn't)

Google docs: b'<meta charset="utf-8"><b style="font-weight:normal;" id="docs-internal-guid-cfdd430f-7fff-9177-bba3-5001b84d8ed7"><span style="font-size:10pt;font-family:Arial;color:#000000;background-color:#ffffff;font-weight:400;font-style:normal;font-variant:normal;text-decoration:underline;-webkit-text-decoration-skip:none;text-decoration-skip-ink:none;vertical-align:baseline;white-space:pre;white-space:pre-wrap;">Test</span></b>'

Plain html: b'<span style="color: rgb(0, 0, 0); font-family: &quot;Bitstream Vera Serif&quot;; font-size: medium; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: underline; display: inline !important; float: none;">Test</span>'

What is interesting is that both cases has `text-decoration:underline`. It's just LO Writer ignores that in one of cases. So bug is probably different from #132770
Comment 3 V Stuart Foote 2020-05-06 17:12:42 UTC
Not clear this is a LibreOffice import filter issue. Looking at a clipboard content viewer (Nirsoft's InsideClipboard) seems more that the copy to clipboard function for the browser impacts usefulness of the inline CSS applied to the copied content.

Checking results of clipboard paste when opening the sample HTML in each of Chrome, Edge, and Firefox are quite different. Paste Special content from MS Edge browser receives the Underline decoration.


Clipboard contents for Chrome's (81.0.4044.129) HTML Format 49414 include the CSS for the span.

00000240   30 70 78 3B 20 74 65 78 74 2D 64 65 63 6F 72 61    0px; text-decora
00000250   74 69 6F 6E 3A 20 75 6E 64 65 72 6C 69 6E 65 3B    tion: underline;

but the text does not receive the underline decoration.

Contrasted with MS Edge (44.18362.449.0) HTML Format 49414 also includes CSS for the span; and does style the text with underline.

00000180   20 6C 65 66 74 3B 20 74 65 78 74 2D 64 65 63 6F     left; text-deco
00000190   72 61 74 69 6F 6E 3A 20 75 6E 64 65 72 6C 69 6E    ration: underlin
000001A0   65 3B 20 74 65 78 74 2D 69 6E 64 65 6E 74 3A 20    e; text-indent: 

While the Firefox (75.0) HTML Format 49414 does not include the CSS for the span. And has no decoration as I'd expect.

Also, the index (acting as prioroty for import) is sequenced differently for each of the browsers clipboard export.
Comment 4 Konstantin Kharlamov 2020-05-06 17:43:04 UTC
FTR if anybody is interested, I made a python script that prints clipboard content https://github.com/Hi-Angel/scripts/blob/master/print_clipboard_content.py
Comment 5 Mike Kaganski 2020-05-06 17:44:45 UTC
Playing with InsideClipboard and its Clp file, I was able to make it work with Chrome by replacing in this clipboard HTML:

> <html>
> <body>
> <!--StartFragment--><span style="color: rgb(0, 0, 0); font-family: &quot;Times New Roman&quot;; font-size: medium; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: underline; display: inline !important; float: none;">Test</span><!--EndFragment-->
> </body>
> </html>

the string "-webkit-text-stroke-width" with " webkit-text-stroke-width", i.e. removing the leading hyphen.
Comment 6 Konstantin Kharlamov 2020-05-06 20:59:49 UTC
(In reply to Mike Kaganski from comment #5)
> Playing with InsideClipboard and its Clp file, I was able to make it work
> with Chrome by replacing in this clipboard HTML:
> 
> > <html>
> > <body>
> > <!--StartFragment--><span style="color: rgb(0, 0, 0); font-family: &quot;Times New Roman&quot;; font-size: medium; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: underline; display: inline !important; float: none;">Test</span><!--EndFragment-->
> > </body>
> > </html>
> 
> the string "-webkit-text-stroke-width" with " webkit-text-stroke-width",
> i.e. removing the leading hyphen.

Thank you for extensive analysis!

Worth noting, leading hyphen and underscores are allowed in the CSS selector name, and are usually used for browser-specific extensions. So LO CSS parser should not bail out upon seeing such syntax.
Comment 7 Xisco Faulí 2020-05-11 10:07:32 UTC
(In reply to Konstantin Kharlamov from comment #6)
> (In reply to Mike Kaganski from comment #5)
> > Playing with InsideClipboard and its Clp file, I was able to make it work
> > with Chrome by replacing in this clipboard HTML:
> > 
> > > <html>
> > > <body>
> > > <!--StartFragment--><span style="color: rgb(0, 0, 0); font-family: &quot;Times New Roman&quot;; font-size: medium; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: underline; display: inline !important; float: none;">Test</span><!--EndFragment-->
> > > </body>
> > > </html>
> > 
> > the string "-webkit-text-stroke-width" with " webkit-text-stroke-width",
> > i.e. removing the leading hyphen.
> 
> Thank you for extensive analysis!
> 
> Worth noting, leading hyphen and underscores are allowed in the CSS selector
> name, and are usually used for browser-specific extensions. So LO CSS parser
> should not bail out upon seeing such syntax.

@Julien, I thought you might be interested in this issue
Comment 8 Julien Nabet 2020-05-11 11:41:19 UTC
Reading tdf#132770 it seems any change in this kind of place may bring some regression because of css.

So I can't help here and uncc myself.
Comment 9 Konstantin Kharlamov 2020-05-11 12:36:51 UTC
(In reply to Julien Nabet from comment #8)
> Reading tdf#132770 it seems any change in this kind of place may bring some
> regression because of css.
> 
> So I can't help here and uncc myself.

While both issues mention CSS, but I see no relation beyond that.

The current issue is about parser of CSS selectors being overly strict. I don't see how relaxing it to allow selectors to start with a hyphen or underscore may lead to a regression.

The issue you referred to is ultimately about <ins> tag being parsed differently from <u> tag. Changing it to be parsed similarly to <u> tag indeed would lead to a regression, due to a bug reported at tdf#132914.

So, the issue you referred to is not about CSS. The issue at tdf#132914 it seems is about html parser not applying CSS rules to some tags (the CSS parsing per se works fine, it's just that HTML parser does not apply its result to some tags). I.e. this issue is not about CSS parser either.