Description: The XLS format has a maximum record length of 8224 bytes. The maximum string length is 32767 characters (a character whose UTF-16 representation requires a conjugate pairs counts at two characters). Consequently, long strings must be split across multiple records using "continue records" (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/999fae21-d3d9-42e8-8290-639782460c67). Strings are represented as "XLUnicodeRichExtendedString" objects (https://docs.microsoft.com/en-us/openspecs/office_file_formats/ms-xls/173d9f51-e5d3-43da-8de2-be7f22e119b9). They may use either narrow (8-bit) or wide (UTF-16LE) characters; which is used by a particular string is indicated by a flag. For whatever reason (blame some nameless dev in the 1990s), the flag is repeated in each continue record. Consequently, it is valid for a string to start off using narrow characters and be continued by a wide character block. Yes, this is perverse. In order to test some other software that parses XLS, I used Excel to create an XLS with a 32767-character narrow-character string ("aaa....aaa"), then opened it up using a OLE compound document hex editor ("Compound File Explorer", though the tool that you use should not matter). My string was split across four records, as expected (in the "Workbook" OLE stream). I changed the narrow/wide character flag byte to 0x01 (indicating wide character data) on the 2nd and 4th blocks. Since XLS uses UTF-16 for wide characters, this changes the string to "aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡". However, I did *not* update the string length. Since those two blocks are now wide characters but I did not add any additional data, the string should be shorter. This makes the document invalid. Excel goes into recovery mode when trying to load it. However, Calc loads the following string: aaa...aaa慡慡慡...慡慡慡aaa...aaa慡慡慡...慡慡慡一浡ե?慖畬ť?ɡ?慡愀慡愀慡ա?慡慡ୡ?敄捳楲瑰潩੮?桓牯⁴慮敭 䰀湯慮敭䄀瑬牥慮整搠獥牣灩楴湯?潓敭桴湩 Copying the extraneous data into a text file, saving it as UTF-16LE and opening it in a hex editor reveals 0x76 bytes of file data following the end of the last string block: 04 00 00 4E 61 6D 65 05 3F 00 56 61 6C 75 65 01 3F 00 61 02 3F 00 61 61 03 00 00 61 61 61 04 00 00 61 61 61 61 05 3F 00 61 61 61 61 61 0B 3F 00 44 65 73 63 72 69 70 74 69 6F 6E 0A 3F 00 53 68 6F 72 74 20 6E 61 6D 65 09 00 00 4C 6F 6E 67 20 6E 61 6D 65 15 00 00 41 6C 74 65 72 6E 61 74 65 20 64 65 73 63 72 69 70 74 69 6F 6E 3F 00 53 6F 6D 65 74 68 69 6E I didn't try debugging into Calc to see where/how it got this data. There might be security implications depending on how/where the over-read occurs. I created a second version of the XLS file in which I corrected the string length. Calc appeared to handle that file correctly. I tested this using release 7.3.1.3 on Windows 10 amd64. I expect that the same will occur on other platforms and versions since XLS is a rather old format. Steps to Reproduce: 1. Create a malformed XLS file as described above 2. Open in Calc Actual Results: Over-read file data is displayed in the document as described above Expected Results: No over-read file data should appear. Reproducible: Always User Profile Reset: No Additional Info: Version: 7.3.1.3 (x64) / LibreOffice Community Build ID: a69ca51ded25f3eefd52d7bf9a5fad8c90b87951 CPU threads: 2; OS: Windows 10.0 Build 19042; UI render: Skia/Raster; VCL: win Locale: en-US (en_US); UI: en-US Calc: threaded
Created attachment 178786 [details] File that reproduces the bug
Created attachment 178787 [details] XLS file with mixed string, corrected length
Use attachment 178786 [details] to reproduce the bug. Attachment attachment 178787 [details] is a version of the file with the string length corrected; Calc appears to handle it correctly.
Created attachment 178788 [details] String block 4 header This screen capture from my OLE hex editor shows the beginning of string block 4. The selected byte is the narrow/wide character flag. 0 indicates narrow character data, 1 indicates wide.
Created attachment 178789 [details] String block 4 end This screen capture from my OLE hex editor shows the end of string block 4 with the additional file data that Calc loads as part of the string.
Created attachment 178790 [details] Bug in Calc This screen capture of Calc shows the end of the string that it loads with the extraneous data.
Also confirmed on Version: 6.4.7.2 Build ID: 1:6.4.7-0ubuntu0.20.04.2 CPU threads: 2; OS: Linux 5.4; UI render: default; VCL: kf5; Locale: en-US (en_US.UTF-8); UI-Language: en-US Calc: threaded For comparison, Gnumeric 1.12.46 loads the file without displaying an error to the user, but appears to fail to load the file's string table and dumps a couple warning messages to the console.
Also confirmed on the oldest release that I had installed on an old VM: Version: 5.1.6.2.0+ Build ID: 5.1.6.2-8.fc24 CPU Threads: 1; OS Version: Linux 4.11; UI Render: default; Local: en-US (en_US.UTF-8); Calc: group Apache OpenOffice 4.1.11 has the same problem. This bug is probably very old.
Confirmed Arch Linux 64-bit, X11 Version: 7.6.0.0.alpha0+ (X86_64) / LibreOffice Community Build ID: 8389048cb41291917449e87b2901d6133bce3373 CPU threads: 8; OS: Linux 6.0; UI render: default; VCL: kf5 (cairo+xcb) Locale: fi-FI (fi_FI.UTF-8); UI: en-US Calc: threaded Jumbo Built on 21 December 2022