Created attachment 129559 [details] Publisher 98 test files + PDFs I have attached 2 PUB files with PDF versions for control purposes. The filter currently doesn't recognise any graphical elements, only text, which is badly placed and formatted due to a lack of an obligatory font substitution upon import. Scribus does slightly better but doesn't find any graphical elements either.
I forgot: Both LO and Scribus add an empty page at the beginning of the imported doc, where in the original and the PDF there is none.
Let this bug report be about "pub98t1.pub" in the archive. Please open a new one for the other file. It's probably not worth further separating all the different bugs into different bug reports at this point, but at least let's have separate ones for the separate files. I can confirm there are many different issues with the file, tested with LibreOffice 5.3beta2 / Windows 7. I leave it to the person who makes the fixes to decide when it's worth closing the report and tracking the remaining issues separately.
Both files suffer from the same major problems: missing graphics elements and an added extra page at the beginning. The text layout is suffering from a lack of font substitution and probably missing support for the concept of text frames. Scribus 1.5.3svn gets the latter part right, but LibreOffice doesn't.
Confirming with: Version: 5.4.0.0.alpha0+ Build ID: d538d3d84172a74dfe97d59a6d3daf9a45459cab CPU Threads: 4; OS Version: Windows 6.19; UI Render: default; TinderBox: Win-x86@39, Branch:master, Time: 2016-12-14_00:28:59 Locale: nl-NL (nl_NL); Calc: CL and with Versie: 4.4.6.3 Build ID: e8938fd3328e95dcf59dd64e7facd2c7d67c704d Locale: nl_NL and with Versie 4.0.0.3 (Bouw-id: 7545bee9c2a0782548772a21bc84a9dcc583b89)
To elaborate: it only appears that Publisher 98 (and older) documents are supported. This is because the physical structure (OLE2 container, name of the "main" file, format of records) of a Publisher file hasn't changed since v.2 (probably since v.1), but the logical structure (i.e., where things like text, shapes etc. are) has changed several times. That means that the parser can read the top level structure of a document, but some (most) parts of it are lost, because they are not where the parser expects them.
I have already put some patches to improve the reading of mspub 97 files in https://github.com/fosnola/libmspub and currently I am trying to improve the mspub 98 files' conversion... Just for note: - the mspub v1 files are not stored in a OLE container, - the "size" of the root/document's block seems to be different in each version (v2: 5e, v3: 78, 97: 9e, 98: d2, 2000?: de) ; I must simplify the code but I will use this information to differentiate the version ( as each version stored the styles differently, except v3 and 97 which seem to share the same code).
Created attachment 143812 [details] the current results I success to retrieve most graphic elements, but there remains many problems to solve: - no picture wrapping ; in fact, this is very difficult to retrieve in Draw, - the linked text-box is not retrieved in 98, but must be retrieved in 97, - many text styles problem, and very basic tables retrieval in 98 ; the conversion must be better in 97 - ... Note: - I do not try to improve the import of the Quill stream (which stores the text, the table content and their styles in 98 files), so some things can still be improved...