Bugzilla – Attachment 82038 Details for
Bug 66597
Problems with copying and extracting text from generated PDF
Home
|
New
|
Browse
|
Search
|
[?]
|
Reports
|
Help
|
New Account
|
Log In
[x]
|
Forgot Password
Login:
[x]
More thorough description of the problem.
Hindi-text-copy-PDF.txt (text/plain), 6.64 KB, created by
Steve White
on 2013-07-04 19:23:51 UTC
(
hide
)
Description:
More thorough description of the problem.
Filename:
MIME Type:
Creator:
Steve White
Created:
2013-07-04 19:23:51 UTC
Size:
6.64 KB
patch
obsolete
>Hi, > >LOWriter performs the best conversion of text to PDF, in the >sense of making the text copyable from the PDF file, of any >of the systems I've tried in Linux. > >In particular, in copying Hindi text, I know there are >a lot of hoops to jump, not the least being that the >letters are often re-ordered by the font layout layer. > >However, in producing a PDF file containing Hindi text, >there remain some problems. > ---------------------------------------------------- >The tests I ran were the following. >In a LOWriter doc, put several copies of the lines (Article 1 of the UDHR) >सà¤à¥ मनà¥à¤·à¥à¤¯à¥à¤ à¤à¥ à¤à¥à¤°à¤µ à¤à¤° ठधिà¤à¤¾à¤°à¥à¤ à¤à¥ मामलॠमà¥à¤ à¤à¤¨à¥à¤®à¤à¤¾à¤¤ सà¥à¤µà¤¤à¤¨à¥à¤¤à¥à¤°à¤¤à¤¾ à¤à¤° समानता पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠ। >à¤à¤¨à¥à¤¹à¥à¤ बà¥à¤¦à¥à¤§à¤¿ à¤à¤° ठनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ à¤à¥ दà¥à¤¨ पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠà¤à¤° परसà¥à¤ªà¤° à¤à¤¨à¥à¤¹à¥à¤ à¤à¤¾à¤à¤à¤¾à¤°à¥ à¤à¥ à¤à¤¾à¤µ सॠबरà¥à¤¤à¤¾à¤µ à¤à¤°à¤¨à¤¾ à¤à¤¾à¤¹à¤¿à¤ । >Format with a different font supporting Hindi. > >Next, "Export as PDF". >Opened the resulting file with Adobe Reader. > >Now select and copy the text from the PDF file, >and paste it into a text editor. >(Oh, let's hope the system's default encoding is UTF-8!) > >It should appear like the original text above. > ---------------------------------------------------- >Note that I've done analogous tests starting from Firefox >(using the CUPS PDF printer), and with XeLaTex. > ---------------------------------------------------- >xHere I compared the distro Lohit Hindi and Gargi, as well >as GNU FreeSerif and GNU FreeSans (latest versions from SVN). >This is LOWriter 4.0.2.2 on Ubuntu. > >Lohit Hindi >सà¤à¥ मनà¥à¤·à¥à¤¯à¥à¤ à¤à¥ à¤à¥à¤°à¤µ à¤à¤° ठधिधिà¤à¤¾à¤°à¥à¤ à¤à¥ मामलॠमॠà¤à¤¨à¥à¤®à¤à¤¾à¤¤ सà¥à¤µà¤¤à¤¨à¥à¤¤à¥à¤°à¤¤à¤¾ à¤à¤° समानता पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠ। >à¤à¤¨à¥à¤¹à¥ बà¥à¤¿ à¤à¤¦à¥à¤§à¤¿ à¤à¤° ठधनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ à¤à¥ दà¥à¤¨ पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠà¤à¤° परसà¥à¤ªà¤° à¤à¤¨à¥à¤¹à¥ à¤à¤¾à¤à¤à¤¾à¤°à¥ à¤à¥ à¤à¤¾à¤µ सॠबतारà¥à¤¤à¤¾à¤µ à¤à¤°à¤¨à¤¾ à¤à¤¾à¤¿ à¤à¤¹à¤ । >FreeSerif >सà¤à¥ मनà¥à¤·à¥à¤¯à¥ à¤à¥ à¤à¥à¤°à¤µ à¤à¤° ठधिधिà¤à¤¾à¤°à¥ à¤à¥ मामलॠमॠà¤à¤¨à¥à¤®à¤à¤¾à¤¤ सà¥à¤µà¤¤à¤¨à¥à¤¤à¥à¤°à¤¤à¤¾ à¤à¤° समानता पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠ। >à¤à¤¨à¥à¤¹à¥ बà¥à¤¿à¤¦ à¤à¤° ठधनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ à¤à¥ दà¥à¤¨ पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠà¤à¤° परसà¥à¤ªà¤° à¤à¤¨à¥à¤¹à¥ à¤à¤¾à¤à¤à¤¾à¤°à¥ à¤à¥ à¤à¤¾à¤µ सॠबतारव à¤à¤°à¤¨à¤¾ à¤à¤¾à¤¿à¤¹à¤ । >FreeSans >सà¤à¥ मनà¥à¤·à¥à¤¯à¥à¤ à¤à¥ à¤à¥à¤°à¤µ à¤à¤° ठधिधिà¤à¤¾à¤°à¥à¤ à¤à¥ मामलॠमà¥à¤ à¤à¤¨à¥à¤®à¤à¤¾à¤¤ सà¥à¤µà¤¤à¤¨à¥à¤¤à¥à¤°à¤¤à¤¾ à¤à¤° समानता पà¥à¤¾à¤ªà¥à¤¤ हॠ। >à¤à¤¨à¥à¤¹à¥à¤ बà¥à¤¿à¤¦à¥à¤§à¤¿ à¤à¤° ठधनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ à¤à¥ दà¥à¤¨ पà¥à¤¾à¤ªà¥à¤¤ हॠà¤à¤° परसà¥à¤ªà¤° à¤à¤¨à¥à¤¹à¥à¤ à¤à¤¾à¤à¤à¤¾à¤°à¥ à¤à¥ à¤à¤¾à¤µ सॠबताव à¤à¤°à¤¨à¤¾ à¤à¤¾à¤¿à¤¹à¤ । >Gargi >सà¤à¥ मनà¥à¤·à¥à¤¯à¥ à¤à¥ à¤à¥à¤°à¤µ à¤à¤° ठधिधिà¤à¤¾à¤°à¥ à¤à¥ मामलॠमॠà¤à¤¨à¥à¤®à¤à¤¾à¤¤ सà¥à¤µà¤¤à¤¨à¥à¤¤à¥à¤°à¤¤à¤¾ à¤à¤° समानता पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠ। >à¤à¤¨à¥à¤¹à¥ बà¥à¤¿à¤¦ à¤à¤° ठधनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ à¤à¥ दà¥à¤¨ पà¥à¤°à¤¾à¤ªà¥à¤¤ हॠà¤à¤° परसà¥à¤ªà¤° à¤à¤¨à¥à¤¹à¥ à¤à¤¾à¤à¤à¤¾à¤°à¥ à¤à¥ à¤à¤¾à¤µ सॠबताव à¤à¤°à¤¨à¤¾ à¤à¤¾à¤¿à¤¹à¤ । > >They're *close*. I'm very impressed that it re-orders the vowel signs!!! >Nobody else does that!!! > > >1) All tests show a duplication of letters in the word > ठधिà¤à¤¾à¤°à¥à¤ >producing > ठधिधिà¤à¤¾à¤°à¥à¤ > >2) The anusvara--the dot on top in मà¥à¤--is lost in most cases. >But for this one word, in FreeSans, it isn't lost, and in another, >it is lost in FreeSerif and Gargi, but not in Lohit or FreeSans! >What's the pattern??? >Gargi has ligature of 094B+0902 it's named uni0972 and not in PUA, which is just a pity. >FreeSerif also has a ligature dev_o_anusvara.abvs. >FreeSans does not use this... relies on mark placement. >Lohit names the same glyph u094B_u0902.abvs. >I modified the FreeSerif font, placing the anusvara using GPOS positioning. >Then the anusvara reappeared in the test. >But something is wrong here. > Guess: maybe for these 'abvs' replacements, the AGLFN is being used? > Otherwise, for unknown reasons, the anusvara isn't being extracted > properly from the ligature glyphs. > >3) The word बà¥à¤¦à¥à¤§à¤¿ is screwed up in every case but in different ways. > > I the last glyph is a ligature, da-dha. The ligature is decomposed > into consonants, but the dha is lost. Then it gets challenged re-ordering > the vowel sign. > (it should be repositioned after the decomposed consonants) > >4) Another strange duplication in Lohit only: बतारà¥à¤¤à¤¾à¤µ > >5) FreeSerif, Sans, Gargi: the reph बरà¥à¤¤à¤¾à¤µ should be transformed back to ra-virama, >but instead it's lost entirely: बताव > This information should be gotten from the font's 'rphf' table. > All of the fonts have it. > The only difference I see is: Lohit uses an AGLFN name. > >6) Weird: ठनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ gets a ध (dha) in every case! ठधनà¥à¤¤à¤°à¤¾à¤¤à¥à¤®à¤¾ > Could this be the dha lost in (3)? > ---------------------------------------------------- >My catch on this: > >Clearly you code uses the font's feature tables to construct the PDF file's ToUnicode. >This is good and right, although hard to do right (and in fact, since the tables >constitute a many-to-many mapping of glyphs to character strings and ToUnicode is many-to-one, it's impossible to to perfectly). > >I suspect you are using the AGLFN. If that's so, please re-consider. >In the presence of OpenType tables, glyph names should be ignored. >The glyph names can at best duplicate what's in the tables. Otherwise >following the AGLFN *only* causes problems. >Oh, hell, maybe you could use it to break a tie, when the feature tables >map the same glyph to two different character strings. >But please, usually, ignore the AGLFN. It is not a good thing. > >I think this is only a few bugs away from working adequately. > >Further things to consider: >1) the rearrangement of vowels really is necessary. >2) the OpenType standards for Indic scripts changed in 2005, > making new scripts such as 'dev1' to replace 'deva', > in which several feature tags are changed, notably, > the order of inputs to 'akhn', 'abvf', 'blwf' are altered.
You cannot view the attachment while viewing its details because your browser does not support IFRAMEs.
View the attachment on a separate page
.
View Attachment As Raw
Actions:
View
Attachments on
bug 66597
: 82038 |
82039
|
82040
|
127393
|
127394
|
127395
|
127398
|
139317
|
139318
|
139319
|
139320
|
141740
|
141756
|
141772
|
141808
|
141809