Bug 59728 - Python 3 loads utf-8 encoded files as CP-1252 encoded files
Summary: Python 3 loads utf-8 encoded files as CP-1252 encoded files
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Extensions (show other bugs)
Version:
(earliest affected)
4.0.0.1 rc
Hardware: Other Windows (All)
: medium normal
Assignee: Not Assigned
URL:
Whiteboard: target:4.1.0 target:4.0.0
Keywords:
Depends on:
Blocks:
 
Reported: 2013-01-22 18:20 UTC by Olivier R.
Modified: 2013-01-28 15:43 UTC (History)
5 users (show)

See Also:
Crash report or crash signature:


Attachments
Badly displayed non-ASCII characters (36.15 KB, image/png)
2013-01-22 18:22 UTC, Olivier R.
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Olivier R. 2013-01-22 18:20:33 UTC
On Windows, LO 4 is unable to install the attached extension (written in Python):

Here is the error I get, when installing it:

(com.sun.star.uno.RuntimeException) { { Message = "<class 'UnicodeDecodeError'>: 'charmap' codec can't decode byte 0x9d in position 3782: character maps to <undefined>, traceback follows\X000a  C:\\Program Files (x86)\\LOdev 4.0\\program\\python-core-3.3.0\\lib\\encodings\\cp1252.py:23 in function decode() [return codecs.charmap_decode(input,self.errors,decoding_table)[0]]\X000a  C:\\Program Files (x86)\\LOdev 4.0\\program\\pythonloader.py:94 in function getModuleFromUrl() [src = fileHandle.read().replace(\"\\r\",\"\")]\X000a  C:\\Program Files (x86)\\LOdev 4.0\\program\\pythonloader.py:146 in function writeRegistryInfo() [mod = self.getModuleFromUrl( locationUrl )]\X000a\X000a", Context = (com.sun.star.uno.XInterface) @0 } } 

This error is due to the characters “” in some strings of the code in DictionarySwitcher.py.

If these characters are removed, the extension can be installed, but non-ASCII characters are not treated properly and are badly displayed. See the picture attached.

On Windows, pythonloader.py don’t load the file as an utf-8 file but as a cp1252 file.
Linux is not concerned by this issue.


Note: this extension install the 4 French dictionaries and offers an interface to switch between them, in Menu Tools > Language > French spelling dictionaries…
Comment 1 Olivier R. 2013-01-22 18:22:15 UTC
Created attachment 73467 [details]
Badly displayed non-ASCII characters
Comment 2 Olivier R. 2013-01-22 18:29:17 UTC
Extensions are too big to be uploaded on this website, so here it is:

- the extension which cannot be installed:
http://dicollecte.org/_misc/lo-oo-ressources-linguistiques-fr-v4.9-buggy.oxt

- the extension that can be installed, but non-ASCII characters are badly displayed:
http://dicollecte.org/_misc/lo-oo-ressources-linguistiques-fr-v4.9-wrongdisplay.oxt
(Go to Menu Tools > Language > French spelling dictionaries…)
Comment 3 Urmas 2013-01-22 19:33:54 UTC
Confirming.
Comment 4 Stephan Bergmann 2013-01-23 08:01:14 UTC
See the corresponding mail thread starting at <http://lists.freedesktop.org/archives/libreoffice/2013-January/044413.html> "Python extension issue with LO 4 on Windows" for further details.

What's presumably needed is something like changing line 93 of pyuno/source/loader/pythonloader.py from

  fileHandle = open( filename )

to

  fileHandle = open( filename, encoding="utf_8" )

but that would probably break the compatibility with Python 2 we still try to abide by.
Comment 5 Urmas 2013-01-23 10:20:08 UTC
We should just use BOM or coding keyword to define the source file encoding, as described here:

http://www.python.org/dev/peps/pep-0263/
Comment 6 Olivier R. 2013-01-23 10:54:52 UTC
(In reply to comment #4)

>   fileHandle = open( filename, encoding="utf_8" )
> 
> but that would probably break the compatibility with Python 2 we still try
> to abide by.

I tried. It works.
Well, the compatibility is already broken. ;) It’s may be not a good long term solution, but it looks better than opening files in cp-1252 encoding.


I also tried without specifying any encoding and using the BOM in the source code. Then I got the error:

(com.sun.star.uno.RuntimeException) { { Message = "<class 'SyntaxError'>: invalid character in identifier (DictionarySwitcher.py, line 1), traceback follows\X000a  C:\\Program Files (x86)\\LOdev 4.0\\program\\pythonloader.py:100 in function getModuleFromUrl() [codeobject = compile( src, encfile(filename), \"exec\" )]\X000a  C:\\Program Files (x86)\\LOdev 4.0\\program\\pythonloader.py:147 in function writeRegistryInfo() [mod = self.getModuleFromUrl( locationUrl )]\X000a\X000a", Context = (com.sun.star.uno.XInterface) @0 } }
Comment 7 Not Assigned 2013-01-23 17:04:45 UTC
Stephan Bergmann committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=c2445b03f4d27bbd7e14c4322704ce89b582839b

fdo#59728: Fix encoding of .py files as UTF-8 for Python 3



The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 8 Stephan Bergmann 2013-01-23 17:12:31 UTC
On Windows 7, I could not reproduce the failure to install <http://extensions.libreoffice.org/extension-center/dictionnaires-francais/releases/4.9/lo-oo-ressources-linguistiques-fr-v4.9.oxt>, but once installed, in Writer, "Tools - Language - French spelling dictionaries..." opened an "Orthographe française" dialog with all the non-ASCII characters garbled, and the fix from comment 7 fixes that.  Requested backport to libreoffice-4-0 (and subsequently libreoffice-4-0-0, too) as <https://gerrit.libreoffice.org/#/c/1829/>.  Thanks to mstahl for the "if sys.version >= '3':" idiom.

(In reply to comment #5)
> We should just use BOM or coding keyword to define the source file encoding

I would hope that just supporting UTF-8 is enough there these days.  But if trouble remains, we could indeed go into that direction.
Comment 9 Olivier R. 2013-01-23 17:33:18 UTC
(In reply to comment #8)
> On Windows 7, I could not reproduce the failure to install
> <http://extensions.libreoffice.org/extension-center/dictionnaires-francais/
> releases/4.9/lo-oo-ressources-linguistiques-fr-v4.9.oxt>

Because I removed the specific characters that made the installation failure.
Comment 10 Not Assigned 2013-01-23 17:40:55 UTC
Stephan Bergmann committed a patch related to this issue.
It has been pushed to "libreoffice-4-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=85e7a0f6cd9b311e6734e747b03ad0a736ff6dbd&h=libreoffice-4-0

fdo#59728: Fix encoding of .py files as UTF-8 for Python 3


It will be available in LibreOffice 4.0.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.
Comment 11 Stephan Bergmann 2013-01-24 07:48:10 UTC
requested backport to libreoffice-4-0-0 as <https://gerrit.libreoffice.org/#/c/1836/>
Comment 12 Not Assigned 2013-01-28 15:43:18 UTC
Stephan Bergmann committed a patch related to this issue.
It has been pushed to "libreoffice-4-0-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=355f644fb1d987947d63f2aa7e6a8f59d8337324&h=libreoffice-4-0-0

fdo#59728: Fix encoding of .py files as UTF-8 for Python 3


It will be available already in LibreOffice 4.0.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.