Bug 97439 - Update (merge) Autocorrect pt_PT to 2016-01-29
Summary: Update (merge) Autocorrect pt_PT to 2016-01-29
Status: RESOLVED FIXED
Alias: None
Product: LibreOffice
Classification: Unclassified
Component: Linguistic (show other bugs)
Version:
(earliest affected)
5.1.0.0.alpha0+ Master
Hardware: All All
: medium normal
Assignee: Marco A.G.Pinto
URL:
Whiteboard: target:5.2.0 target:5.1.1 target:5.3.0
Keywords:
Depends on:
Blocks: AutoCorrect-Complete
  Show dependency treegraph
 
Reported: 2016-01-29 16:31 UTC by Marco A.G.Pinto
Modified: 2017-06-25 01:00 UTC (History)
8 users (show)

See Also:
Crash report or crash signature:


Attachments
LibreOffice 5.2 - AutoCorrection pt-PT (32.42 KB, application/zip)
2016-08-12 00:01 UTC, Tiago Santos
Details
Spreadsheet template attached. Possibly useful for future reference or use. (197.51 KB, application/vnd.oasis.opendocument.spreadsheet)
2016-08-13 13:34 UTC, Tiago Santos
Details
Template spreadsheet for future use or reference (171.68 KB, application/vnd.oasis.opendocument.spreadsheet)
2016-08-13 14:26 UTC, Tiago Santos
Details
Updated patch with the changes introduced by Marco in 5.2.1 (35.15 KB, application/zip)
2016-08-22 16:51 UTC, Tiago Santos
Details
Updated patch with the changes introduced by Marco in 5.2.1 (35.15 KB, application/zip)
2016-08-22 16:53 UTC, Tiago Santos
Details
DocumentList.xml diff from 5.2.1 version and this patch (410.80 KB, text/xml)
2016-08-22 17:00 UTC, Tiago Santos
Details
Processing spreadsheet v2 (428.82 KB, application/vnd.oasis.opendocument.spreadsheet)
2016-08-22 17:04 UTC, Tiago Santos
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marco A.G.Pinto 2016-01-29 16:31:09 UTC
Hello!

I have grabbed the AOO autocorrect XML from 2010 and added around 168 new words to it:
https://bz.apache.org/ooo/show_bug.cgi?id=126815

It will be shipped with AOO 4.2.0 to be released in February.

I notice that the LO version has 1000+ new strings but they are emoticons and the rest seems to be the 2010 version.

I was wondering if you could merge my DocumentList.xml into the LO's one but being careful to remove the duplicate words.

PS-> I tried to add the Portuguese team to Cc but I got an error saying the e-mails aren't recognised.

Thanks!

Kind regards,
     >Marco A.G.Pinto
      ---------------
Comment 1 Commit Notification 2016-02-02 10:59:57 UTC
Marco A. G. Pinto committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=33003d0fed4a5aaef4b631fcc3c0941f0eca34c9

tdf#97439 Enhance pt-PT autocorrect file

It will be available in 5.2.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 2 Commit Notification 2016-02-02 19:29:34 UTC
Marco A. G. Pinto committed a patch related to this issue.
It has been pushed to "libreoffice-5-1":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=21f66a09625c207adab459287df5a84b00be017d&h=libreoffice-5-1

tdf#97439 Enhance pt-PT autocorrect file

It will be available in 5.1.1.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 3 Tiago Santos 2016-08-12 00:01:22 UTC
Created attachment 126766 [details]
LibreOffice 5.2 - AutoCorrection pt-PT

Good evening,

I am submitting an update to the Portuguese (Portugal) Auto Correction files. 
The files changed in the package are:

DocumentList.xml
SentenceExceptList.xml
WordExceptList

This was made by adding the changes from Brasilian Portuguese AutoCorrection files (many more strings corrected, better symbol support and fuller exception lists). 
The merge was semi-automated after correcting to Portugal Portuguese or eliminating the strings that didn't make sense in this language variant.
The file can be tested on Linux placing it at /home/user/.config/libreoffice/4/user/autocorr

Do not hesitate to contact me if anyone requires assistance repeating the same procedure for other language variants.
Hope this patch is helpful to others.

Best regards.
Comment 4 jani 2016-08-12 11:17:57 UTC
Added to gerrit:
https://gerrit.libreoffice.org/28078
Comment 5 Tiago Santos 2016-08-13 13:34:59 UTC
Created attachment 126782 [details]
Spreadsheet template attached. Possibly useful for future reference or use.
Comment 6 Tiago Santos 2016-08-13 14:26:25 UTC
Created attachment 126783 [details]
Template spreadsheet for future use or reference

Simplified template spreadsheet for future use or reference. Former attachment set as obsolete.
Comment 7 Commit Notification 2016-08-13 14:53:10 UTC
Tiago Santos committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=5ad876c7d801466bca3864c625da46c3ff313d1e

tdf#97439 autocorrect pt-PT

It will be available in 5.3.0.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds

Affected users are encouraged to test the fix and report feedback.
Comment 8 Marco A.G.Pinto 2016-08-13 22:05:52 UTC
Damn... your Calc file doesn't help much...

What we needed was similar to this:
2016-07-11
AUTOCORRECT MOST RECENT:
[19:03] <@erAck> marcoagpinto: whatever, the relevant file is https://gerrit.libreoffice.org/gitweb?p=core.git;a=blob_plain;f=extras/source/autocorr/lang/pt/DocumentList.xml;hb=HEAD

Uploading the whole wordlist in one line is a bad idea and doesn't allow us to check it.

You could also have read my site, where I explain how to convert each word to line using Notepad++:
http://marcoagpinto.cidadevirtual.pt/proofingtoolgui.html

The simplest way would be to take the file from Gerrit just like erAck suggested to me above.
Comment 9 Marco A.G.Pinto 2016-08-13 22:12:10 UTC
Guys?!

I will talk with Arón on Monday again about how to "refresh" the Gerrit repository in my HDD.

I can then have access to the most up-to-date autocorrect pt_PT, right?

Other way would be to download the file just like ErAck suggested, pasted in my previous comment.

On Monday I will take a look at it.

Then, I will open the file in my Hunspell tool (Proofing Tool GUI) and see if improvements/changes need to be done to the present wordlist!

Thanks!

Kind regards from your brother and friend,
       >Marco A.G.Pinto
        ---------------
Comment 10 Marco A.G.Pinto 2016-08-13 23:07:48 UTC
@Tiago,

Sorry for being too hard on you... I just had a very tough day at work and other situation regarding L10n in Mozilla made me very angry.

I know you have tried your best!

Thanks!
Comment 11 Tiago Santos 2016-08-14 15:20:14 UTC
@Marco

I do not want to be the guy to tell you how you should spend your time, but:

- It is very simple and fast to make the file conform to your standards via replacements. And you actually have a tutorial to guide you, right?
- It is faster to do the changes yourself than to complain to others about it. That's why I did it in the first place!
- Though simple batch manipulations are enough for this task, after reading your comment on Gerrit, I checked your blog. 
     BUG REPORT: The tool introduces destructive changes on many tools when creating and manipulating thesaurus. One of the reasons is that the sorting mechanism places shorter words in the end. For example: 'recrutar', 'recrutamento' e 'recruta' are sorted in this order. MyThes is unable to read the last word (in this example 'recruta') due to this. As such, .idx files are also buggy.
- As in LibreOffice, this submission is only to those who are interested in it. And it works. I am using it and Ian Iversen also managed to use it before committing to the repos. Anyway, like LibreOffice, this "patch" is volunteer work and is served as is. No guaranties and absolutely no promises of extra work (though I don't mind doing it in good cooperation).
- Last but not least, I haven't tried my best. I spend 3 hours merging the thesaurus and polishing. I spent 4 hours back and forward with you. I really appreciate feedback and that my stuff is reviewed but, please, manage your temper. 

In the last couple of days I have constructed a thesaurus with approximately 90000 lines. Though not mandatory, it would be great if we could work together on it and iron out some things in my thesaurus and in your tool, so we both can make European Portuguese spell-checking 'great again';)

Best regards
Comment 12 Aron Budea 2016-08-22 13:47:21 UTC
(In reply to Commit Notification from comment #7)
> Tiago Santos committed a patch related to this issue.
> It has been pushed to "master":
> 
> http://cgit.freedesktop.org/libreoffice/core/commit/
> ?id=5ad876c7d801466bca3864c625da46c3ff313d1e
> 
> tdf#97439 autocorrect pt-PT

Note that these are new files in 'extras/source/autocorr/lang/pt-PT/', while the original Portuguese autocorrect is in 'extras/source/autocorr/lang/pt/'.

-Are the new autocorrect files used?
-If they are, don't they clash with the original one?
-Either way, I don't think the new files will be shipped, as CustomTarget_autocorr.mk in extras refers to the original.
Comment 13 Tiago Santos 2016-08-22 16:47:20 UTC
I know nothing about compile paths, but on Windows the path for the files is Program Files/LibreOffice 5/share/extensions/dict-pt-PT, and in Linux the name of the file is sufixed by ‘_pt-PT’, so ‘./lang/pt-PT/' would make sense. 

I have read recently the changes made in 5.2.1.1. 
https://wiki.documentfoundation.org/Releases/5.2.1/RC1

Now I understand why this commit created some fuss, since it contains another update from Marco:
https://bugs.documentfoundation.org/show_bug.cgi?id=100960

I didn’t mean to disrespect his work when submitting to this thread but I didn’t find that report before seeing this on this thread, and I though the autocorrection files were not being actively worked on.

As such, I have recreated the files with the latest version in the repos (used gerrit linked by Marco) and merged it with a formatting style that is recognizes by ProofingTools. It also has even more emojis and emojis keywords, added  from the Brazilian keywords that could be used by Portuguese users. The combined total of new entries is roughly 2000. 

In order to assist review, I have also made a diff file to track the changes. Proofingtools is very good but in case there is a need for it, I also attach an updated version of the spreadsheet.

I would also suggest making this old thread be marked as duplicate (100960) after reviewing this patch.
Comment 14 Tiago Santos 2016-08-22 16:51:29 UTC
Created attachment 126968 [details]
Updated patch with the changes introduced by Marco in 5.2.1
Comment 15 Tiago Santos 2016-08-22 16:53:44 UTC
Created attachment 126970 [details]
Updated patch with the changes introduced by Marco in 5.2.1
Comment 16 Tiago Santos 2016-08-22 17:00:31 UTC
Created attachment 126971 [details]
DocumentList.xml diff from 5.2.1 version and this patch
Comment 17 Tiago Santos 2016-08-22 17:04:09 UTC
Created attachment 126972 [details]
Processing spreadsheet v2
Comment 18 Aron Budea 2016-08-22 17:41:18 UTC
(In reply to Tiago Santos from comment #13)
> I know nothing about compile paths, but on Windows the path for the files is
> Program Files/LibreOffice 5/share/extensions/dict-pt-PT, and in Linux the
> name of the file is sufixed by ‘_pt-PT’, so ‘./lang/pt-PT/' would make
> sense. 

I don't know the exact details, either, I haven't worked with autocorrect files before. All I know is, there was no 'extras/source/autocorr/lang/pt-PT' directory before, and the Portuguese autocorrect entries were in 'extras/source/autocorr/lang/pt'. I'd assume CustomTarget_autocorr.mk runs when preparing the release, zips the files in that directory, and creates the .dat file with the proper name that is to be installed in 'share/autocorr/'.

Therefore my suggestion is, when you prepare the changes for gerrit, put your updated files in 'extras/source/autocorr/lang/pt', and remove the previously added 'extras/source/autocorr/lang/pt-PT' ones.

Confirmation from someone knowing how this works is more than welcome.
Comment 19 jani 2016-08-22 20:07:07 UTC
(In reply to Tiago Santos from comment #17)
> Created attachment 126972 [details]
> Processing spreadsheet v2

However much I would like to help you, it is impossible to see which of the patches are relevant and should go into the code, they are to some extent conflicting.

We normally do not accept patches attached to issues, but request they are submitted as gerrit patches. 

Of course no rules without exception, and from time to time I do take a look at NON-programming patches, to help the people, but only where it is clearly documented.

rgds
jan I.
Comment 20 Tiago Santos 2016-08-23 14:12:21 UTC
Hi Jan,

I will reply in line for easier reading.

> However much I would like to help you, it is impossible to see which of the
> patches are relevant and should go into the code, they are to some extent
> conflicting.

This last update resolves the conflict, since it also incorporates the changes made the were incorporated into LO 5.2.1.  It is explained more extensively in the comment 13.

> We normally do not accept patches attached to issues, but request they are
> submitted as gerrit patches. 

Though I have said it before on gerrit, I thank you again for checking on this improvement suggestion. Never thought that this would require so much of time from all of you.

I am obviously failing to follow the protocol, though this venue seam appropriate given that I am new here and I have just ‘jump in’ to make an offer. 

IMHO, it makes sense for someone new to post a patch suggestion here before it is proposed in gerrit, since this allows easier debate about the needs and features. I am not arguing, I am showing a newcomer point of view.

> Of course no rules without exception, and from time to time I do take a look
> at NON-programming patches, to help the people, (...)

All the patch files are XML. 
I know you know this, and that you have not referred this for me, but AFAIK XMLs are non-coding documents by definition Maybe somewhere else an exception exists.

> (...)but only where it is clearly 
> documented.

The attachment that contains the second update to DocumentList.xml is in comments 14 or 15 (I accidentally resubmitted it). 

Other files changed inside the acor_pt_PT.dat are:
- SentenceExceptList.xml from the acor_pt_BR.dat;
- WordExceptList.xml and from the acor_pt_BR.dat.
Both files are perfectly suited for the European Portuguese variant.

The other files are just to assist in the review of the extensive additions, just in case there are any doubts about their validity or quality.


Best regards,

Tiago Santos
Comment 21 jani 2016-08-23 20:33:43 UTC
> I am obviously failing to follow the protocol, though this venue seam
> appropriate given that I am new here and I have just ‘jump in’ to make an
> offer. 
Well we actually have a guide for new contributors, which are meant to make it easier to participate.

https://wiki.documentfoundation.org/Development/GetInvolved

Everyone that sends a license statement (see wiki page), gets a welcome mail with a.o. a link to this pages


> IMHO, it makes sense for someone new to post a patch suggestion here before
> it is proposed in gerrit, since this allows easier debate about the needs
> and features. I am not arguing, I am showing a newcomer point of view.

Actually not, we use gerrit to debate the patches and bugzilla to define the bugs. In gerrit we can e.g. comment single lines (or files) of your patch, something which is not possible in Bugzilla.

Bugzilla is intented to describe bugs, not solutions, and the discussions in here is concentrated on defining the bug, while the discussions in gerrit is solely about how the bug is solved (the patch).

> 
> 
> All the patch files are XML. 
> I know you know this, and that you have not referred this for me, but AFAIK
> XMLs are non-coding documents by definition Maybe somewhere else an
> exception exists.

The only way to merge XML files like yours is through gerrit, so seen from that pow it is "code". 


> The attachment that contains the second update to DocumentList.xml is in
> comments 14 or 15 (I accidentally resubmitted it). 

But the patches are distribued in comment 15, 16 and 17.

> 
> Other files changed inside the acor_pt_PT.dat are:
> - SentenceExceptList.xml from the acor_pt_BR.dat;
> - WordExceptList.xml and from the acor_pt_BR.dat.
> Both files are perfectly suited for the European Portuguese variant.
> 
> The other files are just to assist in the review of the extensive additions,
> just in case there are any doubts about their validity or quality.

The problem with attachments are that I can only either add the whole attachement or none, since it represent your work. 

"the other files" would not be accepted in a gerrit review.

We try to make life as easy as possible for new people, but we focus a lot on
- keeping the code stable (for our million of users).
- make sure your work is credited to you (and thus not modified by us)

rgds
jan I.
Comment 22 Tiago Santos 2016-08-24 14:28:42 UTC
(In reply to jan iversen from comment #21)

> The problem with attachments are that I can only either add the whole
> attachement or none, since it represent your work. 

The packages were made also for testing convenience. Just extract on Windows or just replace on Linux. 
 
> "the other files" would not be accepted in a gerrit review.

Regarding the ‘other files’ on the zip, their source was credited from the beginning in comment 3. 

> We try to make life as easy as possible for new people, but we focus a lot on
> - keeping the code stable (for our million of users).
> - make sure your work is credited to you (and thus not modified by us)

I understand your concerns. 
I would not have shared any work here, if I believed that it would interfere with those principles, or that LibreOffice operated without these considerations. 
I also do not like to make my things other people burden. It seamed to me that no extra work apart from maybe another spell-check review would be required. That is the reason for the submitted ods.

> https://wiki.documentfoundation.org/Development/GetInvolved
> 
> Everyone that sends a license statement (see wiki page), gets a welcome mail
> with a.o. a link to this pages
(...)
> Bugzilla is intented to describe bugs, not solutions, and the discussions in
> here is concentrated on defining the bug, while the discussions in gerrit is
> solely about how the bug is solved (the patch).

Thank you for taking interest on this matter and for taking your time to explain to me how I should proceed. I will read the introductory guide you pointed out, and I will follow up via gerrit or, if I may, contact you via e-mail in case of doubt.


Best regards,

Tiago Santos
Comment 23 Tiago Santos 2016-09-28 13:48:42 UTC
All changes merged with master.