Hebrew spell-checking in OpenOffice

Hebrew spell-checking in OpenOffice

Nadav Har'El nyh at math.technion.ac.il
Tue Nov 2 11:40:57 IST 2010


Recently I noticed that (thanks to Lior Kaplan, it seems) it is now trivial
to get Hebrew spellchecking (based on Hspell 1.1) in OpenOffice.
The Hebrew localized version (now available on the official OpenOffice site!)
comes with Hebrew spell-checking pre-bundled, and there's an extension [1]
for those who use the English version of open-office.

However, when I actually used this spell checker, and observed my wife using
it, I noticed two annoying problems in the way it works. I'm not sure if these
are OpenOffice problems per se, or perhaps problems that should be solved in
the context of hunspell, OpenOffice's spell-checking library. It is possible
that changes to the dictionary file is all that is needed to solve these
problems, but it is also possible that OpenOffice code needs to be changed.
I simply don't know I was hoping that someone here could help me figure this
out, or at least point me to the right place to report these problems.

The first issue is acronyms (rashei tevot) and abbreviations. In Hebrew,
these use the geresh and gershaim (or single or double quotes), which is
part of the word. OpenOffice does not understand that these quotes are part
of the Hebrew word, and splits the word on them. As a result all acronyms are
marked as spelling mistakes. This is really annoying, especially for certain
types of documents where acronyms are common.

The second issue is the correction suggestions for spelling errors. All
the suggestions indeed appear to be valid words, but their order is
terrible - it appears little or no attention was paid to trying to provide
the most likely suggestions first. The screenshot on the extension page [1]
provides an excellent example: When given the mis-spelling עיברי, rather than
provide the most likely suggestion first - עברי, it is given as the 8th
suggestion, and the first suggestions are highly unlikely. The sixth
suggestion is especially unlikely (requiring one accidental transpose and one
movement): ערביי. I'd like OpenOffice to use common-sense edit-distance
based heuristics to decide which suggestion to give first (i.e., one typing
mistake is more likely than two), but also Hebrew-specific rules regarding
the "cost" of these edits, e.g., that in Hebrew omitting or adding a vowel
(em kri'a) is more likely than omitting or adding just any random letter.
Hebrew also has letters that sound the same (e.g., tav and tet) or close,
and a bunch of other rules I'd like to see.
I believe that hunspell's dictionary in fact has a way to give such correction
rules, but I don't know how to correctly write them, or how to make OpenOffice
use them.

I (and thousands of other OpenOffice users in Israel) would be grateful
if someone could look into these issues.

Nadav.

[1] http://extensions.services.openoffice.org/en/project/dict-he


-- 
Nadav Har'El                        |    Tuesday, Nov  2 2010, 25 Heshvan 5771
nyh at math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |The person who knows how to laugh at
http://nadav.harel.org.il           |himself will never cease to be amused.



More information about the Linux-il mailing list