Unicode in C

Unicode in C

Nadav Har'El nyh at math.technion.ac.il
Tue Mar 13 10:25:03 IST 2012


On Tue, Mar 13, 2012, kobi zamir wrote about "Re: Unicode in C":
> imho because hspell only use hebrew, it can internally continue to use
> hebrew only charset without nikud iso-8859-8 (or with nikud win-1255).

I agree, and this has been my feeling all along. By using iso-8859-8
internally (and for the basic word lookup, an even more optimized 5-bit
encoding) instead of utf-8, Hspell's memory usage is at least halved.

> it will be helpful if hspell will give the user convenience functions. this
> functions will that take utf-8 and return utf-8. the functions will convert
> the utf-8 to the hebrew only coding that hspell will use internally.

So I guess that you're also in the UTF-8 camp. That's also the direction
I'm leaning. But the question is - will one day after Hspell gets a
UTF-8 API, people start complaining why it doesn't have a UTF-16,
UTF-32, or some other sort of API? And don't answer "if they want
UTF-16, let them use iconv to convert UTF-16 to UTF-8 and back" - after
all they can do this now with ISO-8859-8 (and like you said, Enchant is
doing exactly that) and still people complain ;-)

> p.s.
> i will be happy if hspell will give easy to use functions for using the
> library lingual info. in current version of hspell using lingual info is
> very hard. see:
> http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala

I agree that the linginfo (aka morphological analyzer) C API needs an
overhaul. Out of embarrasment, it's not even documented in hspell(3) :-)
It could also have been implemented more efficiently (memory-wise) than
it is. But following the maxim "If it ain't broken, don't fix it",
we haven't touched this code in years :(

P.S. 

Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui
has a bug: it claims that החתול might mean ה+חתול with the second word
being in construct form (סמיכוך). But this isn't a valid split - the
construct form cannot be preceded by the definite article (ה) - and
Hspell knows this (try running hspell -al or going to the demo at
http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi to check).
Similarly, הירוק only has one legal meaning ("the green") and the two other
meanings listed in the png on your site are *wrong*. So it appears something
is wrong with your word splitting code? This is surprising if you're using
libhspell... I didn't look at your code to see where it went wrong.

Nadav.

-- 
Nadav Har'El                        |                   Tuesday, Mar 13 2012, 
nyh at math.technion.ac.il             |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |And now for some feedback:
http://nadav.harel.org.il           |EEEEEEEEEEEEEEEEEEEEEEEEEEE



More information about the Linux-il mailing list