Unicode in C
Nadav Har'El
nyh at math.technion.ac.il
Tue Mar 13 10:25:03 IST 2012
On Tue, Mar 13, 2012, kobi zamir wrote about "Re: Unicode in C":
> imho because hspell only use hebrew, it can internally continue to use
> hebrew only charset without nikud iso-8859-8 (or with nikud win-1255).
I agree, and this has been my feeling all along. By using iso-8859-8
internally (and for the basic word lookup, an even more optimized 5-bit
encoding) instead of utf-8, Hspell's memory usage is at least halved.
> it will be helpful if hspell will give the user convenience functions. this
> functions will that take utf-8 and return utf-8. the functions will convert
> the utf-8 to the hebrew only coding that hspell will use internally.
So I guess that you're also in the UTF-8 camp. That's also the direction
I'm leaning. But the question is - will one day after Hspell gets a
UTF-8 API, people start complaining why it doesn't have a UTF-16,
UTF-32, or some other sort of API? And don't answer "if they want
UTF-16, let them use iconv to convert UTF-16 to UTF-8 and back" - after
all they can do this now with ISO-8859-8 (and like you said, Enchant is
doing exactly that) and still people complain ;-)
> p.s.
> i will be happy if hspell will give easy to use functions for using the
> library lingual info. in current version of hspell using lingual info is
> very hard. see:
> http://code.google.com/p/hspell-gir/source/browse/src/hspell-gir.vala
I agree that the linginfo (aka morphological analyzer) C API needs an
overhaul. Out of embarrasment, it's not even documented in hspell(3) :-)
It could also have been implemented more efficiently (memory-wise) than
it is. But following the maxim "If it ain't broken, don't fix it",
we haven't touched this code in years :(
P.S.
Looking at http://code.google.com/p/hspell-gir/, I see that hspell-gui
has a bug: it claims that החתול might mean ה+חתול with the second word
being in construct form (סמיכוך). But this isn't a valid split - the
construct form cannot be preceded by the definite article (ה) - and
Hspell knows this (try running hspell -al or going to the demo at
http://www.cs.technion.ac.il/~danken/cgi-bin/hspell.cgi to check).
Similarly, הירוק only has one legal meaning ("the green") and the two other
meanings listed in the png on your site are *wrong*. So it appears something
is wrong with your word splitting code? This is surprising if you're using
libhspell... I didn't look at your code to see where it went wrong.
Nadav.
--
Nadav Har'El | Tuesday, Mar 13 2012,
nyh at math.technion.ac.il |-----------------------------------------
Phone +972-523-790466, ICQ 13349191 |And now for some feedback:
http://nadav.harel.org.il |EEEEEEEEEEEEEEEEEEEEEEEEEEE
More information about the Linux-il
mailing list