Unicode in C

Unicode in C

Omer Zak w1 at zak.co.il
Mon Mar 12 15:20:12 IST 2012


It depends upon your tradeoffs.
If you use mostly Western fonts (Latin, Hebrew, etc.) and want to
economize on memory use, use UTF-8.  However, for Chinese, it costs more
memory than it saves.

If you need to use Far Eastern fonts and/or have random access for your
text, use fixed size wide character encoding (16 bit or 32 bit size).

My suggestion for the particular case of libhspell is as follows.
1. Is there any standard API for spellchecking libraries?  If yes, try
to use it.
2. Otherwise, specify two such APIs - one is UTF-8 based, one is fixed
size wide character based.  Create two binary variants of the libhspell
and optimize each one for the corresponding API.  Hopefully, it'll be
possible to use essentially the same code base for 16 bit and 32 bit
characters.

The rationale is that different wordprocessors may need either API, and
that they need to run spellchecking as fast as possible.


--- Omer


On Mon, 2012-03-12 at 15:05 +0200, Nadav Har'El wrote:
> Hi, I have a question that I was sort of sad that I couldn't readily
> find the answer to...
> 
> Let's say I want to create a C API (a C library), with functions which
> take strings as arguments. What am I supposed to use if I want these strings
> to be in any language? Obviously the answer is "Unicode", but that
> doesn't really answer the question... How is Unicode used in C?
> 
> As far as I can see, there are two major approaches to this problem.
> 
> One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
> other languages, is to use "wide characters" - characters of 16 or 32 bit
> size, and strings are an array of such characters.
> 
> The second approach, proposed by Plan 9, is to use UTF-8.
> 
> I personally like better the UTF-8 approach, because it naturally fits
> with C's "char *" type and with Linux's system calls (which take char*,
> not any sort of wide characters), but I'm completely unsure that this is
> what users actually want. If not, then I wonder, why?
> 
> Some background on this question: People have been complaining for years
> that Hspell, and in particular the libhspell functions, use ISO-8859-8
> instead of "unicode". But if one wants to add unicode to libhspell, what
> should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

-- 
$ python
>>> type(type(type))
<type 'type'>          My own blog is at http://www.zak.co.il/tddpirate/
My opinions, as expressed in this E-mail message, are mine alone.
They do not represent the official policy of any organization with which
I may be affiliated in any way.
WARNING TO SPAMMERS:  at http://www.zak.co.il/spamwarning.html




More information about the Linux-il mailing list