Unicode in C

Unicode in C

Dan Kenigsberg danken at cs.technion.ac.il
Tue Mar 13 16:35:44 IST 2012


On Mon, Mar 12, 2012 at 03:05:56PM +0200, Nadav Har'El wrote:
> Hi, I have a question that I was sort of sad that I couldn't readily
> find the answer to...
> 
> Let's say I want to create a C API (a C library), with functions which
> take strings as arguments. What am I supposed to use if I want these strings
> to be in any language? Obviously the answer is "Unicode", but that
> doesn't really answer the question... How is Unicode used in C?
> 
> As far as I can see, there are two major approaches to this problem.
> 
> One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
> other languages, is to use "wide characters" - characters of 16 or 32 bit
> size, and strings are an array of such characters.
> 
> The second approach, proposed by Plan 9, is to use UTF-8.
> 
> I personally like better the UTF-8 approach, because it naturally fits
> with C's "char *" type and with Linux's system calls (which take char*,
> not any sort of wide characters), but I'm completely unsure that this is
> what users actually want. If not, then I wonder, why?
> 
> Some background on this question: People have been complaining for years
> that Hspell, and in particular the libhspell functions, use ISO-8859-8
> instead of "unicode". But if one wants to add unicode to libhspell, what
> should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?

I think this background is most important. The real questions is the motivation
of the people complaining. If it is something beyond "yuck, 8bit is old!", we
should ask them which encoding is good for their use case.

When I compiled hspell for a (paying!) customer who used Windows, I wrote my own
wrapper functions to convert Windows' wide chars to hspell's 8bit (and vice
versa). I bet that if there's anyone using libhspell in a Unix-like environment,
he would prefer utf-8.

In my opinion, it is nice to fit to modern standards of your major target
environment (read: utf8), but not necessary to cater to all encodings.
Would you even consider supplying a hspell_iso88598_to_utf8 function to help
your client app do the conversion itself? I'm not sure this is our bees wax.

However this is only me and my bets. If anyone needs another encoding, let him
speak now or use his own iconv calls forever.


Dan.




More information about the Linux-il mailing list