Unicode in C

Unicode in C

Dov Grobgeld dov.grobgeld at gmail.com
Mon Mar 12 16:30:24 IST 2012


My suggestion is go the glib/gtk approach and use utf-8 everywhere and have
the API accept char*, i.e. there is no typedef for a unicode character
strings. If this is not acceptable because of speed (this is its only
tradeoff), then use UCS-4 internally and provide two external interfaces
for UCS-4 and UTF-8. For backwards compatibility you can provide your own
iso-8859-8 to utf8 conversion functions. I suggest that you don't add an
iconv dependence but let the user take care of character set conversions,
which you don't really care about.

Regards,
Dov

2012/3/12 Elazar Leibovich <elazarl at gmail.com>

> The simplest option is, to accept StringPiece-like structure (pointer to
> buffer + size), and encoding, then to convert the data internally to your
> encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
> and convert the other output back.
>
> Do you mind using iconv-like library?
>
>
> On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El <nyh at math.technion.ac.il>wrote:
>
>> Hi, I have a question that I was sort of sad that I couldn't readily
>> find the answer to...
>>
>> Let's say I want to create a C API (a C library), with functions which
>> take strings as arguments. What am I supposed to use if I want these
>> strings
>> to be in any language? Obviously the answer is "Unicode", but that
>> doesn't really answer the question... How is Unicode used in C?
>>
>> As far as I can see, there are two major approaches to this problem.
>>
>> One approach, used in the Win32 C APIs on MS-Windows, and also in Java and
>> other languages, is to use "wide characters" - characters of 16 or 32 bit
>> size, and strings are an array of such characters.
>>
>> The second approach, proposed by Plan 9, is to use UTF-8.
>>
>> I personally like better the UTF-8 approach, because it naturally fits
>> with C's "char *" type and with Linux's system calls (which take char*,
>> not any sort of wide characters), but I'm completely unsure that this is
>> what users actually want. If not, then I wonder, why?
>>
>> Some background on this question: People have been complaining for years
>> that Hspell, and in particular the libhspell functions, use ISO-8859-8
>> instead of "unicode". But if one wants to add unicode to libhspell, what
>> should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?
>>
>> Thanks,
>> Nadav.
>>
>> --
>> Nadav Har'El                        |                    Monday, Mar 12
>> 2012,
>> nyh at math.technion.ac.il
>> |-----------------------------------------
>> Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if
>> we knew
>> http://nadav.harel.org.il           |how to make AOL's Free CD's edible!
>>
>> _______________________________________________
>> Linux-il mailing list
>> Linux-il at cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20120312/b7a18b18/attachment.html>


More information about the Linux-il mailing list