Unicode in C

Mon Mar 12 17:39:02 IST 2012

What's the advantage of using ucs-4 internally?
Especially if the program needs to save memory (embedded devices are pretty
common these days).

Ely

2012/3/12 Dov Grobgeld <dov.grobgeld at gmail.com>

> My suggestion is go the glib/gtk approach and use utf-8 everywhere and
> have the API accept char*, i.e. there is no typedef for a unicode character
> strings. If this is not acceptable because of speed (this is its only
> tradeoff), then use UCS-4 internally and provide two external interfaces
> for UCS-4 and UTF-8. For backwards compatibility you can provide your own
> iso-8859-8 to utf8 conversion functions. I suggest that you don't add an
> iconv dependence but let the user take care of character set conversions,
> which you don't really care about.
>
> Regards,
> Dov
>
> 2012/3/12 Elazar Leibovich <elazarl at gmail.com>
>
>> The simplest option is, to accept StringPiece-like structure (pointer to
>> buffer + size), and encoding, then to convert the data internally to your
>> encoding (say, ISO-8859-8, replacing illegal characters with whitespace),
>> and convert the other output back.
>>
>> Do you mind using iconv-like library?
>>
>>
>> On Mon, Mar 12, 2012 at 3:05 PM, Nadav Har'El <nyh at math.technion.ac.il>wrote:
>>
>>> Hi, I have a question that I was sort of sad that I couldn't readily
>>> find the answer to...
>>>
>>> Let's say I want to create a C API (a C library), with functions which
>>> take strings as arguments. What am I supposed to use if I want these
>>> strings
>>> to be in any language? Obviously the answer is "Unicode", but that
>>> doesn't really answer the question... How is Unicode used in C?
>>>
>>> As far as I can see, there are two major approaches to this problem.
>>>
>>> One approach, used in the Win32 C APIs on MS-Windows, and also in Java
>>> and
>>> other languages, is to use "wide characters" - characters of 16 or 32 bit
>>> size, and strings are an array of such characters.
>>>
>>> The second approach, proposed by Plan 9, is to use UTF-8.
>>>
>>> I personally like better the UTF-8 approach, because it naturally fits
>>> with C's "char *" type and with Linux's system calls (which take char*,
>>> not any sort of wide characters), but I'm completely unsure that this is
>>> what users actually want. If not, then I wonder, why?
>>>
>>> Some background on this question: People have been complaining for years
>>> that Hspell, and in particular the libhspell functions, use ISO-8859-8
>>> instead of "unicode". But if one wants to add unicode to libhspell, what
>>> should it be? UTF-8? Wide chars (UTF-16 or UTF-32)?
>>>
>>> Thanks,
>>> Nadav.
>>>
>>> --
>>> Nadav Har'El                        |                    Monday, Mar 12
>>> 2012,
>>> nyh at math.technion.ac.il
>>> |-----------------------------------------
>>> Phone +972-523-790466, ICQ 13349191 |We could wipe out world hunger if
>>> we knew
>>> http://nadav.harel.org.il           |how to make AOL's Free CD's edible!
>>>
>>> _______________________________________________
>>> Linux-il mailing list
>>> Linux-il at cs.huji.ac.il
>>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>>
>>
>>
>> _______________________________________________
>> Linux-il mailing list
>> Linux-il at cs.huji.ac.il
>> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>>
>>
>
> _______________________________________________
> Linux-il mailing list
> Linux-il at cs.huji.ac.il
> http://mailman.cs.huji.ac.il/mailman/listinfo/linux-il
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20120312/ce4cd5db/attachment-0001.html>