Unicode in C

Tue Mar 13 13:19:01 IST 2012

Hi,

2012/3/13 Elazar Leibovich <elazarl at gmail.com>

> 2012/3/13 kobi zamir <kobi.zamir at gmail.com>
>
>>
>>
>>> So I guess that you're also in the UTF-8 camp.
>>>
>>
>> yes, but my opinion about utf-8 is just my opinion. i like python and
>> python defaults to utf-8.
>>
>
> Python's internal representation is not UTF-8, but UTF-16, or UTF-32,
> depends on build parameters. Thus python doesn't really support code points
> above the BMP.
> Of course, you cannot know the internal representation, since python
> (cleverly) does not allow you to cast a unicode string to a sequence of
> bytes without specifying the result encoding.
>
> http://docs.python.org/c-api/unicode.html
>
> (see also this very good presentation<http://98.245.80.27/tcpc/OSCON2011/gbu.html>on internal unicode representations in various languages).
>
>
Nitpick: It's actually ucs2/ucs4 (which preceded the above but are
compatible).

Actually one can know the internal representation by checking
sys.maxunicode [1]. I'm using it in python-bidi to manually handle
surrogate pairs if needed [2].

[1] http://docs.python.org/dev/library/sys.html#sys.maxunicode
[2]
https://github.com/MeirKriheli/python-bidi/blob/master/src/bidi/algorithm.py#L46

Cheers
-- 
Meir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cs.huji.ac.il/pipermail/linux-il/attachments/20120313/7cb0fe16/attachment.html>