воскресенье, января 20, 2008

The Microsoft approach to Unicode

OS X and Linux use UTF-8 to represent Unicode texts. The advantage of UTF-8 is that ASCII text only needs 1 byte for each character, and converting apps to Unicode is relatively easy. The downside is that characters are variable width, which complicates string manipulation.

For Windows 95/98/NT, Microsoft used UCS-2, which uses exactly two bytes for every character, so it's easy to manipulate strings once you convert all your chars to WCHAR and all your "strings" to _T("string"). But a WCHAR can only hold 65536 symbols which is not enough for some Asian scripts and certain Unicode symbols.

Once they realized that 65536 characters will not, after all, suit everyone, Windows 2000 was upgraded from UCS-2 to UTF-16, which still uses 2 bytes for most chars, but some need 4.

So the Microsoft approach to Unicode now combines the worst of two worlds: you need twice the memory for ASCII text, programs are hard to upgrade, and the encoding is once again variable width so you can't manipulate strings easily.