UTF-16 Support in Java

Unicode Technical Report #17 says "An example of the first are Java String and char APIs, which use UTF-16 code units [Whistler 99]." But current Java specification and implementation don't support UTF-16 properly.


1, Java Character Encoding Model

Encoding Conversions


2, UTF-8

Differences between UTF-8 format and Java modified UTF-8 format

  1. The null byte (byte)0 is encoded using the 2-byte format rather than the 1-byte format, so that Java virtual machine UTF-8 strings never have embedded nulls.
  2. Only the 1-byte, 2-byte, and 3-byte formats are used. The Java virtual machine does not recognize the longer UTF-8 formats (= surrogate pairs = UTF-16). Surrogate pairs are encoded 6 bytes in Java modified UTF-8 format!

UTF-8 Format (The Unicode Standard Version 2.0)

Unicode value 1st Byte 2nd Byte 3rd Byte 4th Byte
000000000xxxxxxx 0xxxxxxx - - -
00000yyyyyxxxxxxx 110yyyyy 10xxxxxx - -
zzzzyyyyyyxxxxxxx 1110zzzz 10yyyyyy 10xxxxxx -
110110wwwwzzzzyy
+ 110111yyyyxxxxxx
11110uuua 10uuzzzz 10yyyyyy 10xxxxxx

a. where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates)

Java modified UTF-8 Format

Unicode value 1st Byte 2nd Byte 3rd Byte 4th Byte
0000000000000000 00000000 00000000 - -
000000000xxxxxxx 0xxxxxxx - - -
00000yyyyyxxxxxxx 110yyyyy 10xxxxxx - -
zzzzyyyyyyxxxxxxx 1110zzzz 10yyyyyy 10xxxxxx -

Java modified UTF-8 format is designed for Unicode 1.0 and don't follow the changes between Unicode 1.0 and 2.0.


References

[Whistler & Davis 1999] p
Ken Whistler, Mark Davis: Character Encoding Model, Unicode Technical Report #17, 1999.
http://www.unicode.org/unicode/reports/tr17/
[Dürst & Yergeau 1999]
Martin J. Dürst, François Yergeau: Character Model for the World Wide Web, World Wide Web Consortium Working Draft, 1999.
http://www.w3.org/TR/charmod/
[UTF-16]
ISO/IEC JTC1/SC2/WG2: Transformation Format for 16 Planes of Group 00 (UTF-16), 1993.
http://wwwold.dkuug.dk/jtc1/sc2/wg2/docs/n1334
[Lindholm & Yellin 1999]
Tim Lindholm, Frank Yellin : The Java Virtual Machine Specification Second Edition. http://java.sun.com/docs/books/vmspec/index.html

Kazuhiro Kazama