UTF-16 Support in Java
Unicode Technical Report #17 says "An example of the first are
Java String and char APIs, which use UTF-16 code units [Whistler 99]."
But current Java specification and implementation don't support UTF-16
properly.
1, Java Character Encoding Model
- ACR (Abstract Character Repertorie) level
- The set of abstract characters to be encoded.
- Data structures
- char[]
- char[] with its index range
- String
- String with its index range
- Data operations
- java.text.BreakIterator class
- CCS (Coded Character Set) level
- a mapping from an abstract character repertoire to a set of non-negative integers
- Relations between Java and CCS
- JDK 1.0.2 ... Unicode 1.1
- JDK 1.1 - 1.1.6 ... Unicode 2.0
- JDK 1.1.7 later ... Unicode 2.1
- Data structures
- Unicode 1.0 - 1.1
- char (16 bit)
- char[] with its index number
- String with its index number
- Unicode 2.0 later
- int (32 bit)
- char[]
- char[] with its index range
- String
- String with its index range
- Data operations
- Unicode 1.0 - 1.1
- java.lang.Character class
- = CEF level operations
- Unicode 2.0 later
- java.lang.Character class (but they used "char")
- No iterator operation is provided.
- CEF (Character Encoding Form) level
- a mapping from a set of non-negative integers (from a CCS) to a set of sequences of particular code units of some specified width.
- code unit's width = 16 bit
- Relations between Java and CEF
- JDK 1.0.2 ... UCS-2 (fixed-width encoding)
- JDK 1.1 later ... UTF-16 (surrogate pairs) (variable-width encoding)
- Data Structures
- char
- char[] with its index number
- String with its index number
- Data operations
- index increment & decrement
- java.text.CharacterIterator and its subclasses
- CES (Character Encoding Scheme) level
- A mapping from a set of sequences of codes units (from one or more CEFs) to a serialized sequence of bytes.
- 2 type CES
- java.io.DataInputStream & java.io.DataOutputStream classes and their subclasses
- readUTF()/writeUTF()
- Java modified UTF-8 format
- java.io.Reader & java.io.Writer and their subclasses
- byte-to-char & char-to-byte converters in a sun.io package
- UTF-8, UTF-16BE, UTF-16LE, etc.
- Data Structures
Encoding Conversions
2, UTF-8
Differences between UTF-8 format and Java modified UTF-8 format
- The null byte (byte)0 is encoded using the 2-byte format rather
than the 1-byte format, so that Java virtual machine UTF-8 strings
never have embedded nulls.
- Only the 1-byte, 2-byte, and 3-byte formats are used.
The Java virtual machine does not recognize the longer UTF-8 formats (= surrogate pairs = UTF-16).
Surrogate pairs are encoded 6 bytes in Java modified UTF-8 format!
UTF-8 Format (The Unicode Standard Version 2.0)
| Unicode value
| 1st Byte
| 2nd Byte
| 3rd Byte
| 4th Byte
|
| 000000000xxxxxxx
| 0xxxxxxx
| -
| -
| -
|
| 00000yyyyyxxxxxxx
| 110yyyyy
| 10xxxxxx
| -
| -
|
| zzzzyyyyyyxxxxxxx
| 1110zzzz
| 10yyyyyy
| 10xxxxxx
| -
|
110110wwwwzzzzyy
+ 110111yyyyxxxxxx
| 11110uuua
| 10uuzzzz
| 10yyyyyy
| 10xxxxxx
|
a. where uuuuu = wwww + 1 (to account for addition of 1000016 as in Section 3.7, Surrogates)
Java modified UTF-8 Format
| Unicode value
| 1st Byte
| 2nd Byte
| 3rd Byte
| 4th Byte
|
| 0000000000000000
| 00000000
| 00000000
| -
| -
|
| 000000000xxxxxxx
| 0xxxxxxx
| -
| -
| -
|
| 00000yyyyyxxxxxxx
| 110yyyyy
| 10xxxxxx
| -
| -
|
| zzzzyyyyyyxxxxxxx
| 1110zzzz
| 10yyyyyy
| 10xxxxxx
| -
|
Java modified UTF-8 format is designed for Unicode 1.0 and don't follow the changes between Unicode 1.0 and 2.0.
References
- [Whistler & Davis 1999]
p
- Ken Whistler, Mark Davis:
Character Encoding Model,
Unicode Technical Report #17,
1999.
http://www.unicode.org/unicode/reports/tr17/
- [Dürst & Yergeau 1999]
- Martin J. Dürst,
François Yergeau:
Character Model for the World Wide Web,
World Wide Web Consortium Working Draft,
1999.
http://www.w3.org/TR/charmod/
- [UTF-16]
- ISO/IEC JTC1/SC2/WG2:
Transformation Format for 16 Planes of Group 00 (UTF-16),
1993.
http://wwwold.dkuug.dk/jtc1/sc2/wg2/docs/n1334
- [Lindholm & Yellin 1999]
- Tim Lindholm,
Frank Yellin :
The Java Virtual Machine Specification Second Edition.
http://java.sun.com/docs/books/vmspec/index.html
Kazuhiro Kazama