Character Set

From Oracle FAQ
Jump to: navigation, search

Introduction[edit]

Character sets are the set of code that represents each supported character. This code is named code point.

There is two kinds of character sets:

  • fixed length character sets where each character is represented by the same number of bytes, for instance 1 for West European WE8MSWIN1252 character set and 2 for Unicode AL16UTF16 one.
  • variable length character sets where each character is represented by a variable number of bytes, for instance 1 to 3 for old Unicode UTF8 or 1 to 4 for new Unicode AL32UTF8.


Unicode character sets[edit]

Unicode is an ISO norm that gives a value to each character. From this several character sets have been defined among them the most known are UCS2, AL16UTF16, UTF8 and AL32UTF8.

UCS2 and AL16UTF16 are fixed length characters set which coded the characters on 2 bytes. The difference between the two character sets is that UCS2 does not take care of platform endianess whereas AL16UTF16 does. That means that in a dump of a string the character bytes are swapped between a big endian and a little endian platform in UCS2 whereas there are in the same order with AL16UTF16.

To support more than than the 65536 characters allowed by 2 bytes, these character sets have been extended to 2 groups of 2 bytes.

AL32UTF8 is an extension of UTF8 to support more character families and extend to not speaking language ones.

The following table gives the matching values for the code points:

UnicodeUCS2 / AL16UTF16(AL32)UTF8
coderepresentation
U+0000 – U+007F00000000 0xxxxxxx00000000 0xxxxxxx0xxxxxxx
U+0080 – U+07FF00000yyy yyxxxxxx00000yyy yyxxxxxx110yyyyy 10xxxxxx
U+0800 – U+FFFF (*)zzzzyyyy yyxxxxxxzzzzyyyy yyxxxxxx1110zzzz 10yyyyyy 10xxxxxx
U+10000 – U+100000000uuuuu zzzzyyyy
yyxxxxxx
110110ww wwzzzzyy
110111yy yyxxxxxx
 (**)
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

(*) Unicode codes from U+D800 to U+DFFF are not valid. These codes start with the bit string 11011.
(**) The extended Unicode codes (>U+FFFF) are represented with 2 groups of UCS2 bytes ; these groups (starting with bit string 11011) are part of the invalid range of strict Unicode codes and so cannot be misinterpreted. (note: in table wwww = uuuuu – 1)

External links[edit]

Unicode Consortium