UTF8(5) BSD File Formats Manual UTF8(5)NAMEutf8 — UTF-8, a transformation format of ISO 10646
SYNOPSIS
ENCODING "UTF-8"
DESCRIPTION
The UTF-8 encoding represents UCS-4 characters as a sequence of octets,
using between 1 and 6 for each character. It is backwards compatible
with ASCII, so 0x00-0x7f refer to the ASCII character set. The multibyte
encoding of non- ASCII characters consist entirely of bytes whose high
order bit is set. The actual encoding is represented by the following
table:
[0x00000000 - 0x0000007f] [00000000.0bbbbbbb] -> 0bbbbbbb
[0x00000080 - 0x000007ff] [00000bbb.bbbbbbbb] -> 110bbbbb, 10bbbbbb
[0x00000800 - 0x0000ffff] [bbbbbbbb.bbbbbbbb] ->
1110bbbb, 10bbbbbb, 10bbbbbb
[0x00010000 - 0x001fffff] [00000000.000bbbbb.bbbbbbbb.bbbbbbbb] ->
11110bbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x00200000 - 0x03ffffff] [000000bb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
111110bb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
[0x04000000 - 0x7fffffff] [0bbbbbbb.bbbbbbbb.bbbbbbbb.bbbbbbbb] ->
1111110b, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb, 10bbbbbb
If more than a single representation of a value exists (for example,
0x00; 0xC0 0x80; 0xE0 0x80 0x80) the shortest representation is always
used. Longer ones are detected as an error as they pose a potential
security risk, and destroy the 1:1 character:octet sequence mapping.
COMPATIBILITY
The utf8 encoding supersedes the utf2(5) encoding. The only differences
between the two are that utf8 handles the full 31-bit character set of
ISO 10646 whereas utf2(5) is limited to a 16-bit character set, and that
utf2(5) accepts redundant, non-"shortest form" representations of charac‐
ters.
SEE ALSOeuc(5), utf2(5)
F. Yergeau, UTF-8, a transformation format of ISO 10646, January 1998,
RFC 2279.
STANDARDS
The utf8 encoding is compatible with RFC 2279.
BSD October 10, 2002 BSD