Tuesday, September 28, 2010

Encoding an Unicode Character to UTF-8

Introduction

I'm living in an UTF-8 world. Everthing, from the operating systems I use (Solaris, Linux and Mac OS X) to my terminals. Even the file systems I use must be able to support the UTF-8 names I give to my files.

UTF-8 is a flexible and well supported encoding for Unicode and it's supported, out of the box, on all the operating systems I use. UTF-8 allows me not to worry (ever) about the characters I use in file names or in the files I write. Being a native Italian living in Spain, it's essential for me to have an encoding that supports all of the languages I use in my daily file.

UTF-8 is the perfect solution: it's ASCII backwards compatible, so I need no conversion when writing ASCII files, and can encode every characters in the Unicode character set. No need to worry if one day I had to write some Japanese file name. The only "drawback" it may have it's a little space overhead compared to some other encodings (such as UTF-16) but in this particular case, UTF-8 beats them all.

In such an UTF-8 world, there's no need to worry: just use the tools you have and everything will be fine.

Well, almost. Sometimes, as described in that post, I need to know exactly how UTF-8 encodes a specific Unicode code point. Some other times, since I'm a sed and grep addict, it's really handy knowing what to look for. Some days ago, for example, I had look for file names that contained specific code points to correct an idiosyncrasy between Subversion and Mac OS X (I still cannot believe it...). In such a cases, an UNIX terminal and its utilities are really your best friends. Since the encoding process is very easy, I'm quickly describing it to you in the case you need it.

The UTF-8 Encoding

The UTF-8 encoding is a variable width encoding in which each Unicode code point can be encoded as a sequence of 1 to 4 octects. Each octect is composed by a heading byte and trailing bytes. Since the encoding is a variable width one, you need a way to tell where a sequence starts and where it ends. That information is stored in the head byte.

Heading Byte

The head byte can take one of these forms:
  • 0xxxxxxx: Single byte.
  • 110xxxxx: Head of a sequence of two bytes.
  • 1110xxxx: Head of a sequence of three bytes.
  • 11110xxx: Head of a sequence of four bytes.

You may have noticed that the number of 1s in a heading byte tells how long the sequence is. If the heading byte starts with 0, the sequence is single byte.

The 0xxxxxxx format guarantees that ASCII characters in the [0, 127] range are encoded by the identical function thus guaranteeing the backwards compatibility with ASCII.

Trailing Bytes

Trailing bytes in a multibyte UTF-8 sequence always have the following form:
  • 10xxxxxx

Encoding

The 0s and 1s in the heading and trailing bytes format described in the previous sections are called control bits. In the UTF-8 encoding, the concatenation of the non control bits is the scalar value of the encoded Unicode code point. Because of the structure of the aforementioned forms, the number of required bytes to encode a specific Unicode code point with UTF-8 depends on the Unicode range where it falls (the ranges are given both hexadecimal and binary forms:
  • [U+000, U+007F] [00000000, 01111111]: one byte.
  • [U+0080, U+07FF] [10000000, 111 11111111]: two bytes.
  • [U+0800, U+FFFF] [1000 00000000, 11111111 11111111]: three bytes.
  • [U+10000, U+10FFFF] [1 00000000 00000000, 10000 00000000 00000000]: four bytes.

One Byte Range

In the one byte range, meaningful bits of the Unicode code points are in then [0, 6] range. Since the single byte form has a 1 bit overhead (the leading 0), a single byte is sufficient:

[0xxxxxxx  0xxxxxxxx]
[00000000, 011111111]

Two Bytes Range

In the two bytes range, meaningful bits of the Unicode code points are in the [0, 10] range. Since the head of a sequence of two bytes has a 3 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 11 bits.

[     fff ffssssss,      fff ffssssss]
[00000000 10000000, 00000111 11111111]

(* f are bits from the first byte of the encoded sequence)
(** s are bits from the second byte of encoded the sequence)

The encoded sequence is:

[110fffff 10ssssss]

Three Bytes Range

In the three bytes range, meaningful bits of the Unicode code points are in the [0, 15] range. Since the head of a sequence of three bytes has a 4 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 16 bits.

[ffffssss sstttttt, ffffssss sstttttt]
[00001000 00000000, 11111111 11111111]

(* f are bits from the first byte of the sequence)
(** s are bits from the second byte of the sequence)
(*** t are bits from the third byte of the sequence)

The encoded sequence is:

[1110ffff 10ssssss 10tttttt]

Four Bytes Range

In the four bytes range, meaningful bits of the Unicode code points are in the [0, 20] range. Since the head of a sequence of four bytes has a 5 bits overhead and the form of a trailing byte has a 2 bits overhead, there's room for exactly 21 bits.

[   fffss sssstttt tthhhhhh,    fffss sssstttt tthhhhhh]
[00000001 00000000 00000000, 00010000 11111111 11111111]

(* f are bits from the first byte of the sequence)
(** s are bits from the second byte of the sequence)
(*** t are bits from the third byte of the sequence)
(**** h are bits from the third byte of the sequence)

The encoded sequence is:

[11110fff 10ssssss 10tttttt 10hhhhhh]

Conclusion

You know see how easy is to convert an Unicode code point to its representation in the UTF-8 encoding. The UTF-8 encoding of the Byte Order Mark (BOM) character, for example, whose code point is U+FEFF, can be easily be computed as explained above.

The U+FEFF code point falls in the three byte range and its binary representation is:

[F    E    F    F   ]
[1111 1110 1111 1111]

According the the aforementioned rule, the conversion gives:

[E   F    B   B    B   F   ]
[11101111 10111011 10111111]


that corresponds to the \xEF\xBB\xBF expression used with sed in our previous example.

No comments: