character encoding scheme (foldoc) | character encoding
character encoding scheme
(Or "character encoding scheme") A mapping between
binary data values and character code positions (or "code
points").
Early systems stored characters in a variety of ways,
e.g. four six-bit characters in a 24-bit word, but around
1960, eight-bit bytes started to become the most common data
storage layout, with each character stored in one byte,
typically in the ASCII character set.
In the case of ASCII, the character encoding is an
identity mapping: code position 65 maps to the byte value
65. This is possible because ASCII uses only code positions
representable as single bytes, i.e., values between 0 and
255. (US-ASCII only uses values 0 to 127, in fact.)
From the late 1990s, there was increased use of larger
character sets such as Unicode and many CJK {coded
character sets}. These can represent characters from many
languages and more symbols.
Unicode uses many more than the 256 code positions that can
be represented by one byte. It thus requires more complex
mappings: sometimes the characters are mapped onto pairs of
bytes (see DBCS). In many cases, this breaks programs that
assume a one-to-one mapping of bytes to characters, and so,
for example, treat any occurrance of the byte value 13 as a
carriage return. To avoid this problem, character encodings
such as UTF-8 were devised.
(2015-11-29)
|