Characters sets and Localization issues

Characters, glyphs and fonts

We often speak inaccurately of character sets: we may refer to a "Greek character set" or a "Latin character set". But in order to understand how different writing systems are supported by computer systems, we need to be more precise about characters.

Users don't view or print characters: a user views or prints glyphs. A glyph is a representation of a character. The character "Capital Letter A" is represented by the glyph "A" in Times New Roman Bold, and "A" in Arial Bold. A font is a collection of glyphs. A computer is able to retrieve the appropriate glyphs by using mapping information about the keyboard, the language system in use, and the glyphs associated with each character.

Fonts are designed with character sets in mind: a font for use in Russia will include glyphs representing Cyrillic characters. Characters from different language systems are conventionally divided into different "character sets", primarily because, in the past, a limited number of characters could be "addressed" at any one time.

Character codes

Characters are represented by character codes. Character codes are generated and stored when a user inputs a document. Single-Byte character sets (SBCS) provide 256 character codes (2). This is an adequate number to encode most of the characters needed for Western Europe. For example, the Windows Extended ANSI character set contains 256 characters consisting of Latin letters, Arabic numerals, punctuation, and drawing characters.

However, 256 character codes are not enough to represent all the characters needed by multi-lingual users in a single font, or by users in the Far East, where over 12,000 characters may need to be addressed at any one time. Consequently, Multi-Byte character sets (commonly known as Double-Byte character sets) are necessary. Double-Byte character sets (DBCS) are a mixture of Single-Byte and Double-Byte character encodings and provide over 65,000 character codes (2 to the 16th power).

Codepages

A codepage is a list of selected character codes in a certain order. Codepages are usually defined to support specific languages or groups of languages which share common writing systems. For example, codepage 1253 provides character codes required in the Greek writing system.

Unicode

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode character set is a character set intended to represent the writing schemes of all of the world's major languages. Although early versions could be represented with 16 bits (65,536 characters), by 1996 at version 2.0, that proved insufficient, and it is now believed that at least 21, and possibly 22, bits will ultimately be required, supporting a few million characters.

At Unicode version 2.0, there were 38,885 assigned, named characters. At version 3.0, there were 49,194 assigned characters. At version 3.2, there were 95,156 assigned characters. At version 4.0, there are 96,382 assigned characters. Update : this memo was wrote in 2003, in 2019 version 12.1 reached 137,929 assigned characters !

 If Unicode provides a single number to represent a character, different encoding schemes have been developped to answer to different requirements :

- compliance with 7-bit ASCII characters files (e.g. UTF-8)

- variable-width encoding schemes to minimize the number of bytes required to store Unicode characters, depending on the most commonly used characters, also potentially allowing for compatibility with simpler schemes (e.g. UTF-16 - that supports UCS-2 encoding by acting as a "superset").

- fixed-length encoding schemes to minimize complex processing of strings in memory (e.g UCS-2).

Each encoding scheme is an "algorithm" to decode from the scheme to get the Unicode real number (A Plane + Offset), that will be represented by a glyph depending on the selected font.