Character encoding notes: ASCII, Unicode and UTF-8

Character encoding notes: ASCII, Unicode and UTF-8

1. ASCII code

We know that in the computer, all information is finally represented as a binary string. Each binary bit (bit) has two states, 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states in total, and each state corresponds to a symbol, which is 256 symbols, ranging from 0000000 to 11111111.

In the 1960s, the United States formulated a set of character codes to uniformly regulate the relationship between English characters and binary digits. This is called ASCII code, and it is still in use today.

The ASCII code specifies a total of 128 characters. For example, the space "SPACE" is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) occupy only the last 7 bits of a byte, and the first bit is uniformly defined as 0.

2. Non-ASCII encoding

English encoding with 128 symbols is enough, but for other languages, 128 symbols are not enough. For example, in French, if there is a phonetic symbol above the letter, it cannot be represented by ASCII code. As a result, some European countries have decided to use the highest bit of the unused byte to program a new symbol. For example, the code of é in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent up to 256 symbols.

However, a new problem has arisen here. Different countries have different letters, so even if they all use 256 symbol encoding methods, they represent different letters. For example, 130 represents é in the French encoding, but it represents the letter Gimel (ג) in the Hebrew encoding, and it represents another symbol in the Russian encoding. But anyway, in all these encoding methods, the symbols represented by 0-127 are the same, and the only difference is the segment of 128-255.

As for the scripts of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough. You must use multiple bytes to represent one symbol. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so in theory, it can represent up to 256x256=65536 symbols.

The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB type has nothing to do with the Unicode and UTF-8 in the following text.

3.Unicode

As mentioned in the previous section, there are many encoding methods in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise garbled characters will appear if you decode it with the wrong encoding method. Why do emails often appear garbled? It is because the coding method used by the sender and the recipient is different.

It is conceivable that if there is a code, all the symbols in the world are included. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, this is an encoding of all symbols.

Unicode is of course a large collection, and the current scale can hold more than 1 million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character "strict". For the specific symbol correspondence table, you can query unicode.org or the special Chinese character correspondence table.

4. The Unicode problem

It should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the unicode of the Chinese character "Yan" is the hexadecimal number 4E25, which is converted into a binary number with 15 bits (100111000100101), which means that the representation of this symbol requires at least 2 bytes. Represents other larger symbols, which may require 3 bytes or 4 bytes, or even more.

There are two serious problems here. The first one is how to distinguish between unicode and ascii? How does the computer know that three bytes represent a symbol instead of three symbols separately? The second problem is that we already know that English letters are represented by only one byte. If unicode uniformly stipulates that each symbol is represented by three or four bytes, then there must be two before each English letter. Up to three bytes is 0, which is a great waste of storage. The size of the text file will therefore be two or three times larger, which is unacceptable.

The result of them is: 1) There are multiple storage methods of unicode, which means that there are many different binary formats that can be used to represent unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.

5.UTF-8

The popularity of the Internet strongly requires a unified coding method. UTF-8 is the most widely used unicode implementation on the Internet. Other implementation methods include UTF-16 and UTF-32, but they are basically not used on the Internet. To repeat, the relationship here is that UTF-8 is one of the implementations of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

UTF-8 encoding rules are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set to 0, and the following 7 bits are the unicode code of this symbol. Therefore, for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n>1), the first n bits of the first byte are all set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are all set to 10. The remaining binary bits not mentioned are all unicode codes of this symbol.

The following table summarizes the coding rules, the letter x represents the available coded bits.

Unicode symbol range | UTF-8 encoding method (hexadecimal) | (binary) --------------------+---------- ----------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Next, take the Chinese character "strict" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the "strict" unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the "strict" UTF-8 encoding requires three bytes , That is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary bit of "strict", fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, the "strict" UTF-8 encoding is "11100100 10111000 10100101", and converted to hexadecimal is E4B8A5.

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that the "strict" Unicode code is 4E25, and the UTF-8 code is E4B8A5, which are different. The conversion between them can be achieved through programs.

On the Windows platform, one of the simplest conversion methods is to use the built-in Notepad applet Notepad.exe. After opening the file, click the "Save As" command in the "File" menu, and a dialog box will pop up, with a "encoding" drop-down bar at the very bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English files, it is ASCII encoding, and for simplified Chinese files, it is GB2312 encoding (only for Windows simplified Chinese version, if it is traditional Chinese version, Big5 code will be used).

2) Unicode encoding refers to the UCS-2 encoding method, that is, the Unicode code that directly uses two bytes to store characters. The little endian format used for this option.

3) Unicode big endian encoding corresponds to the previous option. I will explain the meaning of little endian and big endian in the next section.

4) UTF-8 encoding, which is the encoding method discussed in the previous section.

After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. Take the Chinese character "Yan" as an example. The Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in front and 25 is in the back, which is Big endian mode; 25 is in front and 4E is in the back, which is Little endian mode.

These two weird names come from "Gulliver's Travels" by British writer Swift. In the book, a civil war broke out in the Lilliputian Country. The cause of the war was people arguing about whether to start eating eggs from the big end (Big-Endian) or from the small end (Little-Endian). For this matter, six wars broke out before and after. One emperor died and the other emperor lost the throne.

Therefore, the first byte is the "big endian" (Big endian), and the second byte is the "little endian" (Little endian).

So naturally, there will be a question: how does the computer know which way to encode a certain file?

It is defined in the Unicode specification that a character representing the encoding sequence is added to the front of each file. The name of this character is called "ZERO WIDTH NO-BREAK SPACE" (ZERO WIDTH NO-BREAK SPACE), which is represented by FEFF. This is exactly two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are FE FF, it means that the file adopts big-end mode; if the first two bytes are FF FE, it means that the file adopts small-end mode.

8. Examples

Below, give an example.

Open the "Notepad" program Notepad.exe, create a new text file, the content is a "strict" character, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding in turn.

Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding of the file.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 encoding, which also implies that GB2312 is stored in bulk.

2) Unicode: The encoding is four bytes "FF FE 25 4E", where "FF FE" indicates that it is stored in small headers, and the real encoding is 4E25.

3) Unicode big endian: The encoding is four bytes "FE FF 4E 25", where "FE FF" indicates that it is stored in big endian format.

4) UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5", the first three bytes "EF BB BF" indicate that this is UTF-8 encoding, and the last three "E4B8A5" are "strict" Specific coding, its storage order is consistent with the coding order.

Reference: https://cloud.tencent.com/developer/article/1053796 Character encoding notes: ASCII, Unicode and UTF-8-Cloud + Community-Tencent Cloud