Unicode encoding 2. Prerequisites for the creation and development of Unicode

Hello, dear readers of the blog site. Today we will talk with you about where krakozyabrs come from on the site and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from the basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251, and ending with the modern encodings of the Unicode Consortium UTF 16 and 8.

To some, this information may seem redundant, but you would know how many questions I get specifically with regards to crawled out krakozyabrs (an unreadable character set). Now I will have the opportunity to refer everyone to the text of this article and independently look for my jambs. Well, get ready to absorb the information and try to follow the course of the story.

ASCII - basic text encoding for Latin

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals, and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as "aski"). It describes the first 128 characters of the most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks.

Even in these 128 characters described in ASCII, there were some service characters like brackets, bars, asterisks, etc. Actually, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely meet them and they will stand in that order.

But the fact is that with the help of one byte of information, it is possible to encode not 128, but as many as 256 different values ​​​​(two to the power of eight equals 256), therefore, following basic version Asuka appeared a number extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, probably, it is worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone studied at an institute or at school). , each of which is a two in the degree, starting from zero, and up to two in the seventh:

It is not difficult to understand that there can be only 256 of all possible combinations of zeros and ones in such a construction. To translate a number from binary system to decimal is pretty easy. You just need to add up all the powers of two, over which there are ones.

In our example, this is 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth), plus 64 (to the sixth), plus 128 (to the seventh). Total gets 233 in decimal system reckoning. As you can see, everything is very simple.

But if you take a closer look at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds in Asci to the hexadecimal number 2A. You probably know that in hexadecimal system numerals are used, in addition to Arabic numerals, also Latin letters from A (meaning ten) to F (meaning fifteen).

Well, for convert binary to hexadecimal resort to the following simple and visual method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte, only sixteen values ​​\u200b\u200bcan be encoded in binary code (two to the fourth power), which can be easily represented as a hexadecimal number.

Moreover, in the left half of the byte, it will be necessary to count the degrees again, starting from zero, and not as shown in the screenshot. As a result, by simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle turned out to be clear to you. Well, now let's continue, in fact, to talk about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. it became possible to add characters of the letters of your language to Asci.

Here it will be necessary to digress once again to explain - Why do you need coding at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are in co files) and a code that allows you to pull out from this set of vector shapes (font file) exactly the character that you need to insert into Right place.

It is clear that fonts are responsible for the vector forms themselves, but the operating system and the programs used in it are responsible for encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in desired file the font that is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska.

For example, initially there was CP866, in which it was possible to use the characters of the Russian alphabet and it was an extended version of ASCII.

Those. its upper part completely coincided with the basic version of Asuka (128 Latin characters, numbers and other crap), which is shown in the screenshot just above, but the lower part of the table with CP866 encoding had the form shown in the screenshot just below and allowed to encode another 128 signs (Russian letters and all kinds of pseudographics there):

You see, in the right column, the numbers start with 8, because numbers from 0 to 7 refer to the ASCII base part (see the first screenshot). That. the Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and the column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will be displayed in the text.

Where did this amount come from? pseudographics in CP866? The thing is that this encoding for Russian text was developed back in those furry years, when there was no such distribution of graphical operating systems as it is now. And in Dosa, and similar text operating systems, pseudo-graphics made it possible to somehow diversify the design of texts, and therefore it abounds in CP866 and all its other peers from the category of extended versions of Asuka.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of the text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half fully corresponds to the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, was done in CP866.

If you look at the very first screenshot (of the base part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the letters of the Latin alphabet consonant with them from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding only one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why krakozyabry crawl out

Further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose, which, in essence, were still extended versions of Asuka (one text character is encoded with just one byte of information), but without the use of pseudographic characters.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the variant with support for the Russian language. An example of this can serve.

It compares favorably with the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (apart from the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software a headache constantly arose, and we, dear readers, often got out those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they got out when sending and receiving messages via e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem at the root, and often users used for correspondence to avoid the notorious krakozyabrs when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, krakozyabry, climbing out instead of the Russian text, was the result incorrect use encodings given language, which did not match the one in which it was encoded text message initially.

Let's say if the characters encoded with CP866 try to display using code table Windows 1251, then these same krakozyabry (meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often occurs with forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gag to the code that is not visible to the naked eye.

In the end, many people got tired of such a situation with a lot of encodings and constantly getting out krakozyabry, there were prerequisites for creating a new universal variation that would replace all existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where the characters of the language were much more than 256.

Unicode (Unicode) - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not be described in any way in one byte of information, which was allocated for encoding characters in extended versions of ASCII. As a result, a consortium called Unicode(Unicode - Unicode Consortium) with the cooperation of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation to be released under the auspices of the Unicode Consortium was UTF-32. The number in the name of the encoding means the number of bits that is used to encode one character. 32 bits is 4 bytes of information that will be needed to encode one single character in the new universal encoding UTF.

As a result, the same file with text, encoded in the extended version of ASCII and in UTF-32, in the latter case will have a size (weight) four times larger. This is bad, but now we have the opportunity to encode using UTF the number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a huge margin).

But many countries with languages ​​​​of the European group did not need to use such a huge number of characters in the encoding at all, however, when using UTF-32, they received a fourfold increase in weight for nothing text documents, and as a result, an increase in the volume of Internet traffic and the amount of stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was accepted as the default base space for all the characters that we use. It uses two bytes to encode one character. Let's see what this thing looks like.

IN operating system Windows, you can go along the path "Start" - "Programs" - "Accessories" - "Utilities" - "Character Table". As a result, a table with vector shapes of all fonts installed in your system will open. If you choose in " Additional options» Unicode character set, you can see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its double-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and it was this number that was adopted as the base space in Unicode. In addition, there are ways to encode with it about two million characters, but limited to an extended space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English language, because they, after switching from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Asci and two bytes per the same character in UTF-16).

That's it for the satisfaction of everyone and everything in the Unicode consortium, it was decided to come up with variable length encoding. It's called UTF-8. Despite the eight in the name, it really has a variable length, i.e. each text character can be encoded into a sequence of one to six bytes.

In practice, in UTF-8, only the range from one to four bytes is used, because behind four bytes of code, nothing is even theoretically possible to imagine. All Latin characters in it are encoded in one byte, just like in the good old ASCII.

Remarkably, in the case of encoding only Latin, even those programs that do not understand Unicode will still read what is encoded in UTF-8. Those. the base part of Asuka simply passed into this brainchild of the Unicode Consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now even in sets.

In the “Character Table” above, you can see that different fonts support a different number of characters. Some Unicode-rich fonts can be very large. But now they differ not in that they were created for different encodings, but in the fact that the font manufacturer filled or did not fill the single code space with one or another vector form to the end.

Krakozyabry instead of Russian letters - how to fix

Let's now see how krakozyabras appear instead of text, or, in other words, how the correct encoding for Russian text is chosen. Actually, it is set in the program in which you create or edit this same text, or code using text fragments.

For editing and creating text files I personally use a very good, in my opinion, . However, it can highlight the syntax of a good hundred more programming and markup languages, and also has the ability to be extended using plugins. Read detailed overview this wonderful program at the link provided.

In the top menu of Notepad ++ there is an item "Encodings", where you will have the opportunity to convert an existing option to the one used on your site by default:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, in order to avoid the appearance of bugs, choose the option UTF8 without BOM. What is the prefix BOM?

The fact is that when they developed the UTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write a character code, both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand in which sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in the addition of three additional bytes to the very beginning of the documents.

In UTF-8 encoding, no BOM was provided for in the Unicode consortium, and therefore adding a signature (these most notorious additional three bytes to the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always choose the option without BOM (without signature). So you advance protect yourself from crawling krakozyabry.

Remarkably, some programs in Windows do not know how to do this (they cannot save text in UTF-8 without BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on the servers, because of this little thing, a problem may arise - krakozyabry will come out.

Therefore, by no means do not use regular Windows notepad for editing documents of your site, if you do not want the appearance of krakozyabrov. I consider the already mentioned Notepad ++ editor to be the best and simplest option, which has practically no drawbacks and consists of only advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is inherently very close to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described a little above. Where does this information come from?

It is registered in the registry of your operating room Windows systems- which encoding to choose in the case of ANSI, which one to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad ++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid krakozyabrov, in addition to the actions described above, it will be useful to write information about this encoding in its header of the source code of all pages of the site so that there is no confusion on the server or local host.

In general, in all hypertext markup languages ​​except Html, a special xml declaration is used, which specifies the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly the character codes of that language should be interpreted. But what is noteworthy, if you save the document in the default unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of a document Html language used to specify encoding Meta element, which is written between the opening and closing Head tag:

... ...

This entry is quite different from the one accepted in , but it is fully consistent with the new Html 5 standard that is slowly being introduced, and it will be absolutely correctly understood by anyone using it. this moment browsers.

In theory, the Meta element with the encoding html document it would be better to put as high as possible in the header of the document so that at the time of the meeting in the text of the first character not from the base ANSI (which will always be read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the blog pages site

  • What is Unicode?

    Unicode(English) Unicode) is a universal character encoding standard that allows you to provide characters from all languages ​​of the world.

    Unlike ASCII, one character is encoded as two bytes, allowing the use of 65 536 characters, against 256 .

    As you know, one byte is an integer from zero before 255 . In turn, a byte consists of eight bits that store numerical values ​​in binary form, where each next unit of the current bit is twice the value of the previous bit. Thus, two bytes can store a number from zero before 65 535 , which makes it possible to use 65 536 characters (zero + 65 535 , zero is also a number, it is not nothing).

    Unicode characters are divided into sections. First 128 characters repeat table ASCII.

    The family of encodings is responsible for displaying characters. Unicode (Unicode Transformation Format - UTF). The most famous and widely used encoding is UTF-8.

  • How to use the table?

    Symbols are presented in 16 pieces per line. From above you can see hexadecimal number from 0 before 16 . On the left, similar numbers in hexadecimal form from 0 before FFF.
    By connecting the number on the left with the number on the top, you can find out the character code. For example: English letter F located on the line 004 , in a column 6 : 004 + 6 = character code 0046 .

    However, you can simply hover over specific character in the table to find out the character code. Or click on a symbol to copy it, or its code in one of the formats.

    You can enter search keywords in the search field, for example: arrows, sun, heart. Or you can specify the character code in any format, for example: 1123, 04BC, چ. Or the symbol itself, if you want to know the symbol code.

    Search by keywords is currently under development, so it may not produce results. But many popular symbols can already be found.

Believe it or not, there is an image format built into the browser. This format allows you to download images before they are needed, provides rendering of the image on normal or retina screens and allows you to add css images. OK, that's not entirely true. This is not an image format, although everything else is still valid. Using it, you can create resolution-independent icons that take no time to load and are stylable with using CSS.

What is Unicode?

Unicode is the ability to correctly display letters and punctuation from different languages ​​on the same page. It's incredibly useful: users will be able to work with your site all over the world and it will show what you want - it can be French with diacritics or Kanji .

Unicode continues to evolve: now the current version is 8.0, which has more than 120 thousand characters (in the original article published in early 2014, it was about version 6.3 and 110 thousand characters).

In addition to letters and numbers, there are other characters and icons in Unicode. IN latest versions these included emoji, which you can see in the iOS messenger.

HTML pages are created from a sequence of Unicode characters and are converted to bytes when sent over the network. Each letter and each symbol of any language has its own unique code and is encoded when the file is saved.

When using the UTF-8 encoding system, you can directly insert Unicode characters into text, but you can also add Unicode characters to text by specifying a numeric symbolic link. For example, this is a heart symbol and you can display this symbol by simply adding code to the markup.

This numeric reference can be specified in both decimal and hexadecimal format. The decimal format requires the letter x to be added at the beginning, the entry will give the same heart ( ) as the previous option. (2665 is the hexadecimal version of 9829).

If you're adding a Unicode character with CSS, then you can only use hexadecimal values.

Some of the more commonly used Unicode characters have more memorable textual names or abbreviations instead of numeric codes, such as the ampersand (& - &). Such characters are called mnemonics in HTML, their full list is on Wikipedia.

Why should you use Unicode?

Good question, here are some reasons:

  1. To use correct characters from different languages.
  2. To replace icons.
  3. To replace icons connected via @font-face .
  4. To set CSS classes

Correct characters

The first of the reasons does not require any additional action. If the HTML is saved in UTF-8 format and its encoding is transmitted over the network as UTF-8, everything should work as it should.

Must. Unfortunately, not all browsers and devices support all Unicode characters in the same way (more precisely, not all fonts support the full character set). For example, newly added emoji characters are not supported everywhere.

For UTF-8 support in HTML5 add (if you do not have access to the server settings, you should also add ). The old doctype uses ( ).


The second reason to use Unicode is to have a large number useful symbols that can be used as icons. For example, , ≡ and .

Their obvious advantage is that you do not need any additional files to add them to the page, which means your site will be faster. You can also change their color or add a shadow with CSS. And by adding transitions (css transition) you can smoothly change the color of the icon when hovering over it without any additional images.

Let's say I want to include a rating indicator with stars on my page. I can do it like this:

★ ★ ★ ☆ ☆

You will get the following result:

But if you're unlucky, you'll see something like this:

Same rating on BlackBerry 9000

This happens if the characters used are not in the font of the browser or device (fortunately, these asterisks are supported perfectly and the old Blackberry phones are the only exception here).

If there is no Unicode character, it can be replaced by characters ranging from an empty square (□) to a diamond with a question mark (�).

But how do you find a Unicode character that might be suitable for use in your design? You can look it up on a site like Unicodinator by looking at the available characters, but there are also the best way. - this great site lets you draw the icon you're looking for and then offers you a list of similar Unicode characters.

Using Unicode with @font-face Icons

If you are using icons that are linked with an external font via @font-face , Unicode characters can be used as a fallback. This way you can show a similar Unicode character on devices or browsers where @font-face is not supported:

On the left are Font Awesome icons in Chrome, and on the right are Unicode replacement characters in Opera Mini.

Many @font-face matching tools use the Unicode character range from the private use area. The problem with this approach is that if @font-face is not supported, character codes are passed to the user without any meaning.

Great for creating icon sets in @font-face and allows you to choose a suitable Unicode character as the basis for the icon.

But be careful - some browsers and devices don't like single Unicode characters when used with @font-face . It makes sense to check Unicode character support with Unify - this app will help you determine how safe it is to use a character in the @font-face icon set.

Support for Unicode characters

The main problem with using Unicode characters as a fallback is poor support in screen readers (again, some information about this can be found on Unify), so it's important to choose the characters you use carefully.

If your icon is just a decorative element next to a text label readable by a screen reader, you don't have to worry too much. But if the icon is on its own, it's worth adding a hidden text label to help screen reader users. Even if a Unicode character is read by a screen reader, there is a chance that it will be very different from its intended purpose. For example, ≡ (≡) as a hamburger icon will be read as “identical” by VoiceOver on iOS.

Unicode in CSS class names

The fact that Unicode can be used in class names and in style sheets has been known since 2007. It was then that Jonathan Snook wrote about the use of Unicode characters in helper classes when laying out rounded corners. This idea has not received much distribution, but it is worth knowing about the possibility of using Unicode in class names (special characters or Cyrillic).

Font selection

Few fonts support the full Unicode character set, so be sure to check for the characters you want when choosing a font.

Lots of icons in Segoe UI Symbol or Arial Unicode MS . These fonts are available on both PC and Mac; Lucida Grande also has a fair amount of Unicode characters. You can add these fonts to the font-family declaration to enforce the maximum number of Unicode characters for users who have these fonts installed.

Determining Unicode Support

It would be very convenient to be able to check for the presence of a particular Unicode character, but there is no guaranteed way to do this.

Unicode characters can be effective when supported. For example, an emoji in the subject line of an email makes it stand out from the rest in mailbox.


This article only covers the basics of Unicode. I hope you find it useful and help you better understand Unicode and use it effectively.

