Unicode encoding 2. Prerequisites for the creation and development of Unicode

Hello, dear readers of the blog site. Today we will talk with you about where krakozyabrs come from on the site and in programs, what text encodings exist and which ones should be used. Let's take a closer look at the history of their development, starting from the basic ASCII, as well as its extended versions CP866, KOI8-R, Windows 1251, and ending with the modern encodings of the Unicode Consortium UTF 16 and 8.

To some, this information may seem redundant, but you would know how many questions I get specifically with regards to crawled out krakozyabrs (an unreadable character set). Now I will have the opportunity to refer everyone to the text of this article and independently look for my jambs. Well, get ready to absorb the information and try to follow the course of the story.

ASCII - basic text encoding for Latin

The development of text encodings occurred simultaneously with the formation of the IT industry, and during this time they managed to undergo quite a lot of changes. Historically, it all started with EBCDIC, which was rather dissonant in Russian pronunciation, which made it possible to encode letters of the Latin alphabet, Arabic numerals, and punctuation marks with control characters.

But still, the starting point for the development of modern text encodings should be considered the famous ASCII(American Standard Code for Information Interchange, which in Russian is usually pronounced as "aski"). It describes the first 128 characters of the most commonly used by English-speaking users - Latin letters, Arabic numerals and punctuation marks.

Even in these 128 characters described in ASCII, there were some service characters like brackets, bars, asterisks, etc. Actually, you can see them yourself:

It is these 128 characters from the original version of ASCII that have become the standard, and in any other encoding you will definitely meet them and they will stand in that order.

But the fact is that with the help of one byte of information, it is possible to encode not 128, but as many as 256 different values ​​​​(two to the power of eight equals 256), therefore, following basic version Asuka appeared a number extended ASCII encodings, in which, in addition to 128 basic characters, it was also possible to encode symbols of the national encoding (for example, Russian).

Here, probably, it is worth saying a little more about the number systems that are used in the description. Firstly, as you all know, a computer only works with numbers in the binary system, namely with zeros and ones (“Boolean algebra”, if anyone studied at an institute or at school). , each of which is a two in the degree, starting from zero, and up to two in the seventh:

It is not difficult to understand that there can be only 256 of all possible combinations of zeros and ones in such a construction. To translate a number from binary system to decimal is pretty easy. You just need to add up all the powers of two, over which there are ones.

In our example, this is 1 (2 to the power of zero) plus 8 (two to the power of 3), plus 32 (two to the fifth), plus 64 (to the sixth), plus 128 (to the seventh). Total gets 233 in decimal system reckoning. As you can see, everything is very simple.

But if you take a closer look at the table with ASCII characters, you will see that they are represented in hexadecimal encoding. For example, "asterisk" corresponds in Asci to the hexadecimal number 2A. You probably know that in hexadecimal system numerals are used, in addition to Arabic numerals, also Latin letters from A (meaning ten) to F (meaning fifteen).

Well, for convert binary to hexadecimal resort to the following simple and visual method. Each byte of information is divided into two parts of four bits, as shown in the above screenshot. That. in each half byte, only sixteen values ​​\u200b\u200bcan be encoded in binary code (two to the fourth power), which can be easily represented as a hexadecimal number.

Moreover, in the left half of the byte, it will be necessary to count the degrees again, starting from zero, and not as shown in the screenshot. As a result, by simple calculations, we get that the number E9 is encoded in the screenshot. I hope that the course of my reasoning and the solution to this puzzle turned out to be clear to you. Well, now let's continue, in fact, to talk about text encodings.

Extended versions of Asuka - CP866 and KOI8-R encodings with pseudographics

So, we started talking about ASCII, which was, as it were, the starting point for the development of all modern encodings (Windows 1251, Unicode, UTF 8).

Initially, it contained only 128 characters of the Latin alphabet, Arabic numerals and something else, but in the extended version it became possible to use all 256 values ​​that can be encoded in one byte of information. Those. it became possible to add characters of the letters of your language to Asci.

Here it will be necessary to digress once again to explain - Why do you need coding at all? texts and why it is so important. The characters on your computer screen are formed on the basis of two things - sets of vector shapes (representations) of all kinds of characters (they are in co files) and a code that allows you to pull out from this set of vector shapes (font file) exactly the character that you need to insert into Right place.

It is clear that fonts are responsible for the vector forms themselves, but the operating system and the programs used in it are responsible for encoding. Those. any text on your computer will be a set of bytes, each of which encodes one single character of this very text.

The program that displays this text on the screen (text editor, browser, etc.), when parsing the code, reads the encoding of the next character and looks for the corresponding vector form in desired file the font that is connected to display this text document. Everything is simple and banal.

This means that in order to encode any character we need (for example, from the national alphabet), two conditions must be met - the vector form of this character must be in the font used, and this character could be encoded in extended ASCII encodings in one byte. Therefore, there are a whole bunch of such options. Only for encoding characters of the Russian language, there are several varieties of the extended Aska.

For example, initially there was CP866, in which it was possible to use the characters of the Russian alphabet and it was an extended version of ASCII.

Those. its upper part completely coincided with the basic version of Asuka (128 Latin characters, numbers and other crap), which is shown in the screenshot just above, but the lower part of the table with CP866 encoding had the form shown in the screenshot just below and allowed to encode another 128 signs (Russian letters and all kinds of pseudographics there):

You see, in the right column, the numbers start with 8, because numbers from 0 to 7 refer to the ASCII base part (see the first screenshot). That. the Russian letter "M" in CP866 will have the code 9C (it is located at the intersection of the corresponding row with 9 and the column with the number C in the hexadecimal number system), which can be written in one byte of information, and if there is a suitable font with Russian characters, this letter without problems will be displayed in the text.

Where did this amount come from? pseudographics in CP866? The thing is that this encoding for Russian text was developed back in those furry years, when there was no such distribution of graphical operating systems as it is now. And in Dosa, and similar text operating systems, pseudo-graphics made it possible to somehow diversify the design of texts, and therefore it abounds in CP866 and all its other peers from the category of extended versions of Asuka.

CP866 was distributed by IBM, but in addition to this, a number of encodings were developed for Russian characters, for example, the same type (extended ASCII) can be attributed KOI8-R:

The principle of its operation remains the same as that of the CP866 described a little earlier - each character of the text is encoded by one single byte. The screenshot shows the second half of the KOI8-R table, because the first half fully corresponds to the basic Asuka, which is shown in the first screenshot in this article.

Among the features of the KOI8-R encoding, it can be noted that the Russian letters in its table are not in alphabetical order, as, for example, was done in CP866.

If you look at the very first screenshot (of the base part, which is included in all extended encodings), you will notice that in KOI8-R Russian letters are located in the same cells of the table as the letters of the Latin alphabet consonant with them from the first part of the table. This was done for the convenience of switching from Russian to Latin characters by discarding only one bit (two to the seventh power or 128).

Windows 1251 - the modern version of ASCII and why krakozyabry crawl out

Further development of text encodings was due to the fact that graphical operating systems were gaining popularity and the need to use pseudographics in them disappeared over time. As a result, a whole group arose, which, in essence, were still extended versions of Asuka (one text character is encoded with just one byte of information), but without the use of pseudographic characters.

They belonged to the so-called ANSI encodings, which were developed by the American Standards Institute. In common parlance, the name Cyrillic was also used for the variant with support for the Russian language. An example of this can serve.

It compares favorably with the previously used CP866 and KOI8-R in that the place of pseudographic symbols in it was taken by the missing symbols of Russian typography (apart from the accent mark), as well as symbols used in Slavic languages ​​close to Russian (Ukrainian, Belarusian, etc.). ):

Due to such an abundance of Russian language encodings, font manufacturers and manufacturers software a headache constantly arose, and we, dear readers, often got out those same notorious krakozyabry when there was confusion with the version used in the text.

Very often they got out when sending and receiving messages via e-mail, which entailed the creation of very complex conversion tables, which, in fact, could not solve this problem at the root, and often users used for correspondence to avoid the notorious krakozyabrs when using Russian encodings like CP866, KOI8-R or Windows 1251.

In fact, krakozyabry, climbing out instead of the Russian text, was the result incorrect use encodings given language, which did not match the one in which it was encoded text message initially.

Let's say if the characters encoded with CP866 try to display using code table Windows 1251, then these same krakozyabry (meaningless set of characters) will come out, completely replacing the text of the message.

A similar situation very often occurs with forums or blogs, when text with Russian characters is mistakenly saved in the wrong encoding that is used on the site by default, or in the wrong text editor, which adds gag to the code that is not visible to the naked eye.

In the end, many people got tired of such a situation with a lot of encodings and constantly getting out krakozyabry, there were prerequisites for creating a new universal variation that would replace all existing ones and would finally solve the problem with the appearance of unreadable texts. In addition, there was the problem of languages ​​like Chinese, where the characters of the language were much more than 256.

Unicode (Unicode) - universal encodings UTF 8, 16 and 32

These thousands of characters of the Southeast Asian language group could not be described in any way in one byte of information, which was allocated for encoding characters in extended versions of ASCII. As a result, a consortium called Unicode(Unicode - Unicode Consortium) with the cooperation of many IT industry leaders (those who produce software, who encode hardware, who create fonts), who were interested in the emergence of a universal text encoding.

The first variation to be released under the auspices of the Unicode Consortium was UTF-32. The number in the name of the encoding means the number of bits that is used to encode one character. 32 bits is 4 bytes of information that will be needed to encode one single character in the new universal encoding UTF.

As a result, the same file with text, encoded in the extended version of ASCII and in UTF-32, in the latter case will have a size (weight) four times larger. This is bad, but now we have the opportunity to encode using UTF the number of characters equal to two to the thirty-second power ( billions of characters, which will cover any really necessary value with a huge margin).

But many countries with languages ​​​​of the European group did not need to use such a huge number of characters in the encoding at all, however, when using UTF-32, they received a fourfold increase in weight for nothing text documents, and as a result, an increase in the volume of Internet traffic and the amount of stored data. This is a lot, and no one could afford such waste.

As a result of the development of Unicode, UTF-16, which turned out to be so successful that it was accepted as the default base space for all the characters that we use. It uses two bytes to encode one character. Let's see what this thing looks like.

IN operating system Windows, you can go along the path "Start" - "Programs" - "Accessories" - "Utilities" - "Character Table". As a result, a table with vector shapes of all fonts installed in your system will open. If you choose in " Additional options» Unicode character set, you can see for each font separately the entire range of characters included in it.

By the way, by clicking on any of them, you can see its double-byte code in UTF-16 format, consisting of four hexadecimal digits:

How many characters can be encoded in UTF-16 using 16 bits? 65,536 (two to the power of sixteen), and it was this number that was adopted as the base space in Unicode. In addition, there are ways to encode with it about two million characters, but limited to an extended space of a million characters of text.

But even this successful version of the Unicode encoding did not bring much satisfaction to those who wrote, say, programs only in English language, because they, after switching from the extended version of ASCII to UTF-16, the weight of documents doubled (one byte per character in Asci and two bytes per the same character in UTF-16).

That's it for the satisfaction of everyone and everything in the Unicode consortium, it was decided to come up with variable length encoding. It's called UTF-8. Despite the eight in the name, it really has a variable length, i.e. each text character can be encoded into a sequence of one to six bytes.

In practice, in UTF-8, only the range from one to four bytes is used, because behind four bytes of code, nothing is even theoretically possible to imagine. All Latin characters in it are encoded in one byte, just like in the good old ASCII.

Remarkably, in the case of encoding only Latin, even those programs that do not understand Unicode will still read what is encoded in UTF-8. Those. the base part of Asuka simply passed into this brainchild of the Unicode Consortium.

Cyrillic characters in UTF-8 are encoded in two bytes, and, for example, Georgian characters in three bytes. The Unicode Consortium, after creating UTF 16 and 8, solved the main problem - now we have fonts have a single code space. And now their manufacturers can only fill it with vector forms of text characters based on their strengths and capabilities. Now even in sets.

In the “Character Table” above, you can see that different fonts support a different number of characters. Some Unicode-rich fonts can be very large. But now they differ not in that they were created for different encodings, but in the fact that the font manufacturer filled or did not fill the single code space with one or another vector form to the end.

Krakozyabry instead of Russian letters - how to fix

Let's now see how krakozyabras appear instead of text, or, in other words, how the correct encoding for Russian text is chosen. Actually, it is set in the program in which you create or edit this same text, or code using text fragments.

For editing and creating text files I personally use a very good, in my opinion, . However, it can highlight the syntax of a good hundred more programming and markup languages, and also has the ability to be extended using plugins. Read detailed overview this wonderful program at the link provided.

In the top menu of Notepad ++ there is an item "Encodings", where you will have the opportunity to convert an existing option to the one used on your site by default:

In the case of a site on Joomla 1.5 and higher, as well as in the case of a blog on WordPress, in order to avoid the appearance of bugs, choose the option UTF8 without BOM. What is the prefix BOM?

The fact is that when they developed the UTF-16 encoding, for some reason they decided to attach to it such a thing as the ability to write a character code, both in direct sequence (for example, 0A15) and in reverse (150A). And in order for programs to understand in which sequence to read the codes, it was invented BOM(Byte Order Mark or, in other words, signature), which was expressed in the addition of three additional bytes to the very beginning of the documents.

In UTF-8 encoding, no BOM was provided for in the Unicode consortium, and therefore adding a signature (these most notorious additional three bytes to the beginning of the document) simply prevents some programs from reading the code. Therefore, when saving files in UTF, we must always choose the option without BOM (without signature). So you advance protect yourself from crawling krakozyabry.

Remarkably, some programs in Windows do not know how to do this (they cannot save text in UTF-8 without BOM), for example, the same notorious Windows Notepad. It saves the document in UTF-8, but still adds the signature (three extra bytes) to the beginning of it. Moreover, these bytes will always be the same - read the code in direct sequence. But on the servers, because of this little thing, a problem may arise - krakozyabry will come out.

Therefore, by no means do not use regular Windows notepad for editing documents of your site, if you do not want the appearance of krakozyabrov. I consider the already mentioned Notepad ++ editor to be the best and simplest option, which has practically no drawbacks and consists of only advantages.

In Notepad++, when you select an encoding, you will have the option to convert text to UCS-2 encoding, which is inherently very close to the Unicode standard. Also in Notepad it will be possible to encode text in ANSI, i.e. in relation to the Russian language, this will be Windows 1251, which we have already described a little above. Where does this information come from?

It is registered in the registry of your operating room Windows systems- which encoding to choose in the case of ANSI, which one to choose in the case of OEM (for the Russian language it will be CP866). If you install another default language on your computer, then these encodings will be replaced with similar ones from the ANSI or OEM category for that same language.

After you save the document in Notepad ++ in the encoding you need or open the document from the site for editing, you can see its name in the lower right corner of the editor:

To avoid krakozyabrov, in addition to the actions described above, it will be useful to write information about this encoding in its header of the source code of all pages of the site so that there is no confusion on the server or local host.

In general, in all hypertext markup languages ​​except Html, a special xml declaration is used, which specifies the text encoding.

Before parsing the code, the browser knows which version is being used and how exactly the character codes of that language should be interpreted. But what is noteworthy, if you save the document in the default unicode, then this xml declaration can be omitted (the encoding will be considered UTF-8 if there is no BOM or UTF-16 if there is a BOM).

In the case of a document Html language used to specify encoding Meta element, which is written between the opening and closing Head tag:

... ...

This entry is quite different from the one accepted in , but it is fully consistent with the new Html 5 standard that is slowly being introduced, and it will be absolutely correctly understood by anyone using it. this moment browsers.

In theory, the Meta element with the encoding html document it would be better to put as high as possible in the header of the document so that at the time of the meeting in the text of the first character not from the base ANSI (which will always be read correctly and in any variation), the browser should already have information on how to interpret the codes of these characters.

Good luck to you! See you soon on the blog pages site

You may be interested

What's happened URL address What is the difference between absolute and relative links for a site
OpenServer - modern local server and an example of its use for WordPress installations on computer
What is Chmod, what permissions to assign to files and folders (777, 755, 666) and how to do it via PHP
Yandex search on the site and online store

In an attempt to configure this or that function of the Internet, any user must have come across such a concept as "Unicode". To find out what this concept means, read this article to end.

"Unicode": definition

The term "Unicode" today refers to a character encoding standard. This standard was proposed in 1991 by the non-profit organization Unicode Inc. The Unicode standard was designed to combine a large number of different characters in one document. A page created on the basis of such an encoding can contain hieroglyphs, letters, and mathematical symbols. In this encoding, all characters are displayed without problems.

"Unicode": reasons for creation

Long before the advent of the Unicode system, encodings were chosen based on the preferences of the author of the document. Often for this reason, in order to read one document, it was necessary to use different tables. However, this had to be done several times. This significantly complicates the life of ordinary users. As mentioned earlier, in 1991, to solve this problem, the non-profit organization Unicode Inc. proposed to use a new type of information coding. This type coding was created to combine a wide variety of standards. The Unicode encoding allowed us to achieve the impossible: to create a tool that supports a huge variety of characters. The result exceeded expectations: documents were obtained that could simultaneously contain both Russian and English text, as well as mathematical expressions and Latin. Before creation unified system encoding developers had to solve a number of problems arising from the existence of a huge number of standards that already existed on this moment. The most common of these problems were character set limitations, elvish scripts, font duplication, and the problem of converting different encodings.

"Unicode": a digression into history

Imagine the following picture: in the yard of the 80s, computer technology has not yet become so widespread and has a look different from today. Each operating system is unique in its own way and modified by enthusiasts for certain specific needs. As a result, the need for information exchange has led to additional improvements. When trying to read a document created in another operating system, strange character sets were usually displayed on the screen. This required further work with the encoding, which was not always possible to complete quickly. Sometimes it took several months to process the necessary document. Users who often have to exchange information began to create special conversion tables for themselves. Working with such tables revealed one interesting feature: it is necessary to create such tables simultaneously in two directions. The machine cannot perform a banal inversion of calculations. For it, the source file is written in the right column, and the result is written in the left. On the contrary, they cannot be rearranged. If you need to use some special characters in the document, you first need to add them, and then explain to another user what needs to be done with them so that they do not turn into “cracks”. It is also worth considering that each encoding had to develop its own fonts. This led to the creation of a huge number of duplicates in the operating system. So, for example, on one page, the user could see a dozen fonts identical to the standard Times New Roman, but marked with UCS-2, UTF-16, UTF-8, ANSI. Thus, there is a need to develop a universal standard.

"Unicode": creators

The beginning of the history of the creation of "Unicode" can be attributed to 1987. It was then that Joe Becker of Xerox, along with Mark Davis and Lee Collins of Apple, began research into the practical development of a universal encoding. In 1988, Joe Becker published a blueprint for an international multilingual encoding. A few months later, the working Unicode development group was expanded. It included experts such as Glenn Wright from Sun Microsystems, Mike Kernegan and Ken Whistler from RLG. This made it possible to complete work on the preliminary formation of a single coding standard.

"Unicode": a general description

Unicode is based on general concept symbol. This definition is understood as an abstract phenomenon that exists in the form of writing, realized through graphemes. In Unicode, each character is assigned a unique code that belongs to one or another block of the standard. So, for example, the grapheme "B" is present in both English and Russian, but it corresponds to two different characters. These characters can also be converted to lowercase. This means that each of these symbols is described by a key, a set of properties, and a name.

"Unicode": advantages

Unicode differs from other modern encoding systems by a huge supply of characters for "encrypting" various characters. The thing is that previous encodings had only 8 bits. This means that they only supported 28 characters. The new development had 216 characters, which was a big step forward. Thus, it became possible to encode almost all existing alphabets. The need to use conversion tables with the advent of "Unicode" has disappeared. The existence of a single standard simply reduced their usefulness to zero. At the same time, the "kryakozyabry" also disappeared. The emergence of a new standard made their existence impossible. The need to create duplicate fonts was also eliminated.

"Unicode": development

Despite the fact that progress does not stand still, the Unicode encoding continues to hold a leading position in the world. This became possible largely due to the fact that it became easily implemented and widely used. However, it should not be assumed that the same Unicode encoding is used today as it was 25 years ago. Today version 5.x.x is used. The number of encoded characters has increased to 231. From its inception until the advent of version 2.0.0, the Unicode encoding has almost doubled the number of characters included in it. In subsequent years, this growth of opportunities continued. By the time version 4.0.0 appeared, it became necessary to increase the standard itself. As a result, the Unicode encoding took on the form in which we know it today.

What else is useful in Unicode? In addition to the huge, constantly growing number of characters, the Unicode encoding has one rather useful feature. This is normalization. The encoding does not waste computer resources on regular checking of the same character, which may have a similar spelling in different alphabets. For this purpose, a special algorithm is used, which makes it possible to display similar characters separately in a column and access them, rather than checking all the information each time. In total, four such algorithms have been developed and implemented. The transformation in each of them is carried out according to a certain principle that differs from others.

Every Internet user, in an attempt to configure one or another of its functions, at least once saw the written word "Unicode" on the display. What it is, you will learn by reading this article.

Definition

Unicode is a character encoding standard. It was proposed by the non-profit organization Unicode Inc. in 1991. The standard was developed with the aim of combining as many different types of characters as possible in one document. A page based on it may contain letters and hieroglyphs from different languages ​​(from Russian to Korean) and mathematical symbols. In this case, all characters in this encoding are displayed without problems.

Reasons for creation

Once upon a time, long before the advent of the unified Unicode system, the encoding was chosen based on the preferences of the author of the document. For this reason, it was not uncommon to read a single document by using different tables. Sometimes this had to be done several times, which significantly complicated the life of an ordinary user. As already mentioned, a solution to this problem was proposed in 1991 by the non-profit organization Unicode Inc., which proposed a new type of character encoding. It was intended to unite obsolete and diverse standards. "Unicode" is an encoding that made it possible to achieve the unthinkable at that time: to create a tool that supports a huge number of characters. The result exceeded many expectations - documents appeared that simultaneously contained both English and Russian text, Latin and mathematical expressions.

But the creation of a single encoding was preceded by the need to resolve a number of problems that arose due to the huge variety of standards that already existed at that time. The most common of them:

  • elvish letters, or "krakozyabry";
  • limited set of characters;
  • encoding conversion problem;
  • duplicate fonts.

A small historical excursion

Imagine it's in the 80s. Computer technology not yet so common and has a look different from today. At that time, each OS is unique in its own way and modified by each enthusiast for specific needs. The need to share information turns into an additional refinement of everything in the world. An attempt to read a document created under another OS often displays an incomprehensible set of characters on the screen, and games with encoding begin. It is not always possible to do this quickly, and sometimes the necessary document can be opened in six months, or even later. People who exchange information frequently create conversion tables for themselves. And work on them reveals interesting detail: you need to create them in two directions: “from mine to yours” and vice versa. The machine cannot make a banal inversion of calculations, for it the source is in the right column, and the result is in the left, but not vice versa. If there was a need to use any special characters in the document, they must first be added, and then also explain to the partner what he needs to do so that these characters do not turn into "crazy". And let's not forget that for each encoding we had to develop or implement our own fonts, which led to the creation of a huge number of duplicates in the OS.

Imagine also that on the page of fonts you will see 10 pieces of identical Times New Roman with small marks: for UTF-8, UTF-16, ANSI, UCS-2. Now do you understand that the development of a universal standard was an urgent need?

"Creator Fathers"

The origins of Unicode can be traced back to 1987, when Joe Becker of Xerox, along with Lee Collins and Mark Davis of Apple, began research into the practical creation of a universal character set. In August 1988, Joe Becker published a draft proposal for a 16-bit international multilingual coding system.

After a few months working group Unicode was expanded to include Ken Whistler and Mike Kernegan of the RLG, Glenn Wright of Sun Microsystems, and a few others, completing the preliminary work on a single encoding standard.

general description

Unicode is based on the concept of a character. This definition refers to an abstract phenomenon that exists in a specific type of writing and is realized through graphemes (its “portraits”). Each character is specified in "Unicode" by a unique code belonging to a specific block of the standard. For example, the grapheme B exists in both the English and Russian alphabets, but in Unicode it has 2 different characters. A transformation is applied to them, i.e. each of them is described by a database key, a set of properties, and a full name.

Benefits of Unicode

From other contemporaries, the Unicode encoding was distinguished by a huge supply of characters for "encrypting" characters. The fact is that its predecessors had 8 bits, that is, they supported 28 characters, but new development already had 216 characters, which was a giant step forward. This made it possible to encode almost all existing and common alphabets.

With the advent of "Unicode", there is no need to use conversion tables: how single standard he simply negated their need. In the same way, the "crakozyabry" have sunk into oblivion - a single standard made them impossible, as well as eliminated the need to create duplicate fonts.

Development of Unicode

Of course, progress does not stand still, and 25 years have passed since the first presentation. However, the Unicode encoding stubbornly holds its position in the world. In many ways, this became possible due to the fact that it became easily implemented and spread, being recognized by developers of proprietary (paid) and open source software.

At the same time, one should not assume that today the same Unicode encoding is available to us as a quarter of a century ago. At the moment, its version has changed to 5.x.x, and the number of encoded characters has increased to 231. The ability to use a larger supply of characters was abandoned in order to still maintain support for Unicode-16 (an encoding where their maximum number was limited to 216). Since its inception and up to version 2.0.0, the "Unicode Standard" has increased the number of characters that were included in it, almost 2 times. The growth of opportunities continued in subsequent years. By version 4.0.0, there was already a need to increase the standard itself, which was done. As a result, "Unicode" acquired the form in which we know it today.

What else is in Unicode?

In addition to the huge, constantly replenishing number of characters, it has one more useful feature. This is the so-called normalization. Instead of scrolling through the entire document character by character and substituting the corresponding icons from the lookup table, one of the existing normalization algorithms is used. What are we talking about?

Instead of wasting computer resources on regularly checking the same character, which may be similar in different alphabets, a special algorithm is used. It allows you to take out similar characters in a separate column of the substitution table and refer to them already, and not recheck all the data over and over again.

Four such algorithms have been developed and implemented. In each of them, the transformation takes place according to a strictly defined principle that differs from the others, so it is not possible to call any one of them the most effective. Each was developed for specific needs, was implemented and successfully used.

Distribution of the standard

In its 25 years of history, Unicode is probably the most widely used encoding in the world. Programs and web pages are also adjusted to this standard. The fact that Unicode is used today by more than 60% of Internet resources can speak of the breadth of application.

Now you know when the Unicode standard appeared. What is it, you also know and will be able to appreciate the full significance of the invention made by a group of specialists Unicode Inc. over 25 years ago.

This site requires JavaScript to function correctly. Please enable JavaScript in your browser settings.

Unicode character table

show all
Range: 0000-001F: C0 Control Characters 0020-007F: Basic Latin 0080-009F: C1 Control Characters 00A0-00FF: Latin-1 Additional Characters 0100-017F: Latin Extended-A 0180-024F: Latin Extended-B 0250-02AF : Extended International Phonetic Alphabet 02B0-02FF: Non-combinable extended modifier characters 0300-036F: Combinable diacritics 0370-03FF: Greek and Coptic alphabets 0400-04FF: Cyrillic 0500-052F: Additional Cyrillic characters 0530-058F: Armenian alphabet 0590 -05FF: Hebrew 0600-06FF: Arabic 0700-074F: Syriac 0750-077F: Additional Arabic 0780-07BF: Tana (Maldive) 07C0-07FF: Nko 0800-083F: Samaritan 0840-085F: Mandaean 08A0-08FF: Extended Arabic-A 0900-097F: Devanagari 0980-09FF: Bengali 0A00-0A7F: Gurmukhi 0A80-0AFF: Gujarati 0B00-0B7F: Oriya 0B80-0BFF: Tamil 0C00-0C7F: Telugu 0C80-0CFF: Kannada 0D00-0D7F: Malayalam 0D80-0DFF: Sinhalese 0E00-0E7F: Thai 0E80-0EFF: Lao 0F00-0FFF: Tibetan 1000-109F: Myanmar 10A0-10FF: Georgian 1100-11FF: Khangul (Korean script) 1200-137F: Ethiopian syllabary 1380-139F: Ethiopian additional characters 13A0-13FF: Cherokee script 1400-167F: Canadian syllabary 1680-169F: Ogham 16A0-16FF: Runic script 1700-171F: Tagalog (Bayin) ) 1720-173F: Hanunoo 1740-175F: Buhid 1760-177F: Tagbanwa 1780-17FF: Khmer script 1800-18AF: Old Mongolian script 18B0-18FF: Extended Canadian syllabary 1900-194F: Limbu script 1950-197F: Tai script le 1980-19DF: New Tai Ly alphabet 19E0-19FF: Khmer characters 1A00-1A1F: Bugi script (Lontara) 1A20-1AAF: Old Tai Ly script (Tai Tham) 1B00-1B7F: Balinese script 1B80-1 BBF: Sundanese script 1BC0-1BFF: Batak script 1C00-1C4F: Lepcha script (Rong) 1C50-1C7F: Ol Chiki script 1CD0-1CFF: Vedic characters 1D00-1D7F: Phonetic extensions 1D80-1DBF: Additional phonetic extensions 1DC0-1DFF: Additional combinable diacritics 1E00-1EFF: Latin Extended Supplement 1F00-1FFF: Greek Character Extended 2000-206F: Punctuation Marks 2070-209F: Superscripts and Subscripts 20A0-20CF: Currency Symbols 20D0-20FF: Combinable Diacritics for Symbols 2100- 214F: Letter-like characters 2150-218F: Numerical forms 2190-21FF: Arrows 2200-22FF: Mathematical operators 2300-23FF: Miscellaneous technical characters 2400-243F: Control code icons 2440-245F: OCR characters 2460-24FF: Embedded letters and numbers 2500-257F: Border symbols 2580-259F: Fill symbols 25A0-25FF: Geometric shapes 2600-26FF: Miscellaneous symbols 27 00-27BF: Dingbats 27C0-27EF: Miscellaneous math symbols-A 27F0-27FF: Extra arrows-A 2800-28FF: Braille 2900-297F: Extra arrows-B 2980-29FF: Miscellaneous math symbols-B 2A00-2AFF: Extra Mathematical Operators 2B00-2BFF: Miscellaneous Symbols and Arrows 2C00-2C5F: Glagolitic 1AB0-1AFF: Combination Diacritics (Extension A) 1CC0-1CCF: Extended Sundanese Character Set A9E0-A9FF: Myanmar Script (Extension B) AAE0-AAFF: Extended Meitei Script Character Set AB30-AB8F: Latin Extended-E AB30-AB6F: Varang-Kshiti AB90-ABBF: Beria Zaghawa Script 2C60-2C7F: Latin Extended-C 2C80-2CFF: Coptic Script 2D00-2D2F: Additional Georgian Script 2D30-2D7F: Tifinagh 2D80-2DDF: Ethiopic Extended Character Set 2DE0-2DFF: Cyrillic Extended-A 2E00-2E7F: Additional signs punctuation 2E80-2EFF: CJK Additional Character Keys 2F00-2FDF: Kangxi Dictionary Character Keys 2FF0-2FFF: Character Description Characters 3000-303F: CJC Characters and Punctuation 3040-309F: Hiragana 30A0-30FF: Katakana 3100-312F: Zhuyin (bopomofo) 3130-318F: Chamo combined with Hangul 3190-319F: Characters used in kambun 31A0-31BF: Bopomofo extended character set 31C0-31EF: CJK features 31F0-31FF: Katakana phonetic extensions 3200-32FF: Nested CJK letters and months 3300- 33FF: CJK Compatibility Marks 3400-4DBF: CJK Unified Characters (Extension A) 4DC0-4DFF: I Ching Hexagrams 4E00-9FFF: CJK Unified Characters A000-A48F: Syllables and A490-A4CF: Radicals and A4D0-A4FF: Lisu Alphabet A500 -A63F: Vai Syllabary A640-A69F: Cyrillic Extended-B A6A0-A6FF: Bamum Script A700-A71F: Tone Change Symbols A720-A7FF: Latin Extended-D A800-A82F: Siloti Nagri A830-A83F: Indian Numeric Symbols A840- A87F: Square Pi Phagba Lama A880-A8DF: Saurashtra A8E0-A8FF: Devanagari Extended Character A900-A92F: Kayah Li A930-A95F: Rejang A960-A97F: Hangul (Extension A) A980-A9DF: Javanese AA00-AA5F: Cham AA60 -AA7F: Myanmar script (Extension A) AA80-AADF: Tai Viet script AB00-AB2F: Ethiopian script (Extension A) ABC0-ABFF: Meitei/Manipuri AC00-D7AF: Hangul syllables D800-DB7F: Top of DB80 surrogate pairs -DBFF: Upper part of private use surrogate pairs DC00-DFFF: Lower part of surrogate pairs E000-F8FF: Private use area F900-FAFF: CJK compatible characters FB00-FB4F: Alphabetical representation forms FB50-FDCF: Arabic letter-A representation forms FDF0-FDFF: Arabic-A Representation Forms FE00-FE0F: Style Variant Selectors FE10-FE1F: Vertical Forms FE20-FE2F: Combinable Halves of Characters FE30-FE4F: CJC Compatibility Forms FE50-FE6F: Small Size Variants FE70-FE FF: Arabic Letter-B Forms FF00-FFEF: Halfwidth and Fullwidth Forms FFF0-FFFF: Special Characters

  • What is Unicode?

    Unicode(English) Unicode) is a universal character encoding standard that allows you to provide characters from all languages ​​of the world.

    Unlike ASCII, one character is encoded as two bytes, allowing the use of 65 536 characters, against 256 .

    As you know, one byte is an integer from zero before 255 . In turn, a byte consists of eight bits that store numerical values ​​in binary form, where each next unit of the current bit is twice the value of the previous bit. Thus, two bytes can store a number from zero before 65 535 , which makes it possible to use 65 536 characters (zero + 65 535 , zero is also a number, it is not nothing).

    Unicode characters are divided into sections. First 128 characters repeat table ASCII.

    The family of encodings is responsible for displaying characters. Unicode (Unicode Transformation Format - UTF). The most famous and widely used encoding is UTF-8.

  • How to use the table?

    Symbols are presented in 16 pieces per line. From above you can see hexadecimal number from 0 before 16 . On the left, similar numbers in hexadecimal form from 0 before FFF.
    By connecting the number on the left with the number on the top, you can find out the character code. For example: English letter F located on the line 004 , in a column 6 : 004 + 6 = character code 0046 .

    However, you can simply hover over specific character in the table to find out the character code. Or click on a symbol to copy it, or its code in one of the formats.

    You can enter search keywords in the search field, for example: arrows, sun, heart. Or you can specify the character code in any format, for example: 1123, 04BC, چ. Or the symbol itself, if you want to know the symbol code.

    Search by keywords is currently under development, so it may not produce results. But many popular symbols can already be found.

Believe it or not, there is an image format built into the browser. This format allows you to download images before they are needed, provides rendering of the image on normal or retina screens and allows you to add css images. OK, that's not entirely true. This is not an image format, although everything else is still valid. Using it, you can create resolution-independent icons that take no time to load and are stylable with using CSS.

What is Unicode?

Unicode is the ability to correctly display letters and punctuation from different languages ​​on the same page. It's incredibly useful: users will be able to work with your site all over the world and it will show what you want - it can be French with diacritics or Kanji .

Unicode continues to evolve: now the current version is 8.0, which has more than 120 thousand characters (in the original article published in early 2014, it was about version 6.3 and 110 thousand characters).

In addition to letters and numbers, there are other characters and icons in Unicode. IN latest versions these included emoji, which you can see in the iOS messenger.

HTML pages are created from a sequence of Unicode characters and are converted to bytes when sent over the network. Each letter and each symbol of any language has its own unique code and is encoded when the file is saved.

When using the UTF-8 encoding system, you can directly insert Unicode characters into text, but you can also add Unicode characters to text by specifying a numeric symbolic link. For example, this is a heart symbol and you can display this symbol by simply adding code to the markup.

This numeric reference can be specified in both decimal and hexadecimal format. The decimal format requires the letter x to be added at the beginning, the entry will give the same heart ( ) as the previous option. (2665 is the hexadecimal version of 9829).

If you're adding a Unicode character with CSS, then you can only use hexadecimal values.

Some of the more commonly used Unicode characters have more memorable textual names or abbreviations instead of numeric codes, such as the ampersand (& - &). Such characters are called mnemonics in HTML, their full list is on Wikipedia.

Why should you use Unicode?

Good question, here are some reasons:

  1. To use correct characters from different languages.
  2. To replace icons.
  3. To replace icons connected via @font-face .
  4. To set CSS classes

Correct characters

The first of the reasons does not require any additional action. If the HTML is saved in UTF-8 format and its encoding is transmitted over the network as UTF-8, everything should work as it should.

Must. Unfortunately, not all browsers and devices support all Unicode characters in the same way (more precisely, not all fonts support the full character set). For example, newly added emoji characters are not supported everywhere.

For UTF-8 support in HTML5 add (if you do not have access to the server settings, you should also add ). The old doctype uses ( ).

Icons

The second reason to use Unicode is to have a large number useful symbols that can be used as icons. For example, , ≡ and .

Their obvious advantage is that you do not need any additional files to add them to the page, which means your site will be faster. You can also change their color or add a shadow with CSS. And by adding transitions (css transition) you can smoothly change the color of the icon when hovering over it without any additional images.

Let's say I want to include a rating indicator with stars on my page. I can do it like this:

★ ★ ★ ☆ ☆

You will get the following result:

But if you're unlucky, you'll see something like this:

Same rating on BlackBerry 9000

This happens if the characters used are not in the font of the browser or device (fortunately, these asterisks are supported perfectly and the old Blackberry phones are the only exception here).

If there is no Unicode character, it can be replaced by characters ranging from an empty square (□) to a diamond with a question mark (�).

But how do you find a Unicode character that might be suitable for use in your design? You can look it up on a site like Unicodinator by looking at the available characters, but there are also the best way. - this great site lets you draw the icon you're looking for and then offers you a list of similar Unicode characters.

Using Unicode with @font-face Icons

If you are using icons that are linked with an external font via @font-face , Unicode characters can be used as a fallback. This way you can show a similar Unicode character on devices or browsers where @font-face is not supported:

On the left are Font Awesome icons in Chrome, and on the right are Unicode replacement characters in Opera Mini.

Many @font-face matching tools use the Unicode character range from the private use area. The problem with this approach is that if @font-face is not supported, character codes are passed to the user without any meaning.

Great for creating icon sets in @font-face and allows you to choose a suitable Unicode character as the basis for the icon.

But be careful - some browsers and devices don't like single Unicode characters when used with @font-face . It makes sense to check Unicode character support with Unify - this app will help you determine how safe it is to use a character in the @font-face icon set.

Support for Unicode characters

The main problem with using Unicode characters as a fallback is poor support in screen readers (again, some information about this can be found on Unify), so it's important to choose the characters you use carefully.

If your icon is just a decorative element next to a text label readable by a screen reader, you don't have to worry too much. But if the icon is on its own, it's worth adding a hidden text label to help screen reader users. Even if a Unicode character is read by a screen reader, there is a chance that it will be very different from its intended purpose. For example, ≡ (≡) as a hamburger icon will be read as “identical” by VoiceOver on iOS.

Unicode in CSS class names

The fact that Unicode can be used in class names and in style sheets has been known since 2007. It was then that Jonathan Snook wrote about the use of Unicode characters in helper classes when laying out rounded corners. This idea has not received much distribution, but it is worth knowing about the possibility of using Unicode in class names (special characters or Cyrillic).

Font selection

Few fonts support the full Unicode character set, so be sure to check for the characters you want when choosing a font.

Lots of icons in Segoe UI Symbol or Arial Unicode MS . These fonts are available on both PC and Mac; Lucida Grande also has a fair amount of Unicode characters. You can add these fonts to the font-family declaration to enforce the maximum number of Unicode characters for users who have these fonts installed.

Determining Unicode Support

It would be very convenient to be able to check for the presence of a particular Unicode character, but there is no guaranteed way to do this.

Unicode characters can be effective when supported. For example, an emoji in the subject line of an email makes it stand out from the rest in mailbox.

Conclusion

This article only covers the basics of Unicode. I hope you find it useful and help you better understand Unicode and use it effectively.

Link List

  • (Unicode based @font-face icon set generator)
  • Shape Catcher (Unicode character recognition tool)
  • Unicodinator (unicode character table)
  • Unify (Check for Unicode character support in browsers)
  • Unitools (Collection of tools for working with Unicode)