What is hashing function. Cryptographic hash functions

The search algorithms we have considered are usually based on an abstract comparison operation. From this series, the distribution search method, described in "Tables of symbols and binary search trees", stands out, in which the element with the key i is stored in the i-th position of the table, which allows accessing it directly. Distributing search uses the values of the keys as array indices, rather than the operands of the comparison operation; the method itself relies on the fact that the keys are distinct integers from the same range as the table indexes. In this chapter, we will look at hashing, an advanced distributional search used in more typical search applications where keys do not have such convenient properties. The end result of the application this approach not at all like comparison-based methods — instead of moving through the dictionary data structures by comparing the lookup keys with the keys in the elements, we try to access the elements in the table directly by performing the arithmetic conversion of the keys to table addresses.

Search algorithms using hashing consist of two separate parts... The first step is to compute a hash function that converts the lookup key to an address in the table. Ideally, different keys would map to different addresses, but often two or more different keys can give the same address in the table. Therefore, the second part of the hashing search is the collision resolution process, which handles such keys. One of the conflict resolution techniques we'll look at in this chapter uses linked lists, so it finds direct use in dynamic situations where it is difficult to predict the number of search keys ahead of time. The other two collision resolution methods achieve high performance search as the elements are stored in a fixed array. We'll look at how these methods can be improved so that they can be used even in cases where the size of the table cannot be predicted in advance.

Hashing - good example balance between time and memory. If there were no limits on the amount of memory used, any search could be performed with just one memory access, simply using the key as a memory address, as in an allocating search. However, this ideal case is usually unattainable, as long keys can require a huge amount of memory. On the other hand, if there were no restrictions on lead time, one could get by with a minimal amount of memory using the sequential search method. Hashing is a way of using an acceptable amount of both memory and time, and striking a balance between these two extremes. In particular, any balance can be maintained by simply changing the size of the table, rather than rewriting the code or choosing other algorithms.

Hashing is one of the classic problems in computer science: its various algorithms have been studied in detail and are widely used. We'll see that with loose assumptions, we can hope to support find and insert operations in constant-time symbol tables, regardless of table size.

This expected value is the theoretical performance optimum for any symbol table implementation, but hashing is still not a panacea for two main reasons. At first, lead time depends on the key length, which can be significant in real applications using long keys. Second, hashing does not provide efficient implementations of other symbol table operations, such as select or sort. In this chapter, we will take a closer look at these and other issues.

Hash functions

First of all, it is necessary to solve the problem of calculating a hash function that converts keys into table addresses. Usually, the implementation of this arithmetic calculation is not difficult, but you still need to be careful not to run into various subtle underwater rocks... If you have a table that can contain M elements, you need a function that converts the keys to integers in the range. An ideal hash function should be easy to compute and resemble a random function: for any arguments, the results should, in a sense, be equally probable.

The hash function depends on the key type. Strictly speaking, a separate hash function is required for each possible kind of keys. To improve efficiency, it is usually desirable to avoid explicit type conversions, and instead go to the idea of looking at the binary representation of keys in machine word as an integer that can be used in arithmetic calculations. Hashing came before languages high level- on early computers, it was common to treat a value as either a string key or an integer. In some high-level languages, it is difficult to create programs that depend on the representation of keys in a particular computer, since such programs are inherently machine-dependent and therefore difficult to transfer to another computer. Hash functions typically depend on the key-to-integer conversion process, so it can be difficult to achieve both machine independence and efficiency in hashing implementations. Typically, simple integer or floating-point keys can be converted with just one machine operation, but string keys and other types of composite keys are more costly and more attention to efficiency.

Probably the simplest situation is when the keys are floating point numbers from a fixed range. For example, if the keys are numbers greater than 0 and less than 1, you can simply multiply them by M, round the result to a lower integer, and get an address in the range between 0 and M - 1; such an example is shown in Fig. 14.1. If the keys are greater than s and less than t, they can be scaled by subtracting s and dividing by t-s, bringing them into the range of values between 0 and 1, and then multiplying by M to get the address in the table.

Rice. 14.1.

To convert floating point numbers in the range between 0 and 1 to indexes of a table whose size is 97, these numbers are multiplied by 97. In this example, there were three collisions: for indexes 17, 53, and 76. Hash values are determined by the highest bits of the key, the least significant bits do not play any role. One of the goals of developing a hash function is to correct this imbalance so that every bit is taken into account during the calculation.

If the keys are w-bit integers, they can be converted to floating point numbers and divided by 2 w to get floating point numbers between 0 and 1, and then multiplied by M as in the previous paragraph. If floating point operations take a long time, and the numbers are not large enough to cause overflow, the same result can be obtained using integer arithmetic operations: you need to multiply the key by M, and then perform a right shift by w digits to divide by 2 w (or, if the multiplication overflows, perform the shift and then the multiplication). Such methods are useless for hashing, unless the keys are evenly distributed over the range, since the hash value is determined only by the leading digits of the key.

More simple and effective method for w-bit integers - one of, perhaps, the most frequently used hashing methods - choosing a prime number table as the size M and calculating the remainder of k by M, i.e. h (k) = k mod M for any integer key k. This function is called a modular hash function. It is very easy to compute (k% M in C ++), and is efficient for achieving an even distribution of key values between values less than M. A small example is shown in Fig. 14.2.

Rice. 14.2.

The three right-hand columns show the result of hashing the 16-bit keys listed on the left using the following functions:

v% 97 (left)

v% 100 (center) and

(int) (a * v)% 100 (right),

where a = .618033. The table sizes for these functions are 97, 100, and 100, respectively. The values look random (since the keys are random). The second function (v% 100) uses only the two right-most digits of the keys and therefore may show poor performance for non-random keys.

Modular hashing is applicable to floating point keys as well. If the keys are in a small range, you can scale them to numbers between 0 and 1.2 w to get w-bit integers and then use a modular hash function. Another option is to simply use the binary representation of the key as the operand of the modular hash function (if available).

Modular hashing is used in all cases when there is access to the bits that make up the keys, regardless of whether they are integers represented by a machine word, a sequence of characters packed into a machine word, or any other. possible option... A sequence of random characters packed into a machine word is not exactly the same as random integer keys, since not all bits are used for encoding. But both of these types (and any other type of key coded to fit in a machine word) can be made to look like random indexes on a small table.

The main reason for choosing a prime hash table as the size M for modular hashing is shown in Fig. 14.3. This example of 7-bit character data treats the key as a base 128 number — one digit for each character in the key. The word now corresponds to the number 1816567, which can also be written as

because in ASCII the characters n, o, and w correspond to the numbers 1568 = 110, 1578 = 111, and 1678 = 119. The choice of the table size M = 64 for this type of key is unsuccessful, because adding values that are multiples of 64 (or 128) to x does not change the value of x mod 64 - for any key, the value of the hash function is the value of the last 6 bits of this key. Of course, a good hash function should consider all the digits of the key, especially for character keys. Similar situations can arise when M contains a power factor of 2. The simplest way to avoid this, choose a prime number as M.

Rice. 14.3.

Each row of this table lists: a 3-letter word, the ASCII representation of that word as a 21-bit number in octal and decimal notation, and standard modular hash functions for table sizes 64 and 31 (the two rightmost columns). Table size 64 leads to undesirable results because only the right-most bits of the key are used to obtain the hash value, and the letters in the words of the common language are unevenly distributed. For example, all words ending with the letter y have a hash value of 57. In contrast, a simple value of 31 causes fewer collisions in a table that is more than half the size.

Modular hashing is very easy to implement, except that the size of the table must be a prime number. For some applications, you can be content with a small known prime, or look in a list of known prime numbers for one that is close to the required table size. For example, numbers equal to 2 t - 1 are prime when t = 2, 3, 5, 7, 13, 17, 19 and 31(and for no other values of t< 31 ): это известные простые числа Мерсенна. Чтобы динамически распределить таблицу нужного размера, нужно вычислить простое число, близкое к этому значению. Такое вычисление нетривиально (хотя для этого и существует остроумный алгоритм, который будет рассмотрен в части 5), поэтому на практике обычно используют таблицу заранее вычисленных значений (см. рис. 14.4). Использование модульного хеширования - не the only reason, by which the size of the table should be made a prime number; another reason is discussed in section 14.4.

Rice. 14.4.

This table of the largest primes less than 2 n for , can be used to dynamically allocate a hash table when you want the table size to be a prime number. For any given positive value in the covered range, this table can be used to determine a prime number that is less than 2 times different from it.

Another way to handle integer keys is to combine the multiplicative and modular methods: you need to multiply the key by a constant between 0 and 1, and then do the division modulo M. In other words, you need to use a function. There is a relationship between the values, M and the effective radix of the key that could theoretically lead to anomalous behavior, but if you use an arbitrary value of a, in real application there is hardly any problem. Often the value φ = 0.618033 ... (golden ratio) is chosen as a.

Many other variations have been explored on this topic, in particular hash functions that can be implemented with efficient machine instructions such as shift and masked highlighting (see the links section).

In many applications that use symbol tables, keys are not numbers and are not necessarily short; more often they are alphanumeric strings, which can be quite long. So how do you calculate the hash function for a word like averylongkey?

In 7-bit ASCII code, this word corresponds to 84-bit number \ begin (align *) 97 \ cdot 128 ^ (11) & + 118 \ cdot 128 ^ (10) + 101 \ cdot 128 ^ (9) + 114 \ cdot 128 ^ (8) + 121 \ cdot 128 ^ (7) \\ & + 108 \ cdot 128 ^ (6) + 111 \ cdot 128 ^ (5) + 110 \ cdot 128 ^ (4) + 103 \ cdot 128 ^ (3) \\ & + 107 \ cdot 128 ^ (2) + 101 \ cdot 128 ^ (1) + 121 \ cdot 128 ^ (0), \ end (align *),

which is too large to perform normal arithmetic functions on most computers. And often you need to handle much longer keys.

To compute the modular hash function for long keys, they are transformed chunk by chunk. You can use the arithmetic properties of the module function and use Horner's algorithm (see section 4.9 "Abstract data types"). This method is based on another way of writing the numbers corresponding to keys. For this example, write the following expression: \ begin (align *) (((((((((97 \ cdot 128 ^ (11) & + 118) \ cdot 128 ^ (10) + 101) \ cdot 128 ^ ( 9) + 114) \ cdot 128 ^ (8) + 121) \ cdot 128 ^ (7) \\ & + 108) \ cdot 128 ^ (6) + 111) \ cdot 128 ^ (5) + 110) \ cdot 128 ^ (4) + 103) \ cdot 128 ^ (3) \\ & + 107) \ cdot 128 ^ (2) + 101) \ cdot 128 ^ (1) + 121. \ end (align *)

That is decimal number, corresponding to the character encoding of the string, can be calculated by looking at it from left to right, multiplying the accumulated value by 128, and then adding the code value of the next character. In the case of a long string, this way of calculating will eventually result in a number larger than what a computer can ever imagine. However, this number is not needed, since only a (small) remainder is required from its division by M. The result can be obtained without even storing a large accumulated value, since at any time in the calculation, you can discard a multiple of M - each time you perform multiplication and addition, you only need to store the remainder of the division modulo M. The result will be the same as if we had the opportunity to calculate a long number and then perform division (see. Exercise 14.10). This observation leads to a straightforward arithmetic way of computing modular hash functions for long strings — see Program 14.1. This program uses one final trick: instead of base 128, it uses the prime number 127. The reason for this change is discussed in the next paragraph.

There are many ways to compute hash functions at roughly the same cost as modular hashing using Horner's method (one or two arithmetic operations for each character in the key). For random keys, these methods are practically the same, but real keys are rarely random. The ability to randomize real keys at a low cost leads to the consideration of randomized hashing algorithms, since we need hash functions that generate random indexes on a table regardless of key distribution. Randomization is easy, since you don't have to literally adhere to the definition of modular hashing - you just need to use all the digits of the key to compute an integer less than M.

M = 96 and a = 128 (top),

M = 97 and a = 128 (center) and

M = 96 and a = 127 (bottom)

The uneven distribution in the first case is a result of the uneven use of letters and persistence of unevenness due to the fact that both the table size and the factor are multiples of 32. The other two examples look random because the table size and factor are coprime numbers.

Program 14.1 shows one way to do this: using a simple base instead of a power of 2 and an integer corresponding to the ASCII representation of the string. In fig. 14.5 fig. Figure 14.5 shows how this change improves the distribution for typical string keys. In theory, the hash values generated by Program 14.1 can give poor results for table sizes that are multiples of 127 (although in practice this will likely be almost invisible); to create a randomized algorithm, one could choose the value of the multiplier at random. An even more efficient approach is to use random values of the coefficients in the calculation and different random values for each key digit. This approach yields a randomized algorithm called universal hashing.

In theory, an ideal universal hash function is one for which the probability of a collision between two different keys in a table of size M is exactly 1 / M. It can be proved that using as the coefficient a in Program 14.1, not a fixed arbitrary value, but a sequence of random different values, transforms modular hashing into a universal hash function. However, the cost of generating a new random number for each character in the key is usually unacceptable. In practice, the tradeoff shown in Program 14.1 can be achieved by not storing an array of different random numbers for each key symbol, but by varying the coefficients by generating a simple pseudo-random sequence.

To summarize, in order to use hashing to implement an abstract symbol table, you first need to extend the interface of the abstract type to include a hash operation, which maps keys to non-negative integers smaller than the size of the table M.

Within the framework of this article, I will tell you what is hash, why it is needed, where and how it is used, as well as the most famous examples.

Many information technology tasks are highly data-critical. For example, if you need to compare two files of 1 KB each and two files of 10 GB each, then this is a completely different time. Therefore, algorithms that allow you to operate with shorter and more capacious values are considered very popular.

One of these technologies is Hashing, which has found its application in solving a lot of problems. But, I think to you, as an ordinary user, it is still not clear what kind of animal this is and what it is for. Therefore, further I will try to explain everything in the most simple words.

Note: The material is intended for ordinary users and does not contain many technical aspects, however, it is more than enough for basic acquaintance.

What is Hash or Hashing?

I'll start with the terms.

Hash Function, Convolution Function is a special kind of function that allows you to convert arbitrary length texts to a fixed length code (usually a short alphanumeric notation).

Hashing is the process of converting source texts itself.

Hash, Hash Code, Hash Value, Hash Sum- This is the output value of the Hash function, that is, the resulting block of fixed length.

As you can see, the terms have a somewhat figurative description, from which it is difficult to understand what all this is for. Therefore, I will immediately give a small example (I will talk about the rest of the applications a little later). Let's say you have 2 10GB files. How can you quickly find out which one you need? The file name can be used, but it is easy to rename it. You can watch dates, but after copying the files, the dates can be the same or in a different sequence. The size, as you yourself understand, can do little to help (especially if the sizes are the same or you have not looked at the exact values of the bytes).

This is where this Hash is needed, which is a short block formed from the source text of the file. These two 10GB files will have two different but short Hashcodes (something like "ACCAC43535" and "BBB3232A42"). Using them, you can quickly find out desired file, even after copying and changing names.

Note: Due to the fact that Hash is a very well-known concept in the computer world and on the Internet, often everything that is related to Hash is shortened to this very word. For example, the phrase "I use an MD5 hash" in translation means that the site or somewhere else uses the hashing algorithm of the MD5 standard.

Hash function properties

Now, I will talk about the properties of Hash functions, so that it is easier for you to understand where Hash is used and what is needed for. But, first, one more definition.

Collision- this is a situation when the same hash sum is obtained for two different texts. As you yourself understand, since a block of fixed length, then it has a limited number of possible values, and therefore repeats are possible.

And now to the very properties of the Hash functions:

1. The input can be a text of any size, and the output is a data block of a fixed length. This follows from the definition.

2. The hash-sum of the same texts must be the same. Otherwise, such functions are simply useless - it's analogous to a random number.

3. Nice function the convolutions must have a good distribution. Agree that if the size of the output Hash is, for example, 16 bytes, then if the function returns only 3 different values for any text, then there is no sense in such a function and these 16 bytes (16 bytes is 2 ^ 128 options, which is approximately 3, 4 * 10 ^ 38 degrees).

4. How well the function reacts to the slightest changes in the source text. A simple example. Changed 1 letter in a 10 GB file, the function value should be different. If this is not the case, then using such a function is very problematic.

5. The likelihood of a collision. A very complex parameter calculated under certain conditions. But, its essence is that what's the point of the hash function if the resulting hash sum will often coincide.

6. Speed of Hash calculation. What is the use of a convolution function if it takes a long time to compute? None, because then it is easier to compare the file data or use a different approach.

7. The complexity of recovering the original data from the Hash value. This characteristic is more specific than general, since this is not always required. However, for the most well-known algorithms, this characteristic is estimated. For example, you can hardly get the original file from this function. However, if there is a collision problem (for example, you need to find any text that matches such a Hash), then such a characteristic can be important. For example, passwords, but more on them later.

8. Open or closed source code of such a function. If the code is not open source, then the complexity of data recovery, namely cryptographic strength, remains in question. This is partly a problem with encryption.

Now you can move on to the question "what is it all for?"

Why is Hash needed?

There are only three main goals of hash functions (or rather, their purpose).

1. Checking data integrity. V this case everything is simple, such a function should be calculated quickly and allow just as quickly to check that, for example, a file downloaded from the Internet was not damaged during transmission.

2. Increase in the speed of data retrieval. A fixed block size allows you to get many advantages in solving search problems. In this case, the point is that, technically, using hash functions can have a positive effect on performance. For such functions, collision probability and good distribution are very important.

3. For cryptographic needs. This view convolution functions are used in those areas of security where it is important that the results are difficult to replace or where it is necessary to make the task of obtaining as difficult as possible useful information from Hash.

Where and how is hash used?

As you probably already guessed, Hash is used in many tasks. Here are a few of them:

1. Passwords are usually stored not in clear text, but in the form of hash-sums, which allows for a higher degree of security. After all, even if an attacker gains access to such a database, he will still have to spend a lot of time to find the appropriate texts for these hash codes. This is where the characteristic "the complexity of recovering the original data from the Hash values" is important.

Note: I advise you to read this article for a couple of tips to improve password security.

2. In programming, including databases. Of course, most often we are talking about data structures that allow quick search... Purely technical aspect.

3. When transferring data over a network (including the Internet). Many protocols, such as TCP / IP, include special check fields that contain the hash of the original message so that if there is a failure somewhere, it will not affect the data transfer.

4. For various security related algorithms. For example, Hash is used in electronic digital signatures.

5. To check the integrity of files. If you paid attention, then often on the Internet you can find files (for example, archives) additional descriptions with hash code. This measure is used not only so that you do not accidentally launch a file that is damaged when downloading from the Internet, but also there are simply failures on hosting. In such cases, you can quickly check the Hash and, if required, re-upload the file.

6. Sometimes, hash functions are used to create unique identifiers (as part of). For example, when saving pictures or just files, they usually use Hash in names together with date and time. This avoids overwriting files with the same name.

In fact, the further you go, the more often hash functions are used in information technology... Mainly due to the fact that the amount of data and the power of the most simple computers have grown a lot. In the first case, it is more about search, and in the second, it is more about security issues.

Notable hash functions

The most famous are the following three hash functions.

Annotation: In this lecture, the concept of a hash function is formulated, and also given short review algorithms for the formation of hash functions. In addition, the possibility of using block encryption algorithms to generate a hash function is considered.

The purpose of the lecture: to get acquainted with the concept of "hash function", as well as with the principles of such functions.

Hash function concept

Hash function is a mathematical or other function that, for a string of arbitrary length, calculates some integer value or some other string of fixed length. Mathematically, it can be written like this:

where M is the original message, sometimes called prototype and h is the result called the hash value (and also hash code or digest messages(from the English. message digest)).

The meaning of the hash function is to determine the characteristic feature of the preimage - the value of the hash function. This value usually has a certain fixed size, such as 64 or 128 bits. The hash code can be further analyzed to solve any problem. So, for example, hashing can be used to compare data: if two data arrays have different hash codes, the arrays are guaranteed to be different; if they are the same, the arrays are most likely the same. In the general case, there is no one-to-one correspondence between the original data and the hash code due to the fact that the number of values of hash functions is always less than the number of variants of the input data. Therefore, there are many input messages giving the same hash codes (such situations are called collisions). The likelihood of collisions plays an important role in assessing the quality of hash functions.

Hash functions are widely used in modern cryptography.

The simplest hash function can be constructed using the "sum modulo 2" operation as follows: we get the input string, add all the bytes modulo 2 and return the result byte as the value of the hash function. The length of the hash value in this case will be 8 bits, regardless of the size of the input message.

For example, suppose the original digitized message was the following (in hexadecimal format):

We translate the message into binary form, write the bytes under each other and add the bits in each column modulo 2:

0011 1110 0101 0100 1010 0000 0001 1111 1101 0100 ---------- 0110 0101

The result (0110 0101 (2) or 65 (16)) will be the hash value.

However, such a hash function cannot be used for cryptographic purposes, such as generating electronic signature, since it is quite easy to change the content of the signed message without changing the checksum value.

Therefore, the considered hash function is not suitable for cryptographic applications. In cryptography, a hash function is considered good if it is difficult to create two preimages with the same value hash function, and also if the output of the function does not explicitly depend on the input.

Let's formulate the basic requirements for cryptographic hash functions:

the hash function must be applicable to messages of any size;
the calculation of the value of the function must be performed quickly enough;
with a known value of the hash function, it should be difficult (almost impossible) to find a suitable preimage of M;
with a known message M, it should be difficult to find another message M 'with the same hash value as the original message;
it should be hard to find any pair of random different messages with the same hash value.

Creating a hash function that meets all of these requirements is not an easy task. It should also be remembered that data of arbitrary size is received at the input of the function, and the hash result should not be the same for data of different sizes.

Currently, in practice, functions are used as hash functions that process the input message block by block and calculate the hash value h i for each block M i of the input message according to dependencies of the form

h i = H (M i, h i-1),

where h i-1 is the result obtained when calculating the hash function for previous block input data.

As a result, the output of the hash function h n is a function of all n blocks of the input message.

Using block encryption algorithms to generate a hash function

You can use a block hash function as a hash function. If the block algorithm used is cryptographically secure, then the hash function based on it will also be reliable.

The simplest way to use the block algorithm to obtain a hash code is to encrypt the message in CBC mode. In this case, the message is presented as a sequence of blocks, the length of which is equal to the length of the encryption algorithm block. If necessary, the last block is padded on the right with zeros to get a block of the desired length. The hash value will be the last encrypted block of text. Provided that a reliable block encryption algorithm is used, the resulting hash value will have the following properties:

it is practically impossible without knowledge of the encryption key to calculate the hash value for a given open array of information;
it is practically impossible without knowledge of the encryption key to select open data for a given value of the hash function.

The hash value formed in this way is usually called imitation insert or authenticator and is used to check the integrity of the message. Thus, insertion of imitation is a control combination that depends on open data and secret key information. The purpose of using a simulated insert is to detect all accidental or deliberate changes in the information array. The value obtained by the hash function when processing an input message is appended to the message when the message is known to be correct. The recipient verifies the integrity of the message by calculating the impersonation of the received message and comparing it with the received hash code, which must be transmitted in a secure manner. One of these safe ways there may be an encryption imitating insert private key the sender, i.e. creating a signature. It is also possible to encrypt the received hash code with a symmetric encryption algorithm, if the sender and the receiver have a common symmetric encryption key.

The specified process for obtaining and using a simulated insert is described in the domestic standard GOST 28147-89. The standard proposes to use the least significant 32 bits of the block obtained at the output of the encryption operation of the entire message in the mode of concatenating cipher blocks to control the integrity of the transmitted message. In the same way, to form an imitating insert, you can use any block symmetric encryption algorithm.

Others possible way The application of a block cipher to generate a hash code is as follows. The original message is processed sequentially in blocks. The last block is padded with zeros if necessary, sometimes the length of the message is added to the last block as a binary number. At each stage, we encrypt the hash value obtained in the previous stage, taking the current message block as the key. The last encrypted value received will be the final hash result.

In fact, there are several more possible schemes for using a block cipher to form a hash function. Let М i be the block of the original message, hi - the value of the hash function at the i-th stage, f - the block encryption algorithm used in the simple replacement mode, - the operation of addition modulo 2. Then, for example, the following schemes for generating the hash function are possible :

In all these schemes, the length of the generated hash value is equal to the length of the encrypted block. All these, as well as some other schemes for using a block cipher algorithm to calculate hash values, can be applied in practice.

The main disadvantage of hash functions designed on the basis of block algorithms is the relatively low speed work. The required cryptographic strength can be achieved in a smaller number of operations on the input data. There are faster hashing algorithms, designed independently, from scratch, based on the requirements of cryptographic strength (the most common of them are MD5, SHA-1, SHA-2 and GOST R 34.11-94).

What is a hash? A hash function is a mathematical transformation of information into a short, specific length string.

Why is this needed? Hash analysis is often used to check the integrity of important files. operating system, important programs, important data. Control can be carried out both as needed and on a regular basis.

How it's done? First, they determine the integrity of which files need to be monitored. For each file, its hash value is calculated using a special algorithm, with the result being saved. After the required time, a similar calculation is made and the results are compared. If the values differ, then the information contained in the file has been changed.

What characteristics should a hash function have?

must be able to convert data of arbitrary length to fixed;
must have an open algorithm so that you can investigate its cryptographic strength;
should be one-sided, that is, there should be no mathematical possibility to determine the initial data by the result;
should "resist" collisions, that is, should not produce the same values for different input data;
should not require large computing resources;
at the slightest change in the input data, the result should change significantly.

What are the popular hashing algorithms? The following hash functions are currently in use:

CRC - Cyclic Redundancy Code or Checksum. The algorithm is quite simple, it has a large number of variations depending on the required output length. Not cryptographic!
MD 5 is a very popular algorithm. Like him previous version MD 4 is a cryptographic function. The hash size is 128 bits.
SHA -1 is also a very popular cryptographic function. The hash size is 160 bits.
GOST R 34.11-94 - Russian cryptographic standard for calculating hash functions. The hash size is 256 bits.

When can a system administrator use these algorithms? Often, when downloading any content, for example, programs from the manufacturer's website, music, films or other information, there is a value checksums calculated according to a certain algorithm. For security reasons, after downloading, it is necessary to independently calculate the hash function and compare the value with what is indicated on the website or in the attachment to the file. Have you ever done this?

Why is it more convenient to calculate the hash? Now there are a large number of such utilities, both paid and free for use. I personally liked HashTab. Firstly, during installation, the utility is built in as a tab in the file properties, secondly, it allows you to choose a large number of hashing algorithms, and thirdly, it is free for private non-commercial use.

What is Russian? As mentioned above, in Russia there is a hashing standard GOST R 34.11-94, which is widely used by many manufacturers of information security products. One of these tools is the program of fixation and control of the initial state. software package FIX. This program is a means of monitoring the effectiveness of the use of information security systems.

FIX (version 2.0.1) for Windows 9x / NT / 2000 / XP

Calculation of checksums of specified files using one of 5 implemented algorithms.
Fixation and subsequent control of the initial state of the software package.
Comparison of versions of the software package.
Fixation and control of directories.
Control of changes in specified files (directories).
Formation of reports in TXT, HTML, SV formats.
The product has a FSTEC certificate for NDV 3 No. 913 until June 01, 2013.

And what about the digital signature? The result of calculating the hash function, together with the user's secret key, goes to the input of the cryptographic algorithm, where the digital signature is calculated. Strictly speaking, the hash function is not part of the EDS algorithm, but it is often done on purpose in order to exclude an attack using a public key.

Nowadays, many e-commerce applications allow you to store the user's secret key in a private token area (ruToken, eToken) without the technical possibility of extracting it from there. The token itself has a very limited memory area, measured in kilobytes. To sign a document, there is no way to transfer the document to the token itself, but it is very easy to transfer the hash of the document to the token and receive an EDS at the output.

Questions:

1. The concept of a hash function.

2. Using block encryption algorithms to form a hash function.

3. Review of algorithms for the formation of hash functions.

1. Hash function concept

Hash function(hash function) is a mathematical or other function that, for a string of arbitrary length, calculates some integer value or some other string of fixed length. Mathematically, it can be written like this:

h = H (M) ,

where M - the original message, sometimes called prototype , a h - the result, called the hash value (and also hash code or digest messages(from the English. message digest)).

The meaning of the hash function is to determine the characteristic feature of the preimage - the value of the hash function. This value usually has a certain fixed size, such as 64 or 128 bits. The hash code can be further analyzed to solve any problem. So, for example, hashing can be used to compare data: if two data arrays have different hash codes, the arrays are guaranteed to be different; if they are the same, the arrays are most likely the same. In the general case, there is no one-to-one correspondence between the original data and the hash code due to the fact that the number of values of hash functions is always less than the number of variants of the input data. Therefore, there are many input messages giving the same hash codes (such situations are called collisions ). The likelihood of collisions plays an important role in assessing the quality of hash functions.

Hash functions are widely used in modern cryptography.

For example, suppose the original digitized message was the following (in hexadecimal format):

2 B1 4 A9 5 FE4

We translate the message into binary form, write the bytes under each other and add the bits in each column modulo 2:

0010 1011

0001 0100

1010 1001

0101 1111

1110 0100

——————-

0010 1101

Result: 0010 1101 or 2 D and will be the value of the hash function.

However, such a hash function cannot be used for cryptographic purposes, for example, for generating an electronic signature, since it is quite easy to change the content of a signed message without changing the checksum value.

Therefore, the considered hash function is not suitable for cryptographic applications. In cryptography, a hash function is considered good if it is difficult to create two preimages with the same hash value, and also if the function's output does not explicitly depend on the input.

Let's formulate the basic requirements for cryptographic hash functions:

· The hash function must be applicable to a message of any size;

· The calculation of the value of the function should be performed quickly enough;

With a known value of the hash function, it should be difficult (almost impossible) to find a suitable preimage M ;

With a known message M it should be hard to find another message M ' with the same hash value as the original message;

· It should be difficult to find any pair of random different messages with the same hash value.

Currently, in practice, functions are used as hash functions that process the input message block by block and calculate the hash value h i for each block M i input message by dependencies of the form

h i = H (M i, h i-1),

where h i-1 - the result obtained when calculating the hash function for the previous block of input data.

As a result, the output of the hash function is h n is a function of all n blocks of the input message.

2. Using block encryption algorithms to generate a hash function.

A symmetric block encryption algorithm can be used as a hash function. If the block algorithm used is cryptographically secure, then the hash function based on it will also be reliable.

The simplest way to use the block algorithm to obtain a hash code is to encrypt the message in CBC mode ( Cipher Block Chaining - Chaining blocks of ciphertext). In this case, the message is presented as a sequence of blocks, the length of which is equal to the length of the encryption algorithm block. If necessary, the last block is padded on the right with zeros to get a block of the desired length. The hash value will be the last encrypted block of text. Provided that a reliable block encryption algorithm is used, the resulting hash value will have the following properties:

· It is almost impossible to calculate the hash value for a given open array of information without knowing the encryption key;

· It is practically impossible to select open data for a given value of the hash function without knowing the encryption key.

The hash value formed in this way is usually called imitation insert or authenticator and is used to check the integrity of the message. Thus, insertion of imitation is a control combination that depends on open data and secret key information. The purpose of using a simulated insert is to detect all accidental or deliberate changes in the information array. The value obtained by the hash function when processing an input message is appended to the message when the message is known to be correct. The recipient verifies the integrity of the message by calculating the impersonation of the received message and comparing it with the received hash code, which must be transmitted in a secure manner. One of these secure methods can be encryption of the impersonation with the sender's private key, i.e. creating a signature. It is also possible to encrypt the received hash code with a symmetric encryption algorithm if the sender and receiver have a common symmetric encryption key.

The specified process for obtaining and using a simulated insert is described in the domestic standard GOST 28147-89. The standard proposes to use the least significant 32 bits of the block received at the output of the encryption operation of the entire message in the mode of concatenating cipher blocks to control the integrity of the transmitted message. In the same way, any block symmetric encryption algorithm can be used to form an imitated insert.

Another possible way to use a block cipher to generate a hash code is as follows. The original message is processed sequentially in blocks. The last block is padded with zeros if necessary, sometimes the length of the message is added to the last block as a binary number. At each stage, we encrypt the hash value obtained at the previous stage, taking the current message block as the key. The last encrypted value received will be the final hash result.

Thus, if the usual message encryption scheme is M using a block cipher f on the key TO we recorded as E = f (M, K) , then the scheme for obtaining the hash code h according to the algorithm described above can be represented as

h i = f ( h i -1 , M )

As the initial hash code h 0 take some constant. Encryption is performed in a simple overwrite mode. Using this method the size of the block is the same as the length of the key and the size of the hash value will be the length of the block.

Another way of using the block cipher in the simple replacement mode is also possible: the message elements are encrypted with the hash values obtained in the previous step:

h i = f ( M , h i -1 ,)

In fact, there are several more possible schemes for using a block cipher to form a hash function. Let be M i - the block of the original message, h i - the value of the hash function on i -th stage, f - block encryption algorithm used in the simple replacement mode; - addition operation modulo 2. Then, for example, the following schemes for generating a hash function are possible:

The main disadvantage of hash functions designed on the basis of block algorithms is the relatively low speed of operation. The required cryptographic strength can be achieved in a smaller number of operations on the input data. There are faster hashing algorithms (the most common of them are MD5, SHA-1, SHA-2 and GOST R 34.11-94).

3. Review of algorithms for the formation of hash functions.

Currently, various special algorithms for calculating the hash function have been proposed and are practically used. The most famous algorithms are MD5, SHA-1, SHA-2 and other versions of SHA, as well as the domestic algorithm set out in GOST R 34.11-94.

Algorithm MD5 appeared in the early 90s of the twentieth century as a result of the improvement of the algorithm for generating the MD4 hash function. The characters in the name "MD" stand for Message Digest - a summary of the message. The author of the MD4 and MD5 algorithms is R. Rivest. As a result of using MD5, a 128-bit hash value is generated for an arbitrary message. Input data is processed in blocks of 512 bits. The algorithm uses elementary logical operations(inversion, conjunction, addition modulo 2, cyclic shifts, etc.), as well as ordinary arithmetic addition. Complex repetition of these elementary functions the algorithm ensures that the result after processing is well mixed. Therefore, it is unlikely that two messages, chosen at random, have the same hash code. The MD5 algorithm has the following property: each bit of the received hash value is a function of each bit of the input. MD5 is considered to be the strongest hash function for a 128-bit hash value.

Algorithm SHA The Secure Hash Algorithm (Secure Hash Algorithm) was developed by the US National Institute of Standards and Technology (NIST) and published as a US Federal Information Standard in 1993. SHA-1, like MD5, is based on the MD4 algorithm. SHA-1 generates a 160-bit hash value based on processing the original message in blocks of 512 bits. The SHA-1 algorithm also uses simple logical and arithmetic operations. The most important difference between SHA-1 and MD5 is that the SHA-1 hash is 32 bits longer than the MD5 hash. If we assume that both algorithms are of the same complexity for cryptanalysis, then SHA-1 is a more robust algorithm. Using a brute force attack (frontal attack), it is more difficult to create an arbitrary message that has a given hash code, and it is also more difficult to create two messages that have the same hash code.

In 2001, the US National Institute of Standards and Technology adopted three hash functions as a standard with a longer hash length than SHA-1. Often these hash functions are called SHA-2 or SHA-256, SHA-384 and SHA-512 (the name indicates the length of the hash code generated by the algorithms). These algorithms differ not only in the length of the generated hash code, but also in the used internal functions and the length of the processed block (for SHA-256, the block length is 512, and for SHA-384 and SHA-512, the block length is 1024 bits). Gradual improvements to the SHA algorithm lead to an increase in its cryptographic strength. Despite the differences of the considered algorithms from each other, they are all further development of SHA-1 and MD4 and have a similar structure.

In Russia, GOST R34.11-94 has been adopted, which is the domestic standard for hash functions. Its structure is quite different from the structure of the SHA-1,2 or MD5 algorithms, which are based on the MD4 algorithm. The length of the hash code generated by the GOST R 34.11-94 algorithm is 256 bits. The algorithm sequentially processes the original message in blocks of 256 bits from right to left. The parameter of the algorithm is the hashing start vector - an arbitrary fixed value with a length of also 256 bits. The GOST R 34.11-94 algorithm uses the operations of permutation, shift, arithmetic addition, addition modulo 2. As auxiliary function in GOST 34.11-94, the algorithm according to GOST 28147-89 is used in the simple replacement mode.

4. Hash requirements

A hash function is a one-way function designed to retrieve a digest or "fingerprint" of a file, message, or some block of data.

The hash code is generated by the function H :

h = H (M)

Where M is a message of arbitrary length and h is a fixed length hash code.

Consider the requirements that a hash function must meet in order for it to be used as a message authenticator. Let's look at a very simple hash function example. Then we will analyze several approaches to building a hash function.

Hash function H used to authenticate messages must have the following properties:

1. Hash function H must be applied to a data block of any length.

2. Hash function H creates a fixed length output.

3. H (M) relatively easy (in polynomial time) is calculated for any value M .

4. For any given hash code value h computationally impossible to find M such that H (M) = h .

5. For any given NS it is computationally impossible to find that

H (y) = H (x).

6. It is computationally impossible to find an arbitrary pair ( NS , y ) such that H (y) = H (x) .

The first three properties require the hash function to generate a hash code for any message.

The fourth property defines the requirement of one-way hash function: it is easy to create a hash code from a given message, but it is impossible to recover a message from a given hash code. This property is important if hash authentication includes a secret value. The secret value itself may not be sent, however, if the hash function is not one-way, the adversary can easily reveal the secret value as follows. When the transmission is intercepted, the attacker receives a message M and hash code C = H (SAB || M) ... If the attacker can invert the hash function, then, therefore, he can get SAB || M = H-1 (C) ... Since the attacker now knows and M and SAB || M , receive SAB quite simple.

The fifth property ensures that no other message can be found whose hash value matches the hash value. of this message... This prevents the authenticator from being tampered with when using an encrypted hash code. In this case, the adversary can read the message and, therefore, generate its hash code. But since the adversary does not own the secret key, he has no way to change the message so that the recipient does not find it. If this property is not met, the attacker can perform the following sequence of actions: intercept the message and its encrypted hash code, calculate the message hash code, create an alternative message with the same hash code, replace the original message with a fake one. Since the hash codes of these messages are the same, the recipient will not detect the spoofing.

A hash function that satisfies the first five properties is called a simple or weak hash function. If, in addition, the sixth property is satisfied, then such a function is called a strong hash function. The sixth property protects against a class of attacks known as the birthday attack.

5. Simple hash functions

All hash functions are performed as follows. An input value (message, file, etc.) is treated as a sequence n -bit blocks. The input value is processed sequentially block by block, and created m -bit hash code value.

One of the simplest examples of a hash function is the bitwise XOR of each block:

C i - i th bit of the hash code, 1 <= i <= n .

k - number n -bit blocks of input.

b ij – i th bit in j th block.

Then the entire message is encrypted, including the hash code, in the CBC mode to create encrypted blocks Y1, Y2, ..., YN + 1. By definition of SHS we have:

But XN + 1 is the hash code:

Since the terms in the previous equality can be calculated in any order, therefore, the hash code will not be changed if the encrypted blocks are rearranged.

The original standard proposed by NIST used simple XOR, which was applied to 64-bit message blocks, then the entire message was encrypted using CBC mode.

"The birthday paradox"

Before looking at more complex hash functions, there is one particular attack on simple hash functions that needs to be analyzed.

The so-called "birthday paradox" is as follows. Suppose the number of output values of the hash function is H equals n ... What should be the number k so that for a specific value X and values Y1, , Yk the probability that at least one Yi satisfies the equality

H (X) = H (Y)

would be greater than 0.5.

For one Y the likelihood that H (X) = H (Y) , is equal to 1 / n .

Accordingly, the probability that , is equal to 1 - 1 / n .

If you create k values, then the probability that there will be no coincidences for none of them is equal to the product of the probabilities corresponding to one value, i.e. (1 - 1 / n) k .

Therefore, the probability of at least one match is

1 - (1 - 1 / n) k

Thus, we found out that for m -bit hash code is enough to choose 2m-1 messages so that the probability of matching hash codes is greater than 0.5.

Now consider the following problem: denote P (n, k) the probability that in a set of k elements, each of which can take n values, there are at least two with the same values. What should be equal k , to P (n, k) would be more 0,5 ?

The number of different ways to select elements in such a way that there are no duplicates is

n (n-1) ... (n-k + 1) = n! / (n-k)!

The total number of possible ways to select elements is n k

The probability that there are no duplicates is n! / (n-k)! n k

The probability that there are duplicates, respectively, is

1 - n! / (N-k)! Nk P (n, k) = 1 - n! / ((n-k)! x nk) = 1 - (n x (n-1) x ... x (n-k-1)) / nk = 1 - [(n-1) / n x (n-2) / n x ... x (n-k + 1) / n] = 1 - [(1- 1 / n) x (1 - 2 / n) x ... x (1 - (k-1) / n)]

If the hashcode is long m bit i.e. takes 2m values, then

This result is called the "birthday paradox" because, according to the above reasoning, in order for two people to have more than 0.5 probability of coinciding birthdays, there must be only 23 people in a group. This result seems surprising, perhaps because for each individual in the group, the likelihood that their birthday will coincide with the birthday of someone else in the group is rather small.

Let's go back to considering the properties of hash functions. Let's assume a 64-bit hash code is used. It can be considered that this is quite sufficient and, therefore, safe length for the hash code. For example, if the encrypted hash code is WITH transmitted with the corresponding unencrypted message M , then the enemy will need to find M ' such that

H (M ") = H (M) ,

in order to spoof the message and deceive the recipient. On average, the adversary must go through 263 messages in order to find one whose hash code is equal to the intercepted message.

However, various kinds of attacks are possible based on the "birthday paradox". The following strategy is possible:

1. The enemy creates 2 m / 2 message options, each of which has some specific meaning. The adversary prepares the same number of messages, each of which is fake and intended to replace the real message.

2. The two sets of messages are compared to find a pair of messages that have the same hash code. The probability of success according to the "birthday paradox" is greater than 0.5. If no matching pair is found, then additional original and fake messages are generated until a match is found.

3. The attacker offers the sender the original message for signature. This signature can then be attached to the fake variant for transmission to the recipient. Since both options have the same hashcode, the same signature will be generated. The adversary will be confident of success even without knowing the encryption key.

Thus, if a 64-bit hash code is used, then the required computational complexity is on the order of 232.

In conclusion, we note that the length of the hash code must be large enough. A length of 64 bits is not considered safe at this time. Preferably, the length is in the order of 100 bits.

Using encrypted block chaining

There are various hash functions based on the creation of a chain of encrypted blocks, but without the use of a secret key. One such hash function was proposed by Rabin. Message M broken into blocks of fixed length M1, M2,. ... ... , МN and uses a symmetric encryption algorithm like DES to compute the hash code G in the following way:

H 0 - initial value H i = E Mi G = H N

This is similar to using encryption in CBC mode, but in this case there is no secret key. As with any simple hash function, this algorithm is susceptible to a "birthday attack", and if the encryption algorithm is DES and only a 64-bit hash code is generated, then the system is considered quite vulnerable.

Other attacks such as "birthday" can be carried out, which are possible even if the adversary has access to only one message and the corresponding encrypted hash code and cannot obtain several pairs of messages and encrypted hash codes. The following scenario is possible: suppose that the adversary intercepted the message with the authenticator in the form of an encrypted hash code, and it is known that the unencrypted hash code has a length m bits. Next, the enemy must perform the following actions:

Using the algorithm described above, compute the unencrypted hash code G .

Create a fake message as Q1, Q2,. ... ... , QN-2 .

Calculate H i = E Qi for 1 <= i <= N-2 .

· Create 2 m / 2 random blocks NS and for each such block NS calculate E X ... Create additional 2 m / 2 random block Y and for each block Y calculate D Y [G] , where D - decryption function corresponding E ... Based on the "birthday paradox", we can say that with a high degree of probability this sequence will contain blocks NS and Y such that E X = D Y [Y] .

Create message Q1, Q2,. ... ... , QN-2, X, Y ... This message has a hash code G and therefore can be used in conjunction with an encrypted authenticator.

This form of attack is known as a meet-in-the-middle attack. Various studies have proposed more sophisticated methods to reinforce the blockchain approach. For example, Davis and Price described the following option:

Another option is possible:

However, both of these schemes are also vulnerable to various attacks. More generally, some form of "birthday attack" can be shown to succeed with any hash algorithm involving the use of a cipher blockchain without the use of a secret key.

Further research was directed towards finding other approaches to creating hashing functions.

Hash function MD5

Consider the MD5 (RFC 1321) message digest algorithm developed by MIT's Ron Rivest.

MD5 execution logic

The algorithm receives a message of arbitrary length as input and creates a 128-bit message digest as output. The algorithm consists of the following steps:

Rice. 8.1. MD5 execution logic

Step 1: adding missing bits

The message is supplemented so that its length becomes 448 modulo 512 (). This means that the length of the added message is 64 bits less than a multiple of 512. The addition is always made, even if the message is of the required length. For example, if a message is 448 bits long, it is padded with 512 bits to 960 bits. Thus, the number of added bits ranges from 1 to 512.

The addition consists of one followed by the required number of zeros.

Step 2: adding length

The 64-bit representation of the length of the original (before appending) message in bits is appended to the result of the first step. If the original length is greater than 2 64, then only the last 64 bits are used. Thus, the field contains the length of the original message modulo 64.

The first two steps create a message that is a multiple of 512 bits. This extended message is represented as a sequence of 512-bit blocks Y 0, Y 1,. ... ., Y L-1, while the total length of the extended message is L * 512 bits. Thus, the length of the received extended message is a multiple of sixteen 32-bit words.

Rice. 8.2. Extended message structure

Step 3: initializing the MD buffer

A 128-bit buffer is used to store the intermediate and final results of the hash function. The buffer can be represented as four 32-bit registers (A, B, C, D). These registers are initialized with the following hexadecimal numbers:

A = 01234567 B = 89ABCDEF C = FEDCBA98 D = 76543210

Step 4: processing a sequence of 512-bit (16-word) blocks

The basis of the algorithm is a module consisting of four cyclic processing, designated as HMD5. The four loops have a similar structure, but each loop uses its own atomic logic function, denoted f F, f G, f H, and f I, respectively.

Rice. 8.3. Processing the next 512-bit block

Each cycle takes as input the current 512-bit block Y q, which is currently being processed, and the 128-bit value of the buffer ABCD, which is an intermediate digest value, and changes the contents of this buffer. Each loop also uses the fourth part of the 64-element table T based on the sin function. The i-th element of T, denoted by T [i], has a value equal to the integer part of 2 32 * abs (sin (i)), i is in radians. Since abs (sin (i)) is a number between 0 and 1, each element of T is an integer that can be represented by 32 bits. The table provides a "random" set of 32-bit values that should eliminate any regularity in the input.

To obtain MD q + 1, the output of four cycles is added modulo 2 32 with MD q. The addition is performed independently for each of the four words in the buffer.

CLS s - circular left shift by s bits of a 32-bit argument.

X [k] - M is the k-th 32-bit word in the q-th 512 message block.

T [i] - i-th 32-bit word in matrix T.

+ - addition mod 2 32.

At each of the four cycles of the algorithm, one of the four elementary logical functions is used. Each atomic function takes three 32-bit words as input and creates one 32-bit word at the output. Each function is a set of bitwise logical operations, i.e. The nth bit of the output is a function of the nth bit of the three inputs. The elementary functions are as follows:

An array of 32-bit words X contains the value of the current 512-bit input block that is currently being processed. Each cycle is executed 16 times, and since each block of the input message is processed in four cycles, each block of the input message is processed according to the scheme shown in Fig. 4, 64 times. If we represent the input 512-bit block in the form of sixteen 32-bit words, then each input 32-bit word is used four times, once in each cycle, and each element of the table T, consisting of 64 32-bit words, is used only one once. After each step of the loop, four words A, B, C, and D are cyclically shifted to the left. At each step, only one of the four words of the buffer ABCD is changed. Therefore, each word in the buffer is changed 16 times, and then the 17th time at the end to get the final output of that block.

digest.

2. Speed: the software implementation of the algorithm should be fast enough. In particular, the algorithm should be fast enough on a 32-bit architecture. Therefore, the algorithm is based on a simple set of elementary operations on 32-bit words.

3. Simplicity and compactness: the algorithm should be simple to describe and easy to program, without large programs or lookup tables. These characteristics not only have obvious software advantages, but are also desirable from a security point of view, because it is better to have a simple algorithm for analyzing possible weaknesses.

4. Little-endian architecture desirable: some processor architectures (such as the Intel 80xxx line) store the left-hand bytes of a word in the position of the least-endian byte addresses. Others (such as SUN Sparcstation) store the right bytes of the word in the position of the least significant byte addresses (big MD4 additional constant in the first loop is not applied. A similar additional constant is used for each of the steps in the second loop. Another additional constant is used for each of the steps in the third loop The hash code is a function of each bit of the input.The complex repetition of the atomic functions f F, f G, f H and f I ensures that the result is well mixed; that is, it is unlikely that two messages are chosen at random, even if they have apparently similar patterns, had the same digests that produce the same output value, which means that running MD5 on a single 512-bit block will result in the same output for two different inputs in buffer ABCD. does not exist on MD5.