AlexNet is a convolutional neural network for image classification. Development of an image recognition system based on the apparatus of artificial neural networks Multilayer neural networks

AlexNet is a convolutional neural network that has had a major impact on the development of machine learning, especially computer vision algorithms. The network won the ImageNet LSVRC-2012 image recognition competition by a large margin in 2012 (with 15.3% errors versus 26.2% in second place).

AlexNet's architecture is similar to that of Yann LeCum's LeNet. However, AlexNet has more filters per layer and nested convolutional layers. The network includes convolutions, maximum pooling, dropout, data augmentation, ReLU activation functions, and stochastic gradient descent.

Features of AlexNet

As an activation function, Relu is used instead of arctangent to add nonlinearity to the model. Due to this, with the same accuracy of the method, the speed becomes 6 times faster.
Using a dropout instead of regularization solves the overfitting problem. However, the training time is doubled with a dropout rate of 0.5.
Overlapping joins is performed to reduce the size of the network. Due to this, the level of errors of the first and fifth levels are reduced to 0.4% and 0.3%, respectively.

ImageNet dataset

ImageNet is a collection of 15 million high-resolution tagged images, divided into 22,000 categories. The images were collected online and tagged manually using Amazon's Mechanical Turk crowdsourcing. Since 2010, the annual ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held as part of the Pascal Visual Object Challenge. The challenge uses a portion of the ImageNet dataset with 1000 images in each of the 1000 categories. A total of 1.2 million images for training, 50,000 images for validation, and 150,000 for testing are obtained. ImageNet consists of images with different resolutions. Therefore, for the competition, they are scaled to a fixed resolution of 256 × 256. If the original image was rectangular, then it is cropped to a square in the center of the image.

Architecture

Picture 1

The network architecture is shown in Figure 1. AlexNet contains eight weighted layers. The first five of them are convolutional, and the other three are fully connected. The output is passed through a softmax loss function that generates a distribution of 1000 class labels. The network maximizes multilinear logistic regression, which is equivalent to maximizing the mean over all training cases of the logarithm of the probability of correct labeling over the expectation distribution. The kernels of the second, fourth and fifth convolutional layers are associated only with those kernel maps in the previous layer that are on the same GPU. The cores of the third convolutional layer are associated with all the maps of the cores of the second layer. Neurons in fully connected layers are connected to all neurons in the previous layer.

Thus, AlexNet contains 5 convolutional layers and 3 fully connected layers. Relu is applied after every convolutional and fully connected layer. The dropout is applied before the first and second fully connected layers. The network contains 62.3 million parameters and requires 1.1 billion computations on a forward pass. Convolutional layers, which account for 6% of all parameters, do 95% of the calculations.

Education

AlexNet goes through 90 eras. The training takes 6 days at a time on two Nvidia Geforce GTX 580 GPUs, which is the reason the network is split in two. Stochastic gradient descent is used with a learning rate of 0.01, an impulse of 0.9, and a decay of weights of 0.0005. The learning rate is divisible by 10 after saturation of precision, and decreases by 3 times over the course of training. Weighting coefficient update scheme w looks like:

where i- iteration number, v Is the impulse variable, and epsilon- learning rate. During the entire training stage, the learning rate was chosen equal for all layers and adjusted manually. A subsequent heuristic was to divide the learning rate by 10 when the number of validation errors stopped decreasing.

Examples of use and implementation

The results show that a large, deep convolutional neural network is capable of achieving record results on very complex datasets using only supervised learning. A year after the publication of AlexNet, all ImageNet contestants began using convolutional neural networks to solve the classification problem. AlexNet was the first implementation of convolutional neural networks and ushered in a new era of research. Now it has become easier to implement AlexNet using deep learning libraries: PyTorch, TensorFlow, Keras.

Result

The network achieves the following levels of errors of the first and fifth levels: 37.5% and 17.0%, respectively. The best performance achieved in the ILSVRC-2010 competition was 47.1% and 28.2% using an approach that averages the predictions from six sparse coding models trained on different feature vectors. Since then, the results have been achieved: 45.7% and 25.7% using an approach that averages the predictions of two classifiers trained on Fisher's vectors. The ILSVRC-2010 results are shown in Table 1.

Left: eight ILSVRC-2010 test images and five tags that the model thinks most likely. The correct label is written under each image, and the probability is shown with a red bar if it is in the top five. Right: five ILSVRC-2010 test images in the first column. The remaining columns show six training images. 1

A neural network is a mathematical model and its implementation in the form of software or hardware-software implementation, which is based on modeling the activity of biological neural networks, which are networks of neurons in a biological organism. Scientific interest in this structure arose because the study of its model allows one to obtain information about a certain system. That is, such a model can have practical implementation in a number of branches of modern science and technology. The article discusses issues related to the use of neural networks for the construction of image identification systems that are widely used in security systems. Issues related to the topic of the image recognition algorithm and its application are investigated in detail. Briefly provides information on the methodology for training neural networks.

neural networks

learning with neural networks

image recognition

local perception paradigm

security systems

1. Yann LeCun, J.S. Denker, S. Solla, R.E. Howard and L. D. Jackel: Optimal Brain Damage, in Touretzky, David (Eds), Advances in Neural Information Processing Systems 2 (NIPS * 89). - 2000 .-- 100 p.

2. Zhigalov K.Yu. Method of photorealistic vectorization of laser ranging data for further use in GIS // Izvestiya vysshikh uchebnykh zavod. Geodesy and aerial photography. - 2007. - No. 6. - P. 285–287.

3. Ranzato Marc'Aurelio, Christopher Poultney, Sumit Chopra and Yann LeCun: Efficient Learning of Sparse Representations with an Energy-Based Model, in J. Platt et al. (Eds), Advances in Neural Information Processing Systems (NIPS 2006). - 2010 .-- 400 p.

4. Zhigalov K.Yu. Preparation of equipment for use in automated control systems for road construction // Natural and technical sciences. - M., 2014. - No. 1 (69). - S. 285–287.

5. Y. LeCun and Y. Bengio: Convolutional Networks for Images, Speech, and Time-Series, in Arbib, M. A. (Eds) // The Handbook of Brain Theory and Neural Networks. - 2005 .-- 150 p.

6. Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and K. Muller (Eds) // Neural Networks: Tricks of the trade. - 2008 .-- 200 p.

Today, technological and research progress covers all new horizons, progressing rapidly. One of them is modeling the surrounding natural world using mathematical algorithms. In this aspect, there are trivial, for example, modeling sea vibrations, and extremely complex, non-trivial, multicomponent tasks, for example, modeling the functioning of the human brain. In the process of studying this issue, a separate concept was identified - a neural network. A neural network is a mathematical model and its implementation in the form of software or hardware-software implementation, which is based on modeling the activity of biological neural networks, which are networks of neurons in a biological organism. Scientific interest in this structure arose because the study of its model allows one to obtain information about a certain system. That is, such a model can have practical implementation in a number of branches of modern science and technology.

A brief history of the development of neural networks

It should be noted that initially the concept of "neural network" originates in the work of American mathematicians, neurolinguists and neuropsychologists W. McCulloch and W. Pitts (1943), where the authors first mention it, define it and make the first attempt to build a model neural network. Already in 1949 D. Hebb proposed the first learning algorithm. Then there was a number of studies in the field of neural learning, and the first working prototypes appeared around 1990-1991. last century. Nevertheless, the computing power of the equipment of that time was not enough for the fast enough operation of neural networks. By 2010, the power of GPU video cards has greatly increased and the concept of programming directly on video cards appeared, which significantly (3-4 times) increased the performance of computers. In 2012, neural networks won the ImageNet championship for the first time, which marked their further rapid development and the emergence of the term Deep Learning.

In the modern world, neural networks have a colossal coverage, scientists consider the research carried out in the field of studying the behavioral characteristics and states of neural networks extremely promising. The list of areas in which neural networks have found application is huge. This includes recognition and classification of patterns, and forecasting, and the solution of approximation problems, and some aspects of data compression, data analysis and, of course, application in security systems of a different nature.

The study of neural networks is actively taking place in the scientific communities of different countries. In such a consideration, it is presented as a special case of a number of pattern recognition methods, discriminant analysis, and clustering methods.

It should also be noted that over the past year, funding has been allocated to startups in the field of image recognition systems for more than the previous 5 years, which indicates a fairly high demand for this type of development in the final market.

Application of neural networks for image recognition

Consider the standard tasks solved by neural networks when applied to images:

● identification of objects;

● recognition of parts of objects (for example, faces, arms, legs, etc.);

● semantic definition of the boundaries of objects (allows you to leave only the boundaries of objects in the picture);

● semantic segmentation (allows you to split the image into various separate objects);

● selection of surface normals (allows you to convert two-dimensional images into three-dimensional images);

● highlighting objects of attention (allows you to determine what a person would pay attention to in a given image).

It should be noted that the problem of image recognition has a striking character, the solution of this problem is a complex and extraordinary process. When performing recognition, the object can be a human face, a handwritten digit, as well as many other objects that are characterized by a number of unique features, which significantly complicates the identification process.

In this study, an algorithm for creating and learning to recognize handwritten symbols of a neural network will be considered. The image will be read by one of the inputs of the neural network, and one of the outputs will be used to output the result.

At this stage, it is necessary to briefly dwell on the classification of neural networks. Today there are three main types:

● convolutional neural networks (CNN);

● recurrent networks (deep learning);

● reinforcement learning.

One of the most common examples of building a neural network is the classic neural network topology. Such a neural network can be represented as a fully connected graph, its characteristic feature is forward propagation of information and back propagation of error signaling. This technology does not have recursive properties. An illustrative neural network with classical topology can be depicted in Fig. 1.

Rice. 1. Neural network with the simplest topology

Rice. 2. Neural network with 4 layers of hidden neurons

One of the clearly significant disadvantages of this network topology is redundancy. Due to the redundancy when supplying data in the form of, for example, a two-dimensional matrix to the input, it is possible to obtain a one-dimensional vector. So, for the image of a handwritten Latin letter described using a 34x34 matrix, 1156 inputs are required. This suggests that the computing power spent on the implementation of the software and hardware solution of this algorithm will be too large.

The problem was solved by the American scientist Ian Le Koon, who analyzed the work of Nobel Prize winners in medicine T. Wtesel and D. Hubel. As part of their study, the object of the study was the visual cortex of the cat's brain. Analysis of the results showed that the cortex contains a number of simple cells, as well as a number of complex cells. Simple cells reacted to the image of straight lines received from the visual receptors, and complex cells - to translational movement in one direction. As a result, the principle of constructing neural networks, called convolutional, was developed. The idea of this principle was that to implement the functioning of the neural network, the alternation of convolutional layers, which are usually denoted as C - Layers, subsampling layers S - Layers, and fully connected layers F - Layers at the output from the neural network, is used.

At the heart of building a network of this kind are three paradigms - the paradigm of local perception, the paradigm of shared weights and the paradigm of subsampling.

The essence of the local perception paradigm is that not the entire image matrix is fed to each input neuron, but a part of it. The rest of the parts are fed to other input neurons. In this case, you can observe the mechanism of parallelization, using this method, you can save the topology of the image from layer to layer, multidimensionally processing it, that is, a number of neural networks can be used during processing.

The shared weights paradigm suggests that a small set of weights can be used for multiple relationships. These sets are also called "cores". For the final result of image processing, we can say that the shared weights have a positive effect on the properties of the neural network, the study of the behavior of which increases the ability to find invariants in images and filter noise components without processing them.

Based on the foregoing, we can conclude that when applying the image folding procedure on the basis of the kernel, an output image will appear, the elements of which will be the main characteristic of the degree of correspondence to the filter, that is, a feature map will be generated. This algorithm is shown in Fig. 3.

Rice. 3. Algorithm for generating a feature map

The subsampling paradigm is that the input image is reduced by decreasing the spatial dimension of its mathematical equivalent - an n-dimensional matrix. The need for subsampling is expressed in invariance to the scale of the original image. When applying the technique of alternating layers, it becomes possible to generate new feature maps from existing ones, that is, the practical implementation of this method is that the ability to degenerate a multidimensional matrix into a vector matrix, and then completely into a scalar value will be acquired.

Implementing neural network training

Existing networks are divided into 3 classes of architectures in terms of learning:

● supervised learning (percepton);

● unsupervised learning (adaptive resonance networks);

● blended learning (networks of radial-basis functions).

One of the most important criteria for evaluating the performance of a neural network in the case of image recognition is the quality of image recognition. It should be noted that for a quantitative assessment of the quality of image recognition using the functioning of a neural network, the root-mean-square error algorithm is most often used:

(1)

In this dependence, Ep is the p-th recognition error for a pair of neurons,

Dp is the expected output result of the neural network (usually the network should strive for 100% recognition, but this does not happen in practice), and the construction O (Ip, W) 2 is the square of the network output, which depends on the p-th input and the set the weight coefficients W. This construction includes both the convolution kernels and the weight coefficients of all layers. The error calculation consists in calculating the arithmetic mean for all pairs of neurons.

As a result of the analysis, a regularity was derived that the nominal value of the weight, when the error value is minimal, can be calculated based on dependence (2):

(2)

From this dependence, we can say that the problem of calculating the optimal weight is the arithmetic difference of the derivative of the first-order error function with respect to weight, divided by the derivative of the second-order error function.

The given dependencies make it possible to trivially calculate the error that is in the output layer. The calculation of the error in the hidden layers of neurons can be implemented using the error backpropagation method. The main idea of the method is to disseminate information, in the form of signaling an error, from output neurons to input neurons, that is, in the direction opposite to the propagation of signals through the neural network.

It is also worth noting that the training of the network is carried out on specially prepared databases of images classified into a large number of classes, and it takes quite a long time.
Today the largest database is ImageNet (www.image_net.org). It has free access to academic institutions.

Conclusion

As a result of the above, it should be noted that neural networks and algorithms, implemented on the principle of their functioning, can be used in systems for recognizing a fingerprint card for internal affairs bodies. Often, it is the software component of a software and hardware complex aimed at recognizing such a unique complex image as a drawing, which is an identification data, that does not fully solve the tasks assigned to it. A program based on algorithms based on a neural network will be much more efficient.

To summarize, we can summarize the following:

● neural networks can find application, both in the issue of recognition of images and texts;

● this theory makes it possible to talk about the creation of a new promising class of models, namely, models based on intelligent modeling;

● neural networks are capable of learning, which indicates the possibility of optimizing the process from functioning. This possibility is an extremely important option for the practical implementation of the algorithm;

● Evaluation of the pattern recognition algorithm using a neural network study can have a quantitative value, respectively, there are mechanisms for adjusting the parameters to the required value by calculating the required weight coefficients.

Today, further research of neural networks seems to be a promising area of research that will be successfully applied in even more branches of science and technology, as well as human activities. The main emphasis in the development of modern recognition systems is now shifting to the field of semantic segmentation of 3D images in geodesy, medicine, prototyping and other areas of human activity - these are rather complex algorithms and this is due to:

● lack of a sufficient number of databases of reference images;

● lack of a sufficient number of free experts for the initial training of the system;

● images are not stored in pixels, which requires additional resources from both the computer and the developers.

It should also be noted that today there are a large number of standard architectures for constructing neural networks, which greatly facilitates the task of building a neural network from scratch and reduces it to the selection of a network structure suitable for a specific task.

Currently, there are quite a large number of innovative companies on the market that are engaged in image recognition using neural network learning technologies for the system. It is known for certain that they achieved an image recognition accuracy in the region of 95% using a database of 10,000 images. Nevertheless, all achievements relate to static images, with video sequences at the moment everything is much more complicated.

Bibliographic reference

Markova S.V., Zhigalov K.Yu. APPLICATION OF THE NEURAL NETWORK FOR CREATION OF THE IMAGE RECOGNITION SYSTEM // Fundamental research. - 2017. - No. 8-1. - S. 60-64;
URL: http://fundamental-research.ru/ru/article/view?id=41621 (date of access: 03.24. We bring to your attention the journals published by the "Academy of Natural Sciences"

Friends, we continue the story about neural networks, which we started last time, and about.

What is a neural network

In the simplest case, a neural network is a mathematical model consisting of several layers of elements that perform parallel computations. Initially, such an architecture was created by analogy with the smallest computing elements of the human brain - neurons. The smallest computational elements of an artificial neural network are also called neurons. Neural networks usually consist of three or more layers: an input layer, a hidden layer (or layers) and an output layer (Fig. 1), in some cases the input and output layers are not taken into account, and then the number of layers in the network is counted by the number of hidden layers. This type of neural network is called a perceptron.

Rice. 1. The simplest perceptron

An important feature of a neural network is its ability to learn by example, this is called supervised learning. The neural network is trained on a large number of examples consisting of input-output pairs (input and output corresponding to each other). In problems of object recognition, such a pair will be the input image and the corresponding label - the name of the object. Neural network training is an iterative process that reduces the deviation of the network output from a given “teacher's answer” - a label corresponding to a given image (Fig. 2). This process consists of steps, called learning epochs (they are usually in the thousands), at each of which the "weights" of the neural network - the parameters of the hidden layers of the network - are adjusted. Upon completion of the training process, the quality of the neural network is usually good enough to perform the task for which it was trained, although the optimal set of parameters that ideally recognizes all images is often impossible to find.

Rice. 2. Training the neural network

What are deep neural networks

Deep, or deep, neural networks are neural networks consisting of several hidden layers (Fig. 3). This figure is a depiction of a deep neural network, giving the reader a general idea of what a neural network looks like. However, the real architecture of deep neural networks is much more complex.

Rice. 3. A neural network with many hidden layers

The creators of convolutional neural networks, of course, were initially inspired by the biological structures of the visual system. The first computational models based on the concept of a hierarchical primate visual flow are known as the Fukushima Neocognitron (Figure 4). The modern understanding of the physiology of the visual system is similar to the type of information processing in convolutional networks, at least for fast object recognition.

Rice. 4. Diagram showing connections between layers in the Neocognitron model.

Later, this concept was implemented by Canadian researcher Ian LeCoon in his convolutional neural network, which he created for recognizing handwritten characters. This neural network consisted of two types of layers: convolutional layers and subsampling layers (or pooling layers). In it, each layer has a topographic structure, that is, each neuron is associated with a fixed point of the original image, as well as with a receptive field (an area of the input image that is processed by this neuron). At each location in each layer, there are a number of different neurons, each with its own set of input weights associated with the neurons in the rectangular slice of the previous layer. Different input rectangular fragments with the same set of weights are associated with neurons from different locations.

The general architecture of a deep neural network for pattern recognition is shown in Figure 5. The input image is represented as a set of pixels or small areas of the image (for example, 5-by-5 pixels)

Rice. 5. Convolutional neural network diagram

As a rule, deep neural networks are depicted in a simplified form: as processing stages, which are sometimes called filters. Each stage differs from the other in a number of characteristics, such as the size of the receptive field, the type of features the network learns to recognize in a given layer, and the type of computation performed at each stage.

The fields of application of deep neural networks, including convolutional networks, are not limited to face recognition. They are widely used for speech and audio signal recognition, processing readings from different types of sensors, or for segmentation of complex multilayer images (such as satellite maps) or medical images (X-ray images, fMRI images - see).

Neural networks in biometrics and face recognition

To achieve high recognition accuracy, the neural network is pre-trained on a large array of images, for example, such as in the MegaFace database. This is the main training method for face recognition.

Rice. 6. The MegaFace database contains 1 million images of more than 690 thousand people

After the network has been trained to recognize faces, the face recognition process can be described as follows (Figure 7). First, the image is processed using a face detector: an algorithm that detects a rectangular portion of the image with a face. This fragment is normalized in order to be easier to process by the neural network: the best result will be achieved if all input images are of the same size, color, etc. The normalized image is fed to the input of the neural network for processing by the algorithm. This algorithm is usually a unique development of the company to improve the quality of recognition, but there are also "standard" solutions for this problem. The neural network builds a unique feature vector, which is then transferred to the database. The search engine compares it with all the vectors of features stored in the database, and gives the search result in the form of a certain number of names or user profiles with similar facial features, each of which is assigned a certain number. This number represents the degree of similarity of our feature vector with the one found in the database.

Rice. 7. Face recognition process

Determining the quality of the algorithm

Accuracy

When we choose which algorithm to apply to an object or face recognition problem, we must have a means of comparing the effectiveness of different algorithms. In this part we will describe the tools with which this is done.

The quality of the face recognition system is assessed using a set of metrics that correspond to typical scenarios for using the system for authentication using biometrics.

As a rule, the performance of any neural network can be measured in terms of accuracy: after setting the parameters and completing the training process, the network is tested on a test set, for which we have a teacher's response, but which is separate from the training set. Typically, this parameter is a quantitative measure: a number (often as a percentage) that indicates how well the system is able to recognize new objects. Another common measure is error (it can be expressed as a percentage or in a numerical equivalent). However, there are more precise measures for biometrics.

In biometrics in general and biometrics for face recognition in particular, there are two types of applications: verification and identification. Verification is the process of confirming a certain identity by comparing an image of an individual (a vector of facial features or another vector of features, for example, a retina or fingerprints) with one or more previously saved templates. Identification is the process of determining the identity of an individual. Biometric samples are collected and compared to all templates in the database. There is an identification in a closed set of features if it is assumed that a person exists in the database. Thus, recognition combines one or both of the terms - verification and identification.

Often, in addition to the direct result of comparison, it is required to assess the level of "confidence" of the system in its decision. This value is called the "similarity score" (or similarity score). A higher similarity score indicates that the two compared biometric samples are more similar.

There are a number of methods for assessing the quality of the system (both for the task of verification and identification). We will talk about them next time. And you stay with us and do not hesitate to leave comments and ask questions.

NOTES

Fukushima (1980) "Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position," Biological Cybernetics.
LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard and L.D. Jackel (1989) "Backpropagation Applied to Handwritten Zip Code Recognition", Neural Computation, vol. 1, pp., 541-551.
Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, Stefano Ermon Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data.
Ian Goodfellow, Yoshua Bengio, Aaron Courville (2016) Deep Learning. MIT press.
Poh, C-H. Chan, J. Kittler, Julian Fierrez (UAM), and Javier Galbally (UAM) (2012) Description of Metrics For the Evaluation of Biometric Performance.