What does quality metric mean in machine learning? Ranking training

In the process of preparing the problem for the entrance test for the GoTo summer school, we found that there is practically no qualitative description of the main ranking metrics in Russian (the problem concerned a special case of the ranking problem - building a recommendatory algorithm). We at E-Contenta actively use various ranking metrics, so we decided to correct this misunderstanding by writing this article.

The task of ranking now arises everywhere: sorting web pages according to a given search query, personalizing the news feed, recommending videos, products, music ... In a word, the topic is hot. There is even a special direction in machine learning, which deals with the study of ranking algorithms capable of self-learning - learning to rank. To choose the best from the whole variety of algorithms and approaches, it is necessary to be able to assess their quality quantitatively. The most common ranking quality metrics will be discussed below.

Briefly about the task of ranking

Ranking is the task of sorting a set elements for reasons of their relevance... Most often, relevance is understood in relation to no one object... For example, in an information search task, an object is a request, elements are various documents (links to them), and relevance is a document's compliance with a request, in a recommendation task, an object is a user, elements are one or another recommended content (goods, videos, music ), and relevance is the likelihood that the user will use (buy / like / view) this content.

Formally, consider N objects and M elements. The result of the work of the algorithm for ranking elements for an object is a mapping that assigns to each element a weight that characterizes the degree of relevance of an element to an object (the higher the weight, the more relevant the object). In this case, the set of weights specifies a permutation on the set of elements of elements (we assume that the set of elements is ordered) based on their sorting in descending order of weight.

In order to assess the quality of ranking, it is necessary to have a certain “standard” with which the results of the algorithm could be compared. Consider - the reference relevance function, which characterizes the "real" relevance of elements for a given object (- the element is perfect, - completely irrelevant), as well as the corresponding permutation (in descending order).

There are two main ways to get it:
1. Based on historical data. For example, in the case of content recommendations, you can take the user's views (likes, purchases) and assign the viewed weights of the corresponding elements 1 (), and 0 for all others.
2. Based on expert judgment. For example, in a search task, for each request, you can involve a team of assessors who will manually assess the relevance of documents to the request.

It is worth noting that when it takes only extreme values: 0 and 1, then the permutation is usually not considered and only the set of relevant elements for which.

The purpose of the ranking quality metric- to determine to what extent the relevance scores obtained by the algorithm and the corresponding permutation correspond to true relevance values. Let's consider the main metrics.

Mean average precision

Mean average precision at K ( [email protected]) is one of the most commonly used ranking quality metrics. To understand how it works, let's start with the "basics".

Note: "* precision" metrics are used in binary problems, where it takes only two values: 0 and 1.

Precision at K

Precision at K ( [email protected]) - accuracy on K elements - the basic metric of the ranking quality for one object. Let's say our ranking algorithm has generated relevance scores for each item. By selecting the first elements with the largest among them, you can calculate the proportion of relevant ones. This is exactly what precision at K does:

Note: we mean the element that, as a result of the permutation, ended up in the th position. So, - the element with the largest, - the element with the second largest, and so on.

Average precision at K

Precision at K is a metric that is easy to understand and implement, but it has an important drawback - it does not take into account the order of items in the "top". So, if out of ten elements we guessed only one, then it does not matter where it was: on the first, or on the last, in any case. At the same time, it is obvious that the first option is much better.

This disadvantage is offset by the ranking metric. average precision at K ( [email protected]) which is equal to the sum [email protected] by indices k from 1 to K only for relevant items divided by K:

So, if out of three elements we were only relevant in the last place, then if we guessed only the one that was in the first place, then, and if everything was guessed, then.

Now and [email protected] in our teeth.

Mean average precision at K

Mean average precision at K ( [email protected]) is one of the most commonly used ranking quality metrics. V [email protected] and [email protected] the quality of ranking is assessed for a single object (user, search query). In practice, there are many objects: we are dealing with hundreds of thousands of users, millions of search queries, etc. Idea [email protected] is to count [email protected] for each object and average:

Note: This idea is quite logical, assuming that all users are equally needed and equally important. If this is not the case, then instead of simple averaging, you can use a weighted one, multiplying [email protected] each object by weight corresponding to its "importance".

Normalized Discounted Cumulative Gain

Normalized discounted cumulative gain (nDCG) is another common ranking quality metric. As with [email protected], let's start with the basics.

Cumulative Gain at K

Consider again one object and elements with the largest. Cumulative gain at K ( [email protected]) is a basic ranking metric that uses a simple idea: the more relevant items in this top, the better:

This metric has obvious drawbacks: it is not normalized and does not take into account the position of the relevant elements.

Note that, in contrast to [email protected], [email protected] can also be used in the case of non-binary reference relevance values.

Discounted Cumulative Gain at K

Discounted cumulative gain at K ( [email protected]) - modification of cumulative gain at K, taking into account the order of elements in the list by multiplying the relevance of the element by a weight equal to the inverse logarithm of the position number:

Note: if it takes only the values 0 and 1, then, and the formula takes on a simpler form:

The use of the logarithm as a discounting function can be explained by the following intuitive reasons: from the point of view of ranking, the positions at the beginning of the list differ much more than the positions at the end of the list. So, in the case of a search engine, there is a whole gulf between positions 1 and 11 (only in a few cases out of a hundred, a user enters the first page of the search results), and there is not much difference between positions 101 and 111 - few people reach them. These subjective considerations are nicely expressed using a logarithm:

Discounted cumulative gain solves the problem of taking into account the position of the relevant elements, but only exacerbates the problem with the lack of normalization: if it varies within the limits, then it already takes on values in a not entirely clear segment. The following metric is intended to solve this problem.

Normalized Discounted Cumulative Gain at K

As you might guess from the name, normalized discounted cumulative gain at K ( [email protected]) - nothing more than a normalized version [email protected]:

where is the maximum (I - ideal) value. Since we have agreed that it takes values in, then.

Thus, it inherits from the consideration of the position of elements in the list and, at the same time, takes values in the range from 0 to 1.

Note: by analogy with [email protected] can be calculated, averaged over all objects.

Mean reciprocal rank

Mean reciprocal rank (MRR) is another commonly used ranking quality metric. It is given by the following formula:

where - reciprocal rank for the th object - a very simple in its essence value equal to the inverse rank of the first correctly guessed element.

The mean reciprocal rank varies in range and takes into account the position of the elements. Unfortunately, he does this for only one element - the first correctly predicted, not paying attention to all subsequent ones.

Rank correlation metrics

Separately, it is worth highlighting the ranking quality metrics based on one of the coefficients rank correlation... In statistics, a rank correlation coefficient is a correlation coefficient that takes into account not the values themselves, but only their rank (order). Consider the two most common rank correlation coefficients: Spearman's and Kendall's.

Kendall's rank correlation coefficient

The first of these is Kendall's correlation coefficient, which is based on the calculation of consistent
(and mismatched) pairs of permutations - pairs of elements, to which the permutations were assigned the same (different) order:

Spearman's rank correlation coefficient

The second - Spearman's rank correlation coefficient - is in fact nothing more than Pearson's correlation calculated on the values of the ranks. There is a fairly convenient formula expressing it directly from ranks:

where is the Pearson correlation coefficient.

Rank correlation metrics have a drawback that we already know: they do not take into account the position of elements (even worse than [email protected] since the correlation is calculated for all elements, not for the K elements with the highest rank). Therefore, in practice, they are used extremely rarely.

Cascading Metrics

Up to this point, we have not delved into how the user (further we will consider a special case of the object - the user) studies the elements offered to him. In fact, we implicitly made the assumption that viewing each element independent from viewing other elements - a kind of "naivety". In practice, however, the items are often viewed by the user one at a time, and whether the user views the next item depends on their satisfaction with the previous items. Consider an example: in response to a search query, the ranking algorithm offered the user several documents. If the documents at positions 1 and 2 turned out to be extremely relevant, then the probability that the user will view the document at position 3 is small, because he will be quite satisfied with the first two.

Similar models of user behavior, where the study of the elements proposed to him occurs sequentially and the probability of viewing the element depends on the relevance of the previous ones are called cascading.

Expected reciprocal rank

Expected reciprocal rank (ERR)- an example of a ranking quality metric based on a waterfall model. It is given by the following formula:

where rank is understood in decreasing order. The most interesting thing about this metric is probabilities. When calculating them, the assumptions of the cascade model are used:

where is the probability that the user will be satisfied with the object with the rank. These probabilities are calculated based on values. Since in our case, we can consider a simple option:

which can be read as: the true relevance of the element in the position In conclusion, here are some useful links.

On the elements inside each list. Partial order is usually specified by specifying a score for each element (for example, "relevant" or "not relevant"; more than two grades are possible). The purpose of the ranking model is to approximate and generalize in the best way (in a sense) the method of ranking in the training set for new data.

Ranking learning is still a fairly young, rapidly developing area of research that emerged in the 2000s with the emergence of interest in the field of information retrieval in the application of machine learning methods to ranking problems.

Collegiate YouTube

1 / 5
During the training of the ranking model and during its operation, each document-request pair is translated into a numerical vector of ranking features (also called ranking factors or signals) characterizing the properties of the document, the request and their relationship. Such signs can be divided into three groups:

The following are some examples of ranking features used in the LETOR dataset widely known in the art:
- Values of measures TF, TF-IDF, BM25, and the language model of matching the request of various areas of the document (title, URL, body text, link text);
- Lengths and IDF-sums of document zones;
- Document ranks obtained by various variants of link ranking algorithms such as PageRank and HITS.
Ranking quality metrics

There are several metrics that evaluate and compare the performance of ranking algorithms on a sample with assessors. Often, the parameters of the ranking model tend to be adjusted in such a way as to maximize the value of one of these metrics.
Examples of metrics:

Algorithm classification

In his article "Learning to Rank for Information Retrieval" and speeches at thematic conferences, Tai-Yang Liu from Microsoft Research Asia analyzed the existing methods for solving the problem of teaching ranking and proposed their classification into three approaches, depending on the input representation used. data and penalty functions:

Pointwise approach

Notes (edit)
1. Tie-Yan Liu (2009), Learning to Rank for Information Retrieval, Foundations and Trends in Information Retrieval: Vol. 3: No 3, p. 225-331, ISBN 978-1-60198-244-5, DOI 10.1561 / 1500000016... Slides from T. Lew's speech at WWW 2009 are available.
Hello, Habr!

In machine learning tasks, metrics are used to assess the quality of models and compare various algorithms, and their selection and analysis is an indispensable part of a datasatanist's job.

In this article, we will look at some quality criteria in classification problems, discuss what is important when choosing a metric and what can go wrong.

Metrics in classification problems

To demonstrate useful features sklearn and a visual representation of metrics, we will use our dataset on the churn of customers of a telecom operator, which we met in the first article of the course.

Let's load the necessary libraries and look at the data
Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelmbleEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_modelression GrainFormalization GrainStarifier from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv ("../../ data / telecom_churn.csv")
Df.head (5)
Data preprocessing
# Let's make a mapping of binary columns # and encode the state with dummy coding (for simplicity, it's better not to do this for wooden models) d = ("Yes": 1, "No": 0) df ["International plan"] = df [" International plan "]. Map (d) df [" Voice mail plan "] = df [" Voice mail plan "]. Map (d) df [" Churn "] = df [" Churn "]. Astype (" int64 " ) le = LabelEncoder () df ["State"] = le.fit_transform (df ["State"]) ohe = OneHotEncoder (sparse = False) encoded_state = ohe.fit_transform (df ["State"]. values.reshape (- 1, 1)) tmp = pd.DataFrame (encoded_state, columns = ["state" + str (i) for i in range (encoded_state.shape)]) df = pd.concat (, axis = 1)

Accuracy, precision and recall

Before moving on to the metrics themselves, it is necessary to introduce an important concept to describe these metrics in terms of classification errors - confusion matrix(error matrix).
Suppose we have two classes and an algorithm that predicts the belonging of each object to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
this is the response of the algorithm on the object, and

The true class label on this object.
Thus, there are two types of classification errors: False Negative (FN) and False Positive (FP).

Algorithm training and construction of the error matrix
X = df.drop ("Churn", axis = 1) y = df ["Churn"] # Divide the sample into train and test, all metrics will be evaluated on the test dataset X_train, X_test, y_train, y_test = train_test_split (X, y , stratify = y, test_size = 0.33, random_state = 42) # Train the native logistic regression lr = LogisticRegression (random_state = 42) lr.fit (X_train, y_train) # Use the function to build the error matrix from the sklearn documentation def plot_confusion_matrix (cm, classes , normalize = False, title = "(! LANG: Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy

An intuitive, obvious and almost unused metric is accuracy - the percentage of correct answers of the algorithm:

This metric is useless in problems with unequal classes and it is easy to show it with an example.

Let's say we want to evaluate the performance of a spam mail filter. We have 100 non-spam emails, 90 of which our classifier identified correctly (True Negative = 90, False Positive = 10) and 10 spam emails, 5 of which the classifier also identified correctly (True Positive = 5, False Negative = 5 ).
Then accuracy:

However, if we simply predict all emails as non-spam, we will get a higher accuracy:

At the same time, our model has absolutely no predictive power, since we originally wanted to identify spam messages. To overcome this, we will be helped by the transition from a common metric for all classes to separate indicators of the quality of classes.

Precision, recall and F-measure

To assess the performance of the algorithm on each of the classes separately, we introduce the precision and recall metrics.

Precision can be interpreted as the proportion of objects called positive by the classifier and at the same time really positive, and recall shows what proportion of objects of a positive class from all objects of a positive class the algorithm found.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the algorithm's ability to detect a given class in general, and precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a Type I error, and the second is called a Type II error. In our problem of determining subscriber churn, the mistake of the first kind will be the acceptance of a loyal subscriber for an outgoing one, since our null hypothesis is that none of the subscribers is leaving, and we reject this hypothesis. Accordingly, an error of the second kind will be the "skipping" of the outgoing subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall do not depend, in contrast to accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.
Often in real practice the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the problem of determining customer churn.
Obviously we cannot find of all outgoing customers and only their. But having identified the strategy and resource for customer retention, we can select the necessary precision and recall thresholds. For example, you can focus on retaining only high-yield customers or those who are more likely to report, since we are limited by the call center resource.

Usually, when optimizing the hyperparameters of an algorithm (for example, in the case of iterating over a grid GridSearchCV), one metric is used, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregated measure of quality. F-measure (in general

) - harmonic mean precision and recall:

in this case determines the weight of the accuracy in the metric and for

this is the harmonic mean (with a multiplier of 2, so that in the case of precision = 1 and recall = 1, we have

)
The F-measure reaches its maximum when the completeness and accuracy are equal to one and is close to zero if one of the arguments is close to zero.
Sklearn has a handy function _metrics.classification report returning recall, precision and F-measure for each of the classes, as well as the number of instances of each class.
Report = classification_report (y_test, lr.predict (X_test), target_names = ["Non-churned", "Churned"]) print (report)
class precision recall f1-score support
Non-churned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg / total 0.84 0.87 0.84 1100
It should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, it is often necessary to resort to the techniques of artificial modification of the dataset to equalize the ratio of the classes. There are many of them and we will not touch on them, you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the real answer of the algorithm (as a rule, the probability of belonging to a class, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold equal to 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the aforementioned lack of class balance.

One of the ways to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C haracteristic curve). This curve is a line from (0,0) to (1,1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

We already know TPR, this is completeness, and FPR shows what proportion of objects of the negative class the algorithm predicted incorrectly. Ideally, when the classifier makes no mistakes (FPR = 0, TPR = 1), we will get the area under the curve equal to one, otherwise, when the classifier randomly outputs the class probabilities, AUC-ROC will tend to 0.5, since the classifier will issue the same amount of TP and FP.
Each point on the graph corresponds to the choice of a certain threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR by minimizing FPR, which means that our curve should ideally tend to the point (0,1).

ROC curve drawing code
Sns.set (font_scale = 1.5) sns.set_color_codes ("muted") plt.figure (figsize = (10, 8)) fpr, tpr, thresholds = roc_curve (y_test, lr.predict_proba (X_test) [:, 1], pos_label = 1) lw = 2 plt.plot (fpr, tpr, lw = lw, label = "ROC curve") plt.plot (,) plt.xlim () plt.ylim () plt.xlabel ("False Positive Rate ") plt.ylabel (" True Positive Rate ") plt.title (" ROC curve ") plt.savefig (" ROC.png ") plt.show ()

The AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, but not everything is so unambiguous) and can be interpreted as the probability that a randomly selected positive object will be ranked by the classifier higher (will have a higher probability of being positive) than a randomly selected negative an object.

Consider the following problem: we need to select 100 relevant documents from 1 million documents. We have mastered two algorithms:
- Algorithm 1 returns 100 documents, 90 of which are relevant. In this way,
- Algorithm 2 returns 2000 documents, 90 of which are relevant. In this way,
Most likely, we would choose the first algorithm that produces very few False Positives compared to its competitor. But the difference in False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative, and in problems where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:
- Algorithm 1
- Algorithm 2
Here, a significant difference between the two algorithms is already noticeable - 0.855 in accuracy!

Precision and recall are also used to construct the curve and, like AUC-ROC, find the area under it.

It can be noted here that on small datasets, the area under the PR-curve can be overly optimistic, because it is calculated using the trapezoidal method, but usually there is enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, see here.

Logistic Loss

The logistic loss function stands apart, defined as:

this is the algorithm's answer to

Ohm object,

true class label on

Ohm object, and

sample size.

Details about the mathematical interpretation of the logistic loss function have already been written in the framework of the post about linear models.
This metric rarely appears in business requirements, but often in tasks on kaggle.
Intuitively, logloss minimization can be thought of as the task of maximizing accuracy by penalizing mispredictions. However, it should be noted that logloss penalizes extremely strongly for the classifier's confidence in the wrong answer.

Let's consider an example:
Def logloss_crutch (y_true, y_pred, eps = 1e-15): return - (y_true * np.log (y_pred) + (1 - y_true) * np.log (1 - y_pred)) print ("Logloss with uncertain classification% f "% logloss_crutch (1, 0.5)) >> Logloss with uncertain classification 0.693147 print (" Logloss with confident classification and correct answer% f "% logloss_crutch (1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print (" Logloss with confident classification and Wrong answer% f "% logloss_crutch (1, 0.1)) >> Logloss with confident classification and Wrong answer 2.302585
Note how the logloss has grown dramatically with an incorrect answer and a confident classification!
Consequently, an error at one object can result in a significant degradation of the overall sample error. Such objects are often outliers, which must be remembered to be filtered or considered separately.
Everything falls into place if you draw a logloss graph:

It can be seen that the closer to zero the algorithm's response with ground truth = 1, the higher the error value and the steeper the curve grows.

Summing up:
- In the case of a multi-class classification, you need to carefully monitor the metrics of each of the classes and follow the logic of the decision tasks rather than optimizing the metric
- In the case of unequal classes, it is necessary to select a balance of classes for training and a metric that will correctly reflect the quality of the classification
- The choice of metric should be done with a focus on the subject area, preprocessing the data and, possibly, segmenting (as in the case of dividing into rich and poor customers)
useful links
1. Evgeny Sokolov's course: Seminar on the choice of models (there is information on the metrics of regression problems)
2. Problems on the AUC-ROC from A.G. Dyakonova
3. You can read more about other metrics on kaggle. A link to the competition where it was used has been added to the description of each metric
4. Presentation by Bogdan Melnik aka ld86 about training on unbalanced samples
UDC 519.816

S. V. SEMENIKHIN L. A. DENISOVA

Omsk State Technical University

RANGE MACHINE LEARNING METHOD

BASED ON A MODIFIED GENETIC ALGORITHM FOR THE YRSO METRIC

The problem of ranking documents on the information search results page and machine learning issues of ranking are considered. An approach is proposed to optimize the ranking function using the NOCO quality metric based on a modified genetic algorithm. The research of the developed algorithms was carried out (on test collections LETO ^ and their effectiveness for machine learning of ranking was shown.

Keywords: information retrieval, machine learning ranking, relevance, optimization, genetic algorithms.

1. Introduction. In modern information retrieval systems (ISS), the volumes of data operated by the system are so large that the key task is to rank relevant documents in response to a user's search query. At this stage of the development of ISS, machine learning (ML) ranking is of the greatest interest. The existing approaches to ML, based on numerical methods (in particular, gradient methods) or on analytical calculations, have a number of disadvantages that significantly affect the quality of information retrieval and the time required to rank relevant documents.

At the beginning of the research, list approaches to machine learning ranking were considered, most of which use the gradient descent method. In the works considered, ML is reduced to optimization of search quality metrics (SEQ), but only metrics represented by continuous functions are used. This limitation often leads to the fact that, as a result of optimization, the ranking function has lower scores for many important accepted indicators (DCG, nDCG, Graded Mean Reciprocal Rank, etc.), which are discrete functions. The paper proposes the use of genetic algorithms (GA) in teaching ranking to minimize Huber's loss function using expert assessments of relevance as reference values. An approach to ML based on the optimization of discrete metrics of information retrieval quality was also proposed.

2. Statement of the problem of machine learning ranking. In most modern information retrieval systems, the ranking function is built on the basis of n simple ranking functions (PRF) and can be written as:

where SRF¡ is the ¡th simple ranking function for document d and query q, WCi is the weighting coefficient of the ¡th simple ranking function, n is the number of PRFs in the ranking system.

In the course of machine learning for ranking, a set of search documents B and queries O from the test collection LBTOA was used. For all deO requests, a pair is formed with each deD document. For each such pair, the IRS determines the relevance values that are used to rank search results. In order to assess the quality of ranking, the system requires reference relevance values E for each document-query pair ^, e). For these purposes, expert assessments of relevance are used.

For the study, we used an ISS, in which the ranking is made on the basis of N = 5 simple ranking functions SRFi (WC) l г = 1, N, which form a vector optimality criterion:

where WCе (WC) is the vector of variable parameters; (ШС), (ЯБ) are the spaces of parameters and vector criteria, respectively.

The application of genetic algorithms for ML ranking makes it possible to maximize discrete quality metrics such as nDCG. The nDCG metric for ranking documents in the search engine is determined in accordance with the expression:

DCG @ n = X 2 ---

RF (q, d) = X WC. ■ SRF., I = 1 1 1

where grade (p) is the average relevance score given by experts to the document located at position p in the list of results, gradee; 1 / log2 (2 + p) is a coefficient depending on the position of the document (the first documents have more weight).

Then the normalized version of NDCG will be written as

N000 @ n = RSD @ n / g,

where r is the normalization factor, which is equal to the maximum possible value 0С [email protected] n for a given query (i.e. equal to the OOO of the ideal ranking).

Thus, in order to optimize (maximize) the metric of the OSS, the objective function (NM) will be written in the following form

3. Metrics of the quality of ranking of search results. When ranking documents in search results, quality metrics act as criteria. From the list of generally accepted metrics for assessing the quality of information retrieval systems, three main ones have been selected that assess the accuracy, relevance and completeness of information retrieval.

1. The criterion for the accuracy of information retrieval

where a is the number of relevant documents found, b is the number of documents mistakenly considered relevant.

2. The Bpref criterion, which evaluates the relevance of information retrieval, is used to process a job with R relevant documents and is calculated by the formula

Bpref = - ^ (1 - Non Re ¡Before (r) / R). (4)

Here r denotes a known relevant document, and NonRelBefore (r) - the number of known irrelevant documents ranked higher than r (only the first R of the estimated irrelevant documents from the run are taken into account in the calculation).

3. Criterion of completeness of search results

r = a / (a + c),

where a is the number of found relevant documents, c is the number of not found relevant documents.

4. Test collections. In a machine learning problem, ranking requires a set of documents and queries with corresponding relevance scores determined by experts. This data is used for machine learning of the ranking function as well as for quality assessment.

ranking of search results by the system. In the ML process, test collections are used as a training set and, therefore, have a significant impact on the results. A test collection of documents and requests LETOR was used for research. This collection is used for information retrieval research by Microsoft Research. Table 1 shows the characteristics of the LETOR test collections.

5. Modified genetic algorithm. To use genetic algorithms in ranking machine learning, the problem must be formulated in such a way that the solution is encoded as a vector (genotype), where each gene can be a bit, number, or other object. In this case, the genotype is represented by a vector of weights for the corresponding ranking factors. The condition for stopping the execution of the genetic algorithm is finding the optimal solution, the exhaustion of the number of generations or the time allotted for evolution.

It should be noted that GAs are most effective in searching for the region of the global extremum, however, they can work slowly when it is necessary to find a local minimum in this region. The proposed way to avoid this drawback is to create a modified genetic algorithm (MGA), which will switch to a local (high-speed) optimization algorithm after finding the global optimum region using the basic GA. The proposed MGA is a hybrid method based on the classical GA and the Nelder - Mead method (simplex algorithm). The Nelder - Mead method, a frequently used nonlinear optimization algorithm, is a numerical method for finding the minimum of an objective function in a multidimensional space. The hybrid MGA algorithm proposed in this paper switches to the Nelder - Mead method after the conditions for stopping the GA are satisfied. A block diagram of the MGA algorithm is shown in Fig. one.

When performing the research, a limitation on the number of calculations of the objective function (Nrf = 16,000) was accepted when searching for the global extremum region and the condition for switching to a local optimization algorithm based on the Nelder - Mead method (after the basic genetic algorithm has performed 75% of Nrf operations).

6. Results. As a result of the research carried out using the machine learning algorithm

Table 1

Number of documents and queries in test collections

Test collection name Subsystem name Number of requests Number of documents

LETOR 4.0 MQ2007 1692 69623

LETOR 4.0 MQ2008 784 15211

LETOR 3.0 OHSUMED 106 16140

LETOR 3.0 Gov03td 50 49058

LETOR 3.0 Gov03np 150 148657

LETOR 3.0 Gov03hp 150 147606

LETOR 3.0 Gov04td 75 74146

LETOR 3.0 Gov04np 75 73834

LETOR 3.0 Gov04hp 75 74409

Rice. 1. Block diagram of the hybrid MVL algorithm based on genetic algorithms and the Nelder-Mead method

To rank LTR-MGA, a vector of WC * weights for the ranking function is obtained. Further, based on data from the LETOYA test collection, the ranking quality was assessed, for which quality metrics were calculated. Discrete metric of ranking quality [email protected] evaluates the quality of the first n documents of the system's response. The generally accepted metrics for assessing the quality of ranking are [email protected], [email protected] and [email protected] However, for a more detailed consideration of changes in the metric depending on the values [email protected] for all n from 1 to 10. To compare the effectiveness of the developed algorithm with existing solutions, a comparative analysis was carried out using the ranking algorithms provided in the LETOIA 3.0 collections. The results of running the algorithms for the test collections TB2003 and TB2004 for the NDCG metric are shown in Fig. 2. The results show that the LTR-MGA algorithm outperforms the test algorithms, with the highest values being

are for [email protected](at the level of the first document). The superiority of the LTR-MGA algorithm is due to the fact that, in contrast to the test ranking functions considered in the experiments, in the proposed approach to optimize the ranking function, it is the NDCG metric that is used as the objective function.

In order to assess the ranking quality when using the proposed LTR-MGA algorithm, the values of the quality metrics for ranking documents in the search results were calculated (Fig. 3). Comparison of the ranking results (Table 2) when using the basic ranking function, the basic LTR-GA algorithm and the modified LTR-MGA algorithm indicates the advantage of the latter.

In addition, the study estimated the time required for MO ranking. This is necessary to confirm that the proposed LTR-MGA method is superior in this indicator to the approach based on the use of traditional

Rice. 2. Comparison of machine learning algorithms for ranking

by the NDCG metric for test collections: on the left - the Gov03td dataset, on the right - the Gov04td dataset

Rice. 3. Evaluation of ranking quality metrics for the basic ranking formula and learning algorithms LTR-GA and LTR-MGA

Ranking quality metrics for different ranking machine learning algorithms

table 2

Ranking quality metric Basic ranking function LTR-GA LTR-MGA Increasing the metric value,%

Accuracy 0.201 0.251 0.267 26.81

[email protected](first 5 documents) 0.149 0.31 0.339 90.47

[email protected](first 10 documents) 0.265 0.342 0.362 29.14

Bpref 0.303 0.316 0.446 51.49

Completeness 0.524 0.542 0.732 39.03

* The best values for the corresponding metric are highlighted in gray

onion genetic algorithm (LTYA-OL). The results of comparing the time spent on the execution of algorithms LTY-OL and LTY-MOL are shown in table. 3.

7. Conclusion. Thus, the studies carried out have shown that when using the proposed approach, the values of the considered ranking metrics in the ISS increase (on average by 19.55% compared to the LTL-OL algorithm). This confirms that LITA-MOL works correctly and significantly improves the ranking function, in other words, it successfully solves the optimization problem. Using a modified algorithm

Due to the application of the local optimization method and the introduced restrictions on the number of calculations of the objective function, the time of machine learning has decreased (by an average of 17.71% compared with the use of the traditional genetic algorithm LTIAOL).

The developed ML-MOL ranking machine learning algorithm can be used in ISS that use a ranking model based on a combination of simple ranking functions. However, one should take into account some restrictions on the application of the proposed approach. Based

Estimating the execution time of machine learning ranking depending on the size of the training sample

Table 3

Size of the text document collection

LTR-GA lead time

LTR-MGA Runtime

Reduction of execution time,%

Mean

* The best values for the corresponding test collection size are highlighted in gray

of the obtained results, it was revealed that after the ML, the highest increase is observed in the ranking quality metric, the value of which was taken as the objective function. At the same time, the rest of the metrics may not have a significant improvement, and in some cases even worsen their values. As one of the possible approaches to eliminating this deficiency, it is proposed to solve the optimization problem as a multi-criteria one: to uniformly improve several main ranking metrics of search results, instead of optimizing one. In addition, in further studies, it is planned to develop a methodology for constructing an objective function based on a linear convolution of the main ranking quality metrics to improve the information retrieval process.

Bibliographic list

1. Tie-Yan Liu. Learning to Rank for Information Retrieval // Journal Foundations and Trends in Information Retrieval. Vol. 3, issue 3. March 2009. P. 225-331.

2. Christopher J. C. Burges, Tal Shaked, Erin Renshaw. Learning to Rank using Gradient Descent // Proceeding ICML "05 Proceedings of the 22nd international conference on Machine learning. 2005. P. 89-96.

3. Semenikhin, SV Investigation of approaches to machine learning for ranking documents by a search system based on genetic algorithms / SV Semenikhin // Young Russia: advanced technologies in industry. - 2013. - No. 2. - P. 82 - 85.

4. Multicriteria optimization based on genetic algorithms in the synthesis of control systems: monograph. / L. A. Denisova. - Omsk: Publishing house of OmSTU, 2014 .-- 170 p. - ISBN 978-5-8149-1822-2.

5. Denisova, L. A. Automation of parametric synthesis of the control system using a genetic algorithm / L. A. Denisova, V. A. Meshcheryakov // Automation in industry. - 2012. - No. 7. - P. 34 - 38.

6. Huber, Peter J. Robust Estimation of a Location Parameter // Annals of Statistics. - 1964. - No. 53. - P. 73-101.

7. Semenikhin, S. V. Automation of information retrieval based on multicriteria optimization and genetic algorithms / S. V. Semenikhin, L. A. Denisova // Dynamics of systems, mechanisms and machines. - 2014. - No. 3. - P. 224 - 227.

8. Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong and Hang Li. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval // SIGIR 2007 Workshop on Learning to Rank for Information Retrieval. - 2007 .-- S. 3-10.

9. Ageev, M. S. Official metrics of ROMIP "2004 / M. S. Ageev, I. E Kuralenok // II Russian seminar on the assessment of information retrieval methods (ROMIP 2004), Pushchino, 2004: tr.; Ed. S. Nekrest'yanova. - St. Petersburg: Research Institute of Chemistry, St. Petersburg State University. - P. 142-150.

10. J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (1965). 308-313.

Svyatoslav Vitalievich SEMENIKHIN, post-graduate student of the Department of Automated Information Processing and Control Systems. Correspondence address: [email protected] DENISOVA Lyudmila Albertovna, Doctor of Technical Sciences, Associate Professor of the Department of Automated Information Processing and Control Systems. Correspondence address: [email protected]

This chapter presents popular methods for assessing the quality of the classification model, which, among other things, are used in other works on this topic. Their description and justification of the metrics used for this assessment are given.

Quality assessment metrics

Full accuracy

This metric is one of the simplest and at the same time universal metrics for assessing the performance of classification algorithms. The value of this coefficient is calculated as the proportion of correctly classified objects from the total number of objects in the sample. This metric is popular due to its simplicity and the ability to extend to any number of classes. The main disadvantage of this metric is that it assigns the same weight to all documents, which may be incorrect in the case of a strong displacement of documents in the training set towards one or more classes. This metric can have a high value, but the classifier within the same class can show extremely low quality of work. At the same time, the metric does not signal this in any way.

Precision, Completeness and F-Measure

Metrics such as precision and recall were for the first time widely used in assessing the performance of systems solving information retrieval problems. The accuracy of the system within one class is the proportion of objects that really belong to a certain class relative to all objects assigned by the system to this class. Completeness is expressed as the proportion of objects found by the classifier belonging to a class relative to all objects of this class. Table 4 is a contingency table of a separate class, where TP (true positive) is a true-positive decision, TN (true negative) is a true-negative decision, FP (false positive) is a false-positive decision, and FN (false negative) is a false one. -negative decision.

Table 1 - Table of contingency of a class of objects

Thus, precision and completeness are calculated as:

The F-measure combines information about the accuracy and completeness of the evaluated algorithm. It is calculated as the harmonic average of the accuracy and completeness indicators:

Due to the fact that the F-measure is calculated separately for each class, it is convenient to use it for searching and analyzing specific errors of the algorithm, for evaluating a classification with several classes. Moreover, in the case of a large number of classes, a characteristic is needed that would aggregate the completeness and accuracy for all classes and characterize the overall behavior of the system. In this work, the following aggregated values are used for this purpose: macro precision, which is calculated as the arithmetic mean of the accuracy for all classes, macro recall, which is calculated as the arithmetic mean of the completeness for all classes, and macro F- measure (Macro F-score), which is the harmonic mean between them.

Cross validation

Cross-validation is one of the most common methods for conducting full testing and assessing the performance of various machine learning algorithms. For an independent sample, this method allows one to obtain an unbiased estimate of the error probability, in contrast to the average error on the trained sample, which can be a biased estimate of the error probability due to overfitting of the algorithm. Another advantage of this procedure is the ability to obtain an estimate of the probability of an algorithm error, in the absence of a specially designed control sample for testing.

Let us assume that is a set of feature descriptions of objects, on which a finite sample of use cases is specified, where is a finite set of classes. A mapping is specified that assigns an algorithm to an arbitrary selection of use cases. Then the quality of the algorithm for an arbitrary sample of precedents is estimated using the quality functional:

where is some non-negative function that returns the value of the algorithm error with a correct class label.