Quality metric in machine learning. Metrics in Machine Learning Problems

On the elements inside each list. Partial order is usually specified by specifying a score for each element (for example, "relevant" or "not relevant"; more than two grades are possible). The purpose of the ranking model is to approximate and generalize in the best way (in a sense) the method of ranking in the training set for new data.

Ranking learning is still a fairly young, rapidly developing area of research that emerged in the 2000s with the emergence of interest in the field of information retrieval in the application of machine learning methods to ranking problems.

Collegiate YouTube

1 / 5
During the training of the ranking model and during its operation, each document-query pair is translated into a numerical vector of ranking features (also called ranking factors or signals) characterizing the properties of the document, query and their relationship. Such signs can be divided into three groups:

The following are some examples of ranking features used in the LETOR dataset well known in the art:
- Values of measures TF, TF-IDF, BM25, and the language model of matching the request of various areas of the document (title, URL, body text, link text);
- Lengths and IDF-sums of document zones;
- Document ranks obtained by various variants of link ranking algorithms such as PageRank and HITS.
Ranking quality metrics

There are several metrics used to evaluate and compare the performance of ranking algorithms on a sample with assessors. Often, the parameters of a ranking model tend to be adjusted in such a way as to maximize the value of one of these metrics.
Examples of metrics:

Algorithm classification

In his article "Learning to Rank for Information Retrieval" and speeches at thematic conferences, Tai-Yang Liu from Microsoft Research Asia analyzed the existing methods for solving the problem of teaching ranking and proposed their classification into three approaches, depending on the input representation used. data and penalty functions:

Pointwise approach

Notes (edit)
1. Tie-Yan Liu (2009), Learning to Rank for Information Retrieval, Foundations and Trends in Information Retrieval: Vol. 3: No 3, p. 225-331, ISBN 978-1-60198-244-5, DOI 10.1561 / 1500000016... Slides from T. Lew's speech at WWW 2009 are available.
On the outflow of customers of a telecom operator.

Let's load the required libraries and look at the data
import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import LabelmbleEncoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_modelression GrainFormalization GrainStarifier from sklearn.linear_modelression from sklearn.preprocessing from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv ("../../ data / telecom_churn.csv")

df.head (5)
Data preprocessing
# Let's make a mapping of binary columns # and encode the state with dummy coding (for simplicity, it's better not to do this for wooden models) d = ("Yes": 1, "No": 0) df ["International plan"] = df [" International plan "]. Map (d) df [" Voice mail plan "] = df [" Voice mail plan "]. Map (d) df [" Churn "] = df [" Churn "]. Astype (" int64 " ) le = LabelEncoder () df ["State"] = le.fit_transform (df ["State"]) ohe = OneHotEncoder (sparse = False) encoded_state = ohe.fit_transform (df ["State"]. values.reshape (- 1, 1)) tmp = pd.DataFrame (encoded_state, columns = ["state" + str (i) for i in range (encoded_state.shape)]) df = pd.concat (, axis = 1)
Accuracy, precision and recall
Before moving on to the metrics themselves, it is necessary to introduce an important concept to describe these metrics in terms of classification errors - confusion matrix(error matrix).
Suppose we have two classes and an algorithm that predicts the belonging of each object to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)

Here is the response of the algorithm on the object, and is the true class label on that object.
Thus, there are two types of classification errors: False Negative (FN) and False Positive (FP).

Algorithm training and construction of the error matrix
X = df.drop ("Churn", axis = 1) y = df ["Churn"] # Divide the sample into train and test, all metrics will be evaluated on the test dataset X_train, X_test, y_train, y_test = train_test_split (X, y , stratify = y, test_size = 0.33, random_state = 42) # Train the native logistic regression lr = LogisticRegression (random_state = 42) lr.fit (X_train, y_train) # Use the function of constructing the error matrix from the sklearn documentation def plot_confusion_matrix (cm, classes , normalize = False, title = "(! LANG: Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy
An intuitive, obvious and almost unused metric is accuracy - the percentage of correct answers of the algorithm:

This metric is useless in problems with unequal classes, and it is easy to show it with an example.

Let's say we want to evaluate the performance of a spam mail filter. We have 100 non-spam emails, 90 of which our classifier identified correctly (True Negative = 90, False Positive = 10), and 10 spam emails, 5 of which the classifier also identified correctly (True Positive = 5, False Negative = five).
Then accuracy:

However, if we just predict all emails as non-spam, we will get a higher accuracy:

At the same time, our model has absolutely no predictive power, since initially we wanted to identify spam messages. To overcome this, we will be helped by the transition from a common metric for all classes to separate indicators of the quality of classes.
Precision, recall and F-measure
To assess the performance of the algorithm on each of the classes separately, we introduce the precision and recall metrics.

Precision can be interpreted as the proportion of objects named by the classifier as positive and at the same time really positive, and recall shows what proportion of objects of a positive class out of all objects of a positive class found by the algorithm.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the ability of the algorithm to detect a given class in general, and precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a Type I error, and the second is called a Type II error. In our problem of determining subscriber churn, the mistake of the first kind will be the acceptance of a loyal subscriber for an outgoing one, since our null hypothesis is that none of the subscribers leaves, and we reject this hypothesis. Accordingly, an error of the second kind will be the "skipping" of the outgoing subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall do not depend, in contrast to accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.
Often in real practice, the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the problem of determining customer churn.
Obviously we cannot find of all outgoing customers and only them. But, having identified the strategy and resource for customer retention, we can select the necessary precision and recall thresholds. For example, you can focus on retaining only high-yielding customers or those who are more likely to leave because we have limited call center resources.

Usually, when optimizing the hyperparameters of an algorithm (for example, in the case of iterating over a grid GridSearchCV), one metric is used, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregated measure of quality. F-measure (in general) - harmonic mean precision and recall:

In this case, it determines the weight of the accuracy in the metric, and when this is the harmonic mean (with a factor of 2, so that in the case of precision = 1 and recall = 1 to have)
The F-measure reaches its maximum when the completeness and precision are equal to one, and is close to zero if one of the arguments is close to zero.
Sklearn has a handy function _metrics.classification report, which returns recall, precision and F-measure for each of the classes, as well as the number of instances of each class.

report = classification_report (y_test, lr.predict (X_test), target_names = ["Non-churned", "Churned"]) print (report)
class precision recall f1-score support
Non-churned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg / total 0.84 0.87 0.84 1100

It should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, it is often necessary to resort to the techniques of artificial modification of the dataset to equalize the ratio of the classes. There are many of them, and we will not touch on them, you can look at some of the methods and choose the one that suits your task.
AUC-ROC and AUC-PR
When converting the real answer of the algorithm (as a rule, the probability of belonging to a class, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold equal to 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the aforementioned lack of class balance.

One of the ways to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C haracteristic curve). This curve is a line from (0,0) to (1,1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

We already know TPR, this is completeness, and FPR shows what proportion of objects of the negative class the algorithm predicted incorrectly. In the ideal case, when the classifier makes no mistakes (FPR = 0, TPR = 1), we get the area under the curve equal to one; otherwise, when the classifier randomly outputs class probabilities, the AUC-ROC will tend to 0.5, since the classifier will output the same number of TP and FP.
Each point on the graph corresponds to the choice of a certain threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR by minimizing FPR, which means that our curve should ideally tend to the point (0,1).

ROC curve drawing code
sns.set (font_scale = 1.5) sns.set_color_codes ("muted") plt.figure (figsize = (10, 8)) fpr, tpr, thresholds = roc_curve (y_test, lr.predict_proba (X_test) [:, 1], pos_label = 1) lw = 2 plt.plot (fpr, tpr, lw = lw, label = "ROC curve") plt.plot (,) plt.xlim () plt.ylim () plt.xlabel ("False Positive Rate ") plt.ylabel (" True Positive Rate ") plt.title (" ROC curve ") plt.savefig (" ROC.png ") plt.show ()

The AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, not everything is so unambiguous) and can be interpreted as the probability that a randomly selected positive object will be ranked by the classifier higher (will have a higher probability of being positive) than a randomly selected negative object ...

Consider the following problem: we need to select 100 relevant documents from 1 million documents. We have mastered two algorithms:
- Algorithm 1 returns 100 documents, 90 of which are relevant. Thus,
- Algorithm 2 returns 2000 documents, 90 of which are relevant. Thus,
Most likely, we would choose the first algorithm that produces very few False Positives compared to its competitor. But the difference in False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative, and in problems where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:
- Algorithm 1
- Algorithm 2
Here, a significant difference between the two algorithms is already noticeable - 0.855 in accuracy!

Precision and recall are also used to construct the curve and, like AUC-ROC, find the area under it.

It can be noted here that on small datasets, the area under the PR-curve can be overly optimistic, because it is calculated using the trapezoidal method, but usually there is enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, see here.
Logistic Loss
The logistic loss function stands apart, defined as:

here is the algorithm's response on the -th object, is the true class label on the -th object, and the sample size.

Details about the mathematical interpretation of the logistic loss function have already been written in the framework of the post about linear models.
This metric rarely appears in business requirements, but often in tasks on kaggle.
Intuitively, logloss minimization can be thought of as the task of maximizing accuracy by penalizing incorrect predictions. However, it should be noted that logloss penalizes extremely strongly for the classifier's confidence in the wrong answer.

Let's consider an example:

def logloss_crutch (y_true, y_pred, eps = 1e-15): return - (y_true * np.log (y_pred) + (1 - y_true) * np.log (1 - y_pred)) print ("Logloss with uncertain classification% f "% logloss_crutch (1, 0.5)) >> Logloss with uncertain classification 0.693147 print (" Logloss with confident classification and correct answer% f "% logloss_crutch (1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print (" Logloss with confident classification and Wrong answer% f "% logloss_crutch (1, 0.1)) >> Logloss with confident classification and Wrong answer 2.302585
Note how the logloss has grown dramatically with an incorrect answer and a confident classification!
Consequently, an error at one object can result in a significant degradation of the overall sample error. Such objects are often outliers, which must be remembered to be filtered or considered separately.
Everything falls into place if you draw a logloss graph:

It can be seen that the closer to zero the algorithm's response with ground truth = 1, the higher the error value and the steeper the curve grows.
Let's summarize:
- In the case of a multi-class classification, you need to carefully monitor the metrics of each of the classes and follow the logic of the decision tasks rather than optimizing the metric
- In the case of unequal classes, it is necessary to select a balance of classes for training and a metric that will correctly reflect the quality of the classification
This chapter presents popular methods for assessing the quality of the classification model, which, among other things, are used in other works on this topic. Their description and justification of the metrics used for this assessment are given.

Quality assessment metrics

Full accuracy

This metric is one of the simplest and at the same time universal metrics for assessing the performance of classification algorithms. The value of this coefficient is calculated as the proportion of correctly classified objects from the total number of objects in the sample. This metric is popular due to its simplicity and the ability to extend to any number of classes. The main disadvantage of this metric is that it assigns the same weight to all documents, which may be incorrect in the case of a strong displacement of documents in the training set towards one or more classes. This metric can have a high value, but the classifier within the same class can show extremely low quality of work. At the same time, the metric does not signal this in any way.

Precision, completeness and F-measure

Metrics such as precision and recall were for the first time widely used in assessing the performance of systems solving information retrieval problems. The accuracy of the system within one class is the proportion of objects that really belong to a certain class relative to all objects assigned by the system to this class. Completeness is expressed as the proportion of objects found by the classifier belonging to a class relative to all objects of this class. Table 4 is a contingency table of a separate class, where TP (true positive) is a true-positive decision, TN (true negative) is a true-negative decision, FP (false positive) is a false-positive decision, and FN (false negative) is a false one. -negative decision.

Table 1 - Table of contingency of a class of objects

Thus, precision and completeness are calculated as:

The F-measure combines information about the accuracy and completeness of the evaluated algorithm. It is calculated as the harmonic average of the accuracy and completeness indicators:

Due to the fact that the F-measure is calculated separately for each class, it is convenient to use it for searching and analyzing specific errors of the algorithm, for evaluating a classification with several classes. Moreover, in the case of a large number of classes, a characteristic is needed that would aggregate the completeness and accuracy for all classes and characterize the overall behavior of the system. In this work, the following aggregated values are used for this purpose: macro precision, which is calculated as the arithmetic mean of the accuracy for all classes, macro recall, which is calculated as the arithmetic mean of the completeness for all classes, and macro F- measure (Macro F-score), which is the harmonic mean between them.

Cross validation

Cross-validation is one of the most common methods for conducting full testing and assessing the performance of various machine learning algorithms. For an independent sample, this method allows one to obtain an unbiased estimate of the error probability, in contrast to the average error on the trained sample, which can be a biased estimate of the error probability due to overfitting of the algorithm. Another advantage of this procedure is the ability to obtain an estimate of the probability of an algorithm error, in the absence of a specially designed control sample for testing.

Let us assume that is a set of feature descriptions of objects on which a finite sample of use cases is specified, where is a finite set of classes. A mapping is given that assigns an algorithm to an arbitrary selection of use cases. Then the quality of the algorithm for an arbitrary sample of precedents is estimated using the quality functional:

where is some non-negative function that returns the value of the algorithm error with the correct class label.

Hello, Habr!

In machine learning tasks, metrics are used to assess the quality of models and compare various algorithms, and their selection and analysis is an indispensable part of a datasatanist's job.

In this article, we will look at some quality criteria in classification problems, discuss what is important when choosing a metric and what can go wrong.

Metrics in classification problems

To demonstrate useful features sklearn and a visual representation of metrics, we will use our dataset on the churn of customers of a telecom operator, which we met in the first article of the course.

Let's load the required libraries and look at the data
Import pandas as pd import matplotlib.pyplot as plt from matplotlib.pylab import rc, plot import seaborn as sns from sklearn.preprocessing import Labelmblencoder, OneHotEncoder from sklearn.model_selection import cross_val_score from sklearn.linear_modelression GrainFormalization GrainFrom sklearn.linear_modelression import from sklearn.preprocessing from sklearn.metrics import precision_recall_curve, classification_report from sklearn.model_selection import train_test_split df = pd.read_csv ("../../ data / telecom_churn.csv")
Df.head (5)
Data preprocessing
# Let's make a mapping of binary columns # and encode the state with dummy coding (for simplicity, it's better not to do this for wooden models) d = ("Yes": 1, "No": 0) df ["International plan"] = df [" International plan "]. Map (d) df [" Voice mail plan "] = df [" Voice mail plan "]. Map (d) df [" Churn "] = df [" Churn "]. Astype (" int64 " ) le = LabelEncoder () df ["State"] = le.fit_transform (df ["State"]) ohe = OneHotEncoder (sparse = False) encoded_state = ohe.fit_transform (df ["State"]. values.reshape (- 1, 1)) tmp = pd.DataFrame (encoded_state, columns = ["state" + str (i) for i in range (encoded_state.shape)]) df = pd.concat (, axis = 1)

Accuracy, precision and recall

Before moving on to the metrics themselves, it is necessary to introduce an important concept to describe these metrics in terms of classification errors - confusion matrix(error matrix).
Suppose we have two classes and an algorithm that predicts the belonging of each object to one of the classes, then the classification error matrix will look like this:

True Positive (TP) False Positive (FP)
False Negative (FN) True Negative (TN)
this is the response of the algorithm on the object, and

The true class label on this object.
Thus, there are two types of classification errors: False Negative (FN) and False Positive (FP).

Algorithm training and construction of the error matrix
X = df.drop ("Churn", axis = 1) y = df ["Churn"] # Divide the sample into train and test, all metrics will be evaluated on the test dataset X_train, X_test, y_train, y_test = train_test_split (X, y , stratify = y, test_size = 0.33, random_state = 42) # Train the native logistic regression lr = LogisticRegression (random_state = 42) lr.fit (X_train, y_train) # Use the function to build the error matrix from the sklearn documentation def plot_confusion_matrix (cm, classes , normalize = False, title = "(! LANG: Confusion matrix", cmap=plt.cm.Blues): """ This function prints and plots the confusion matrix. Normalization can be applied by setting `normalize=True`. """ plt.imshow(cm, interpolation="nearest", cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=45) plt.yticks(tick_marks, classes) if normalize: cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis] print("Normalized confusion matrix") else: print("Confusion matrix, without normalization") print(cm) thresh = cm.max() / 2. for i, j in itertools.product(range(cm.shape), range(cm.shape)): plt.text(j, i, cm, horizontalalignment="center", color="white" if cm > thresh else "black") plt.tight_layout() plt.ylabel("True label") plt.xlabel("Predicted label") font = {"size" : 15} plt.rc("font", **font) cnf_matrix = confusion_matrix(y_test, lr.predict(X_test)) plt.figure(figsize=(10, 8)) plot_confusion_matrix(cnf_matrix, classes=["Non-churned", "Churned"], title="Confusion matrix") plt.savefig("conf_matrix.png") plt.show()!}

Accuracy

An intuitive, obvious and almost unused metric is accuracy - the percentage of correct answers of the algorithm:

This metric is useless in problems with unequal classes and it is easy to show it with an example.

Let's say we want to evaluate the performance of a spam mail filter. We have 100 non-spam emails, 90 of which our classifier identified correctly (True Negative = 90, False Positive = 10) and 10 spam emails, 5 of which the classifier also identified correctly (True Positive = 5, False Negative = 5 ).
Then accuracy:

However, if we just predict all emails as non-spam, we will get a higher accuracy:

At the same time, our model has absolutely no predictive power, since we initially wanted to identify spam messages. To overcome this, we will be helped by the transition from a common metric for all classes to separate indicators of the quality of classes.

Precision, recall and F-measure

To assess the performance of the algorithm on each of the classes separately, we introduce the precision and recall metrics.

Precision can be interpreted as the proportion of objects named by the classifier as positive and at the same time really positive, and recall shows what proportion of objects of a positive class out of all objects of a positive class found by the algorithm.

It is the introduction of precision that does not allow us to write all objects into one class, since in this case we get an increase in the False Positive level. Recall demonstrates the ability of the algorithm to detect a given class in general, and precision demonstrates the ability to distinguish this class from other classes.

As we noted earlier, there are two types of classification errors: False Positive and False Negative. In statistics, the first type of error is called a Type I error, and the second is called a Type II error. In our problem of determining subscriber churn, the mistake of the first kind will be the acceptance of a loyal subscriber for an outgoing one, since our null hypothesis is that none of the subscribers is leaving, and we reject this hypothesis. Accordingly, an error of the second kind will be the "skipping" of the outgoing subscriber and the erroneous acceptance of the null hypothesis.

Precision and recall do not depend, in contrast to accuracy, on the ratio of classes and therefore are applicable in conditions of unbalanced samples.
Often in real practice, the task is to find the optimal (for the customer) balance between these two metrics. A classic example is the problem of determining customer churn.
Obviously we cannot find of all outgoing customers and only them. But having identified the strategy and resource for customer retention, we can select the necessary precision and recall thresholds. For example, you can focus on retaining only high-yield customers or those who are more likely to report, since we are limited by the call center resource.

Usually, when optimizing the hyperparameters of an algorithm (for example, in the case of iterating over a grid GridSearchCV), one metric is used, the improvement of which we expect to see on the test sample.
There are several different ways to combine precision and recall into an aggregated measure of quality. F-measure (in general

) - harmonic mean precision and recall:

in this case determines the weight of the accuracy in the metric and for

this is the harmonic mean (with a multiplier of 2, so that in the case of precision = 1 and recall = 1, we have

)
The F-measure reaches its maximum when the completeness and accuracy are equal to one and is close to zero if one of the arguments is close to zero.
Sklearn has a handy function _metrics.classification report returning recall, precision and F-measure for each of the classes, as well as the number of instances of each class.
Report = classification_report (y_test, lr.predict (X_test), target_names = ["Non-churned", "Churned"]) print (report)
class precision recall f1-score support
Non-churned 0.88 0.97 0.93 941
Churned 0.60 0.25 0.35 159
avg / total 0.84 0.87 0.84 1100
It should be noted here that in the case of problems with unbalanced classes, which prevail in real practice, it is often necessary to resort to the techniques of artificial modification of the dataset to equalize the ratio of the classes. There are many of them and we will not touch on them, you can look at some methods and choose the one that suits your task.

AUC-ROC and AUC-PR

When converting the real answer of the algorithm (as a rule, the probability of belonging to a class, see SVM separately) into a binary label, we must choose some threshold at which 0 becomes 1. A threshold equal to 0.5 seems natural and close, but it is not always turns out to be optimal, for example, in the aforementioned lack of class balance.

One of the ways to evaluate the model as a whole, without being tied to a specific threshold, is AUC-ROC (or ROC AUC) - area ( A rea U nder C urve) under the error curve ( R eceiver O perating C haracteristic curve). This curve is a line from (0,0) to (1,1) in True Positive Rate (TPR) and False Positive Rate (FPR) coordinates:

We already know TPR, this is completeness, and FPR shows what proportion of objects of the negative class the algorithm predicted incorrectly. Ideally, when the classifier makes no mistakes (FPR = 0, TPR = 1), we will get the area under the curve equal to one, otherwise, when the classifier randomly outputs the class probabilities, AUC-ROC will tend to 0.5, since the classifier will issue the same amount of TP and FP.
Each point on the graph corresponds to the choice of a certain threshold. The area under the curve in this case shows the quality of the algorithm (more is better), in addition, the steepness of the curve itself is important - we want to maximize TPR by minimizing FPR, which means that our curve should ideally tend to the point (0,1).

ROC curve drawing code
Sns.set (font_scale = 1.5) sns.set_color_codes ("muted") plt.figure (figsize = (10, 8)) fpr, tpr, thresholds = roc_curve (y_test, lr.predict_proba (X_test) [:, 1], pos_label = 1) lw = 2 plt.plot (fpr, tpr, lw = lw, label = "ROC curve") plt.plot (,) plt.xlim () plt.ylim () plt.xlabel ("False Positive Rate ") plt.ylabel (" True Positive Rate ") plt.title (" ROC curve ") plt.savefig (" ROC.png ") plt.show ()

The AUC-ROC criterion is resistant to unbalanced classes (spoiler: alas, but not everything is so unambiguous) and can be interpreted as the probability that a randomly selected positive object will be ranked by the classifier higher (will have a higher probability of being positive) than a randomly selected negative an object.

Consider the following problem: we need to select 100 relevant documents from 1 million documents. We have mastered two algorithms:
- Algorithm 1 returns 100 documents, 90 of which are relevant. Thus,
- Algorithm 2 returns 2000 documents, 90 of which are relevant. Thus,
Most likely, we would choose the first algorithm that produces very few False Positives compared to its competitor. But the difference in False Positive Rate between these two algorithms extremely small - only 0.0019. This is a consequence of the fact that AUC-ROC measures the proportion of False Positive relative to True Negative, and in problems where the second (larger) class is not so important to us, it may not give an entirely adequate picture when comparing algorithms.

In order to correct the situation, let's return to completeness and accuracy:
- Algorithm 1
- Algorithm 2
Here, a significant difference between the two algorithms is already noticeable - 0.855 in accuracy!

Precision and recall are also used to construct the curve and, like AUC-ROC, find the area under it.

It can be noted here that on small datasets, the area under the PR-curve can be overly optimistic, because it is calculated using the trapezoidal method, but usually there is enough data in such tasks. For details on the relationship between AUC-ROC and AUC-PR, see here.

Logistic Loss

The logistic loss function stands apart, defined as:

this is the algorithm's answer to

Ohm object,

true class label on

Ohm object, and

sample size.

Details about the mathematical interpretation of the logistic loss function have already been written in the framework of the post about linear models.
This metric rarely appears in business requirements, but often in tasks on kaggle.
Intuitively, logloss minimization can be thought of as the task of maximizing accuracy by penalizing mispredictions. However, it should be noted that logloss penalizes extremely strongly for the classifier's confidence in the wrong answer.

Let's consider an example:
Def logloss_crutch (y_true, y_pred, eps = 1e-15): return - (y_true * np.log (y_pred) + (1 - y_true) * np.log (1 - y_pred)) print ("Logloss with uncertain classification% f "% logloss_crutch (1, 0.5)) >> Logloss with uncertain classification 0.693147 print (" Logloss with confident classification and correct answer% f "% logloss_crutch (1, 0.9)) >> Logloss with confident classification and correct answer 0.105361 print (" Logloss with confident classification and Wrong answer% f "% logloss_crutch (1, 0.1)) >> Logloss with confident classification and Wrong answer 2.302585
Note how the logloss has grown dramatically with an incorrect answer and a confident classification!
Consequently, an error at one object can result in a significant degradation of the overall sample error. Such objects are often outliers, which must be remembered to be filtered or considered separately.
Everything falls into place if you draw a logloss graph:

It can be seen that the closer to zero the response of the algorithm with ground truth = 1, the higher the error value and the steeper the curve grows.

Summing up:
- In the case of a multi-class classification, you need to carefully monitor the metrics of each of the classes and follow the logic of the decision tasks rather than optimizing the metric
- In the case of unequal classes, it is necessary to select a balance of classes for training and a metric that will correctly reflect the quality of the classification
- The choice of metric should be done with a focus on the subject area, preprocessing the data and, possibly, segmenting (as in the case of dividing into rich and poor customers)
useful links
1. Evgeny Sokolov's course: Seminar on the choice of models (there is information on the metrics of regression problems)
2. Problems on the AUC-ROC from A.G. Dyakonova
3. You can read more about other metrics on kaggle. A link to the competition where it was used has been added to the description of each metric
4. Presentation by Bogdan Melnik aka ld86 about training on unbalanced samples
UDC 519.816

S. V. SEMENIKHIN L. A. DENISOVA

Omsk State Technical University

RANGE MACHINE LEARNING METHOD

BASED ON A MODIFIED GENETIC ALGORITHM FOR THE YRSO METRIC

The problem of ranking documents on the information search results page and machine learning issues of ranking are considered. An approach is proposed to optimize the ranking function using the NOCO quality metric based on a modified genetic algorithm. The research of the developed algorithms (on test collections LETO ^ was carried out and their effectiveness for machine learning of ranking was shown.

Key words: information retrieval, machine learning ranking, relevance, optimization, genetic algorithms.

1. Introduction. In modern information retrieval systems (ISS), the volumes of data operated by the system are so large that the key task is to rank relevant documents in response to a user's search query. At this stage of the development of ISS, machine learning (ML) ranking is of the greatest interest. The existing approaches to ML, based on numerical methods (in particular, gradient methods) or on analytical calculations, have a number of disadvantages that significantly affect the quality of information retrieval and the time required to rank relevant documents.

At the beginning of the research, list approaches to machine learning ranking were considered, most of which use the gradient descent method. In the considered works, ML is reduced to optimization of search quality metrics (SEQ), however, only metrics represented by continuous functions are used. This limitation often leads to the fact that, as a result of optimization, the ranking function has lower scores for many important accepted indicators (DCG, nDCG, Graded Mean Reciprocal Rank, etc.), which are discrete functions. The paper proposes the use of genetic algorithms (GA) in teaching ranking to minimize Huber's loss function using expert assessments of relevance as reference values. An approach to ML based on optimization of discrete metrics of information retrieval quality was also proposed.

2. Statement of the problem of machine learning ranking. In most modern information retrieval systems, the ranking function is built on the basis of n simple ranking functions (PRF) and can be written as:

where SRF¡ is the ¡th simple ranking function for document d and query q, WCi is the weight coefficient of the ¡th simple ranking function, n is the number of PRFs in the ranking system.

In the course of machine learning for ranking, a set of search documents B and queries O from the test collection LBTOY was used. For all deO requests, a pair is formed with each deD document. For each such pair, the IRS determines the relevance values that are used to rank search results. In order to evaluate the quality of ranking, the system requires reference relevance values E for each document-query pair ^, e). For these purposes, expert assessments of relevance are used.

To carry out the study, we used an ISS, in which the ranking is made on the basis of N = 5 simple ranking functions SRFi (WC) l г = 1, N, which form a vector criterion of optimality:

where WCе (WC) is the vector of variable parameters; (ШС), (ЯБ) are the spaces of parameters and vector criteria, respectively.

Application of genetic algorithms for ML ranking makes it possible to maximize discrete quality metrics such as nDCG. The nDCG metric for ranking documents in the search engine is determined in accordance with the expression:

DCG @ n = X 2 ---

RF (q, d) = X WC. ■ SRF., I = 1 1 1

where grade (p) is the average relevance score given by experts to the document located at position p in the list of results, gradee; 1 / log2 (2 + p) - coefficient depending on the position of the document (the first documents have more weight).

Then the normalized version of NDCG will be written as

N000 @ n = RSD @ n / g,

where r is the normalization factor, which is equal to the maximum possible value 0С [email protected] n for a given query (i.e. equal to the OOO of the ideal ranking).

Thus, in order to optimize (maximize) the metric of the OSS, the objective function (YM) will be written in the following form

3. Metrics of quality of ranking of search results. When ranking documents in search results, quality metrics act as criteria. From the list of generally accepted metrics for assessing the quality of ISS, three main ones were selected that assess the accuracy, relevance and completeness of information retrieval.

1. The criterion for the accuracy of information retrieval

where a is the number of relevant documents found, b is the number of documents mistakenly considered relevant.

2. The Bpref criterion, which evaluates the relevance of information retrieval, is used to process a job with R relevant documents and is calculated by the formula

Bpref = - ^ (1 - Non Re ¡Before (r) / R). (4)

Here r denotes a known relevant document, and NonRelBefore (r) - the number of known irrelevant documents ranked higher than r (only the first R of the estimated irrelevant documents from the run are taken into account in the calculation).

3. Criterion of completeness of search results

r = a / (a + c),

where a is the number of found relevant documents, c is the number of not found relevant documents.

4. Test collections. In a machine learning task, ranking requires a set of documents and queries with corresponding relevance scores determined by experts. This data is used for machine learning of the ranking function as well as for quality assessment.

ranking of search results by the system. In the ML process, test collections are used as a training set and, therefore, have a significant impact on the results. A test collection of documents and requests LETOR was used for research. This collection is used for information retrieval research by Microsoft Research. Table 1 shows the characteristics of the LETOR test collections.

5. Modified genetic algorithm. To use genetic algorithms in machine learning ranking, the problem must be formulated in such a way that the solution is encoded as a vector (genotype), where each gene can be a bit, number, or other object. In this case, the genotype is represented by a vector of weighting factors for the corresponding ranking factors. The condition for stopping the execution of the genetic algorithm is finding the optimal solution, the exhaustion of the number of generations or the time allotted for evolution.

It should be noted that GAs are most effective in searching for an area of a global extremum, but they can work slowly when it is necessary to find a local minimum in this area. The proposed way to avoid this drawback is to create a modified genetic algorithm (MGA), which will switch to a local (high-speed) optimization algorithm after finding the global optimum region using the basic GA. The proposed MGA is a hybrid method based on the classical GA and the Nelder - Mead method (simplex algorithm). The Nelder - Mead method, a frequently used nonlinear optimization algorithm, is a numerical method for finding the minimum of an objective function in a multidimensional space. The hybrid MGA algorithm proposed in this work switches to the Nelder - Mead method after the conditions for stopping the GA are satisfied. A block diagram of the MGA algorithm is shown in Fig. one.

When performing the research, a limitation on the number of calculations of the objective function (Nrf = 16,000) was accepted when searching for the global extremum region and the condition for switching to a local optimization algorithm based on the Nelder - Mead method (after the basic genetic algorithm has performed 75% of Nrf operations).

6. Results. As a result of the research carried out using the machine learning algorithm

Table 1

Number of documents and queries in test collections

Test collection name Subsystem name Number of requests Number of documents

LETOR 4.0 MQ2007 1692 69623

LETOR 4.0 MQ2008 784 15211

LETOR 3.0 OHSUMED 106 16140

LETOR 3.0 Gov03td 50 49058

LETOR 3.0 Gov03np 150 148657

LETOR 3.0 Gov03hp 150 147606

LETOR 3.0 Gov04td 75 74146

LETOR 3.0 Gov04np 75 73834

LETOR 3.0 Gov04hp 75 74409

Rice. 1. Block diagram of the hybrid MVL algorithm based on genetic algorithms and the Nelder-Mead method

To rank LTR-MGA, a vector of WC * weights for the ranking function is obtained. Further, based on data from the test collection LETOYA, the ranking quality was assessed, for which quality metrics were calculated. Discrete metric of ranking quality [email protected] evaluates the quality of the first n documents of the system response. The generally accepted metrics for assessing the quality of ranking are [email protected], [email protected] and [email protected] However, for a more detailed consideration of changes in the metric depending on the values [email protected] for all n from 1 to 10. To compare the efficiency of the developed algorithm with existing solutions, a comparative analysis was carried out using the ranking algorithms provided in the LETOIA 3.0 collections. The results of running the algorithms for the test collections TB2003 and TB2004 for the NDCG metric are shown in Fig. 2. The results show that the LTR-MGA algorithm outperforms the test algorithms, with the highest values being

are for [email protected](at the level of the first document). The superiority of the LTR-MGA algorithm is due to the fact that, in contrast to the test ranking functions considered in the experiments, in the proposed approach to optimize the ranking function, it is the NDCG metric that is used as the objective function.

In order to assess the ranking quality when using the proposed LTR-MGA algorithm, the values of the quality metrics for ranking documents in the search results were calculated (Fig. 3). Comparison of the ranking results (Table 2) when using the basic ranking function, the basic LTR-GA algorithm and the modified LTR-MGA algorithm indicates the advantage of the latter.

In addition, the study estimated the time required for MO ranking. This is necessary to confirm that the proposed LTR-MGA method is superior in this indicator to the approach based on the use of traditional

Rice. 2. Comparison of machine learning algorithms for ranking

by the NDCG metric for test collections: on the left - the Gov03td dataset, on the right - the Gov04td dataset

Rice. 3. Evaluation of ranking quality metrics for the basic ranking formula and learning algorithms LTR-GA and LTR-MGA

Ranking quality metrics for different ranking machine learning algorithms

table 2

Ranking quality metric Basic ranking function LTR-GA LTR-MGA Increasing the metric value,%

Accuracy 0.201 0.251 0.267 26.81

[email protected](first 5 documents) 0.149 0.31 0.339 90.47

[email protected](first 10 documents) 0.265 0.342 0.362 29.14

Bpref 0.303 0.316 0.446 51.49

Completeness 0.524 0.542 0.732 39.03

* The best values for the corresponding metric are highlighted in gray

onnogo genetic algorithm (LTA-OL). The results of comparing the time spent on the execution of the algorithms LТЯ-OL and LТЯ-MOL are shown in table. 3.

7. Conclusion. Thus, the studies carried out have shown that when using the proposed approach, the values of the considered ranking metrics in the ISS increase (on average by 19.55% compared to the LTS-OL algorithm). This confirms that LITA-MOL works correctly and significantly improves the ranking function, in other words, it successfully solves the optimization problem. Using a modified algorithm

Due to the application of the local optimization method and the introduced restrictions on the number of calculations of the objective function, the time of machine learning has decreased (on average by 17.71% compared with the use of the traditional genetic algorithm LTIAOL).

The developed machine learning algorithm for ranking LТY-MOL can be used in ISS that use a ranking model based on a combination of simple ranking functions. However, some limitations on the application of the proposed approach should be taken into account. Based

Estimating the execution time of machine learning ranking depending on the size of the training sample

Table 3

Size of the text document collection

LTR-GA execution time

LTR-MGA Runtime

Reduction of execution time,%

Average value

* The best values for the corresponding test collection size are highlighted in gray

of the obtained results, it was revealed that after the ML, the highest increase is observed in the ranking quality metric, the value of which was taken as the objective function. At the same time, other metrics may not have a significant improvement, and in some cases even worsen their values. As one of the possible approaches to eliminating this deficiency, it is proposed to solve the optimization problem as a multi-criteria one: to uniformly improve several basic metrics for ranking search results, instead of optimizing one. In addition, in further studies, it is planned to develop a methodology for constructing an objective function based on a linear convolution of the main ranking quality metrics to improve the information retrieval process.

Bibliographic list

1. Tie-Yan Liu. Learning to Rank for Information Retrieval // Journal Foundations and Trends in Information Retrieval. Vol. 3, issue 3. March 2009. P. 225-331.

2. Christopher J. C. Burges, Tal Shaked, Erin Renshaw. Learning to Rank using Gradient Descent // Proceeding ICML "05 Proceedings of the 22nd international conference on Machine learning. 2005. P. 89-96.

3. Semenikhin, S. V. Investigation of approaches to machine learning for ranking documents by a search system based on genetic algorithms / S. V. Semenikhin // Young Russia: advanced technologies in industry. - 2013. - No. 2. - P. 82 - 85.

4. Multicriteria optimization based on genetic algorithms in the synthesis of control systems: monograph. / L. A. Denisova. - Omsk: Publishing house of OmSTU, 2014 .-- 170 p. - ISBN 978-5-8149-1822-2.

5. Denisova, L. A. Automation of parametric synthesis of the control system using a genetic algorithm / L. A. Denisova, V. A. Meshcheryakov // Automation in industry. - 2012. - No. 7. - P. 34 - 38.

6. Huber, Peter J. Robust Estimation of a Location Parameter // Annals of Statistics. - 1964. - No. 53. - P. 73-101.

7. Semenikhin, S. V. Automation of information retrieval based on multicriteria optimization and genetic algorithms / S. V. Semenikhin, L. A. Denisova // Dynamics of systems, mechanisms and machines. - 2014. - No. 3. - P. 224 - 227.

8. Tie-Yan Liu, Jun Xu, Tao Qin, Wenying Xiong and Hang Li. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval // SIGIR 2007 Workshop on Learning to Rank for Information Retrieval. - 2007 .-- S. 3-10.

9. Ageev, MS Official metrics of RMIP "2004 / MS Ageev, IE Kuralenok // II Russian seminar on the assessment of information retrieval methods (ROMIP 2004), Pushchino, 2004: tr.; Ed. S. Nekrestyanova. - St. Petersburg: Research Institute of Chemistry, St. Petersburg State University. - P. 142-150.

10. J. A. Nelder, R. Mead, A simplex method for function minimization, The Computer Journal 7 (1965). 308-313.

Svyatoslav Vitalievich SEMENIKHIN, post-graduate student of the Department of Automated Information Processing and Control Systems. Correspondence address: [email protected] DENISOVA Lyudmila Albertovna, Doctor of Technical Sciences, Associate Professor of the Department of Automated Information Processing and Control Systems. Correspondence address: [email protected]

Recommended