Rank correlation and Kendall's rank correlation coefficient. Kendall's and Spearman's rank correlation coefficients Kendall's correlation coefficient formula

KENDALLA RANK CORRELATION COEFFICIENT

One of the sample measures of the dependence of two random variables (features) X and Y, based on the ranking of the sample items (X 1, Y x), .. ., (X n, Y n). K. to. R. to. refers, therefore, to rank statisticians and is determined by the formula

where r i- U belonging to that pair ( X, Y), for a swarm of Xraven i, S = 2N- (n-1) / 2, N is the number of sample elements, for which simultaneously j> i and r j> r i... Is always As a selective measure of dependence To. To. R. to. was widely used by M. Kendall (M. Kendall, see).

K. to. R. K. is used to test the hypothesis of the independence of random variables. If the independence hypothesis is true, then E t = 0 and D t = 2 (2n + 5) / 9n (n-1). With a small sample size, the check is statistical. the hypothesis of independence is made using special tables (see). For n> 10, the normal approximation is used for the distribution of m: if

then the hypothesis of independence is rejected, otherwise it is accepted. Here a . - the level of significance, u a / 2 is the percentage point of the normal distribution. K. to. R. Because, like any other, it can be used to detect the dependence of two qualitative features, if only the elements of the sample can be ordered with respect to these features. If X, Y have a joint normal with the correlation coefficient p, then the relationship between K. to. p. to. and has the form:

see also Spearman's rank correlation, Rank test.

Lit.: Kendal M., Rank correlations, trans. from English., M., 1975; Van der Waerden B.L., Mathematical, trans. from it., M., 1960; Bol'shev L.N., Smirnov N.V., Tables of mathematical statistics, Moscow, 1965.

A. V. Prokhorov.


Encyclopedia of Mathematics. - M .: Soviet encyclopedia... I. M. Vinogradov. 1977-1985.

See what "KENDALLA RANK CORRELATION COEFFICIENT" is in other dictionaries:

    English. с efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology

    KENDALL'S RANK CORRELATION COEFFICIENT- English. efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables ... Explanatory Dictionary of Sociology

    A measure of the dependence of two random variables (features) X and Y, based on the ranking of independent observation results (X1, Y1),. ... ., (Xn, Yn). If the ranks of the values ​​of X are located in the natural order i = 1,. ... ., n, and Ri the rank Y corresponding to ... ... Encyclopedia of mathematics

    Correlation coefficient- (Correlation coefficient) The correlation coefficient is a statistical indicator of the dependence of two random variables. Determination of the correlation coefficient, types of correlation coefficients, properties of the correlation coefficient, calculation and application ... ... Investor encyclopedia

    The relationship between random variables, which, generally speaking, is not strictly functional. Unlike functional dependence, K., as a rule, is considered when one of the quantities depends not only on this other, but also ... ... Encyclopedia of mathematics

    Correlation (correlation dependence) is a statistical relationship of two or more random variables (or quantities that can be considered as such with some acceptable degree of accuracy). In this case, changes in the values ​​of one or ... ... Wikipedia

    Correlation- (Correlation) Correlation is a statistical relationship of two or more random variables. The concept of correlation, types of correlation, correlation coefficient, correlation analysis, price correlation, correlation of currency pairs on Forex Contents ... ... Investor encyclopedia

    It is generally accepted that the beginning of S. of m. Century. or, as it is often called, the statistics of "small n", was laid in the first decade of the XX century by the publication of the work of W. Gosset, in which he placed the t distribution, postulated by those who received the world a little later ... ... Psychological encyclopedia

    Maurice Kendall Sir Maurice George Kendall Date of birth: 6 September 1907 (1907 09 06) Place of birth: Kettering, UK Date of death ... Wikipedia

    Forecast- (Forecast) Definition of a forecast, tasks and principles of forecasting Definition of a forecast, tasks and principles of forecasting, forecasting methods Contents Contents Definition Basic concepts of forecasting Tasks and principles of forecasting ... ... Investor encyclopedia

Submission and preprocessing of expert assessments

In practice, several types of assessments are used:

- high-quality (often-rarely, worse-better, yes-no),

- scale estimates (ranges of values ​​50-75, 76-90, 91-120, etc.),

Score from a given interval (from 2 to 5, 1 -10), mutually independent,

Ranked (objects are arranged by an expert in a certain order, and each is assigned a serial number - rank),

Comparative, obtained by one of the comparison methods

sequential comparison method

method of pairwise comparison of factors.

At the next step of processing expert opinions, it is necessary to evaluate the degree of consistency of these opinions.

The estimates obtained from experts can be considered as a random variable, the distribution of which reflects the opinions of experts about the probability of a particular choice of an event (factor). Therefore, to analyze the scatter and consistency of expert estimates, generalized statistical characteristics are used - averages and scatter measures:

Mean square error,

Variational range min - max,

- coefficient of variation V = mean square deviation / mean arithm. (suitable for any type of assessment)

V i = σ i / x i avg

For rate similarity measures but opinions each pair of experts a variety of methods can be used:

association coefficients, with the help of which the number of matching and non-matching answers is taken into account,

inconsistency coefficients expert opinions,

All these measures can be used either to compare the opinions of two experts, or to analyze the relationship between the series of assessments on two grounds.

Spearman's pair rank correlation coefficient:

where n is the number of experts,

c k - the difference between the estimates of the i-th and j-th experts for all T factors

Kendall's rank correlation coefficient (concordance coefficient) gives an overall assessment of the consistency of opinions of all experts on all factors, but only for cases where rank estimates were used.

It is proved that the value of S, when all experts give the same estimates of all factors, has a maximum value equal to

where n is the number of factors,

m is the number of experts.

The coefficient of concordance is equal to the ratio

moreover, if W is close to 1, then all experts have given sufficiently consistent estimates, otherwise their opinions are not agreed.

The formula for calculating S is shown below:

where r ij are the rank estimates of the i-th factor by the j-th expert,

r cf is the average rank over the entire matrix of estimates and is equal to

And therefore the formula for calculating S can take the form:

If individual assessments of one expert coincide, and they were made standardized during processing, then a different formula is used to calculate the concordance coefficient:



where T j is calculated for each expert (in the event that his assessments were repeated for different objects), taking into account the repetitions according to the following rules:

where t j is the number of groups of equal ranks for the j-th expert, and

h k - the number of equal ranks in the k-th group of related ranks of the j-th expert.

EXAMPLE. Let 5 experts on six factors answer in the ranking as shown in Table 3:

Table 3 - Answers of experts

Experts О1 О2 O3 О4 O5 O6 Sum of ranks by expert
E1
E2
E3
E4
E5

Due to the fact that not a strict ranking was obtained (the assessments from the experts are repeated, and the sums of the ranks are not equal), we will transform the estimates and get the related ranks (Table 4):

Table 4 - Related ranks of expert assessments

Experts О1 О2 O3 О4 O5 O6 Sum of ranks by expert
E1 2,5 2,5
E2
E3 1,5 1,5 4,5 4,5
E4 2,5 2,5 4,5 4,5
E5 5,5 5,5
The sum of the ranks of the object 7,5 9,5 23,5 29,5

Now let's determine the degree of consistency of expert opinions using the coefficient of concordance. Since the ranks are related, we will calculate W by the formula (**).

Then r cf = 7 * 5/2 = 17.5

S = 10 2 +8 2 +4.5 2 +4.5 2 +6 2 +12 2 = 384.5

Let us proceed to the calculations of W. For this, we calculate separately the values ​​of T j. In the example, the assessments are specially selected so that each expert has repeated assessments: the first has two, the second has three, the third has two groups of two ratings, and the fourth has two identical ratings. Hence:

T 1 = 2 3 - 2 = 6 T 5 = 6

T 2 = 3 3 - 3 = 24

Т 3 = 2 3 –2+ 2 3 –2 = 12 Т 4 = 12

We see that the agreement of the experts' opinions is quite high and we can proceed to the next stage of the study - substantiation and adoption of the alternative of the decision recommended by the experts.

Otherwise, you need to go back to steps 4-8.

Rank correlation coefficient characterizes the general nature of nonlinear dependence: an increase or decrease in the effective trait with an increase in the factor one. This is an indicator of the tightness of a monotonic nonlinear relationship.

Service purpose... This online calculator calculates Kendall's rank correlation coefficient according to all basic formulas, as well as an assessment of its significance.

Instruction. Indicate the amount of data (number of lines). The resulting solution is saved in a Word file.

The coefficient proposed by Kendall is built on the basis of relations of the "more-less" type, the validity of which was established when constructing the scales.
Let's select a couple of objects and compare their ranks in one attribute and in another. If, according to this criterion, the ranks form a direct order (that is, the order of the natural series), then the pair is assigned +1, if the opposite, then –1. For the selected pair, the corresponding plus - minus units (by attribute X and by attribute Y) are multiplied. The result is obviously +1; if the ranks of a pair of both features are located in the same sequence, and –1 if in reverse.
If the orders of ranks are the same for all pairs by both criteria, then the sum of units assigned to all pairs of objects is maximum and is equal to the number of pairs. If the rank orders of all pairs are reversed, then –C 2 N. In the general case, C 2 N = P + Q, where P is the number of positive and Q is the number of negative ones assigned to pairs when comparing their ranks for both criteria.
The quantity is called Kendall's coefficient.
It can be seen from the formula that the coefficient τ is the difference between the proportion of pairs of objects in which the order is the same in both criteria (in relation to the number of all pairs) and the proportion of pairs of objects in which the order is not the same.
For example, a coefficient value of 0.60 means that 80% of pairs have the same order of objects, while 20% do not (80% + 20% = 100%; 0.80 - 0.20 = 0.60). Those. τ can be interpreted as the difference between the probabilities of coincidence and non-coincidence of the orders in both signs for a randomly selected pair of objects.
In the general case, the calculation of τ (more precisely, P or Q) even for N of the order of 10 turns out to be cumbersome.
Let's show how to simplify the calculations.


An example. The relationship between the volume of industrial production and investment in fixed assets in 10 regions of one of the federal districts of the Russian Federation in 2003 is characterized by the following data:


Calculate the Spearman and Kendal rank correlation coefficients. Check their significance at α = 0.05. Formulate a conclusion about the relationship between the volume of industrial production and investment in fixed assets in the regions of the Russian Federation under consideration.

Solution... Let's assign ranks to attribute Y and factor X.


Let's sort the data by X.
In the row Y to the right of 3 there are 7 ranks exceeding 3, therefore, 3 will generate a term 7 in P.
To the right of 1 there are 8 ranks exceeding 1 (these are 2, 4, 6, 9, 5, 10, 7, 8), i.e. 8 will enter P, and so on. As a result, Р = 37 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
18.4 5.57 1 3 7 2
20.6 2.88 2 1 8 0
21.5 4.12 3 2 7 0
35.7 7.24 4 4 6 0
37.1 9.67 5 6 4 1
39.8 10.48 6 9 1 3
51.1 8.58 7 5 3 0
54.4 14.79 8 10 0 2
64.6 10.22 9 7 1 0
90.6 10.45 10 8 0 0
37 8


By simplified formulas:




where n is the sample size; z kp is the critical point of the bilateral critical region, which is found from the table of the Laplace function by the equality Ф (z kp) = (1-α) / 2.
If | τ |< T kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками незначима. Если |τ| >T kp - the null hypothesis is rejected. There is a significant rank correlation between the qualitative features.
Find the critical point z kp
Ф (z kp) = (1-α) / 2 = (1 - 0.05) / 2 = 0.475

Let's find the critical point:

Since τ> T kp - we reject the null hypothesis; the rank correlation between the scores on the two tests is significant.

An example. Based on the data on the volume of construction and installation work performed on our own and the number of employees in 10 construction companies in one of the cities of the Russian Federation, determine the relationship between these signs using the Kendal coefficient.

Solution find with a calculator.
Let's assign ranks to attribute Y and factor X.
Let's arrange the objects so that their X ranks represent a natural series. Since the estimates assigned to each pair of this series are positive, the values ​​"+1" included in P will be generated only by those pairs whose ranks in Y form a direct order.
They are easy to calculate by sequentially comparing the ranks of each object in the Y row with the steel ones.
Kendall coefficient.

In the general case, the calculation of τ (more precisely, P or Q) even for N of the order of 10 turns out to be cumbersome. Let's show how to simplify the calculations.

or

Solution.
Let's sort the data by X.
In the row Y to the right of 2 there are 8 ranks exceeding 2, therefore, 2 will generate a term 8 in P.
To the right of 4 there are 6 ranks exceeding 4 (these are 7, 5, 6, 8, 9, 10), i.e. 6 will enter P, and so on. As a result, P = 29 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
38 292 1 2 8 1
50 302 2 4 6 2
52 366 3 7 3 4
54 312 4 5 4 2
59 359 5 6 3 2
61 398 6 8 2 2
66 401 7 9 1 2
70 298 8 3 1 1
71 283 9 1 1 0
73 413 10 10 0 0
29 16


By simplified formulas:


In order to test the null hypothesis about the equality of Kendall's general rank correlation coefficient to zero at a significance level α with a competing hypothesis H 1: τ ≠ 0, it is necessary to calculate the critical point:

where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф (z kp) = (1 - α) / 2.
If | τ | T kp - the null hypothesis is rejected. There is a significant rank correlation between the qualitative features.
Find the critical point z kp
Ф (z kp) = (1 - α) / 2 = (1 - 0.05) / 2 = 0.475
Using the Laplace table, we find z kp = 1.96
Let's find the critical point:

Since τ

Kendall's correlation coefficient is used when variables are represented by two ordinal scales, provided that there are no associated ranks. The calculation of Kendall's coefficient involves counting the number of matches and inversions. Let's consider this procedure using the example of the previous task.

The algorithm for solving the problem is as follows:

    We re-register the data in the table. 8.5 so that one of the rows (in this case, the row x i) turned out to be ranked. In other words, we rearrange the pairs x and y in the right order and we enter the data in columns 1 and 2 of the table. 8.6.

Table 8.6

x i

y i

2. Determine the "degree of ranking" of the 2nd row ( y i). This procedure is carried out in the following sequence:

a) we take the first value of the non-ranked row "3". Calculating the number of ranks below given number, which more the value to be compared. There are 9 such values ​​(numbers 6, 7, 4, 9, 5, 11, 8, 12 and 10). We enter the number 9 in the "matches" column. Then we count the number of values ​​that smaller three. There are 2 such values ​​(ranks 1 and 2); add the number 2 to the "inversion" column.

b) discard the number 3 (we have already worked with it) and repeat the procedure for the next value "6": the number of matches is 6 (ranks 7, 9, 11, 8, 12 and 10), the number of inversions is 4 (ranks 1, 2 , 4 and 5). We enter the number 6 in the "coincidences" column, and the number 4 - in the "inversions" column.

c) in the same way, the procedure is repeated until the end of the row; it should be remembered that each "worked out" value is excluded from further consideration (only the ranks that lie below this number are counted).

Note

In order not to make mistakes in calculations, it should be borne in mind that with each "step" the sum of coincidences and inversions decreases by one; this is understandable if we take into account that each time one value is excluded from consideration.

3. The sum of matches is calculated (R) and the sum of inversions (Q); the data are entered into one and three interchangeable formulas for the Kendall coefficient (8.10). The corresponding calculations are carried out.

t (8.10)

In our case:

Table XIV Appendices are the critical values ​​of the coefficient for a given sample: τ cr. = 0.45; 0.59. The empirically obtained value is compared with the tabular value.

Output

τ = 0.55> τ cr. = 0.45. Correlation is statistically significant for level 1.

Note:

If necessary (for example, in the absence of a table of critical values) statistical significance t Kendall can be determined by the following formula:

(8.11)

where S * = P - Q+ 1 if P< Q , and S * = P - Q - 1 if P> Q.

The values z for the corresponding significance level correspond to the Pearson measure and are found according to the corresponding tables (not included in the appendix. For standard significance levels z cr = 1.96 (for β 1 = 0.95) and 2.58 (for β 2 = 0.99). Kendall's correlation coefficient is statistically significant if z > z cr

In our case S * = P - Q- 1 = 35 and z= 2.40, that is, the initial conclusion is confirmed: the correlation between the signs is statistically significant for the 1st level of significance.

One of the factors limiting the application of criteria based on the assumption of normality is the sample size. As long as the sample is large enough (for example, 100 or more observations), you can assume that the sample distribution is normal, even if you are not sure that the distribution of the variable in the population is normal. However, if the sample is small, these criteria should only be used if there is confidence that the variable is indeed normally distributed. However, there is no way to test this assumption on a small sample.

The use of criteria based on the assumption of normality is also limited to a scale of measurements (see chapter Basic concepts of data analysis). Statistical methods such as t-test, regression, etc. assume that the original data is continuous. However, there are situations where the data are simply ranked (measured on an ordinal scale) rather than measured accurately.

A typical example is given by the ratings of sites on the Internet: the first position is taken by the site with the maximum number of visitors, the second position is taken by the site with the maximum number of visitors among the remaining sites (among sites from which the first site has been removed), etc. Knowing the ratings, we can say that the number of visitors to one site is greater than the number of visitors to another, but how much more is impossible to say. Imagine you have 5 sites: A, B, C, D, E, which are in the top 5 places. Suppose that in the current month we had the following arrangement: A, B, C, D, E, and in the previous month: D, E, A, B, C. The question is, there have been significant changes in site ratings or not? In this situation, obviously, we cannot use the t-test to compare these two groups of data, and move on to the area of ​​specific probabilistic calculations (and any statistical criterion contains a probabilistic calculation!). We reason like this: how likely is it that the difference in the two site layouts is due to purely random reasons, or that the difference is too large and cannot be explained by pure chance. In this reasoning, we only use the ranks or permutations of sites and do not in any way use a specific form of distribution of the number of visitors to them.

For the analysis of small samples and for data measured on poor scales, nonparametric methods are used.

A quick tour of nonparametric procedures

Essentially, for every parametric criterion, there is at least one nonparametric alternative.

In general, these procedures fall into one of the following categories:

  • distinction criteria for independent samples;
  • distinction criteria for dependent samples;
  • assessment of the degree of dependence between the variables.

In general, the approach to statistical criteria in data analysis should be pragmatic and not burdened with unnecessary theoretical reasoning. With a STATISTICA computer at your disposal, you can easily apply several criteria to your data. Knowing about some of the pitfalls of the methods, you will choose the right solution through experimentation. The development of the plot is quite natural: if you need to compare the values ​​of two variables, then you use the t-test. However, it should be remembered that it is based on the assumption of normality and equality of variances in each group. Breaking free from these assumptions results in nonparametric tests that are especially useful for small samples.

The development of the t-test leads to analysis of variance, which is used when the number of compared groups is more than two. The corresponding development of nonparametric procedures leads to a nonparametric analysis of variance, although it is significantly poorer than the classical analysis of variance.

To assess the dependence, or, to put it somewhat pompously, the degree of tightness of the connection, the Pearson correlation coefficient is calculated. Strictly speaking, its application has limitations associated, for example, with the type of scale in which the data are measured and the nonlinearity of the dependence; therefore, alternatively, nonparametric, or so-called rank, correlation coefficients are also used, which are used, for example, for ranked data. If the data are measured on a nominal scale, then it is natural to present them in contingency tables that use Pearson's chi-square test with various variations and corrections for accuracy.

So, in essence, there are only a few types of criteria and procedures that you need to know and be able to use, depending on the specifics of the data. You need to determine which criterion should be applied in a particular situation.

Nonparametric methods are most appropriate when sample sizes are small. If there is a lot of data (for example, n> 100), it often doesn't make sense to use nonparametric statistics.

If the sample size is very small (for example, n = 10 or less), then the significance levels for those nonparametric tests that use the normal approximation can only be considered as rough estimates.

Differences between independent groups... If there are two samples (for example, men and women) that need to be compared with respect to some average value, for example, the mean pressure or the number of leukocytes in the blood, then the t-test can be used for independent samples.

Nonparametric alternatives to this test are the criterion of the Val'd-Wolfowitz, Mann-Whitney series) / n, where x i is the i-th value, n is the number of observations. If the variable contains negative values ​​or zero (0), the geometric mean cannot be calculated.

Harmonic mean

The harmonic average is sometimes used to average frequencies. The harmonic mean is calculated by the formula: ГС = n / S (1 / x i) where ГС is the harmonic mean, n is the number of observations, х i is the value of observation with the number i. If the variable contains zero (0), the harmonic mean cannot be calculated.

Dispersion and standard deviation

Sample variance and standard deviation are the most commonly used measures of variability (variation) in data. The variance is calculated as the sum of the squares of the deviations of the values ​​of the variable from the sample mean, divided by n-1 (but not by n). The standard deviation is calculated as the square root of the variance estimate.

Swing

The range of a variable is an indicator of volatility, calculated as a maximum minus a minimum.

Quartile scope

The quarterly range, by definition, is: upper quartile minus lower quartile (75% percentile minus 25% percentile). Since the 75% percentile (upper quartile) is the value to the left of which 75% of cases are located, and the 25% percentile (lower quartile) is the value to the left of which 25% of cases are located, the quartile range is the interval around the median. which contains 50% of the cases (variable values).

Asymmetry

Asymmetry is a characteristic of the shape of the distribution. The distribution is skewed to the left if the skewness value is negative. The distribution is skewed to the right if the asymmetry is positive. The skewness of the standard normal distribution is 0. The skewness is associated with the third moment and is defined as: skewness = n × M 3 / [(n-1) × (n-2) × s 3], where M 3 is: (x i -x mean x) 3, s 3 is the standard deviation raised to the third power, n is the number of observations.

Excess

Kurtosis is a characteristic of the shape of a distribution, namely, a measure of the severity of its peak (relative to a normal distribution, the kurtosis of which is equal to 0). As a rule, distributions with a sharper peak than normal have a positive kurtosis; distributions whose peak is less acute than the peak of the normal distribution have negative kurtosis. The excess is associated with the fourth moment and is determined by the formula:

kurtosis = / [(n-1) × (n-2) × (n-3) × s 4], where M j is: (x-x mean x, s 4 is the standard deviation to the fourth power, n is the number of observations ...