Correlation coefficients of Spearman's, Kendall's ranks, Fechner's coefficient. Kendall and Spearman rank correlation coefficients Spearman and spearman rank correlation coefficients

It is used to identify the relationship between quantitative or qualitative indicators, if they can be ranked. The values ​​of the X indicator are set in ascending order and assigned ranks. The values ​​of the Y indicator are ranked and the Kendall correlation coefficient is calculated:

where S = PQ.

P big the rank value Y.

Q- the total number of observations following the current observations with smaller the rank value Y. (equal ranks do not count!)

If the studied data are repeated (have the same ranks), then Kendall's corrected correlation coefficient is used in the calculations:

t- the number of related ranks in the row X and Y, respectively.

19.What should be the starting point when defining the theme, object, subject, goal, objectives and hypothesis of the research?

The research program, as a rule, has two sections: methodological and procedural. The first includes substantiating the relevance of the topic, formulating the problem, defining the object and subject, goals and objectives of the research, formulating the basic concepts (categorical apparatus), preliminary systematic analysis of the research object and putting forward a working hypothesis. The second section reveals the strategic research plan, as well as the plan and basic procedures for collecting and analyzing primary data.

First of all, when choosing a research topic, one must proceed from the relevance. Justification of relevance includes an indication of the need and timeliness of the study and solution of the problem for the further development of the theory and practice of teaching and upbringing. Topical research provides an answer to the most pressing questions at this time, reflect the social order of society to pedagogical science, and reveal the most important contradictions that take place in practice. The criterion of relevance is dynamic, mobile, depends on time, taking into account specific and specific circumstances. In its most general form, relevance characterizes the degree of discrepancy between the demand for scientific ideas and practical recommendations (to meet a particular need) and the proposals that science and practice can provide at the present time.

The most convincing basis defining the research topic is the social order, reflecting the most acute, socially significant problems that require urgent solutions. The social order requires a substantiation of a specific topic. Usually this is an analysis of the degree of elaboration of a question in science.

If the social order follows from the analysis of pedagogical practice, then itself scientific problem is in a different plane. It expresses the main contradiction that must be resolved by means of science. The solution to the problem is usually purpose of the study. The goal is a reformulated problem.

The wording of the problem entails object selection research. It can be a pedagogical process, an area of ​​pedagogical reality, or some kind of pedagogical attitude that contains a contradiction. In other words, an object can be anything that explicitly or implicitly contains a contradiction and generates a problem situation. The object is what the process of cognition is directed to. Subject of study - part, side of the object. These are the most significant from a practical or theoretical point of view, properties, aspects, features of an object that are subject to direct study.

In accordance with the purpose, object and subject of research, research tasks, which, as a rule, are aimed at checking hypotheses. The latter is a set of theoretically based assumptions, the truth of which is subject to verification.

Criterion scientific novelty can be used to assess the quality of completed studies. It characterizes new theoretical and practical conclusions, patterns of education, its structure and mechanisms, content, principles and technologies, which at this point in time were not known and were not recorded in the pedagogical literature. The novelty of the research can have both theoretical and practical significance. The theoretical value of the research lies in creating a concept, obtaining a hypothesis, regularity, method, model for identifying a problem, tendency, direction. The practical significance of the research lies in the preparation of proposals, recommendations, etc. The criteria for novelty, theoretical and practical significance change depending on the type of research, they also depend on the time of obtaining new knowledge.

Rank correlation coefficient characterizes the general nature of nonlinear dependence: an increase or decrease in the effective trait with an increase in the factor one. This is an indicator of the tightness of a monotonic nonlinear relationship.

Service purpose... This online calculator calculates Kendall's rank correlation coefficient according to all basic formulas, as well as an assessment of its significance.

Instruction. Indicate the amount of data (number of lines). The resulting solution is saved in a Word file.

The coefficient proposed by Kendall is built on the basis of relations of the "more-less" type, the validity of which was established when constructing the scales.
Let's select a couple of objects and compare their ranks in one attribute and in another. If, according to this criterion, the ranks form a direct order (that is, the order of the natural series), then the pair is assigned +1, if the opposite, then –1. For the selected pair, the corresponding plus - minus units (by attribute X and by attribute Y) are multiplied. The result is obviously +1; if the ranks of a pair of both features are located in the same sequence, and –1 if in reverse.
If the orders of ranks are the same for all pairs by both criteria, then the sum of units assigned to all pairs of objects is maximum and is equal to the number of pairs. If the rank orders of all pairs are reversed, then –C 2 N. In the general case, C 2 N = P + Q, where P is the number of positive and Q is the number of negative ones assigned to pairs when comparing their ranks for both criteria.
The quantity is called Kendall's coefficient.
It can be seen from the formula that the coefficient τ is the difference between the proportion of pairs of objects in which the order is the same in both criteria (in relation to the number of all pairs) and the proportion of pairs of objects in which the order is not the same.
For example, a coefficient value of 0.60 means that 80% of pairs have the same order of objects, while 20% do not (80% + 20% = 100%; 0.80 - 0.20 = 0.60). Those. τ can be interpreted as the difference between the probabilities of coincidence and non-coincidence of the orders in both signs for a randomly selected pair of objects.
In the general case, the calculation of τ (more precisely, P or Q) even for N of the order of 10 turns out to be cumbersome.
Let's show how to simplify the calculations.


An example. The relationship between the volume of industrial production and investment in fixed assets in 10 regions of one of the federal districts of the Russian Federation in 2003 is characterized by the following data:


Calculate the Spearman and Kendal rank correlation coefficients. Check their significance at α = 0.05. Formulate a conclusion about the relationship between the volume of industrial production and investment in fixed assets in the regions of the Russian Federation under consideration.

Solution... Let's assign ranks to attribute Y and factor X.


Let's sort the data by X.
In the row Y to the right of 3 there are 7 ranks exceeding 3, therefore, 3 will generate a term 7 in P.
To the right of 1 there are 8 ranks exceeding 1 (these are 2, 4, 6, 9, 5, 10, 7, 8), i.e. 8 will enter P, and so on. As a result, Р = 37 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
18.4 5.57 1 3 7 2
20.6 2.88 2 1 8 0
21.5 4.12 3 2 7 0
35.7 7.24 4 4 6 0
37.1 9.67 5 6 4 1
39.8 10.48 6 9 1 3
51.1 8.58 7 5 3 0
54.4 14.79 8 10 0 2
64.6 10.22 9 7 1 0
90.6 10.45 10 8 0 0
37 8


By simplified formulas:




where n is the sample size; z kp is the critical point of the bilateral critical region, which is found from the table of the Laplace function by the equality Ф (z kp) = (1-α) / 2.
If | τ |< T kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками незначима. Если |τ| >T kp - the null hypothesis is rejected. There is a significant rank correlation between the qualitative features.
Find the critical point z kp
Ф (z kp) = (1-α) / 2 = (1 - 0.05) / 2 = 0.475

Let's find the critical point:

Since τ> T kp - we reject the null hypothesis; the rank correlation between the scores on the two tests is significant.

An example. Based on the data on the volume of construction and installation work performed on our own and the number of employees in 10 construction companies in one of the cities of the Russian Federation, determine the relationship between these signs using the Kendal coefficient.

Solution find with a calculator.
Let's assign ranks to attribute Y and factor X.
Let's arrange the objects so that their X ranks represent a natural series. Since the estimates assigned to each pair of this series are positive, the values ​​"+1" included in P will be generated only by those pairs whose ranks in Y form a direct order.
They are easy to calculate by sequentially comparing the ranks of each object in the Y row with the steel ones.
Kendall coefficient.

In the general case, the calculation of τ (more precisely, P or Q) even for N of the order of 10 turns out to be cumbersome. Let's show how to simplify the calculations.

or

Solution.
Let's sort the data by X.
In the row Y to the right of 2 there are 8 ranks exceeding 2, therefore, 2 will generate a term 8 in P.
To the right of 4 there are 6 ranks exceeding 4 (these are 7, 5, 6, 8, 9, 10), i.e. 6 will enter P, and so on. As a result, P = 29 and using the formulas we have:

XYrank X, d xrank Y, d yPQ
38 292 1 2 8 1
50 302 2 4 6 2
52 366 3 7 3 4
54 312 4 5 4 2
59 359 5 6 3 2
61 398 6 8 2 2
66 401 7 9 1 2
70 298 8 3 1 1
71 283 9 1 1 0
73 413 10 10 0 0
29 16


By simplified formulas:


In order to test the null hypothesis about the equality of Kendall's general rank correlation coefficient to zero at a significance level α with a competing hypothesis H 1: τ ≠ 0, it is necessary to calculate the critical point:

where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф (z kp) = (1 - α) / 2.
If | τ | T kp - the null hypothesis is rejected. There is a significant rank correlation between the qualitative features.
Find the critical point z kp
Ф (z kp) = (1 - α) / 2 = (1 - 0.05) / 2 = 0.475
Using the Laplace table, we find z kp = 1.96
Let's find the critical point:

Since τ

The needs of economic and social practice require the development of methods for the quantitative description of processes that allow accurate registration of not only quantitative, but also qualitative factors. Provided that the values ​​of the qualitative characteristics can be ordered, or ranged by the degree of decrease (increase) of the characteristic, it is possible to assess the closeness of the relationship between the qualitative characteristics. Qualitative means a feature that cannot be measured accurately, but it allows you to compare objects with each other and, therefore, arrange them in decreasing or increasing order of quality. And the real content of measurements in rank scales is the order in which objects are arranged according to the severity of the measured feature.

For practical purposes, the use of rank correlation is very useful. For example, if a high rank correlation is established between two qualitative features of products, then it is enough to control products only by one of the features, which makes the control cheaper and faster.

As an example, we can consider the existence of a connection between the availability of commercial products of a number of enterprises and overhead costs for sales. In the course of 10 observations, the following table was obtained:

Let us arrange the values ​​of X in ascending order, with each value assigning its ordinal number (rank) to each value:

In this way,

Let's build the following table, where the pairs X and Y are written, obtained as a result of observation with their ranks:

Denoting the difference in ranks as, we write the formula for calculating Spearman's sample correlation coefficient:

where n is the number of observations, it is also the number of pairs of ranks.

Spearman's coefficient has the following properties:

If there is a complete direct relationship between the qualitative features X and Y in the sense that the ranks of the objects coincide for all values ​​of i, then Spearman's sample correlation coefficient is 1. Indeed, substituting it into the formula, we get 1.

If there is a complete inverse relationship between the qualitative features X and Y in the sense that rank corresponds to rank, then Spearman's sample correlation coefficient is -1.

Indeed, if

Substituting the value in the Spearman correlation coefficient formula, we get -1.

If there is neither complete direct nor complete feedback between qualitative features, then Spearman's sample correlation coefficient is between -1 and 1, and the closer to 0 its value, the less connection between features.

According to the above example, we will find the value of P, for this we will complete the table with the values ​​and:

Kendall's sample correlation coefficient. You can assess the relationship between two qualitative characteristics using Kendall's rank correlation coefficient.

Let the ranks of the objects of the sample of size n be equal:

on the basis of X:

on the basis of Y:. Let us assume that to the right there are ranks, large, to the right there are ranks, large, to the right there are ranks, large. Let us introduce the notation for the sum of ranks

Similarly, we introduce the notation as the sum of the number of ranks lying to the right, but less.

Kendall's sample correlation coefficient is written by the formula:

Where n is the sample size.

Kendall's coefficient has the same properties as Spearman's coefficient:

If there is a complete direct relationship between the qualitative features X and Y in the sense that the ranks of the objects coincide for all values ​​of i, then Kendall's sample correlation coefficient is 1. Indeed, to the right there are n-1 ranks, large, therefore, in the same way we establish, what. Then. And Kendall's coefficient is:.

If there is a complete inverse relationship between the qualitative features X and Y in the sense that rank corresponds to rank, then Kendall's sample correlation coefficient is -1. To the right there are no ranks, large, therefore. Likewise. Substituting the value R + = 0 in the Kendall coefficient formula, we get -1.

With a sufficiently large sample size and with the values ​​of the rank correlation coefficients not close to 1, an approximate equality takes place:

Does Kendall's coefficient give a more conservative estimate of correlation than Spearman's coefficient? (the numeric value? is always less than). While calculating the coefficient? less laborious than calculating the coefficient, the latter is easier to recalculate if a new term is added to the series.

An important advantage of the coefficient is that it can be used to determine the coefficient of private rank correlation, which makes it possible to assess the degree of "pure" interconnection of two rank features, eliminating the influence of the third:

The significance of the rank correlation coefficients. When determining the strength of rank correlation based on sample data, it is necessary to consider the following question: with what degree of reliability can one rely on the conclusion that there is a correlation in the general population if a certain sample coefficient of rank correlation is obtained. In other words, the significance of the observed rank correlations should be checked based on the hypothesis that the two rankings under consideration are statistically independent.

With a relatively large sample size n, the significance of the rank correlation coefficients can be checked using the normal distribution table (Appendix Table 1). To test the significance of the Spearman coefficient? (for n> 20) calculate the value

and to test the significance of the Kendall coefficient? (for n> 10) calculate the value

where S = R + - R-, n is the sample size.

Next, the significance level? Is set, the critical value of tcr (?, K) is determined from the table of critical points of the Student's distribution and the calculated value or is compared with it. The number of degrees of freedom is assumed to be k = n-2. If or> tcr, then the values ​​or are considered significant.

Fechner's correlation coefficient.

Finally, we should mention the Fechner coefficient, which characterizes the elementary degree of tightness of a connection, which is advisable to use to establish the fact of a connection when there is a small amount of initial information. The basis for its calculation is taking into account the direction of deviations from the arithmetic mean of the variants of each variation series and determining the consistency of the signs of these deviations for two series, the relationship between which is measured.

This coefficient is determined by the formula:

where na is the number of coincidences of signs of deviations of individual values ​​from their arithmetic mean; nb - respectively the number of mismatches.

Fechner's coefficient can vary between -1.0<= Кф<= +1,0.

Applied aspects of rank correlation. As already noted, the rank correlation coefficients can be used not only for a qualitative analysis of the relationship between two rank features, but also in determining the strength of the relationship between the rank and quantitative features. In this case, the values ​​of the quantitative characteristic are sorted and the corresponding ranks are assigned to them.

There are a number of situations when calculating the rank correlation coefficients is also advisable when determining the strength of the relationship between two quantitative features. So, with a significant deviation of the distribution of one of them (or both) from the normal distribution, the determination of the significance level of the sample correlation coefficient r becomes incorrect, while the rank coefficients? and? are not subject to such restrictions when determining the level of significance.

Another situation of this kind arises when the relationship between two quantitative features is nonlinear (but monotonous). If the number of objects in the sample is small or if the sign of the connection is important for the researcher, then the use of the correlation ratio? may be inadequate here. The calculation of the rank correlation coefficient allows one to get around the indicated difficulties.

Practical part

Task 1. Correlation-regression analysis

Statement and formalization of the problem:

An empirical sample is given, compiled on the basis of a series of observations of the state of equipment (for failure) and the number of manufactured products. The sample implicitly characterizes the relationship between the amount of equipment that has failed and the number of manufactured items. According to the meaning of the sample, it is clear that the manufactured products are produced on the equipment that remains in service, since the more% of the equipment that has failed, the fewer manufactured products. It is required to conduct a study of the sample for correlation-regression dependence, that is, to establish the form of dependence, to evaluate the regression function (regression analysis), as well as to identify the relationship between random variables and assess its tightness (correlation analysis). An additional task of correlation analysis is to estimate the regression equation of one variable for another. In addition, it is necessary to predict the number of products manufactured with a 30% equipment failure.

Let's formalize the given sample in the table, designating the data "Equipment failure,%" as X, data "Number of products" as Y:

Initial data. Table 1

According to the physical meaning of the problem, it can be seen that the number of manufactured products Y directly depends on the% of equipment failure, that is, there is a dependence of Y on X. When conducting regression analysis, it is required to find a mathematical relationship (regression) connecting the values ​​of X and Y. In this case, regression analysis, in Unlike the correlation, it assumes that the X value acts as an independent variable, or a factor, the Y value - as a dependent on it, or an effective sign. Thus, it is required to synthesize an adequate economic and mathematical model, i.e. determine (find, select) the function Y = f (X), which characterizes the relationship between the values ​​of X and Y, using which it will be possible to predict the value of Y at X = 30. This problem can be solved using correlation-regression analysis.

A brief overview of methods for solving correlation-regression problems and the rationale for the chosen solution method.

Regression analysis methods are subdivided into one- and multi-factor based on the number of factors affecting the effective trait. Univariate - the number of independent factors = 1, i.e. Y = F (X)

multifactorial - the number of factors> 1, i.e.

According to the number of investigated dependent variables (effective indicators), regression problems can also be divided into tasks with one or many effective indicators. In general, a task with many effective features can be written:

The method of correlation-regression analysis consists in finding the parameters of the approximating (approximating) dependence of the form

Since only one independent variable appears in the given problem, that is, the dependence on only one factor influencing the result is investigated, a study for one-way dependence, or pair regression, should be applied.

If there is only one factor, the dependence is defined as:

The form of writing a specific regression equation depends on the choice of the function that displays the statistical relationship between the factor and the effective indicator and includes the following:

linear regression, equation of the form,

parabolic, equation of the form

cubic, equation of the form

hyperbolic, equation of the form

semilogarithmic, equation of the form

exponential, equation of the form

power-law, equation of the form.

Finding the function is reduced to determining the parameters of the regression equation and assessing the reliability of the equation itself. To determine the parameters, you can use both the least squares method and the least modulus method.

The first of them is that the sum of the squares of the deviations of the empirical values ​​Yi from the calculated means Yi is minimal.

The method of least modulus consists in minimizing the sum of the moduli of the difference between the empirical values ​​Yi and the calculated means Yi.

To solve the problem, we will choose the least squares method, as it is the simplest and gives good estimates in terms of statistical properties.

The technology for solving the problem of regression analysis using the least squares method.

It is possible to determine the type of dependence (linear, quadratic, cubic, etc.) between the variables by evaluating the deviation of the actual value of y from the calculated one:

where - empirical values, - calculated values ​​by the approximating function. Estimating the values ​​of Si for various functions and choosing the smallest of them, we select an approximating function.

The type of a function is determined by finding the coefficients that are found for each function as a solution to a certain system of equations:

linear regression, equation of the form, system -

parabolic, equation of the form, system -

cubic, equation of the form, system -

Having solved the system, we find, with the help of which we come to a specific expression of the analytical function, having which, we find the calculated values. Further, there is all the data for finding an estimate of the deviation value S and analyzing for a minimum.

For a linear relationship, we estimate the closeness of the relationship between factor X and the effective indicator Y in the form of a correlation coefficient r:

Average value of the indicator;

Average factor value;

y is the experimental value of the indicator;

x is the experimental value of the factor;

Standard deviation in x;

Standard deviation in y.

If the correlation coefficient r = 0, then it is believed that the relationship between the features is insignificant or absent, if r = 1, then there is a very high functional relationship between the features.

Using the Chaddock table, you can qualitatively assess the tightness of the correlation between the signs:

Chaddock table Table 2.

For a nonlinear dependence, the correlation ratio (0 1) and the correlation index R are determined, which are calculated from the following dependences.

where value is the value of the indicator calculated by the regression dependence.

As an estimate of the calculation accuracy, we use the value of the average relative approximation error

With high accuracy, it lies in the range of 0-12%.

To assess the selection of functional dependence, we use the coefficient of determination

The coefficient of determination is used as a “generalized” measure of the quality of the selection of a functional model, since it expresses the ratio between the factorial and total variance, or rather the share of the factorial variance in the total.

To assess the significance of the correlation index R, Fisher's F test is used. The actual value of the criterion is determined by the formula:

where m is the number of parameters of the regression equation, n is the number of observations. The value is compared with the critical value, which is determined according to the F-criterion table, taking into account the accepted significance level and the number of degrees of freedom and. If, then the value of the correlation index R is considered significant.

For the selected form of regression, the coefficients of the regression equation are calculated. For convenience, the calculation results are included in the table of the following structure (in general, the number of columns and their appearance change depending on the type of regression):

Table 3

The solution of the problem.

Observations were made of the economic phenomenon - the dependence of the release of products on the percentage of equipment failure. A set of values ​​is obtained.

The selected values ​​are described in table 1.

We build a graph of empirical dependence for the given sample (Fig. 1)

By the type of the graph, we determine that the analytical dependence can be represented as a linear function:

Let's calculate the pairwise correlation coefficient to assess the relationship between X and Y:

Let's build an auxiliary table:

Table 4

We solve the system of equations to find the coefficients and:

from the first equation, substituting the value

into the second equation, we get:

We find

We get the form of the regression equation:

9. To assess the tightness of the found relationship, we use the correlation coefficient r:

According to the Chaddock table, we establish that for r = 0.90 the relationship between X and Y is very high, therefore, the reliability of the regression equation is also high. To estimate the accuracy of calculations, we use the value of the average relative error of approximation:

We believe that the value provides a high degree of reliability of the regression equation.

For a linear relationship between X and Y, the determination index is equal to the square of the correlation coefficient r:. Consequently, 81% of the total variation is explained by a change in the factor characteristic X.

To assess the significance of the correlation index R, which in the case of a linear relationship is equal in absolute value to the correlation coefficient r, the Fisher's F-test is used. We determine the actual value using the formula:

where m is the number of parameters of the regression equation, n is the number of observations. That is, n = 5, m = 2.

Taking into account the accepted significance level = 0.05 and the number of degrees of freedom, we obtain the critical tabular value. Since, the value of the correlation index R is recognized as significant.

Let's calculate the predicted value Y at X = 30:

Let's build a graph of the found function:

11. Determine the error of the correlation coefficient by the value of the standard deviation

and then we determine the value of the normalized deviation

From the ratio> 2 with a probability of 95%, we can talk about the significance of the obtained correlation coefficient.

Problem 2. Linear optimization

Option 1.

The development plan of the region is supposed to bring into operation 3 oil fields with a total production volume of 9 million tons. At the first field, the volume of production is at least 1 million tons, at the second - 3 million tons, at the third - 5 million tons. To achieve this productivity, it is necessary to drill at least 125 wells. For the implementation of this plan, 25 million rubles have been allocated. capital investments (indicator K) and 80 km of pipes (indicator L).

It is required to determine the optimal (maximum) number of wells to ensure the planned productivity of each field. The initial data on the task are given in the table.

Initial data

The problem statement is given above.

Let us formalize the conditions and constraints specified in the problem. The goal of solving this optimization problem is to find the maximum value of oil production with the optimal number of wells for each field, taking into account the existing constraints on the problem.

The objective function, in accordance with the requirements of the task, will take the form:

where is the number of wells for each field.

Existing restrictions on the task for:

pipe laying length:

number of wells in each field:

construction cost of 1 well:

Linear optimization problems are solved, for example, by the following methods:

Graphically

Simplex method

Using the graphical method is convenient only when solving linear optimization problems with two variables. With a larger number of variables, the use of an algebraic apparatus is necessary. Consider a general method for solving linear optimization problems called the simplex method.

The simplex method is a typical example of iterative computations used to solve most optimization problems. Iterative procedures of this kind are considered, which ensure the solution of problems with the help of operation research models.

To solve the optimization problem using the simplex method, it is necessary that the number of unknowns Xi be greater than the number of equations, i.e. system of equations

satisfies the relation m

A = was equal to m.

Let us denote the column of the matrix A as, and the column of free terms as

A basic solution to system (1) is a set of m unknowns that are a solution to system (1).

Briefly, the algorithm of the simplex method is described as follows:

The original constraint written as an inequality like<= (=>) can be represented as equality by adding the residual variable to the left side of the constraint (subtracting the redundant variable from the left side).

For example, to the left of the original constraint

a residual variable is introduced, as a result of which the original inequality turns into the equality

If the original limitation determines the pipe flow rate, then the variable should be interpreted as the remainder, or the unused part of this resource.

Maximizing the objective function is equivalent to minimizing the same function, taken with the opposite sign. That is, in our case

equivalent to

A simplex table is compiled for the basic solution of the following form:

In this table, it is indicated that after solving the problem in these cells there will be a basic solution. - quotients from dividing a column by one of the columns; - additional multipliers for zeroing the values ​​in the cells of the table related to the resolving column. - min value of the objective function -Z, - the values ​​of the coefficients in the objective function with unknowns.

Any positive value is found among the meanings. If this is not the case, then the problem is considered solved. Any column of the table that is in it is selected, this column is called the "permissive" column. If there are no positive numbers among the elements of the resolving column, then the problem is unsolvable due to the unboundedness of the objective function on the set of its solutions. If positive numbers are present in the resolving column, go to step 5.

The column is filled with fractions, in the numerator of which are the elements of the column, and in the denominator - the corresponding elements of the resolving column. The smallest of all values ​​is selected. The line with the smallest result is called the "enable" line. At the intersection of the resolving line and the resolving column, a resolving element is found, which is highlighted in some way, for example, with color.

Based on the first simplex table, the following is compiled, in which:

Replaces row vector with column vector

the permissive line is replaced by the same line divided by the permissive element

each of the other rows of the table is replaced by the sum of this row with the resolving one, multiplied by a specially selected additional factor in order to obtain 0 in the cell of the resolving column.

With the new table, we turn to point 4.

The solution of the problem.

Based on the formulation of the problem, we have the following system of inequalities:

and the objective function

We transform the system of inequalities into a system of equations by introducing additional variables:

Let us reduce the objective function to its equivalent:

Let's build the original simplex table:

Let's choose a permissive column. Let's calculate the column:

We enter the values ​​into the table. For the smallest of them = 10, we determine the resolving line:. At the intersection of the resolving line and the resolving column, we find the resolving element = 1. We fill the part of the table with additional factors, such that: the resolving row multiplied by them, added to the rest of the table rows, forms 0 in the elements of the resolving column.

We compose the second simplex table:

We take the resolving column in it, calculate the values, enter them into the table. By the minimum, we get the resolving line. The resolving element will be 1. Find additional factors, fill in the columns.

We create the following simplex table:

Similarly, we find the resolving column, resolving row and resolving element = 2. We build the following simplex table:

Since there are no positive values ​​in the -Z line, this table is finite. The first column gives the desired values ​​of the unknowns, i.e. optimal basic solution:

In this case, the value of the objective function is -Z = -8000, which is equivalent to Zmax = 8000. The problem is solved.

Task 3. Cluster analysis

Formulation of the problem:

Split objects based on the data given in the table. The choice of the solution method is to be carried out independently, to build a graph of data dependence.

Option 1.

Initial data

Review of methods for solving this type of problems. Justification of the solution method.

Cluster analysis tasks are solved using the following methods:

Union or tree clustering method is used to form clusters of "dissimilarity" or "distance between objects". These distances can be defined in one-dimensional or multi-dimensional space.

Two-way combining is used (relatively rarely) in circumstances where data is interpreted not in terms of “objects” and “properties of objects”, but in terms of observations and variables. Observations and variables are both expected to contribute to the detection of meaningful clusters at the same time.

K-means method. Used when there is already a hypothesis regarding the number of clusters. You can tell the system to form exactly, for example, three clusters so that they are as different as possible. In general, the K-means method builds exactly K different clusters located at the greatest possible distances from each other.

There are the following ways to measure distances:

Euclidean distance. This is the most common type of distance. It is simply the geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original, not standardized data.

Distance of city blocks (Manhattan distance). This distance is simply the average of the coordinate differences. In most cases, this measure of distance leads to the same results as for ordinary Euclidean distance. Note, however, that for this measure the influence of individual large differences (outliers) decreases (since they are not squared). The Manhattan Distance is calculated using the formula:

Chebyshev's distance. This distance can be useful when you want to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. Sometimes one wants to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different. This can be achieved using a power law distance. The power-law distance is calculated by the formula:

where r and p are user-defined parameters. A few calculation examples can show how this measure "works". The p parameter is responsible for the gradual weighting of the differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

Disagreement percentage. This measure is used when the data is categorical. This distance is calculated by the formula:

To solve the problem, we will choose the unification method (tree-like clustering) as the one that best meets the conditions and formulation of the problem (to split the objects). In turn, the union method can use several variants of communication rules:

Single link (nearest neighbor method). In this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters. That is, any two objects in two clusters are closer to each other than the corresponding link distance. This rule should, in a sense, string objects together to form clusters, and the resulting clusters tend to be long "chains".

Full communication (the method of the most distant neighbors). In this method, the distance between clusters is determined by the largest distance between any two features in different clusters (ie, "farthest neighbors").

There are also many other clustering methods like these (eg, unweighted pairing, weighted pairing, etc.).

Solution method technology. Calculation of indicators.

In the first step, when each object is a separate cluster, the distances between these objects are determined by the selected measure.

Since the task does not specify the units of measure for the characteristics, it is assumed that they are the same. Therefore, there is no need to normalize the initial data, so we immediately proceed to calculating the distance matrix.

The solution of the problem.

Let's build a graph of dependence according to the initial data (Fig. 2)

We will take the usual Euclidean distance as the distance between objects. Then according to the formula:

where l - signs; k is the number of features, the distance between objects 1 and 2 is equal to:

We continue to calculate the remaining distances:

Let's build a table from the obtained values:

The smallest distance. This means that we combine elements 3, 6 and 5 into one cluster. We get the following table:

The smallest distance. Elements 3, 6, 5 and 4 are combined into one cluster. We get a table of two clusters:

The minimum distance between items 3 and 6 is. This means that elements 3 and 6 are combined into one cluster. We choose the maximum distance between the newly formed cluster and the rest of the elements. For example, the distance between cluster 1 and cluster 3.6 is max (13.34166, 13.60147) = 13.34166. Let's compose the following table:

In it, the minimum distance is the distance between clusters 1 and 2. Combining 1 and 2 into one cluster, we get:

Thus, using the “far neighbor” method, two clusters were obtained: 1,2 and 3,4,5,6, the distance between which is 13.60147.

The problem has been solved.

Applications. Solving problems using software packages (MS Excel 7.0)

The problem of correlation and regression analysis.

We enter the initial data into the table (Fig. 1)

Select the menu "Service / Data Analysis". In the window that appears, select the line "Regression" (Fig. 2).

Let's set in the next window the input intervals for X and Y, the reliability level will be 95%, and the output data will be placed on a separate sheet "Report Sheet" (Fig. 3)

After carrying out the calculation, we obtain the final data of the regression analysis on the "Report Sheet" sheet:

It also displays a dot plot of the approximating function, or "Selection Graph":


The calculated values ​​and deviations are shown in the table in the “Predicted Y” and “Balances” columns, respectively.

Based on the initial data and deviations, a residual graph is plotted:

Optimization task


We enter the initial data as follows:

The unknown unknowns X1, X2, X3 are entered into cells C9, D9, E9, respectively.

The objective function coefficients for X1, X2, X3 are entered into C7, D7, E7, respectively.

Enter the objective function into cell B11 as the formula: = C7 * C9 + D7 * D9 + E7 * E9.

Existing task restrictions

For the length of pipe laying:

we add to cells C5, D5, E5, F5, G5

The number of wells in each field:

X3 Ј 100; we add to cells C8, D8, E8.

Cost of construction of 1 well:

we add to cells C6, D6, E6, F6, G6.

The formula for calculating the total length C5 * C9 + D5 * D9 + E5 * E9 is placed in cell B5, the formula for calculating the total cost C6 * C9 + D6 * D9 + E6 * E9 is placed in cell B6.


We select in the menu "Service / Search for a solution", we enter the parameters for finding a solution in accordance with the initial data (Fig. 4):

Using the "Parameters" button, set the following parameters for finding a solution (Fig. 5):


After searching for a solution, we get a report on the results:

Microsoft Excel 8.0e Results Report

Report Created: 11/17/2002 1:28:30 AM

Target Cell (Maximum)

Result

Total loot

Modifiable cells

Result

Number of wells

Number of wells

Number of wells

Restrictions

Meaning

Length

Related

Project cost

not related.

Number of wells

not related.

Number of wells

Related

Number of wells

Related

The first table shows the initial and final (optimal) value of the target cell, where the objective function of the problem being solved was placed. In the second table we see the initial and final values ​​of the variables to be optimized, which are contained in the modified cells. The third table in the results report contains information about the constraints. The column "Value" contains the optimal values ​​of the required resources and the variables to be optimized. The "Formula" column contains limits on consumed resources and variables to be optimized, written in the form of references to cells containing this data. The column "State" determines whether these or those constraints are related or unrelated. Here "bound" are constraints implemented in the optimal solution in the form of rigid equalities. The column "Difference" for resource constraints determines the remainder of the used resources, i.e. the difference between the required amount of resources and their availability.

Similarly, having written down the result of the search for a solution in the "Sustainability Report" form, we will receive the following tables:

Microsoft Excel 8.0e Resilience Report

Worksheet: [Solution of the optimization problem.xls] Solution of the optimization problem

Report Created: 11/17/2002 1:35:16 AM

Modifiable cells

Permissible

Permissible

meaning

price

Coefficient

Increase

Decrease

Number of wells

Number of wells

Number of wells

Restrictions

Limitation

Permissible

Permissible

meaning

Right part

Increase

Decrease

Length

Project cost

The sustainability report contains information about modifiable (optimized) variables and model constraints. This information is associated with the simplex method used in the optimization of linear problems, described above in terms of solving the problem. It allows you to estimate how sensitive the obtained optimal solution is to possible changes in the model parameters.

The first part of the report contains information about the modified cells containing values ​​about the number of wells in the fields. The column "Resulting value" indicates the optimal values ​​of the variables to be optimized. The column "Target coefficient" contains the initial data of the values ​​of the coefficients of the objective function. The next two columns illustrate the allowable increase and decrease of these coefficients without changing the found optimal solution.

The second part of the sustainability report contains information on the constraints imposed on the variables being optimized. The first column shows the resource requirements for the optimal solution. The second contains the values ​​of the shadow prices for the types of resources used. The last two columns contain data on possible increases or decreases in the amount of available resources.

Clustering problem.

The step-by-step method for solving the problem is given above. Here are Excel tables illustrating the progress of solving the problem:

Nearest neighbor method

Solving the problem of cluster analysis - "NEAREST NEIGHBOR'S METHOD"

Initial data

where x1 is the volume of products;

х2 - the average annual cost of the main

Industrial production assets

Far neighbor method

Solution of the problem of cluster analysis - "DISTANCE NEIGHBOR METHOD"

Initial data

where x1 is the volume of products;

х2 - the average annual cost of the main

Industrial production assets

Submission and preprocessing of expert assessments

In practice, several types of assessments are used:

- high-quality (often-rarely, worse-better, yes-no),

- scale estimates (ranges of values ​​50-75, 76-90, 91-120, etc.),

Score from a given interval (from 2 to 5, 1 -10), mutually independent,

Ranked (objects are arranged by an expert in a certain order, and each is assigned a serial number - rank),

Comparative, obtained by one of the comparison methods

sequential comparison method

method of pairwise comparison of factors.

At the next step of processing expert opinions, it is necessary to evaluate the degree of consistency of these opinions.

The estimates obtained from experts can be considered as a random variable, the distribution of which reflects the opinions of experts about the probability of a particular choice of an event (factor). Therefore, to analyze the scatter and consistency of expert estimates, generalized statistical characteristics are used - averages and scatter measures:

Mean square error,

Variational range min - max,

- coefficient of variation V = mean square deviation / mean arithm. (suitable for any type of assessment)

V i = σ i / x i avg

For rate similarity measures but opinions each pair of experts a variety of methods can be used:

association coefficients, with the help of which the number of matching and non-matching answers is taken into account,

inconsistency coefficients expert opinions,

All these measures can be used either to compare the opinions of two experts, or to analyze the relationship between the series of assessments on two grounds.

Spearman's pair rank correlation coefficient:

where n is the number of experts,

c k - the difference between the estimates of the i-th and j-th experts for all T factors

Kendall's rank correlation coefficient (concordance coefficient) gives an overall assessment of the consistency of opinions of all experts on all factors, but only for cases where rank estimates were used.

It is proved that the value of S, when all experts give the same estimates of all factors, has a maximum value equal to

where n is the number of factors,

m is the number of experts.

The coefficient of concordance is equal to the ratio

moreover, if W is close to 1, then all experts have given sufficiently consistent estimates, otherwise their opinions are not agreed.

The formula for calculating S is shown below:

where r ij are the rank estimates of the i-th factor by the j-th expert,

r cf is the average rank over the entire matrix of estimates and is equal to

And therefore the formula for calculating S can take the form:

If individual assessments of one expert coincide, and they were made standardized during processing, then a different formula is used to calculate the concordance coefficient:



where T j is calculated for each expert (in the event that his assessments were repeated for different objects), taking into account the repetitions according to the following rules:

where t j is the number of groups of equal ranks for the j-th expert, and

h k - the number of equal ranks in the k-th group of related ranks of the j-th expert.

EXAMPLE. Let 5 experts on six factors answer in the ranking as shown in Table 3:

Table 3 - Answers of experts

Experts О1 О2 O3 О4 O5 O6 Sum of ranks by expert
E1
E2
E3
E4
E5

Due to the fact that not a strict ranking was obtained (the assessments from the experts are repeated, and the sums of the ranks are not equal), we will transform the estimates and get the related ranks (Table 4):

Table 4 - Related ranks of expert assessments

Experts О1 О2 O3 О4 O5 O6 Sum of ranks by expert
E1 2,5 2,5
E2
E3 1,5 1,5 4,5 4,5
E4 2,5 2,5 4,5 4,5
E5 5,5 5,5
The sum of the ranks of the object 7,5 9,5 23,5 29,5

Now let's determine the degree of consistency of expert opinions using the coefficient of concordance. Since the ranks are related, we will calculate W by the formula (**).

Then r cf = 7 * 5/2 = 17.5

S = 10 2 +8 2 +4.5 2 +4.5 2 +6 2 +12 2 = 384.5

Let us proceed to the calculations of W. For this, we calculate separately the values ​​of T j. In the example, the assessments are specially selected so that each expert has repeated assessments: the first has two, the second has three, the third has two groups of two ratings, and the fourth has two identical ratings. Hence:

T 1 = 2 3 - 2 = 6 T 5 = 6

T 2 = 3 3 - 3 = 24

Т 3 = 2 3 –2+ 2 3 –2 = 12 Т 4 = 12

We see that the agreement of the experts' opinions is quite high and we can proceed to the next stage of the study - substantiation and adoption of the alternative of the decision recommended by the experts.

Otherwise, you need to go back to steps 4-8.

KENDALLA RANK CORRELATION COEFFICIENT

One of the sample measures of the dependence of two random variables (features) X and Y, based on the ranking of the sample items (X 1, Y x), .. ., (X n, Y n). K. to. R. to. refers, therefore, to rank statisticians and is determined by the formula

where r i- U belonging to that pair ( X, Y), for a swarm of Xraven i, S = 2N- (n-1) / 2, N is the number of sample elements, for which simultaneously j> i and r j> r i... Is always As a selective measure of dependence To. To. R. to. was widely used by M. Kendall (M. Kendall, see).

K. to. R. K. is used to test the hypothesis of the independence of random variables. If the independence hypothesis is true, then E t = 0 and D t = 2 (2n + 5) / 9n (n-1). With a small sample size, the check is statistical. the hypothesis of independence is made using special tables (see). For n> 10, the normal approximation is used for the distribution of m: if

then the hypothesis of independence is rejected, otherwise it is accepted. Here a . - the level of significance, u a / 2 is the percentage point of the normal distribution. K. to. R. Because, like any other, it can be used to detect the dependence of two qualitative features, if only the elements of the sample can be ordered with respect to these features. If X, Y have a joint normal with the correlation coefficient p, then the relationship between K. to. p. to. and has the form:

see also Spearman's rank correlation, Rank test.

Lit.: Kendal M., Rank correlations, trans. from English., M., 1975; Van der Waerden B.L., Mathematical, trans. from it., M., 1960; Bol'shev L.N., Smirnov N.V., Tables of mathematical statistics, Moscow, 1965.

A. V. Prokhorov.


Encyclopedia of Mathematics. - M .: Soviet encyclopedia... I. M. Vinogradov. 1977-1985.

See what "KENDALLA RANK CORRELATION COEFFICIENT" is in other dictionaries:

    English. с efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology

    KENDALL'S RANK CORRELATION COEFFICIENT- English. efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables ... Explanatory Dictionary of Sociology

    A measure of the dependence of two random variables (features) X and Y, based on the ranking of independent observation results (X1, Y1),. ... ., (Xn, Yn). If the ranks of the values ​​of X are located in the natural order i = 1,. ... ., n, and Ri the rank Y corresponding to ... ... Encyclopedia of mathematics

    Correlation coefficient- (Correlation coefficient) The correlation coefficient is a statistical indicator of the dependence of two random variables. Determination of the correlation coefficient, types of correlation coefficients, properties of the correlation coefficient, calculation and application ... ... Investor encyclopedia

    The relationship between random variables, which, generally speaking, is not strictly functional. Unlike functional dependence, K., as a rule, is considered when one of the quantities depends not only on this other, but also ... ... Encyclopedia of mathematics

    Correlation (correlation dependence) is a statistical relationship of two or more random variables (or quantities that can be considered as such with some acceptable degree of accuracy). In this case, changes in the values ​​of one or ... ... Wikipedia

    Correlation- (Correlation) Correlation is a statistical relationship of two or more random variables. The concept of correlation, types of correlation, correlation coefficient, correlation analysis, price correlation, correlation of currency pairs on Forex Contents ... ... Investor encyclopedia

    It is generally accepted that the beginning of S. of m. Century. or, as it is often called, the statistics of "small n", was laid in the first decade of the XX century by the publication of the work of W. Gosset, in which he placed the t distribution, postulated by those who received the world a little later ... ... Psychological encyclopedia

    Maurice Kendall Sir Maurice George Kendall Date of birth: 6 September 1907 (1907 09 06) Place of birth: Kettering, UK Date of death ... Wikipedia

    Forecast- (Forecast) Definition of a forecast, tasks and principles of forecasting Definition of a forecast, tasks and principles of forecasting, forecasting methods Contents Contents Definition Basic concepts of forecasting Tasks and principles of forecasting ... ... Investor encyclopedia