Kendall's rank correlation coefficient. Rank correlation and Kendall's rank correlation coefficient Kendall's rank correlation coefficient in excel

Brief theory

Kendall's correlation coefficient is used when variables are represented by two ordinal scales, provided that there are no associated ranks. The calculation of Kendall's coefficient involves counting the number of matches and inversions.

This coefficient varies within and is calculated by the formula:

For calculation, all units are ranked by attribute; according to a number of other criteria, the number of subsequent ranks exceeding the given one (we denote them by) and the number of subsequent ranks below the given one (we denote them by) are calculated for each rank.

It can be shown that

and Kendall's rank correlation coefficient can be written as

In order to test the null hypothesis at the level of significance that the general Kendall's rank correlation coefficient is equal to zero under a competing hypothesis, it is necessary to calculate the critical point:

where is the sample size; Is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality

If - there is no reason to reject the null hypothesis. The rank correlation between the features is insignificant.

If - the null hypothesis is rejected. There is a significant rank correlation between the features.

An example of solving the problem

The task

When recruiting seven candidates for vacant positions, two tests were offered. The test results (in points) are shown in the table:

Test Candidate 1 2 3 4 5 6 7 1 31 82 25 26 53 30 29 2 21 55 8 27 32 42 26

Calculate Kendall's rank correlation coefficient between test results for two tests and assess its significance at the level.

The solution of the problem

Calculate Kendall's coefficient

The ranks of the factor attribute are arranged strictly in ascending order, and the corresponding ranks of the effective attribute are recorded in parallel. For each rank from among the following ranks, the number of higher ranks is calculated (entered in the column) and the number of lower ranks (entered in the column).

1 1 6 0 2 4 3 2 3 3 3 1 4 6 1 2 5 2 2 0 6 5 1 0 7 7 0 0 Sum 16 5

One of the factors limiting the application of criteria based on the assumption of normality is the sample size. As long as the sample is large enough (for example, 100 or more observations), you can assume that the sample distribution is normal, even if you are not sure that the distribution of the variable in the population is normal. However, if the sample is small, these criteria should only be used if there is confidence that the variable is indeed normally distributed. However, there is no way to test this assumption in a small sample.

The use of criteria based on the assumption of normality is also limited to a scale of measurements (see chapter Basic concepts of data analysis). Statistical methods such as t-test, regression, etc. assume that the original data is continuous. However, there are situations where the data are simply ranked (measured on an ordinal scale) rather than measured accurately.

A typical example is given by the ratings of sites on the Internet: the first position is taken by the site with the maximum number of visitors, the second position is taken by the site with the maximum number of visitors among the remaining sites (among sites from which the first site has been removed), etc. Knowing the ratings, we can say that the number of visitors to one site is greater than the number of visitors to another, but how much more is impossible to say. Imagine you have 5 sites: A, B, C, D, E, which are in the top 5 places. Suppose that in the current month we had the following arrangement: A, B, C, D, E, and in the previous month: D, E, A, B, C. The question is, there have been significant changes in site ratings or not? In this situation, obviously, we cannot use the t-test to compare these two groups of data, and move on to the area of ​​specific probabilistic calculations (and any statistical criterion contains a probabilistic calculation!). We reason like this: how likely is it that the difference in the two site layouts is due to purely random reasons, or that the difference is too large and cannot be explained by pure chance. In this reasoning, we only use the ranks or permutations of sites and do not in any way use a specific form of distribution of the number of visitors to them.

For the analysis of small samples and for data measured on poor scales, nonparametric methods are used.

A quick tour of nonparametric procedures

Essentially, for every parametric criterion, there is at least one nonparametric alternative.

In general, these procedures fall into one of the following categories:

  • distinction criteria for independent samples;
  • distinction criteria for dependent samples;
  • assessment of the degree of dependence between the variables.

In general, the approach to statistical criteria in data analysis should be pragmatic and not burdened with unnecessary theoretical reasoning. With a STATISTICA computer at your disposal, you can easily apply several criteria to your data. Knowing about some of the pitfalls of the methods, you will choose the right solution through experimentation. The development of the plot is quite natural: if you need to compare the values ​​of two variables, then you use the t-test. However, it should be remembered that it is based on the assumption of normality and equality of variances in each group. Breaking free from these assumptions results in nonparametric tests that are especially useful for small samples.

The development of the t-test leads to analysis of variance, which is used when the number of compared groups is more than two. The corresponding development of nonparametric procedures leads to a nonparametric analysis of variance, although it is significantly poorer than the classical analysis of variance.

To assess the dependence, or, to put it somewhat pompously, the degree of tightness of the connection, the Pearson correlation coefficient is calculated. Strictly speaking, its application has limitations associated, for example, with the type of scale in which the data are measured and the nonlinearity of the dependence; therefore, alternatively, nonparametric, or so-called rank, correlation coefficients are also used, which are used, for example, for ranked data. If the data are measured on a nominal scale, then it is natural to present them in contingency tables that use Pearson's chi-square test with various variations and corrections for accuracy.

So, in essence, there are only a few types of criteria and procedures that you need to know and be able to use, depending on the specifics of the data. You need to determine which criterion should be applied in a particular situation.

Nonparametric methods are most appropriate when sample sizes are small. If there is a lot of data (for example, n> 100), it often doesn't make sense to use nonparametric statistics.

If the sample size is very small (for example, n = 10 or less), then the significance levels for those nonparametric tests that use the normal approximation can only be considered as rough estimates.

Differences between independent groups... If there are two samples (for example, men and women) that need to be compared with respect to some average value, for example, the mean pressure or the number of leukocytes in the blood, then the t-test can be used for independent samples.

Nonparametric alternatives to this test are the criterion of the Val'd-Wolfowitz, Mann-Whitney series) / n, where x i is the i-th value, n is the number of observations. If the variable contains negative values ​​or zero (0), the geometric mean cannot be calculated.

Harmonic mean

The harmonic average is sometimes used to average frequencies. The harmonic mean is calculated by the formula: ГС = n / S (1 / x i) where ГС is the harmonic mean, n is the number of observations, х i is the value of observation with the number i. If the variable contains zero (0), the harmonic mean cannot be calculated.

Dispersion and standard deviation

Sample variance and standard deviation are the most commonly used measures of variability (variation) in data. The variance is calculated as the sum of the squares of the deviations of the values ​​of the variable from the sample mean, divided by n-1 (but not by n). The standard deviation is calculated as the square root of the variance estimate.

Swing

The range of a variable is an indicator of volatility, calculated as a maximum minus a minimum.

Quartile scope

The quarterly range, by definition, is: upper quartile minus lower quartile (75% percentile minus 25% percentile). Since the 75% percentile (upper quartile) is the value to the left of which 75% of cases are located, and the 25% percentile (lower quartile) is the value to the left of which 25% of cases are located, the quartile range is the interval around the median. which contains 50% of the cases (variable values).

Asymmetry

Asymmetry is a characteristic of the shape of the distribution. The distribution is skewed to the left if the skewness value is negative. The distribution is skewed to the right if the asymmetry is positive. The skewness of the standard normal distribution is 0. The skewness is associated with the third moment and is defined as: skewness = n × M 3 / [(n-1) × (n-2) × s 3], where M 3 is: (x i -x mean x) 3, s 3 is the standard deviation raised to the third power, n is the number of observations.

Excess

Kurtosis is a characteristic of the shape of a distribution, namely, a measure of the severity of its peak (relative to a normal distribution, the kurtosis of which is equal to 0). As a rule, distributions with a sharper peak than normal have a positive kurtosis; distributions whose peak is less acute than the peak of the normal distribution have negative kurtosis. The excess is associated with the fourth moment and is determined by the formula:

kurtosis = / [(n-1) × (n-2) × (n-3) × s 4], where M j is: (x-x mean x, s 4 is the standard deviation to the fourth power, n is the number of observations ...

It is used to identify the relationship between quantitative or qualitative indicators, if they can be ranked. The values ​​of the X indicator are set in ascending order and assigned ranks. The values ​​of the Y indicator are ranked and the Kendall correlation coefficient is calculated:

where S = PQ.

P big the rank value Y.

Q- the total number of observations following the current observations with smaller the rank value Y. (equal ranks do not count!)

If the studied data are repeated (have the same ranks), then Kendall's corrected correlation coefficient is used in the calculations:

t- the number of related ranks in the row X and Y, respectively.

19.What should be the starting point when defining the theme, object, subject, goal, objectives and hypothesis of the research?

The research program, as a rule, has two sections: methodological and procedural. The first includes substantiating the relevance of the topic, formulating the problem, defining the object and subject, goals and objectives of the research, formulating the basic concepts (categorical apparatus), preliminary systematic analysis of the research object and putting forward a working hypothesis. The second section reveals the strategic research plan, as well as the plan and basic procedures for collecting and analyzing primary data.

First of all, when choosing a research topic, one must proceed from the relevance. Justification of relevance includes an indication of the need and timeliness of the study and solution of the problem for the further development of the theory and practice of teaching and upbringing. Topical research provides an answer to the most pressing questions at this time, reflect the social order of society to pedagogical science, and reveal the most important contradictions that take place in practice. The criterion of relevance is dynamic, mobile, depends on time, taking into account specific and specific circumstances. In its most general form, relevance characterizes the degree of discrepancy between the demand for scientific ideas and practical recommendations (to meet a particular need) and the proposals that science and practice can provide at the present time.

The most convincing basis defining the research topic is the social order, reflecting the most acute, socially significant problems that require urgent solutions. The social order requires a substantiation of a specific topic. Usually this is an analysis of the degree of elaboration of a question in science.

If the social order follows from the analysis of pedagogical practice, then itself scientific problem is in a different plane. It expresses the main contradiction that must be resolved by means of science. The solution to the problem is usually purpose of the study. The goal is a reformulated problem.

The wording of the problem entails object selection research. It can be a pedagogical process, an area of ​​pedagogical reality, or some kind of pedagogical attitude that contains a contradiction. In other words, an object can be anything that explicitly or implicitly contains a contradiction and generates a problem situation. The object is what the process of cognition is directed to. Subject of study - part, side of the object. These are the most significant from a practical or theoretical point of view, properties, aspects, features of an object that are subject to direct study.

In accordance with the purpose, object and subject of research, research tasks, which, as a rule, are aimed at checking hypotheses. The latter is a set of theoretically based assumptions, the truth of which is subject to verification.

Criterion scientific novelty can be used to assess the quality of completed studies. It characterizes new theoretical and practical conclusions, patterns of education, its structure and mechanisms, content, principles and technologies, which at this point in time were not known and were not recorded in the pedagogical literature. The novelty of the research can have both theoretical and practical significance. The theoretical value of the research lies in creating a concept, obtaining a hypothesis, regularity, method, model for identifying a problem, tendency, direction. The practical significance of the research lies in the preparation of proposals, recommendations, etc. The criteria for novelty, theoretical and practical significance change depending on the type of research, they also depend on the time of obtaining new knowledge.

KENDALLA RANK CORRELATION COEFFICIENT

One of the sample measures of the dependence of two random variables (features) X and Y, based on the ranking of the sample items (X 1, Y x), .. ., (X n, Y n). K. to. R. to. refers, therefore, to rank statisticians and is determined by the formula

where r i- U belonging to that pair ( X, Y), for a swarm of Xraven i, S = 2N- (n-1) / 2, N is the number of sample elements, for which simultaneously j> i and r j> r i... Is always As a selective measure of dependence To. To. R. to. was widely used by M. Kendall (M. Kendall, see).

K. to. R. K. is used to test the hypothesis of the independence of random variables. If the independence hypothesis is true, then E t = 0 and D t = 2 (2n + 5) / 9n (n-1). With a small sample size, the check is statistical. the hypothesis of independence is made using special tables (see). For n> 10, the normal approximation is used for the distribution of m: if

then the hypothesis of independence is rejected, otherwise it is accepted. Here a . - the level of significance, u a / 2 is the percentage point of the normal distribution. K. to. R. Because, like any other, it can be used to detect the dependence of two qualitative features, if only the elements of the sample can be ordered with respect to these features. If X, Y have a joint normal with the correlation coefficient p, then the relationship between K. to. p. to. and has the form:

see also Spearman's rank correlation, Rank test.

Lit.: Kendal M., Rank correlations, trans. from English., M., 1975; Van der Waerden B.L., Mathematical, trans. from it., M., 1960; Bol'shev L.N., Smirnov N.V., Tables of mathematical statistics, Moscow, 1965.

A. V. Prokhorov.


Encyclopedia of Mathematics. - M .: Soviet encyclopedia... I. M. Vinogradov. 1977-1985.

See what "KENDALLA RANK CORRELATION COEFFICIENT" is in other dictionaries:

    English. с efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology

    KENDALL'S RANK CORRELATION COEFFICIENT- English. efficient, rank correlation Kendall; German Kendalls Rangkorrelationskoeffizient. Correlation coefficient, which determines the degree of correspondence of the ordering of all pairs of objects in two variables ... Explanatory Dictionary of Sociology

    A measure of the dependence of two random variables (features) X and Y, based on the ranking of independent observation results (X1, Y1),. ... ., (Xn, Yn). If the ranks of the values ​​of X are located in the natural order i = 1,. ... ., n, and Ri the rank Y corresponding to ... ... Encyclopedia of mathematics

    Correlation coefficient- (Correlation coefficient) The correlation coefficient is a statistical indicator of the dependence of two random variables. Determination of the correlation coefficient, types of correlation coefficients, properties of the correlation coefficient, calculation and application ... ... Investor encyclopedia

    The relationship between random variables, which, generally speaking, is not strictly functional. Unlike functional dependence, K., as a rule, is considered when one of the quantities depends not only on this other, but also ... ... Encyclopedia of mathematics

    Correlation (correlation dependence) is a statistical relationship of two or more random variables (or quantities that can be considered as such with some acceptable degree of accuracy). In this case, changes in the values ​​of one or ... ... Wikipedia

    Correlation- (Correlation) Correlation is a statistical relationship of two or more random variables. The concept of correlation, types of correlation, correlation coefficient, correlation analysis, price correlation, correlation of currency pairs on Forex Contents ... ... Investor encyclopedia

    It is generally accepted that the beginning of S. of m. Century. or, as it is often called, the statistics of "small n", was put in the first decade of the XX century by the publication of the work of W. Gosset, in which he placed the t distribution, postulated by those who received the world a little later ... ... Psychological encyclopedia

    Maurice Kendall Sir Maurice George Kendall Date of birth: 6 September 1907 (1907 09 06) Place of birth: Kettering, UK Date of death ... Wikipedia

    Forecast- (Forecast) Definition of forecast, tasks and principles of forecasting Definition of forecast, tasks and principles of forecasting, methods of forecasting Contents Contents Definition Basic concepts of forecasting Tasks and principles of forecasting ... ... Investor encyclopedia

To calculate Kendall's rank correlation coefficient r k it is necessary to rank the data for one of the attributes in ascending order and determine the corresponding ranks for the second attribute. Then, for each rank of the second feature, the number of subsequent ranks, greater in magnitude than the taken rank, is determined, and the sum of these numbers is found.

Kendall's rank correlation coefficient is determined by the formula


where R i- the number of ranks of the second variable, starting from i+1, the magnitude of which is greater than the magnitude i rank of this variable.

There are tables of percentage points of the distribution of the coefficient r k, allowing you to test the hypothesis about the significance of the correlation coefficient.

For large sample sizes, critical values r k are not tabulated, and they have to be calculated using approximate formulas, which are based on the fact that under the null hypothesis H 0: r k= 0 and large n random value

distributed approximately according to the standard normal law.

40. Relationship between traits measured in nominal or ordinal scales

The problem often arises of checking the independence of two features measured on a nominal or ordinal scale.

Let some objects measure two features X and Y with the number of levels r and s respectively. The results of such observations are conveniently presented in the form of a table, called a contingency table.

In the table u i(i = 1, ..., r) and v j (j= 1, ..., s) - the values ​​taken by the features, the value n ij- the number of objects from the total number of objects for which the attribute X took on the meaning u i, and the sign Y- meaning v j

We introduce the following random variables:

u i


- the number of objects that have a value v j


In addition, there are obvious equalities



Discrete random variables X and Y independent if and only if

for all couples i, j

Therefore, the conjecture about the independence of discrete random variables X and Y can be written like this:

As an alternative, as a rule, they use the hypothesis

The validity of the hypothesis H 0 should be judged on the basis of sample frequencies n ij contingency tables. In accordance with the law of large numbers at n→ ∞, the relative frequencies are close to the corresponding probabilities:



To test the hypothesis H 0, statistics are used

which, if the hypothesis is true, has the distribution χ 2 sec rs − (r + s- 1) degrees of freedom.

Independence criterion χ 2 rejects hypothesis H 0 with significance level α if:


41. Regression analysis. Basic concepts of regression analysis

For a mathematical description of the statistical relationships between the studied variables, the following problems should be solved:

ü choose a class of functions in which it is advisable to seek the best (in a certain sense) approximation of the dependence of interest;

ü find estimates of the unknown values ​​of the parameters included in the equations of the required dependence;

ü to establish the adequacy of the obtained equation of the required dependence;

ü to identify the most informative input variables.

The totality of the listed tasks is the subject of research in regression analysis.

The regression function (or regression) is the dependence of the mathematical expectation of one random variable on the value taken by another random variable, which forms a two-dimensional system of random variables with the first.

Let there be a system of random variables ( X,Y), then the regression function Y on the X

And the regression function X on the Y

Regression functions f(x) and φ (y) are not mutually reversible if only the relationship between X and Y is not functional.

When n-dimensional vector with coordinates X 1 , X 2 ,…, X n you can consider the conditional mathematical expectation for any component. For example, for X 1


called regression X 1 on X 2 ,…, X n.

For a complete definition of the regression function, it is necessary to know the conditional distribution of the output variable for fixed values ​​of the input variable.

Since in a real situation such information is not available, they are usually limited to the search for a suitable approximating function f a(x) for f(x), based on statistical data of the form ( x i, y i), i = 1,…, n... This data is the result n independent observations y 1 ,…, y n random variable Y for the values ​​of the input variable x 1 ,…, x n, while the regression analysis assumes that the values ​​of the input variable are specified accurately.

The problem of choosing the best approximating function f a(x), being the main one in regression analysis, and does not have formalized procedures for its solution. Sometimes the choice is determined based on the analysis of experimental data, more often from theoretical considerations.

If it is assumed that the regression function is sufficiently smooth, then the approximating function f a(x) can be represented as a linear combination of a set of linearly independent basis functions ψ k(x), k = 0, 1,…, m−1, i.e., in the form


where m- number of unknown parameters θ k(in the general case, the value is unknown, refined during the construction of the model).

Such a function is linear in parameters, therefore, in the case under consideration, we speak of a regression function model that is linear in parameters.

Then the problem of finding the best approximation for the regression line f(x) is reduced to finding such parameter values ​​for which f a(x; θ) is the most adequate to the available data. One of the methods to solve this problem is the least squares method.

42. Least square method

Let the set of points ( x i, y i), i= 1,…, n located on a plane along some straight line

Then, as a function f a(x) approximating the regression function f(x) = M [Y|x] it is natural to take a linear function of the argument x:


That is, the basis functions here are chosen ψ 0 (x) ≡1 and ψ 1 (x)≡x... This regression is called simple linear regression.

If the set of points ( x i, y i), i= 1,…, n is located along some curve, then as f a(x) it is natural to try to choose the family of parabolas

This function is non-linear in parameters θ 0 and θ 1, however, by functional transformation (in this case, taking the logarithm), it can be reduced to a new function f ’a(x), linear in parameters:


43. Simple Linear Regression

The simplest regression model is a simple (one-dimensional, one-factor, paired) linear model, which has the following form:


where ε i- random variables (errors) uncorrelated with each other, having zero mathematical expectations and the same variances σ 2 , a and b- constant coefficients (parameters) that need to be estimated from the measured response values y i.

To find the parameter estimates a and b linear regression, determining the straight line most satisfying the experimental data:


the method of least squares is applied.

According to least squares parameter estimates a and b are found from the condition of minimizing the sum of squares of deviations of the values y i vertically from the “true” regression line:

Let there be ten observations of a random variable Y with fixed values ​​of the variable X

To minimize D we equate to zero the partial derivatives with respect to a and b:



As a result, we obtain the following system of equations for finding estimates a and b:


Solving these two equations gives:



Expressions for Parameter Estimates a and b can also be represented as:

Then the empirical equation of the regression line Y on the X can be written as:


Unbiased variance estimate σ 2 deviations of values y i from the fitted straight line of regression is given by the expression

Let's calculate the parameters of the regression equation


Thus, the regression line looks like:


And the estimation of variance of deviations of values y i from the fitted straight line of regression


44. Checking the Significance of the Regression Line

Found estimate b≠ 0 can be a realization of a random variable, the mathematical expectation of which is equal to zero, that is, it may turn out that there is actually no regression dependence.

To deal with this situation, you should test the hypothesis H 0: b= 0 with a competing hypothesis H 1: b ≠ 0.

The test of the significance of the regression line can be carried out using analysis of variance.

Consider the following identity:

The magnitude y iŷ i = ε i called the remainder and is the difference between two quantities:

ü deviation of the observed value (response) from the total average response;

ü deviation of the predicted response value ŷ i from the same average

The written identity can be written as


Having squared both parts of it and summed over i, we get:


Where the quantities are named:

the total (total) sum of squares of the SC n, which is equal to the sum of the squares of the deviations of observations relative to the mean value of observations

the sum of squares due to the regression of SK p, which is equal to the sum of squares of the deviations of the regression line values ​​relative to the mean of observations.

residual sum of squares SK 0. which is equal to the sum of the squares of the deviations of the observations relative to the values ​​of the regression line

So the spread Y-kov relative to their mean can be attributed to some extent to the fact that not all observations lie on the regression line. If this were the case, then the sum of squares relative to the regression would be zero. It follows that the regression will be significant if the sum of the squares of the SC p is greater than the sum of the squares of the SC 0.

Regression significance test calculations are performed in the following ANOVA table.

If errors ε i distributed according to the normal law, then if the hypothesis H 0 is valid: b= 0 statistics:


distributed according to Fisher's law with the number of degrees of freedom 1 and n−2.

The null hypothesis will be rejected at the significance level α if the calculated statistic value F will be greater than the α percentage point f 1;n−2; α of the Fisher distribution.

45. Checking the adequacy of the regression model. Residual method

The adequacy of the constructed regression model is understood as the fact that no other model gives a significant improvement in predicting the response.

If all values ​​of the responses are obtained at different values x, i.e., there are no several response values ​​obtained with the same x i, then only a limited test of the adequacy of the linear model can be carried out. The basis for such a check is the leftovers:

Deviations from the established pattern:

Insofar as X- one-dimensional variable, points ( x i, d i) can be plotted on a plane in the form of the so-called residual plot. Such a representation sometimes makes it possible to find some regularity in the behavior of the residuals. In addition, the analysis of the residuals allows you to analyze the assumption regarding the distribution of errors.

In the case when the errors are distributed according to the normal law and there is an a priori estimate of their variance σ 2 (an estimate obtained on the basis of previously performed measurements), then a more accurate assessment of the adequacy of the model is possible.

Via F-Fisher's criterion can be used to check whether the residual variance is significant s 0 2 differs from the a priori estimate. If it is significantly greater, then there is an inadequacy and the model should be revised.

If the prior estimate σ 2 no, but response measurements Y repeated two or more times with the same values X, then these repeated observations can be used to obtain another estimate σ 2 (the first is the residual variance). Such an estimate is said to represent a “pure” error, since if x are the same for two or more observations, then only random changes can affect the results and create a scatter between them.

The resulting estimate turns out to be a more reliable estimate of the variance than the estimate obtained by other methods. For this reason, when planning experiments, it makes sense to set up experiments with repetitions.

Suppose we have m different meanings X : x 1 , x 2 , ..., x m... Let for each of these values x i there is n i response observations Y... Total observations are obtained:

Then the simple linear regression model can be written as:


Let's find the variance of “pure” errors. This variance is the combined estimate of the variance σ 2, if we represent the values ​​of the responses y ij at x = x i as sample volume n i... As a result, the variance of “pure” errors is:

This variance serves as an estimate σ 2 regardless of whether the fitted model is correct.

Let us show that the sum of squares of “pure errors” is a part of the residual sum of squares (the sum of squares included in the expression for the residual variance). Remaining for j th observation at x i can be written as:

If you square both sides of this equality and then sum them over j and by i, we get:

On the left of this equality is the residual sum of squares. The first term on the right is the sum of squares of “pure” errors, the second term can be called the sum of squares of inadequacy. The last amount has m−2 degrees of freedom, therefore, the variance of inadequacy

The statistics of the criterion for testing the hypothesis H 0: the simple linear model is adequate, against the hypothesis H 1: the simple linear model is inadequate, the random variable is

If the null hypothesis is true, the value F has a Fisher distribution with degrees of freedom m−2 and nm... The hypothesis of linearity of the regression line should be rejected with a significance level α, if the obtained value of the statistic is greater than the α-percentage point of the Fisher distribution with the number of degrees of freedom m−2 and nm.

46. Checking the adequacy of the regression model (see 45). ANOVA

47. Checking the adequacy of the regression model (see 45). Determination coefficient

Sometimes, to characterize the quality of the regression line, a sample coefficient of determination is used R 2, showing what part (fraction) of the sum of squares, due to the regression, SK p is in the total sum of squares SK n:

The closer R 2 to one, the better the regression approximates the experimental data, the closer the observations are adjacent to the regression line. If R 2 = 0, then the changes in the response are completely due to the influence of unaccounted factors, and the regression line is parallel to the axis x-ov. In the case of simple linear regression, the coefficient of determination R 2 is equal to the square of the correlation coefficient r 2 .

The maximum value R 2 = 1 can be achieved only in the case when the observations were carried out at different values ​​of x-ov. If there are repeated experiments in the data, then the value of R 2 cannot reach unity, no matter how good the model is.

48. Confidence Intervals for Simple Linear Regression Parameters

Just as the sample mean is an estimate of the true mean (the population mean), so are the sample parameters of the regression equation a and b- nothing more than an estimate of the true regression coefficients. Different samples give different estimates of the mean - just as different samples will give different estimates of the regression coefficients.

Assuming that the error distribution law ε i are described by the normal law, the parameter estimate b will have a normal distribution with parameters:


Since the parameter estimate a is a linear combination of independent normally distributed quantities, it will also have a normal distribution with mean and variance:


In this case, the (1 - α) confidence interval for estimating the variance σ 2 taking into account that the ratio ( n−2)s 0 2 /σ 2 distributed by law χ 2 with the number of degrees of freedom n−2 will be determined by the expression


49. Confidence intervals for the regression line. Confidence interval for dependent variable values

We usually do not know the true values ​​of the regression coefficients. a and b... We only know their estimates. In other words, the true regression line can go higher or lower, be steeper or shallower than the one constructed from the sample data. We calculated the confidence intervals for the regression coefficients. You can also calculate the confidence region for the regression line itself.

Let for simple linear regression it is necessary to construct (1− α ) confidence interval for the mathematical expectation of the response Y at value X = X 0. This mathematical expectation is a+bx 0, and its estimate

Since, then.

The obtained estimate of the mathematical expectation is a linear combination of uncorrelated normally distributed values ​​and therefore also has a normal distribution centered at the point of the true value of the conditional mathematical expectation and variance

Therefore, the confidence interval for the regression line at each value x 0 can be represented as


As you can see, the minimum confidence interval is obtained at x 0 equal to the mean and increases as x 0 “moves away” from the middle in any direction.

To obtain a set of joint confidence intervals suitable for the entire regression function, along its entire length, in the above expression instead of t n −2,α / 2 must be substituted