Definition of multicollinearity. Causes and consequences of multicollinearity

Multicollinearity Is a linear relationship between two or more factorial variables in the multiple regression equation. If such a dependence is functional, then one speaks of full multicollinearity... If it is correlation, then partial multicollinearity... If full multicollinearity is rather a theoretical abstraction (it manifests itself, in particular, if a dummy variable having k quality levels, replace with k dichotomous variables), then partial multicollinearity is very real and is almost always present. We can only talk about the degree of its severity. For example, if the explanatory variables include disposable income and consumption, then both of these variables will, of course, be highly correlated.

The absence of multicollinearity is one of the desirable prerequisites of the classical linear multiple model. This is due to the following considerations:

1) In the case of complete multicollinearity, it is generally impossible to construct estimates of the parameters of linear multiple regression using OLS.

2) In the case of partial multicollinearity, the estimates of the regression parameters may be unreliable and, in addition, it is difficult to determine

isolated contribution of factors to the effective indicator.

The main reason for the occurrence of multicollinearity is the presence in the studied object of processes that simultaneously affect some input variables, but are not taken into account in the model. This may be the result of a poor-quality study of the subject area or the complexity of the interrelationships of the parameters of the studied object.

Multicollinearity is suspected of being:

- a large number of insignificant factors in the model;

- large standard errors of the regression parameters;

- instability of estimates (a small change in the initial data leads to a significant change).

One approach to determine the presence or absence of multicollinearity is to analyze the correlation matrix

between explanatory variables and identifying pairs of factors with high pair correlation coefficients (usually more than 0.7). If such factors exist, then there is a clear collinearity between them.

However, paired correlation coefficients, considered individually, cannot assess the cumulative interaction of several factors (and not just two).

Therefore, to assess the presence of multicollinearity in the model, the determinant of the matrix of pairwise correlation coefficients between factors ( determinant of the interfactor correlation matrix)

The closer the determinant of the interfactor correlation matrix to 0, the stronger the multicollinearity, and vice versa, the closer the determinant to 1, the less multicollinearity.


The statistical significance of multicollinearity of factors is determined by testing the null hypothesis under an alternative hypothesis. The Pearson distribution with degrees of freedom is used to test the null hypothesis. The observed value of the statistics is found by the formula, where n- the number of observations, m- the number of factors. For a given significance level, the critical value is determined from the table of critical points of the Pearson distribution. If, then the hypothesis is rejected and it is considered that multicollinearity of factors is present in the model.

The factors influencing multicollinearity can also be distinguished by analyzing the coefficients of multiple determination, calculated under the condition that each of the factors is considered as a dependent variable of other factors:,,…,. The closer they are to 1, the stronger the multicollinearity of the factors. This means that factors with a minimum value of the multiple determination coefficient should be left in the equation.

As for complete multicollinearity, the most decisive struggle should be waged with it: immediately remove from the regression equation variables that are linear combinations of other variables.

Partial multicollinearity is not such a serious evil that it should be identified and eliminated. It all depends on the objectives of the study. If the main task of modeling is only to predict the values ​​of the dependent variable, then with a sufficiently large coefficient of determination () the presence of multicollinearity does not affect the predictive qualities of the model. If the goal of modeling is also to determine the contribution of each factor to the change in the dependent variable, then the presence of multicollinearity is a serious problem.

The simplest method for eliminating multicollinearity is to exclude one or a number of correlated variables from the model.

Since multicollinearity directly depends on the sample, it is possible that with a different sample, there will be no multicollinearity at all, or it will not be so serious. Therefore, to reduce multicollinearity, in some cases, it is sufficient to increase the sample size.

Sometimes the multicollinearity problem can be solved by changing the model specification: either the shape of the model changes, or factors are added that were not taken into account in the original model, but significantly affect the dependent variable.

In some cases, multicollinearity can be minimized or completely eliminated by transforming factor variables. In this case, the following transformations are most common:

1. Linear combination of multicollinear variables (for example,).

2. Replacing the multicollinear variable by its increment.

3. Division of one collinear variable by another.

Suppose we are considering a regression equation and the data for its estimation contain observations for objects of different quality: for men and women, for whites and blacks. the question that may interest us here is the following - is it true that the model under consideration coincides for two samples related to objects of different quality? You can answer this question using the Chow test.

Consider the models:

, i=1,…,N (1);

, i=N+1,…,N+M (2).

In the first sample N observations, in the second - M observations. Example: Y- wages, explaining variables - age, length of service, level of education. Does it follow from the available data that the model of the dependence of wages on the explanatory variables on the right side is the same for men and women?

To test this hypothesis, you can use the general hypothesis testing scheme by comparing constrained regression and unconstrained regression. Regression without constraints here is the union of regressions (1) and (2), i.e. ESS UR = ESS 1 + ESS 2, the number of degrees of freedom - N + M - 2k... Constrained regression (i.e., regression under the assumption that the null hypothesis is fulfilled) will be regression for the entire available set of observations:

, i = 1,…, N+M (3).

Estimating (3), we obtain ESS R... To test the null hypothesis, we use the following statistics:

Which, if the null hypothesis is true, has the Fisher distribution with the number of degrees of freedom of the numerator k and denominator N+ M- 2k.

If the null hypothesis is true, we can combine the available samples into one and estimate the model for N+M observations. If we reject the null hypothesis, then we cannot merge the two samples into one, and we will have to evaluate these two models separately.


The study of the general linear model, which we considered earlier, is very essential, as we have seen, based on the statistical apparatus. However, as in all applications mate. statistics, the strength of a method depends on the assumptions underlying it and necessary for its application. For a while, we will consider situations where one or more of the hypotheses underlying the linear model is violated. We will consider alternative methods of assessment in these cases. We will see that the role of some hypotheses is more significant than the role of others. We need to see what consequences the violation of certain conditions (assumptions) can lead to, be able to check whether they are satisfied or not, and know what statistical methods can and should be applied when the classical least squares method is not suitable.

1. The relationship between the variables is linear and is expressed by the equation - model specification errors (non-inclusion of significant explanatory variables in the equation, the inclusion of unnecessary variables in the equation, the wrong choice of the form of dependence between the variables);


2. X 1 ,…,X k- deterministic variables - stochastic regressors, linearly independent - full multicollinearity;

4. - heteroscedasticity;

5.when i ¹ k- autocorrelation of errors

Before starting the conversation, consider the following concepts: pair correlation coefficient and partial correlation coefficient.

Suppose we are investigating the influence of one variable on another variable ( Y and X). In order to understand how these variables are related to each other, we calculate the pair correlation coefficient using the following formula:

If we get the value of the correlation coefficient close to 1, we conclude that the variables are quite strongly related to each other.

However, if the correlation coefficient between the two variables of interest is close to 1, they may not actually be dependent. The case of the mentally ill and the radio is an example of what is called "false correlation." The high value of the correlation coefficient may also be due to the existence of a third variable, which has a strong effect on the first two variables, which is the reason for their high correlation. Therefore, the problem arises of calculating the "pure" correlation between the variables X and Y, i.e., a correlation in which the influence (linear) of other variables is excluded. For this, the concept of the partial correlation coefficient is introduced.

So, we want to determine the coefficient of partial correlation between the variables X and Y, excluding the linear influence of the variable Z... To determine it, the following procedure is used:

1. We estimate the regression,

2. We get the leftovers,

3. We estimate the regression,

4. We get the leftovers,

5. - sample coefficient of partial correlation, measures the degree of relationship between variables X and Y, cleared of the influence of the variable Z.

Direct calculations:

Property:

The procedure for constructing the partial correlation coefficient is generalized in case we want to get rid of the influence of two or more variables.


1. Perfect multicollinearity.

One of the Gauss-Markov requirements tells us that the explanatory variables should not be related in any exact relationship. If such a relationship exists between the variables, we say that the model has perfect multicollinearity. Example. Consider a model with an average exam score consisting of three explanatory variables: I- parental income, D- the average number of hours spent on training per day, W- the average number of hours spent on training per week. It's obvious that W=7D... And this ratio will be fulfilled for each student who falls into our sample. The case of complete multicollinearity is easy to trace, since in this case it is impossible to construct estimates using the least squares method.

2. Partial multicollinearity or simply multicollinearity.

A much more common situation is when there is no exact linear relationship between the explanatory variables, but there is a close correlation between them - this case is called real or partial multicollinearity (simply multicollinearity) - the existence of close statistical relationships between variables. It must be said that the issue of multicollinearity is more a matter of the degree of manifestation of the phenomenon, rather than its type. Any regression score will suffer from it in one form or another, unless all explanatory variables are completely uncorrelated. Consideration of this problem begins only when it begins to seriously affect the results of the regression estimate (the presence of statistical relationships between the regressors does not necessarily give unsatisfactory estimates). So multicollinearity is a problem where the tight correlation between regressors leads to unreliable regression estimates.

Consequences of multicollinearity:

Formally, since ( X"X) Is non-degenerate, then we can construct OLS estimates of the regression coefficients. However, let us recall how the theoretical variances of the estimates of the regression coefficients are expressed:, where a ii - i th diagonal element of the matrix. Since the matrix (X "X) is close to degenerate and det ( X"X) »0, then

1) there are very large numbers on the main diagonal of the inverse matrix, since the elements of the inverse matrix are inversely proportional to det ( X"X). Therefore, the theoretical variance i-th coefficient is large enough and the variance estimate is also large, therefore, t- statistics are small, which can lead to statistical insignificance i th coefficient. That is, the variable has a significant effect on the explained variable, and we conclude that it is insignificant.

2) Since the estimates and depend on ( X"X) -1, whose elements are inversely proportional to det ( X"X), then if we add or remove one or two observations, adding or removing, thus, one or two rows to the matrix X"X, then the values ​​and can change significantly, up to a change in sign - instability of the estimation results.

3) Difficulty in interpreting the regression equation. Let's say we have two variables in the equation that are related to each other: X 1 and X 2. Regression coefficient at X 1 is interpreted as a measure of change Y by changing X 1 ceteris paribus, i.e. the values ​​of all other variables remain the same. However, since the variables X 1 and X 2 are connected, then the changes in the variable X 1 will cause predictable changes in the variable X 2 and value X 2 will not remain the same.

Example: where X 1 - total area, X 2 - living area. We say: "If the living area increases by 1 sq. M., Then, all other things being equal, the price of an apartment will increase by USD." However, in this case, the living area will also increase by 1 sq. m. and the price increase will be. Delineate Influence on Variable Y each variable separately is no longer possible. The way out in this situation with the price of an apartment is to include in the model not the total area, but the so-called "additional" or "additional" area.

Multicollinearity signs.

There are no precise criteria for determining the presence (absence) of multicollinearity. However, there are heuristic recommendations for detecting it:

1) Analyze the matrix of paired correlation coefficients between regressors and if the value of the correlation coefficient is close to 1, then this is considered a sign of multicollinearity.

2) Analysis of the correlation matrix is ​​only a superficial judgment about the presence (absence) of multicollinearity. A more careful study of this issue is achieved by calculating the coefficients of partial correlation or calculating the coefficients of determination for each of the explanatory variables for all other explanatory variables in the regression.

4) (XX) Is a symmetric positive definite matrix; therefore, all of its eigenvalues ​​are nonnegative. If the determinant of the matrix ( XX) is equal to zero, then the minimum eigenvalue is also zero and the continuity is preserved. Consequently, the value of the manimal eigenvalue can also be judged on the proximity to zero of the determinant of the matrix ( XX). In addition to this property, the minimum eigenvalue is also important because the standard error of the coefficient is inversely proportional.

5) The presence of multicollinearity can be judged by external signs that are consequences of multicollinearity:

a) some of the estimates have signs that are incorrect from the point of view of economic theory or unjustifiably high values;

b) a small change in the initial economic data leads to a significant change in the estimates of the model coefficients;

c) the majority t- the statistics of the coefficients differ insignificantly from zero, at the same time, the model as a whole is significant, as evidenced by the high value F-statistics.

How to get rid of multicollinearity, how to eliminate it:

1) Using factor analysis. Transition from the original set of regressors, among which there are statistically dependent ones, to new regressors Z 1 ,…,Z m using the method of principal components - instead of the initial variables, instead of the initial variables, we consider some of their linear combinations, the correlation between which is small or absent altogether. The challenge here is to give meaningful interpretation to new variables. Z... If it fails, we return to the original variables using the inverse transformations. The estimates obtained will, however, be biased, but will have lower variance.

2) Among all available variables, select the factors most significantly influencing the explained variable. The selection procedures will be discussed below.

3) Transition to biased estimation methods.

When we are faced with the problem of multicollinearity, the inexperienced researcher at first has a desire to simply exclude unnecessary regressors that may be causing it. However, it is not always clear which variables are redundant in this sense. In addition, as will be shown below, discarding the so-called significantly influencing variables leads to a bias of the OLS estimates.


Note that in a number of cases multicollinearity is not such a serious "evil" as to make significant efforts to identify and eliminate it. Basically, it all depends on the objectives of the study.
If the main task of the model is to predict the future values ​​of the dependent variable, then with a sufficiently large coefficient of determination R2 (gt; 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model (if in the future the correlated variables remain the same as before ).
If it is necessary to determine the degree of influence of each of the explanatory variables on the dependent variable, then multicollinearity, leading to an increase in standard errors, is likely to distort the true relationships between the variables. In this situation, multicollinearity is a serious problem.
There is no single method for eliminating multicollinearity that is suitable in any case. This is due to the fact that the causes and consequences of multicollinearity are ambiguous and largely depend on the sample results.
Excluding variable (s) from the model
The simplest method for eliminating multicollinearity is to exclude one or a number of correlated variables from the model. Some caution is required when applying this method. In this situation, specification errors are possible, therefore, in applied econometric models, it is advisable not to exclude explanatory variables until multicollinearity becomes a serious problem.
Retrieving additional data or a new sample
Since multicollinearity directly depends on the sample, it is possible that with a different sample, multicollinearity will not be or it will not be so serious. Sometimes, increasing the sample size is sufficient to reduce multicollinearity. For example, if you are using annual data, you can go to quarterly data. Increasing the amount of data reduces the variance of the regression coefficients and thereby increases their statistical significance. However, obtaining a new sample or expanding an old one is not always possible or involves serious costs. In addition, this approach can enhance autocorrelation. These problems limit the use of this method.
Modifying the model specification
In some cases, the problem of multicollinearity can be solved by changing the model specification: either the shape of the model changes, or explanatory variables are added that were not taken into account in the original model, but significantly affect the dependent variable. If this method is justified, then its use reduces the sum of the squares of the deviations, thereby reducing the standard error of the regression. This leads to a reduction in the standard errors of the coefficients.
Using preliminary information about some parameters
Sometimes, when building a multiple regression model, you can use preliminary information, in particular, the known values ​​of some regression coefficients.
It is likely that the values ​​of the coefficients calculated for any preliminary (usually simpler) models or for a similar model based on a previously obtained sample can be used for the model being developed at the moment.
Selection of the most significant explanatory variables. The procedure for sequential connection of elements
Moving to fewer explanatory variables can reduce duplication of information delivered by highly interdependent features. This is exactly what we face in the case of multicollinear explanatory variables.
Let

Multiple coefficient
correlations between the dependent variable Y and the set of explanatory variables X 1, X 2, ..., Xm. It is defined as the usual pairwise correlation coefficient between Y and a linear function
regression Y = b0 + KX1 + b2X2 + ... + bmXm. Let amp; = R-1 - matrix inverse to matrix R:


Then the squared coefficient Ry.X = Rr (xi, x2, .., x) can be calculated by the formula:


The estimate R * 2.X corrected for unbiasedness of the coefficient of determination R2y.X has the form:

(If a negative number is obtained by formula (6.7), then we assume


The lower confidence limit for

determined
according to the formula:

In practice, when deciding which explanatory variables should be included in the model, the procedure of sequential joining of elements is often used.
(j = 1, 2, ..., m). Wherein

coincides with the square of the usual
pair correlation coefficient

Let


then the xp variable will be the most informative. Then the coefficient corrected for unbiasedness is calculated
(for m = 1) and its lower confidence limit R2min (1).


the pair jxp, xq will be more informative). Then the coefficient corrected for unbiasedness is calculated (with m = 2)
and its lower confidence limit R2min (2).

The procedure continues until at step (to +1) the condition is fulfilled:
Then the model includes the most informative variables obtained in the first steps. Note that in the calculations, formulas (6.7) and (6.8) are used, in which, instead of m, the corresponding value of the step number k is taken.
In fact, this method does not guarantee that we will get rid of multicollinearity.
Other methods of eliminating multicollinearity are also used.
Example 6.1. There are the following conditional data (Table 6.1):
Table 6.1
Data for the daisy-chaining method


X1

X2

X3

Have

1

1,5

0,7

12

2

2,5

1,2

20

3

1

1,4

15

4

5,5

1,9

41

5

3

2,5

33

6

3

3,1

35

7

2,8

3,5

38

8

0,5

4

28

9

4

3,8

47

10

2

5,3

40

Let us consider the effect on the dependent variable of each of the explanatory variables separately. Calculating the paired correlation coefficients, we find that the coefficient

Then:


Consider the effect of pairs of variables (x1, x2) and (x1, x3) on the dependent variable. First, consider the influence of a pair of variables (x1, x2).



Icuvum uvjpcuuivi
When joining variables, two explanatory variables should be included in the equation. Therefore, the theoretical equation will take the form:
Ridge method
Consider the ridge method (ridge regression) for eliminating multicollinearity. The method was proposed by A.E. Hoerl in 1962 and is applied when the matrix (xtX) is close to degenerate. Some small number (from 0.1 to 0.4) is added to the diagonal elements of the matrix (xtX). In this case, biased estimates of the parameters of the equation are obtained. But the standard errors of such estimates in the case of multicollinearity are lower than those given by the usual least squares method.
Example 6.2. Initial data are presented in "Table 6 2 Coefficient of correlation of explanatory variables

what
indicates strong multicollinearity.
Table 6.2
Data for the study of multicollinearity by the ridge method


x1

x2

Have

1

1,4

7

2

3,1

12


Then we get the equation y = 2.63 + 1.37x1 + 1.95x2. The diagonal elements of the inverse matrix will decrease significantly and will be equal to z00 = 0.45264, z11 = 1.57796, z00 = 0.70842, which leads to a decrease in the standard errors of the coefficients.
Summary
Among the main consequences that multicollinearity can lead to, the following can be distinguished:
  1. when testing the main hypothesis about the insignificance of the multiple regression coefficients using the t-test, in most cases it is accepted, however, the regression equation itself when tested using the A-test turns out to be significant, which indicates an overestimated value of the multiple correlation coefficient;
  2. the obtained estimates of the coefficients of the multiple regression equation are generally unjustifiably overestimated or have incorrect signs;
  3. adding or excluding one or two observations from the initial data has a strong influence on the estimates of the model coefficients;
  4. the presence of multicollinearity in a multiple regression model can make it unsuitable for further use (for example, for making forecasts).
Self-test questions
  1. What is multicollinearity?
  2. What indicators indicate the presence of multicollinearity?
  3. What is the determinant of the matrix XTX in the case of perfect multicollinearity?
  4. What can be said about the meaning of the coefficients of the explanatory variables in the case of multicollinearity?
  5. What transformation is performed in the comb method, what does it lead to?
  6. What is the order of actions in the method of successively increasing the number of explanatory variables?
  7. What does the correlation coefficient show?
  8. What does the partial correlation coefficient show?
0

Ministry of Education and Science of the Russian Federation

Federal State Budgetary Educational Institution

higher education

TVER STATE TECHNICAL UNIVERSITY

Department of "Accounting and Finance"

COURSE PROJECT
in the discipline "Econometrics"

"Investigating multicollinearity in econometric models: excluding variable (s) from the model"

Work supervisor:

Cand. those. Sciences, Associate Professor

Konovalova

Executor:

student of group EK-1315 EPO

Tver, 2015

Introduction ………………………………………………………………………… ... 3

1.Analytical part ………………………………………………………… 4

1.1. Generalized signs of multicollinearity in econometric models ………………………………………………………………………… .4

1.2. The main ways to eliminate multicollinearity in econometric models ………… .. ………………………………………… ..7

2. Design part ……………………………………………………………… ..11

2.1. Information and methodological support of econometric research ………………………………………………………………… .11

2.2. An example of an econometric study …………………………… .17

Conclusion ………………………………………………………………… .... 30

List of sources used ………………………………………… ... 31

Introduction

The relevance of the topic of the work "Investigation of multicollinearity in econometric models: exclusion of variable (s) from the model" is due to the fact that nowadays this problem is often encountered in applied econometric models.

The subject of research is the problem of multicollinearity. The object of the research is econometric models.

The main goal of the work is to develop design solutions for information and methodological support of econometric research.

To achieve the goal, the following main research tasks were set and solved:

  1. Generalization of multicollinearity features in econometric models.
  2. Identification of the main ways to eliminate multicollinearity.

3. Development of information and methodological support for econometric research.

  1. Analytical part

1.1. Generalized signs of multicollinearity in econometric models

Multicollinearity - in econometrics (regression analysis) - the presence of a linear relationship between the explanatory variables (factors) of the regression model. At the same time, there are complete collinearity, which means the presence of a functional (identical) linear dependence, and partial or simply multicollinearity- the presence of a strong correlation between the factors.

Complete collinearity leads to uncertainties parameters in a linear regression model regardless of estimation methods. Consider this using the following linear model as an example:

Let the factors of this model be identically related as follows:. Then consider the original linear model, in which we add to the first coefficient arbitrary number a, and subtract the same number from the other two coefficients. Then we have (without a random error):

Thus, despite the relatively arbitrary change in the coefficients of the model, the same model is obtained. This model is fundamentally unidentifiable. Uncertainty already exists in the model itself. If we consider the 3-dimensional space of coefficients, then in this space the vector of true coefficients in this case is not the only one, but is a whole straight line. Any point on this line is a true vector of coefficients.

If complete collinearity leads to uncertainty in parameter values, then partial multicollinearity leads to their instability. evaluations... Instability is expressed in an increase in statistical uncertainty - the variance of estimates. This means that specific assessment results can vary greatly from sample to sample, even though the samples are homogeneous.

As you know, the covariance matrix of estimates of multiple regression parameters using the least squares method is equal to. Thus, the “smaller” the covariance matrix (its determinant), the “larger” the covariance matrix of the parameter estimates, and, in particular, the larger the diagonal elements of this matrix, that is, the variance of the parameter estimates. For clarity, consider the example of a two-factor model:

Then the variance of the parameter estimate, for example, for the first factor is equal to:

where is the sample correlation coefficient between factors.

It is clearly seen here that the greater the modulus of the correlation between the factors, the greater the variance of the parameter estimates. At (full collinearity), the variance tends to infinity, which corresponds to what was said earlier.

Thus, the estimates of the parameters are obtained inaccurate, which means that it will be difficult to interpret the influence of certain factors on the variable being explained. At the same time, multicollinearity does not affect the quality of the model as a whole - it can be recognized as statistically significant, even when all the coefficients are insignificant (this is one of the signs of multicollinearity).

In linear models, the correlation coefficients between parameters can be positive and negative. In the first case, an increase in one parameter is accompanied by an increase in another parameter. In the second case, when one parameter increases, the other decreases.

Based on this, it is possible to establish acceptable and unacceptable multicollinearity. An unacceptable multicollinearity will occur when there is a significant positive correlation between factors 1 and 2, and the influence of each factor on the correlation with the function of y is unidirectional, that is, an increase in both factors 1 and 2 leads to an increase or decrease in the function of y. In other words, both factors act on function y in the same way, and a significant positive correlation between them may allow one of them to be excluded.

The permissible multicollinearity is such that the factors affect the function y differently. Two cases are possible here:

a) with a significant positive correlation between the factors, the influence of each factor on the correlation with the function y is multidirectional, i.e. an increase in one factor leads to an increase in function, and an increase in another factor leads to a decrease in the function of y.

b) with a significant negative correlation between the factors, an increase in one factor is accompanied by a decrease in another factor and this makes the factors ambiguous, therefore, any sign of the influence of factors on the function of y is possible.

In practice, some of the most characteristic features of multicollinearity are distinguished: 1. A small change in the initial data (for example, adding new observations) leads to a significant change in the estimates of the model coefficients. 2. The estimates have large standard errors, low significance, while the model as a whole is significant (high value of the coefficient of determination R 2 and the corresponding F-statistics). 3. The estimates of the coefficients have incorrect signs from the theoretical point of view or unjustifiably large values.

Indirect signs of multicollinearity are high standard errors of estimates of model parameters, small t-statistics (that is, insignificant coefficients), incorrect signs of estimates, while the model as a whole is recognized as statistically significant (large value of F-statistics). Multicollinearity can also be evidenced by a strong change in parameter estimates from the addition (or removal) of sample data (if the requirements for sufficient sample homogeneity are met).

To detect multicollinearity of factors, the correlation matrix of factors can be analyzed directly. Already the presence of large in absolute value (above 0.7-0.8) values ​​of the pair correlation coefficients indicates possible problems with the quality of the estimates obtained.

However, analysis of paired correlation coefficients is insufficient. It is necessary to analyze the coefficients of determination of regressions of factors for other factors (). It is recommended to calculate the indicator. Too high values ​​of the latter mean the presence of multicollinearity.

Thus, the main criteria for detecting multicollinearity are as follows: high R 2 for all insignificant coefficients, high pair correlation coefficients, high values ​​of the VIF coefficient.

1.2. The main ways to eliminate multicollinearity in econometric models

Before indicating the main methods for eliminating multicollinearity, we note that in a number of cases multicollinearity is not a serious problem that requires significant efforts to identify and eliminate it. Basically, it all depends on the objectives of the study.

If the main task of the model is to predict the future values ​​of the regressionand, then with a sufficiently large determination coefficient R2 (> 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model. Although this statement will be justified only in the case that in the future the correlated regressors will retain the same relationship as before. If the goal of the study is to determine the degree of influence of each of the regressors on the regressand, then the presence of multicollinearity, leading to an increase in standard errors, is likely to distort the true relationships between the regressors. In this situation, multicollinearity is a serious problem.

Note that there is no single method for eliminating multicollinearity that is suitable in any case. This is due to the fact that the causes and consequences of multicollinearity are ambiguous and largely depend on the sample results.

In practice, the main methods for eliminating multicollinearity are distinguished:

  1. Eliminating regressors from the model The simplest method for eliminating multicollinearity is to exclude one or a number of correlated regressors from the model. However, some caution is needed when applying this method. In this situation, specification errors are possible. For example, when studying the demand for a certain good, the price of this good and the prices of substitutes for this good, which are often correlated with each other, can be used as explanatory variables. By excluding the prices of substitutes from the model, we are more likely to make a specification error. As a result, biased estimates can be obtained and unreasonable conclusions can be drawn. Thus, in applied econometric models, it is desirable not to exclude regressors until their collinearity becomes a serious problem.
  2. Obtaining additional data or a new sample, since multicollinearity directly depends on the sample, then, perhaps, with a different sample, there will be no multicollinearity at all, or it will not be so serious. Sometimes, increasing the sample size is sufficient to reduce multicollinearity. For example, if you are using annual data, you can go to quarterly data. Increasing the amount of data reduces the variance of the regression coefficients and thereby increases their statistical significance. However, obtaining a new sample or expanding an old one is not always possible or involves serious costs. In addition, this approach can enhance autocorrelation. These problems limit the use of this method.

III. Changing the model specification In some cases, the multicollinearity problem can be solved by changing the model specification: either the shape of the model is changed, or new regressors are added that were not taken into account in the original model, but significantly affect the dependent variable. If this method is justified, then its use reduces the sum of the squares of the deviations, thereby reducing the standard error of the regression. This leads to a reduction in the standard errors of the coefficients.

  1. Transformation of variables in some cases can be minimized or eliminated altogether the problem of multicollinearity only with the help of transformation of variables. The original data in each case is divided by the values ​​of one of the dependent regressors in this case. The application of the method of principal components to the factors of the model allows you to transform the initial factors and obtain a set of orthogonal (uncorrelated) factors. In this case, the presence of multicollinearity will allow us to restrict ourselves to a small number of principal components. Nevertheless, the problem of meaningful interpretation of the principal components may arise.

If by all indications there is multicollinearity, then among econometricians there are different opinions on this matter. When faced with the problem of multicollinearity, there may be a natural desire to discard the “unnecessary” independent variables that may be causing it. However, it should be remembered that new difficulties may arise in doing so. First, it is far from always clear which variables are redundant in this sense.

Multicollinearity means only an approximate linear relationship between factors, but this does not always highlight the "extra" variables. Second, in many situations, the removal of any independent variables can significantly affect the meaning of the model. Finally, discarding the so-called essential variables, i.e. independent variables that actually affect the studied dependent variable, leads to a bias in the coefficients of the model. In practice, usually when multicollinearity is detected, the least significant factor for the analysis is removed, and then the calculations are repeated.

Thus, in practice, the main methods for eliminating multicollinearity are distinguished: changing or increasing the sample, excluding one of the variables, transforming multicollinear variables (use nonlinear forms, use aggregates (linear combinations of several variables), use the first differences instead of the variables themselves. However, if multicollinearity is not eliminated , you can ignore it, taking into account the advisability of exclusion.

  1. Project part

2.1. Information and methodological support of econometric research

Information support of econometric research includes the following information:

Input information:

  • statistical data on the socio-economic indicator, defined as a dependent variable (factors - results);
  • statistical data on socio-economic indicators, defined as explanatory variables (factors - signs);

Intermediate information:

  • a model of the regression equation, the estimated regression equation, quality indicators and a conclusion about the quality of the regression equation, a conclusion about the presence (absence) of a multicollinearity problem, recommendations for using the model;

Effective information:

  • the estimated regression equation, the conclusion about the quality of the regression equation, the conclusion about the presence (absence) of the multicollinearity problem, recommendations for the application of the model.

The econometric research methodology is as follows: specification; parameterization, verification, additional research, forecasting.

1. The specification of the regression equation model includes a graphical analysis of the correlation dependence of the dependent variable on each explanatory variable. Based on the results of the graphical analysis, a conclusion is made about the model of the regression equation of linear or nonlinear types. For graphical analysis, the most commonly recommended MsExcel Scatter Chart tool. As a result of this stage, a model of the regression equation is determined, and in the case of a nonlinear form, methods of its linearization are also determined.

2. Parametrization of the regression equation includes the estimation of the regression parameters and their socio-economic interpretation. For parameterization use the tool "Regression" as part of the add-ins "Data Analysis" MsExcel. Based on the results of automated regression analysis (column "Coefficients"), regression parameters are determined, and their interpretation is also given according to the standard rule:

Bj is the amount by which the value of the variable Y changes on average as the independent variable Xj increases by one, ceteris paribus.

The intercept of the regression equation is equal to the predicted value of the dependent variable Y when all the independent variables are zero.

3. Verification of the regression equation is carried out on the basis of the results of automated regression analysis (stage 2) according to the following indicators: "R-square", "Significance F", "P-value" (for each parameter of the regression), as well as on the graphs of selection and residuals ...

The significance of the coefficients is determined and the quality of the model is assessed. For this, the “Significance F”, “P-Value” and “R-square” are considered. If the “P-value” is less than the static significance equation, then this indicates the significance of the coefficient. If the “R-squared” is greater than 0.6, it means that the regression model describes well the behavior of the dependent variable Y on the factors of the variables.

If the “Significance F” is less than the static equation of significance, then the coefficient of determination (R-square) is considered conditionally statistically significant.

The residual plot allows you to estimate the variation in errors. If there are no special differences between the errors corresponding to different values ​​of Xi, that is, the variations in errors for different values ​​of Xi are approximately the same and it can be assumed that there are no problems. The fitting schedule allows you to form judgments about the baseline, predicted and factor values.

In conclusion, a judgment is formed about the quality of the regression equation.

  1. Additional research.

4.1 Detection of the first sign of multicollinearity. Based on the results of regression analysis obtained in clauses 2-3, the situation is checked in which the coefficient of determination has a high value (R 2> 0.7) and statically significant (Significance F<0,05), и хотя бы один из коэффициентов регрессии не может быть признан статистически значим (P-значение >0.05) .When such a situation is detected, a conclusion is made about the assumption of multicollinearity.

4.2 Detection of the second sign of multicollinearity. Based on the calculations of the correlation coefficients between the factor variables, a significant relationship of individual factors is determined. For calculations in MS Excel, it is advisable to use the Data Analysis / Correlation tool. Based on the values ​​of the correlation coefficient, conclusions are drawn: the closer (r) to the extreme points (± 1), the greater the degree of linear relationship, if the correlation coefficient is less than 0.5, then it is considered that the relationship is weak. The presence of multicollinearity is assumed in the following case if there is a significant correlation coefficient between at least two variables (i.e., greater than 0.7 in modulus).

4.3 Detection of the third sign of multicollinearity. Based on the assessment of auxiliary regressions between factor variables, and between variables where there is a significant correlation coefficient (section 4.2), it is concluded that multicollinearity is present if at least in one auxiliary regression it is significant and significant. The method of additional regressions of the coefficient of determination is as follows: 1) regression equations are constructed that connect each of the regressors with all the remaining ones; 2) the coefficients of determination R 2 are calculated for each regression equation; 3) if the equation and the coefficient of determination are considered statistically significant, then this regressor leads to multicollinearity.

4.4 Generalization of judgments.

On the basis of clauses 4.1-4.3, a judgment is formed about the presence / absence of multicollinearity and regressors leading to multicollinearity.

Further, the directions of using the model are formed (in the case of ignoring or the absence of the problem of multicollinearity) or recommendations for eliminating multicollinearity (in practice, excluding a variable).

When excluding a variable, it is advisable to use the rule:

The coefficient of determination is determined for the regression equation originally constructed from n observations (R 2 1);

By excluding the last variables from consideration (k), an equation is formed for the remaining factors based on the initial n observations and the determination coefficient (R 2 2) is determined for it;

F-statistics are calculated: where (R 1 2 -R 2 2) is the loss of the equation as a result of dropping to variables, (K) is the number of additional degrees of freedom that have appeared, (1- R 1 2) / (nml) is the unexplained variance of the initial equations;

The critical value F a, k, n-m -1 is determined according to the tables of the critical points of the Fisher distribution at a given level of significance a and degrees of freedom v 1 = k, v 2 = n-m-l;

Judgments are formed about the expediency of an exception according to the rule: the (simultaneous) exclusion of k variables from the equation is considered inappropriate for F> F a, k, n-m - 1, otherwise such an exception is permissible.

When the variable is eliminated, the resulting model is analyzed in accordance with clauses 3-4; and is compared with the original model, as a result, the "best" is selected. In practice, since multicollinearity does not affect the predictive qualities of the model, this problem can be ignored.

5. Forecasting is carried out according to the initial / "best" model selected in paragraph 4.4, according to the retrospective forecasting scheme, in which the last 1/3 of observations are used for forecasting.

5.1. Point forecast. The actual values ​​of the factor variables in the forecast period are considered predicted, the predicted values ​​of the resultant variable are determined as predicted by the original / "best" model based on the factor variables in the forecast period. Using the Microsoft Excel "Graph" tool, a graph of the actual and predicted values ​​of the resultant variable is plotted according to observations and a conclusion is made about the proximity of the actual values ​​to the predicted ones.

5.2. Interval forecasting involves calculating prediction standard errors (using Salkever dummy variables) and the upper and lower bounds of the predicted values.

Using the Microsoft Excel Data Analysis / Regression tool, a regression is built for the aggregate dataset of the sample and the forecast period, but with the addition of dummy variables D 1, D 2, ..., D p. In this case, D i = 1 only for the moment of observation (n + i), for all other moments D i = 0. Then the coefficient of the dummy variable D i is equal to the prediction error at time (n + i), and the standard error of the coefficient is equal to the prediction standard error (S i). Thus, an automated regression analysis of the model is carried out, where the aggregate (sample and predicted) values ​​of the factor variables and the values ​​of Salkever dummy variables are used as the X values, and the aggregate (sample and predicted) values ​​of the resultant variable are used as the Y values.

The obtained standard errors of the coefficients for the Salkever dummy variables are equal to the prediction standard errors. Then the boundaries of the interval forecast are calculated using the following formulas: Ymin n + i = Yemp n + i -S i * t cr, Ymax n + i = Yemp n + i + S i * t cr, where t cr is the critical value of the Student's distribution, determined by the formula "= STYURASPOBR (0.05; nm-1)", m is the number of explanatory factors in the model (Y * t), Yemp n + i are the predicted values ​​of the resultant variable (clause 5.1).

Using the Microsoft Excel "Graph" tool, a graph is built according to the actual and predicted values ​​of the resultant variable, the upper and lower bounds of the forecast for observations. A conclusion is made about the fit of the actual values ​​of the resultant variable into the boundaries of the interval forecast.

5.3. The assessment of the stability of the model using the NCO test is carried out as follows:

a) using the Microsoft Excel "Data Analysis / Regression" tool, a regression is built, where the aggregate (sample and predicted) values ​​of the factor variables are taken as the X values, and the aggregate (sample and predicted) values ​​of the resultant variable are taken as the Y values. This regression is used to determine the sum of the squares of the residuals S;

b) according to the regression of clause 5.2 with Salkever dummy variables, the sum of the squares of the residuals Sd is determined;

c) the value of F statistics is calculated and estimated by the formula:

where p is the number of predictive steps. If the obtained value is greater than the critical value F cr, determined by the formula "= FDISP (0.05; p; n-m-1)", then the hypothesis about the stability of the model in the forecast period is rejected, otherwise it is accepted.

5.4. Generalization of judgments about the predictive qualities of the model on the basis of clauses 5.1-5.3, as a result, a conclusion is formed on the predictive quality of the model and recommendations for using the model for forecasting.

Thus, the developed information and methodological support corresponds to the main objectives of the econometric study of the problem of multicollinearity in multiple regression models.

2.2. An example of an econometric study

The study is carried out on the basis of data reflecting the real macroeconomic indicators of the Russian Federation for the period 2003-2011. (table. 1), according to the method of clause 2.1.

Table 1

House expenses. farms (billion rubles) [Y]

Population (million people)

Money supply (billion rubles)

Unemployment rate (%)

1.Specification The regression equation model includes a graphical analysis of the correlation dependence of the dependent variable Y (Household expenses on the explanatory variable X 1 (Population) (Fig. 1), the correlation dependence of the dependent variable Y (Household expenses on the explanatory variable X 2 (Money supply) (Fig. 2), the correlation dependence of the dependent variable Y (Household expenses on the explanatory variable X 3 (Unemployment rate) (Fig. 3).

The graph of the correlation dependence between Y and X 1, presented in Figure 1, reflects a significant (R 2 = 0.71) inverse linear dependence of Y on X 1.

The graph of the correlation dependence between Y and X 2, presented in Figure 2, reflects a significant (R 2 = 0.98) direct linear dependence of Y on X 2.

The graph of the correlation dependence between Y and X 3, presented in Figure 3, reflects an insignificant (R 2 = 0.15) inverse linear dependence of Y on X 3.

Picture 1

Picture 2

Figure 3

As a result, a linear multiple regression model can be specified Y = b 0 + b 1 X 1 + b 2 X 2 + b 3 X 3.

2.Parametrization regression equations are carried out using the "Regression" tool as part of the "Data Analysis" add-ons MsExcel (Fig. 4).

Figure 4

The estimated regression equation is:

233983.8-1605.6X 1 + 1.0X 2 + 396.22X 3.

In this case, the regression coefficients are interpreted as follows: with an increase in the population by 1 million people, house expenses. farms decrease by 1605.6 billion rubles; with an increase in the money supply by 1 billion rubles. house expenses. farms will increase by 1.0 billion rubles; with an increase in the unemployment rate of 1%, house expenses. farms will increase by 396.2 billion rubles. With zero values ​​of the factor variables, the costs of the house. farms will amount to 233,983.8 billion rubles, which, perhaps, has no economic interpretation.

3.Verification the regression equation is carried out on the basis of the results of the automated regression analysis (stage 2).

So, "R-square" is equal to 0.998, i.e. the regression equation describes the behavior of the dependent variable by 99%, which indicates a high level of description of the equation. The "significance of F" is 2.14774253442155E-07, which indicates that the "R-square" is significant. The “P-Value” for b 0 is 0.002, which indicates that this parameter is significant. The “P-Value” for b 1 is 0.002, which indicates that this coefficient is significant. The “P-Value” for b 2 is 8.29103190343224E-07, which indicates that this coefficient is significant. The “P-Value” for b 3 is 0.084, which indicates that this coefficient is not significant.

Based on the plots of residuals, the residuals e are random values.

Based on the fitting plots, a conclusion is made about the proximity of the actual and predicted values ​​for the model.

So, the model is of good quality, while b 3 is not significant, so we can assume the presence of multicollinearity.

4. Additional research.

4.1. Detection of the first sign of multicollinearity. According to the regression analysis data (Figure 5), we can say that there is the first sign of multicollinearity, since a high and significant R 2 is detected, it is revealed that the equation has a high coefficient of determination, and one of the coefficients is not significant. This suggests the presence of multicollinearity.

4.2 Detection of the second sign of multicollinearity.

Based on the calculations of the correlation coefficients between the factor variables, a significant relationship of individual factors is determined. (Table 2). The presence of multicollinearity is assumed in the following case if there is a significant correlation coefficient between at least two variables (i.e., greater than 0.5 in modulus).

table 2

[ X2]

[ X3]

[ X2]

[ X3]

In our case, there is a correlation coefficient between X 1 and X 2 (-0.788), which indicates a strong dependence between the variables X 1, X 2, there is also a correlation coefficient between X 1 and X 3 (0.54), which indicates strong dependence between the variables X 1, X 3.

As a result, the presence of multicollinearity can be assumed.

4.3 Detection of the third sign of multicollinearity.

Since in Section 4.2 a strong relationship was found between the variables X 1 and X 2, then auxiliary regression between these variables is analyzed (Fig. 5).

Figure 5

Since the "F Significance" is 0.01, which indicates that the "R-squared" and the auxiliary regression are significant, it can be assumed that the regressor X 2 leads to multicollinearity.

Since in Section 4.2 a relationship between the variables X 1 and X 3 was found above the average level, then auxiliary regression between these variables is analyzed (Fig. 6).

Figure 6

Since the "Significance F" is 0.13, which indicates that the "R-squared" and the auxiliary regression are not significant, it can be assumed that the regressor X 3 does not lead to multicollinearity.

So, according to the third feature, the presence of multicollinearity can be assumed.

4.4 Generalization of judgments.

According to the analysis of paragraphs 4.1-4.3, all three signs of multicollinearity were found, so it can be assumed with a high probability. At the same time, despite the assumption in Section 4.3 regarding the regressor leading to multicollinearity, it is possible to recommend the exclusion of X 3 from the original model, since X 3 has the smallest correlation coefficient with Y and the coefficient of this regressor is insignificant in the original equation. The results of the regression analysis after excluding X 3 are shown in Fig. 7.

Figure 7

In this case, we will calculate F - statistics to check the feasibility of exclusion:

F fact = 4.62,

and F tab = F 0.05; 1; 5 = 6.61, since F fact< F табл, то исключение допустимо для переменной X 3 .

Assessment of the quality of the linear multiple regression model Y = b 0 + b 1 X 1 + b 2 X 2. The "R-squared" is 0.996, i.e. the regression equation describes the behavior of the dependent variable by 99%, which indicates a high level of description of the equation. The "F significance" is 3.02415218982089E-08, which indicates that the "R-square" is significant. The “P-Value” for b 0 is 0.004, which indicates that this parameter is significant. The “P-Value” for b 1 is 0.005, which indicates that this coefficient is significant. The “P-Value” for b 2 is 3.87838361673427E-07, which indicates that this coefficient is significant. The estimated regression equation is:

201511.7 -1359.6X 1 + 1.01X 2

In this case, the regression coefficients are interpreted as follows: with a decrease in the population by 1 million people, the costs of the house. farms decrease by 1,359.6 billion rubles; with an increase in the level of money supply, house expenses. farms will increase by 1.0) (billion rubles). With zero values ​​of the factor variables, the costs of the house. farms will amount to 201511.7 billion rubles, which may have an economic interpretation.

So, the model = 201511.7 -1359.6X 1 + 1.01X 2 is of good quality and is recommended for forecasting as "best" in comparison with the original model.

5. Forecasting.

5.1 Point prediction. The actual values ​​of the factor variables in the forecast period are considered predicted, the predicted values ​​of the resultant variable are determined as predicted by the "best" model (= 201511.7 -1359.6X 1 + 1.01X 2) based on the factor variables in the forecast period. Using the Microsoft Excel "Graph" tool, a graph of the actual and predicted values ​​of the resultant variable is plotted according to observations and a conclusion is made about the proximity of the actual values ​​to the predicted ones.

The predicted values ​​of the factor variables are presented in Table 3.

Table 3

The predicted values ​​of the effective variable are determined as predicted by the "best" model (= 201511.7 -1359.6X 1 + 1.01X 2) based on factor variables in the forecast period. The predicted values ​​are presented in Table 4; actual values ​​are added for comparison.

Table 4

[Y] empirical

Figure 8 shows the actual and forecast values ​​of the resultant variable, as well as the lower and upper boundaries of the forecast.

Figure 8

According to Fig. 8, the forecast retains an increasing trend, and all forecast values ​​are close to the actual ones.

5.2. Interval forecast.

Using the Microsoft Excel Data Analysis / Regression tool, a regression is built for the aggregate dataset of the sample and the forecast period, but with the addition of dummy variables D 1, D 2, ..., D p. In this case, D i = 1 only for the moment of observation (n + i), for all other moments D i = 0. The data are presented in Table 5, the result of the regression in Fig. 9.

Table 5

[Y] owls

Figure 9

Then the standard error of the coefficient for the dummy variable is equal to the standard prediction error (S i): for 2012 it will be 738.5; for 2013 will be 897.1; for 2014 will be 1139.4.

The boundaries of the interval forecast are calculated in Table 6.

Table 6

[Y] empirical

[Y] owls

[S] pr

According to the table. 6, using the Microsoft Excel "Graph" tool, a graph is built according to the actual and predicted values ​​of the resultant variable, the upper and lower boundaries of the forecast for observations (Fig. 10).

Figure 10

According to the graph, the predicted values ​​fit into the boundaries of the interval forecast, which indicates a good forecast quality.

5.3. Evaluating the stability of the model using the NCO test is carried out as follows:

a) using the Microsoft Excel tool "Data Analysis / Regression", a regression is built (Fig. 11), where the aggregate (sample and forecast) values ​​of the factor variables are taken as the X values, and the aggregate (sample and forecast) values ​​are taken as the Y values the result variable. This regression is used to determine the sum of squares of the residuals S = 2058232.333.

Figure 11

b) by the regression of item 3.2 with Salkever dummy variables (Fig. 9), the sum of the squares of the residuals Sd = 1270272.697 is determined.

c) the value of F statistics is calculated and evaluated:

while F cr = F 0.05; 3; 5 = 5.40, then the obtained value is less than the critical value F cr and the hypothesis about the stability of the model in the forecast period is accepted.

5.4 Generalization of judgments about the predictive qualities of the model on the basis of clauses 5.1-5.3, as a result, a conclusion is formed on the high predictive quality of the model (= 201511.7 -1359.6X 1 + 1.01X 2) and recommendations are given on the use of the model for forecasting.

The technique of clause 2.1 has been successfully tested, it allows us to identify the main signs of multicollinearity and can be recommended for such studies.

Conclusion

Multicollinearity - in econometrics (regression analysis) - the presence of a linear relationship between the explanatory variables (factors) of the regression model. At the same time, a distinction is made between complete collinearity, which means the presence of a functional (identical) linear relationship, and partial or simply multicollinearity, which means the presence of a strong correlation between factors.

The main consequences of multicollinearity are: large variances of estimates, a decrease in the t-statistics of coefficients, estimates of coefficients using the least squares method become unstable, it is difficult to determine the contribution of variables, and an incorrect sign of the coefficient is obtained.

The main criteria for detecting multicollinearity are as follows: high R 2 with insignificant coefficients; High paired correlation coefficients; high values ​​of the VIF coefficient.

The main methods for eliminating multicollinearity are: exclusion of the variable (s) from the model; obtaining additional data or a new sample; changing the model specification; use of preliminary information about some parameters.

The developed information and methodological support corresponds to the main objectives of the econometric study of the problem of multicollinearity in multiple regression models and can be recommended for such studies.

List of sources used

  1. Astakhov, S.N. Econometrics [Text]: Educational-methodical complex. Kazan, 2008 .-- 107s.
  2. Bardasov, S. A. ECONOMETRICS [Text]: a tutorial. 2nd ed., Rev. and add. Tyumen: Tyumen State University Publishing House, 2010.264 p.
  3. Borodkina, L.I. A course of lectures [Electronic resource]. Access mode - http://www.iskunstvo.info/materials/history/2/inf/correl.htm
  4. Voskoboinikov, Yu. ECONOMETRICS in EXCEL Part 1 [Text]: study guide, Novosibirsk 2005,156 p.
  5. Eliseeva, I.I. Workshop on econometrics: textbook. guide for economics. universities / Eliseeva, I.I., Kurysheva, S.V., Gordeenko, N.M. , [and etc.] ; ed. I.I. Eliseeva - M .: Finance and Statistics, 2001 .-- 191 p. - (14126-1).
  6. Multicollinearity [Electronic resource]. Access mode - https://ru.wikipedia.org/wiki/Multicollinearity.
  7. Novikov, A.I. Econometrics [Text]: textbook. manual for ex. "Finance and Credit", "Economics" - M .: Dashkov and K, 2013. - 223 p. - (93895-1).
  8. The problem of multicollinearity [Electronic resource]. Access mode - http://crow.academy.ru/econometrics/lectures_/lect_09_/lect_09_4.pdf.
  9. Chernyak V. Applied Econometrics. Lecture No. 9 [Electronic resource]. Access mode http://www.slideshare.net/vtcherniak/lect-09.
  10. ru - encyclopedic site [Electronic resource]. Access mode - http://kodcupon.ru/ra17syplinoe97/ Multicollinearity.

Download: You do not have access to download files from our server.

Federal Agency for Education and Science of the Russian Federation

Kostroma State Technological University.

Department of Higher Mathematics

on econometrics on the topic:

Multicollinearity

Performed

1st year student

correspondence faculty

sp-t "Accounting,

analysis and audit ".

Checked

Katezhina S.F.

Kostroma 2008


Multicollinearity

Multicollinearity is understood as a high mutual correlation of explanatory variables. Multicollinearity can manifest itself in functional (explicit) and stochastic (latent) forms.

In the functional form of multicollinearity, at least one of the paired relationships between the explanatory variables is a linear functional dependence. In this case, the matrix X`X is special, since it contains linearly dependent column vectors, and its determinant is equal to zero, i.e. the premise of the regression analysis is violated, this leads to the impossibility of solving the corresponding system of normal equations and obtaining estimates of the parameters of the regression model.

However, in economic research, multicollinearity often manifests itself in a stochastic form, when there is a close correlation between at least two explanatory variables. The matrix X`X in this case is nonsingular, but its determinant is very small.

At the same time, the vector of estimates b and its covariance matrix ∑ b are proportional to the inverse matrix (X`X) -1, which means that their elements are inversely proportional to the value of the determinant | X`X |. As a result, significant standard deviations (standard errors) of the regression coefficients b 0, b 1,…, b p are obtained and the assessment of their significance by the t-criterion does not make sense, although in general the regression model may turn out to be significant by the F-criterion.

Estimates become very sensitive to small changes in observations and sample size. The regression equations in this case, as a rule, do not have real meaning, since some of its coefficients may have signs that are incorrect from the point of view of economic theory and unjustifiably large values.

There are no precise quantitative criteria for determining the presence or absence of multicollinearity. Nevertheless, there are some heuristic approaches to its detection.

One such approach is to analyze the correlation matrix between the explanatory variables X 1, X 2, ..., X p and identify pairs of variables with high correlation variables (usually greater than 0.8). If such variables exist, one speaks of multicollinearity between them. It is also useful to find multiple coefficients of determination between one of the explanatory variables and some group of them. The presence of a high multiple coefficient of determination (usually more than 0.6) indicates multicollinearity.

Another approach is to examine the X`X matrix. If the determinant of the matrix X`X or its minimum eigenvalue λ min are close to zero (for example, of the same order of magnitude with the accumulating computational errors), then this indicates the presence of multicollinearity. the same can be evidenced by a significant deviation of the maximum eigenvalue λ max of the matrix X`X from its minimum eigenvalue λ min.

A number of methods are used to eliminate or reduce multicollinearity. The simplest of them (but far from always possible) is that of two explanatory variables with a high correlation coefficient (greater than 0.8), one variable is excluded from consideration. At the same time, which variable to leave and which to remove from the analysis is decided primarily on the basis of economic considerations. If, from an economic point of view, none of the variables can be preferred, then the one of the two variables that has a greater correlation coefficient with the dependent variable is left.

Another method of eliminating or reducing multicollinearity is to move from unbiased estimates determined by the least squares method to biased estimates that have, however, less scattering relative to the parameter being estimated, i.e. lower mathematical expectation of the square of the deviation of the estimate b j from the parameter β j or M (b j - β j) 2.

The estimates determined by the vector, in accordance with the Gauss-Markov theorem, have minimum variances in the class of all linear unbiased estimates, but in the presence of multicollinearity, these variances may turn out to be too large, and turning to the corresponding biased estimates can increase the accuracy of estimating the regression parameters. The figure shows the case when the biased estimate β j ^, the sample distribution of which is given by the density φ (β j ^).

Indeed, let the maximum admissible confidence interval for the estimated parameter β j be (β j -Δ, β j + Δ). Then the confidence probability, or the reliability of the estimate, determined by the area under the distribution curve in the interval (β j -Δ, β j + Δ), as it is easy to see from the figure, will in this case be greater for estimating β j compared to bj (in the figure, these areas are shaded). Accordingly, the mean square of the deviation of the estimate from the estimated parameter will be less for a biased estimate, i.e.:

M (β j ^ - β j) 2< M (b j - β j) 2

When using ridge regression (or ridge regression), instead of unbiased estimates, biased estimates given by the vector

β τ ^ = (X`X + τ E p +1) -1 X`Y,

where τ – some positive number, called "ridge" or "ridge",

E p +1 is the (р + 1) -th order unit matrix.

Adding τ to the diagonal elements of the matrix X`X makes the estimates of the model parameters biased, but at the same time the determinant of the matrix of the system of normal equations increases - instead of (X`X) from will be equal to

| X`X + τ E p +1 |

Thus, it becomes possible to exclude multicollinearity in the case when the determinant | X`X | is close to zero.

To eliminate multicollinearity, the transition from the original explanatory variables X 1, X 2, ..., X n, interconnected by a fairly close correlation dependence, to new variables representing linear combinations of the original ones can be used. In this case, the new variables should be weakly correlated or generally uncorrelated. As such variables, we take, for example, the so-called principal components of the vector of initial explanatory variables studied in component analysis, and consider regression on the principal components, in which the latter act as generalized explanatory variables subject to further meaningful (economic) interpretation.

The orthogonality of the main components prevents the manifestation of the multicollinearity effect. In addition, the applied method allows one to restrict oneself to a small number of principal components with a relatively large number of initial explanatory variables.

Multicollinearity - it is a term used to describe a problem where a loose linear relationship between explanatory variables leads to unreliable regression estimates. Of course, such a relationship does not necessarily give unsatisfactory ratings. If all other conditions are favorable, that is, if the number of observations and the sample variances of the explanatory variables are large, and the variance of the random term is small, then, as a result, quite good estimates can be obtained.

So, multicollinearity should be caused by a combination of a loose dependence and one (or more) unfavorable conditions, and this is the question

the severity of the phenomenon, and not its type. Any regression score will suffer from it to some extent, unless all explanatory variables are completely uncorrelated. Consideration of this problem begins only when it seriously affects the results of the regression estimate.

This problem is common in time series regressions, that is, when data is composed of a series of observations over a period of time. If two or more explanatory variables have a strong temporal trend, then they will be closely correlated, and this can lead to multicollinearity.


What can be done in this case?

The various methods that can be used to mitigate multicollinearity fall into two categories: the first category is attempts to improve the degree to which four conditions are met that ensure the reliability of the regression estimates; the second category is the use of external information. If possible directly obtained data are used first, then it would obviously be useful to increase the number of observations.

If you are using time series data, you can do this by shortening the length of each time period. For example, when evaluating the demand function equations in Exercises 5.3 and 5.6, you can switch from using annual data to quarterly data.

After that, instead of 25 observations, there will be 100 of them. This is so obvious and so easy to do that most researchers using time series almost automatically use quarterly data, if available, instead of annual data, even if the problem of multicollinearity is not worth it, just to reduce to minimum theoretical variances of regression coefficients. There are, however, potential problems with this approach. Autocorrelation can be introduced or enhanced, but it can be neutralized. In addition, bias due to measurement errors can be introduced (or amplified) if the quarterly data is measured with less precision than the corresponding annual data. This problem is not easy to solve, but it may not be significant.