What does a low value of the multiple correlation coefficient mean. Calculation of multiple correlation coefficients
Multiple correlation coefficient
If the partial correlation coefficients of the multiple regression model turned out to be significant, i.e., there really is a correlation relationship between the resulting variable and the factor model variables, then in this case the construction of a multiple correlation coefficient is considered appropriate.
Using the multiple correlation coefficient, the cumulative influence of all factor variables on the resulting variable in the multiple regression model is characterized.
The formula for determining the correlation coefficient of the multiple regression equation through the matrix of paired correlation coefficients:
where is the determinant of the matrix of paired correlation coefficients;
Determinant of the interfactorial correlation matrix.
As can be seen from the formulas, the value of the multiple correlation coefficient depends not only on the correlation of the result with each of the factors, but also on the interfactorial correlation. The considered formula makes it possible to determine the cumulative correlation coefficient without referring to the multiple regression equation, but using only paired correlation coefficients.
Table 17 - Results of calculations of the multiple correlation coefficient
Evaluation of the quality of the constructed model
The coefficient of multiple determination R 2 is the square of the multiple correlation coefficient.
The coefficient of multiple determination characterizes the percentage by which the constructed regression model explains the variation in the values of the resulting variable relative to its average level, i.e. it shows the share of the total variance of the resulting variable explained by the variation of the factor variables included in the regression model. The greater the value of the coefficient of multiple determination, the better the constructed regression model characterizes the relationship between variables.
For the coefficient of multiple determination, the inequality of the form is always satisfied:
Therefore, the inclusion of an additional factor variable in the linear regression model does not reduce the value of the multiple determination coefficient.
Table 18 - Calculated coefficients of determination
In order to prevent exaggeration of the tightness of the relationship, an adjusted multiple determination index is applied, which contains a correction for the number of degrees of freedom and is calculated by the formula:
where n is the sample size, m is the number of variables in the multiple regression equation. With a small number of observations, the unadjusted value of the coefficient of multiple determination R 2 tends to overestimate the share of variation in the resulting trait associated with the influence of factors included in the regression model.
Table 19 - Adjusted Multiple Determination Index
High values of the coefficients of determination R 2 indicate that the regression models approximate the initial data well and such regression models can be used to predict the values of the effective indicator.
To check the significance (quality) of the regression equation means to establish whether mathematical model, expressing the relationship between variables, experimental data, whether there are enough explanatory variables included in the equation to describe the dependent variable. In order to have a general judgment about the quality of the model, for each observation, the average approximation error is determined from the relative deviations. Checking the adequacy of the regression equation (model) is carried out using the average approximation error, the value of which should not exceed 12-15% (the maximum allowable value).
Formula for calculating the average approximation error:
where n is the number of variables in the multiple regression equation; f(x i1 , x i2 , …, x in) - i-th settlement the value of the variable y; - i-th experimental the value of the y variable.
Table 20 - Average approximation error
As can be seen from the calculation results, the average approximation errors do not exceed the allowable values of 12-15%, which indicates the adequacy of the models obtained.
Checking the significance of the coefficients linear equation multiple regression.
Checking the significance of individual coefficients of the equation means that if the coefficient for some variable is insignificant, then it is impossible to trust the influence of this variable on the values of the resulting function y. An insignificant coefficient should be set equal to zero, i.e. the corresponding variable should be excluded from further consideration.
To check the significance of each of the coefficients a 0 , a 1 ,…, a n, Student's t-statistics is used, the experimental value of which is calculated by the formula:
, (i=0,1,…,n), (18)
where a i is the coefficient of the variable x i , is the root mean square error of this coefficient,
where is the standard deviation for the values of the variable y; - standard deviation for values x i ; - coefficient of multiple determination for the regression equation as a whole; - coefficient of multiple determination characterizing the relationship between the factor x i and other factors (x 1 , x 2 ,…, x i-1 , x i+1 ,…, x n) of the regression equation.
Each of the experimental values of statistics is compared with the critical value (i=1,2,…,n), which is searched for in the Student's distribution table at a given significance level b and the number of degrees of freedom k, equal to k=m-n-1. In this case, at the significance level b=0.05 and k=13-3-1=9=2.26.
Table 21 - Calculated experimental values of t - Student statistics
If > , then the hypothesis about the significance of the coefficient a i is not rejected, and the corresponding variable x i remains in the equation. Otherwise, the coefficient a i is considered insignificant and the corresponding variable should be excluded from the regression equation. Thus, comparing the obtained experimental values with the critical one, we can conclude that there are no insignificant coefficients in all four equations.
Checking the Significance of a Linear Multiple Regression Equation as a Whole
If it turns out that at a given level of significance b, the equation is insignificant, then it cannot be used, and the found dependence should be neglected.
To check the significance of the regression equation, Fisher's experimental F-statistic is used:
where m is the sample size; n is the number of variables in the multiple regression equation; f(x i1 , x i2 , …, x in) - i-th calculated value of the variable y; - average of experimental values random variable Y.
The obtained experimental values of the Fisher criterion are compared with the critical values \u003d F (b; k 1 ; k 2) at the selected significance level b. The number of degrees of freedom k 1 = m - n - 1, k 2 = n.
With the chosen significance level b=0.05 and the number of degrees of freedom k 1 = 13 - 3 - 1 = 9, k 2 = 3 = 8.81
Table 22 - Calculated experimental values of the Fisher criterion
When comparing the experimental values of the Fisher criteria with the critical one (at a significance level b=0.05 F cr =8.81), they all satisfy the inequality F op > F cr and it is concluded that with a probability p=1-b=0.95 all equations are significant, and we get certain reasons to trust the constructed regression equations.
Estimating Accuracy of a Linear Multiple Regression Equation
The final statistical procedure is the assessment of the accuracy of the constructed regression equations.
The estimate of the closeness of the experimental values y i of the random variable Y and its calculated values f(x i), obtained using the linear regression equation, is performed using the root-mean-square error according to the following formula:
Table 23 - Results of calculating the root mean square error of the equations
When studying complex phenomena, more than two random factors must be taken into account. A correct idea of the nature of the connection between these factors can be obtained only if all the considered random factors are examined at once. A joint study of three or more random factors will allow the researcher to establish more or less reasonable assumptions about causal relationships between the studied phenomena. A simple form of multiple relationship is a linear relationship between three features. Random factors are denoted as X 1 , X 2 and X 3 . Pairwise correlation coefficients between X 1 and X 2 is denoted as r 12 , respectively between X 1 and X 3 - r 12, between X 2 and X 3 - r 23. As a measure of the tightness of the linear relationship of three features, multiple correlation coefficients are used, denoted R 1-23, R 2 ּ 13, R 3 ּ 12 and partial correlation coefficients denoted r 12.3 , r 13.2 , r 23.1 .
The multiple correlation coefficient R 1.23 of three factors is an indicator of the closeness of a linear relationship between one of the factors (index before the point) and a combination of two other factors (indices after the point).
The values of the coefficient R are always in the range from 0 to 1. As R approaches unity, the degree linear connection three signs increases.
Between the multiple correlation coefficient, for example R 2 ּ 13 , and two pair correlation coefficients r 12 and r 23 there is a relation: each of the pair coefficients cannot exceed in absolute value R 2 ּ 13 .
Formulas for calculating multiple correlation coefficients with known values of the pair correlation coefficients r 12 , r 13 and r 23 are:
The square of the multiple correlation coefficient R 2 called coefficient of multiple determination. It shows the proportion of variation in the dependent variable under the influence of the studied factors.
The significance of multiple correlation is estimated by F- criterion:
n- sample size; k- number of factors. In our case k = 3.
null hypothesis about the equality of the multiple correlation coefficient in the population to zero ( h o:r=0) is accepted if f f<f t, and is rejected if
f f ³ f t.
theoretical value f-criteria is defined for v 1 = k- 1 and v 2 = n - k degrees of freedom and the accepted level of significance a (Appendix 1).
An example of calculating the multiple correlation coefficient. When studying the relationship between the factors, the pair correlation coefficients were obtained ( n =15): r 12 ==0.6; r 13 = 0.3; r 23 = - 0,2.
It is necessary to find out the dependence of the sign X 2 off sign X 1 and X 3 , i.e. calculate the multiple correlation coefficient:
Table value F-criterion at n 1 = 2 and n 2 = 15 - 3 = 12 degrees of freedom at a = 0.05 F 0.05 = 3.89 and at a = 0.01 F 0,01 = 6,93.
Thus, the relationship between features R 2.13 = 0.74 significant on
1% significance level F f > F 0,01 .
Judging by the coefficient of multiple determination R 2 = (0.74) 2 = 0.55, feature variation X 2 is 55% related to the effect of the studied factors, and 45% of the variation (1-R 2) cannot be explained by the influence of these variables.
Partial Linear Correlation
Partial correlation coefficient is an indicator that measures the degree of conjugation of two features.
Mathematical statistics allows you to establish a correlation between two features with a constant value of the third, without setting up a special experiment, but using paired correlation coefficients r 12 , r 13 , r 23 .
Partial correlation coefficients are calculated using the formulas:
The numbers before the dot indicate between which features the dependence is being studied, and the number after the dot indicates the influence of which feature is excluded (eliminated). The error and the criterion of significance of partial correlation are determined by the same formulas as for pairwise correlation:
.
theoretical value t- criterion is determined for v = n– 2 degrees of freedom and accepted significance level a (Appendix 1).
The null hypothesis about the equality of the partial correlation coefficient in the aggregate to zero ( Ho: r= 0) is accepted if t f< t t, and is rejected if
t f ³ t t.
Partial coefficients can take values between -1 and +1. Private determination coefficients are found by squaring the partial correlation coefficients:
D 12.3 = r 2 12ּ3 ; d 13.2 = r 2 13ּ2 ; d 23ּ1 = r 2 23ּ1 .
Determining the degree of particular impact of individual factors on the resultant feature while excluding (eliminating) its connection with other features that distort this correlation is often of great interest. Sometimes it happens that with a constant value of the eliminated trait, it is impossible to notice its statistical effect on the variability of other traits. To understand the technique for calculating the partial correlation coefficient, consider an example. There are three options X, Y and Z. For sample size n= 180 paired correlation coefficients determined
rxy = 0,799; rxz = 0,57; r yz = 0,507.
Let's define partial correlation coefficients:
Partial correlation coefficient between parameter X and Y Z (r xyz = 0.720) shows that only a small part of the relationship of these features in the overall correlation ( rxy= 0.799) is due to the influence of the third feature ( Z). A similar conclusion must be made with regard to the partial correlation coefficient between the parameter X and parameter Z with constant parameter value Y (r X zּy = 0.318 and rxz= 0.57). On the contrary, the partial correlation coefficient between the parameters Y and Z with constant parameter value X r yz ּ x= 0.105 is significantly different from the overall correlation coefficient r z= 0.507. It can be seen from this that if you select objects with the same parameter value X, then the relationship between the features Y and Z they will be very weak, since a significant part of this relationship is due to the variation of the parameter X.
Under some circumstances, the partial correlation coefficient may be opposite in sign to the paired one.
For example, when studying the relationship between features X, Y and Z- paired correlation coefficients were obtained (with n = 100): r xy = 0.6; r X z= 0,9;
r z = 0,4.
Partial correlation coefficients when excluding the influence of the third feature:
The example shows that the values of the pair coefficient and the partial correlation coefficient differ in sign.
The partial correlation method makes it possible to calculate the second-order partial correlation coefficient. This coefficient indicates the relationship between the first and second feature with a constant value of the third and fourth. The second order partial coefficient is determined based on the first order partial coefficients according to the formula:
where r 12 . 4 , r 13-4, r 23 ּ4 - partial coefficients, the value of which is determined by the partial coefficient formula, using the pair correlation coefficients r 12 , r 13 , r 14 , r 23 , r 24 , r 34 .
To determine the degree of dependence between several indicators, multiple correlation coefficients are used. They are then summarized in a separate table, which is called the correlation matrix. The names of the rows and columns of such a matrix are the names of the parameters whose dependence on each other is established. Corresponding correlation coefficients are located at the intersection of rows and columns. Let's find out how you can make a similar calculation using Excel tools.
It is customary to determine the level of relationship between various indicators as follows, depending on the correlation coefficient:
- 0 - 0.3 - no connection;
- 0.3 - 0.5 - weak connection;
- 0.5 - 0.7 - average connection;
- 0.7 - 0.9 - high;
- 0.9 - 1 - very strong.
If the correlation coefficient is negative, then this means that the relationship of the parameters is inverse.
In order to compile a correlation matrix in Excel, one tool is used, included in the package "Data analysis". That's what it's called - "Correlation". Let's see how it can be used to calculate multiple correlation scores.
Step 1: Activate Analysis Pack
It must be said right away that the default package "Data analysis" disabled. Therefore, before proceeding with the procedure for directly calculating the correlation coefficients, you need to activate it. Unfortunately, not every user knows how to do this. Therefore, we will focus on this issue.
After the specified action, the tool package "Data analysis" will be activated.
Stage 2: coefficient calculation
Now you can proceed directly to the calculation of the multiple correlation coefficient. Let's calculate the multiple correlation coefficient of these factors using the example of the table of indicators of labor productivity, capital-labor ratio and power-to-weight ratio at various enterprises.
Stage 3: analysis of the result
Now let's figure out how to understand the result that we got in the process of data processing by the tool "Correlation" in the Excel program.
As we can see from the table, the correlation coefficient of capital-labor ratio (Column 2) and power-to-weight ratio ( Column 1) is 0.92, which corresponds to a very strong relationship. Between labor productivity ( Column 3) and power-to-weight ratio ( Column 1) this indicator is equal to 0.72, which is a high degree of dependence. Correlation coefficient between labor productivity ( Column 3) and capital-labor ratio ( Column 2) is equal to 0.88, which also corresponds to a high degree of dependence. Thus, we can say that the relationship between all the studied factors can be traced quite strong.
As you can see, the package "Data analysis" in Excel is a very convenient and fairly easy-to-use tool for determining the multiple correlation coefficient. It can also be used to calculate the usual correlation between two factors.
The practical significance of the multiple regression equation is assessed using the multiple correlation indicator and its square - the coefficient of determination.
The coefficient of determination shows the proportion of the variation of the resulting trait, which is under the influence of factor traits, i.e. determines what proportion of the trait variation at taken into account in the model and due to the influence of factors included in the model:
The multiple correlation coefficient can be found as the square root of the determination coefficient. The closer the correlation coefficient is to one, the closer the relationship between the result and all factors, and the regression equation better describes the actual data. If the multiple correlation coefficient is close to zero, then the regression equation does not describe the actual data well, and the factors have little effect on the result. This coefficient, unlike the pairwise correlation coefficient, cannot be used to interpret the direction of a relationship.
The value of the multiple correlation coefficient is greater than or equal to the value of the maximum pair correlation coefficient:
For linear multiple regression, the multiple correlation coefficient can be calculated using the following formula:
Accordingly, the multiple coefficient of determination:
There is another formula for calculating the multiple correlation coefficient for linear regression:
where is the determinant of the full matrix of linear pair correlation coefficients (i.e., including paired linear coefficients correlations of factors with the result and among themselves):
The determinant of the matrix of linear pair correlation coefficients of factors among themselves:
The adjusted coefficient of determination is also calculated:
where n is the number of observations;
m- the number of parameters of the regression equation without taking into account the free term (for linear regression, for example, this number is equal to the number of factors included in the model).
The adjusted coefficient of determination is used to solve two problems: assessing the real closeness of the relationship between the result and factors and comparing models with different number parameters. In the first case, attention is paid to the proximity of the adjusted and uncorrected coefficients of determination. If these indicators are large and differ slightly, the model is considered good.
When comparing different models, other things being equal, preference is given to the one that has a larger adjusted coefficient of determination.
It should be noted that the scope of the adjusted coefficient of determination is limited only to these tasks. It cannot be used in formulas where the usual coefficient of determination is applied. The adjusted coefficient of determination cannot be interpreted as the fraction of the variance in the outcome explained by the variance in the factors included in the regression model.
To check the significance of the multiple correlation coefficient, use F- Fisher's criterion, which is determined by the formula:
where R2– multiple coefficient of determination;
m- the number of parameters with factors x in the multiple regression equation (in paired regression m=1).
The obtained value of the F-criterion is compared with the table value at a certain level of significance and m and n-m-1 degrees of freedom. If the calculated value F- the criterion is greater than the tabular one, the multiple regression equation is recognized as significant.
The correlation coefficient is the degree of association between two variables. Its calculation gives an idea of whether there is a relationship between two data sets. Unlike regression, correlation does not allow predicting values. However, the calculation of the coefficient is an important step in the preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and GDP growth is high. This gives us an idea that in order to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not so obvious conclusion at first glance!
Correlation and causality
Perhaps there is not a single area of statistics that would be so firmly established in our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger lies in the fact that often its high values are speculated in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a causal relationship between the quantities.
Correlation coefficient: Pearson and Spearman formula
There are several main indicators that characterize the relationship between two variables. Historically, the first is Pearson's linear correlation coefficient. It is passed at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This ratio allows you to see the relationship between rational numbers that change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the studied quantities. Coefficient rank correlation Spearman allows you to simplify calculations by constructing a hierarchy of variable values.
Relationships between variables
Correlation helps answer two questions. First, whether the relationship between variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool with which to obtain this important information. It is easy to see that household incomes and expenses rise and fall proportionately. Such a relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. Such a relationship is called negative. The values of the correlation coefficient are between -1 and 1. Zero means that there is no relationship between the studied values. The closer the indicator to the extreme values, the stronger the relationship (negative or positive). The absence of dependence is evidenced by a coefficient from -0.1 to 0.1. It must be understood that such a value only indicates the absence of a linear relationship.
Application features
The use of both indicators is subject to certain assumptions. First, the presence of a strong relationship does not determine the fact that one value determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a causal relationship between the studied variables. Thirdly, it shows an exclusively linear relationship. Correlation can be used to evaluate meaningful quantitative data (eg barometric pressure, air temperature) rather than categories such as gender or favorite color.
Multiple correlation coefficient
Pearson and Spearman investigated the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes in. For example, the gross national product is affected not only by foreign direct investment, but also by the monetary and fiscal policies of the state, as well as the level of exports. The growth rate and the volume of GDP are the result of the interaction of a number of factors. However, it should be understood that the multiple correlation model is based on a number of simplifications and assumptions. First, multicollinearity between quantities is excluded. Second, the relationship between the dependent variable and the variables that affect it is assumed to be linear.
Areas of use of correlation and regression analysis
This method of finding the relationship between quantities is widely used in statistics. It is most often resorted to in three main cases:
- For testing causal relationships between the values of two variables. As a result, the researcher hopes to find a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
- To check for a relationship between values. In this case, no one determines which variable is dependent. It may turn out that the value of both quantities determines some other factor.
- To derive an equation. In this case, you can simply substitute numbers into it and find out the values of the unknown variable.
A man in search of a causal relationship
Consciousness is arranged in such a way that we definitely need to explain the events that occur around. A person is always looking for a connection between the picture of the world in which he lives and the information he receives. Often the brain creates order out of chaos. He can easily see a causal relationship where there is none. Scientists have to specifically learn to overcome this trend. The ability to evaluate relationships between data is objectively essential in an academic career.
Media bias
Consider how the presence of a correlation can be misinterpreted. A group of ill-behaved British students were asked if their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parents' smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there are a number of problems with this conclusion. First, the correlation does not indicate which of the quantities is independent. Therefore, it is quite possible to assume that the pernicious habit of parents is caused by the disobedience of children. Secondly, it is impossible to say with certainty that both problems did not arise due to some third factor. For example, low-income families. It should be noted the emotional aspect of the initial conclusions of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his study in this way.
conclusions
Misinterpreting correlation as a causal relationship between two variables can lead to embarrassing research errors. The problem is that it lies at the very core of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between causation and correlation allows you to rationally analyze information as in Everyday life as well as in professional careers.