相关分析(Correlate)

来源：爱玩科技网

相关分析（Correlate）

Correlation and dependence

In statistics, correlation and dependence are any of a broad class of statistical relationships between two or more random variables or observed data values.

Correlation is computed（用...计算） into what is known as the correlation coefficient（相关系数）, which ranges between -1 and +1. Perfect positive correlation (a correlation co-efficient of +1) implies（意味着） that as one security（证券） moves, either up or down, the other security will move in lockstep（步伐一致的）, in the same direction. Alternatively（同样的）, perfect negative correlation means that if one security moves in either direction the security that is perfectly negatively correlated will move by an equal amount in the opposite（相反的） direction. If the correlation is 0, the movements of the securities are said to have no correlation; they are completely random（随意、胡乱）.

There are several correlation coefficients, often denoted（表示、指示） ρ or r, measuring（衡量、测量） the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear（只进行两变量线性分析） relationship between two variables (which may exist even if one is a nonlinear function of the other).

Other correlation coefficients have been developed to be more robust（有效的、稳健） than the Pearson correlation, or more sensitive to nonlinear relationships.Rank（等级） correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient (τ) measure the extent（范围） to which, as one variable increases, the other variable tends to increase, without requiring（需要、命令） that increase to be represented by a linear relationship. If, as the one variable（变量） increases（增加）, the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson's coefficient, used either to reduce the amount of

calculation or to make the coefficient less sensitive to non-normality in distributions（分布）. However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the Pearson product-moment correlation coefficient, and are best seen as measures of a different type of association, rather than as alternative measure of the population correlation coefficient.

Common misconceptions（错误的想法）

Correlation and causality（因果关系）

The conventional（大会） dictum（声明） that \"correlation does not imply causation\" means that correlation cannot be used to infer a causal relationship between the variables.

Correlation and linearity

Four sets of data with the same correlation of 0.816

The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X, the correlation coefficient will not fully determine the form of E(Y|X).

The image on the right shows scatterplots（散点图） of Anscombe's quartet, a set of four different pairs of variables created by Francis Anscombe. The four y variables have the same mean (7.5), standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the

assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that there is an exact functional relationship: only the extent to which that relationship can be approximated（大概） by a linear relationship. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one outlier（异常值） is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.(离群值可降低、也可以增加数据的相关性。但是，这种降低或增加相关性行为都扭曲了数据的实际相关性，是一种我们需要极力避免的局面)

These examples indicate that the correlation coefficient, as a summary statistic, cannot replace the individual examination of the data.

研究随机变量之间的“相关关系”的一种统计方法。相关关系是一种非确定的关系，例如，以X和Y分别记一个人的身高和体重，或分别记每亩田的施肥量和产量，则X与Y显然有关系，而又没有确切到可由其中的一个去精确地决定另一个的程度，这就是相关关系。当两变量X和Y有相关关系时，虽然知道不足以通过X之值x决定Y之值，但可以决定Y的条件分布。反之，也可以由Y之值决定X的条件分布X｜Y＝y。这种依赖关系正是相关关系的实质所在。

相关分析与回归分析在实际应用中有密切的关系。然而，回归分析关心的是一个随机变量Y对另一个（或另一组）随机变量X的依赖关系的函数式。用预测的语言来说，X是预测因子，Y是预测对象，故X和Y的地位是不平等的。在相关分析中，所讨论的变量的地位是平等的，分析侧重于随机变量的种种相关特征。例如，以X和Y分别记小学生的数学和语文成绩，感兴趣的是两者的关系如何，而不在于利用X来预测Y。

相关分析的主要任务是由X、Y的一组观测值（Xi，Yi）（i＝1，2，…，n），估计ρXY以及检验ρXY的假设，特别是H0：ρXY =0。在统计学上，称

为样本相关系数，并用于估计ρXY。费希尔于1915年在（X，Y）的总体分布为二维正态分布的情况下，求得了r的抽样分布，由此可以对ρXY =0的假设进行检验，这是一项重大的进展，标志了相关分析这一统计方法的建立。

复相关：上述相关系数只涉及两个变量X、Y。若有多个变量X1，X2，…，Xk，则可以考虑其中之一（如X1）与其余变量（X2，X3，…，Xk）的相关，基本指标是X1对（X2，X3，…，Xk）的复相关系数R。任取常数a2，a3，…，ak，计算X1与错误！未找到引用源。的相关系数，变动a2，a3，…，ak的数值，使相关系数达到极大，这个极大值就是R。（一个变量同时与多个变量之间的最大关系）

偏相关：这也是相关分析中的一个重要概念。设X、Y和Z分别记同一个人每月的基本开支、娱乐开支和收入。经过分析，可以发现X与Y之间有高度的正相关，究其原因，是由于X和Y同时受Z的影响。若把Z对两者的影响消除，则剩余部分的相关程度就会改变，甚至会变成负相关。后者就是X、Y相对于Z的偏相关，可以用偏相关系数来衡量。（在有控制变量的前提下两个变量之间的关系）

有时，需要考虑一组变量与另一组变量的关系，为此引进了典型相关系数，相应的方法称为典型相关分析，这属于多元统计分析的范围。

典型相关分析：寻求两组变量各自的线性函数中相关系数达到最大值的一对，称为第一对典型变量，还可以求第二对、第三对，等等。这些成对的变量，彼此是互不相关的。各对的相关系数称为典型相关系数。

通过这些典型变量所代表的实际含意，可以找到这两组变量间的一些内在联系。典型相关分析（20世纪）30年代已经出现。（两组变量集之间的关系，即多个变量与多个变量之间的关系） 1、相关分析与回归分析的区别

 资料方面：相关分析要求变量X、Y均服从正态分布；回归分析要求因变量Y服从正态分布，而

X是可以精确测量和严格控制的变量，一般称这为Ⅰ型回归。对于双变量正态分布的资料若进行回归分析则称为Ⅱ型回归。

 应用方面：说明两变量之间的相关关系用相关分析，说明两变量之间依存变化的数量关系用回归

分析。

2、相关分析与回归分析的联系

 对于直线相关与回归（y=a+bx），相关系数R与回归方程中的斜率b的符号一致。R为正说明两

变量同向变化，b为正说明x增（减）一个单位，则y也平均增（减）一个单位。

 对于直线相关与回归，对参数R和b的假设检验是等价的。由于对R的假设检验可以查表，而

对b的假设检验计算较繁，因此在实际应用中常以对前者的假设检验代替后者。

H0：β＝1； H1：β≠1，α（显著性水平）＝0.05。

tbb1Sbb1Sy,x/lxx，

，vn2

Sy,x2ˆ(yy)n2ˆ)(yy2lyyblxy

lxylxx

lxx(xx)2，lyy(yy)2，lxy(xx)(yy)，b

t回归系数1，自由度=n－2

回归系数的标准误公式中的回归系数以及回归系数标准误均由SPSS的线性回归方程提供。根据上面得出的t值，在SPSS上计算原假设H0成立的双侧概率值： P（H0：β＝1）= 2×(1-CDF.T(t, 自由度)) （t＞0） P（H0：β＝1）= 2×CDF.T(t, 自由度) （t＜0）

1-Bivariate Correlations

The Bivariate Correlations procedure computes Pearson's correlation coefficient, Spearman's rho, and Kendall's tau-b with their significance levels. Correlations measure how variables or rank orders are related. Before calculating a correlation coefficient, screen（筛选） your data for outliers (which can cause misleading results) and evidence of a linear relationship. Pearson's correlation coefficient is a measure of linear association. Two variables can be perfectly related, but if the relationship is not linear, Pearson's correlation coefficient is not an appropriate statistic for measuring their association.

Example. Is the number of games won by a basketball team correlated with the average number of points scored per game? A scatterplot indicates that there is a linear relationship. Analyzing data from the 1994–1995 NBA season yields that Pearson's correlation coefficient (0.581) is significant at the 0.01 level. You might suspect that the more games won per season, the fewer points the opponents scored. These variables are negatively correlated (–0.401), and the correlation is significant at the 0.05 level.

Data. Use symmetric quantitative variables for Pearson's correlation coefficient and quantitative variables or variables with ordered categories for Spearman's rho and Kendall's tau-b.

Choose correlation coefficients based on the characteristics of your data. If you have scale, symmetrically（对称的） distributed data, the Pearson correlation coefficient is appropriate. If your data are non-symmetrically distributed or are ordinal in nature (such as ranks), Kendall's tau-b or the Spearman coefficient are more appropriate.

Assumptions. Pearson's correlation coefficient assumes that each pair of variables is bivariate normal.

Using Correlations to Study the Association between Motor Vehicles Sales and Fuel

The Bivariate Correlations procedure computes the pairwise associations （成对的关联分析）for a set of variables and displays （展示、陈列）the results in a matrix（矩阵）. It is useful for determining（决定） the strength and direction of the association between two scale or ordinal variables.

In order to increase sales, motor vehicle design engineers（设计师） want to focus their attention on aspects （方面）of the vehicle（车辆） that are important to customers--for example, how important is fuel efficiency

（燃料效率）with respect to sales? One way to measure this is to compute the correlation between past sales and fuel efficiency.

Information concerning various makes of motor vehicles is collected in car_sales.sav . This data file contains hypothetical（假设的） sales estimates（估量）, plus list prices, and physical specifications（规格、说明） for various makes and models of vehicles. The list prices and physical specifications were obtained alternately （交替的）from edmunds.com and manufacturer sites.

Use Bivariate Correlations to measure the importance of fuel efficiency to the salability of a motor vehicle ► To run a correlations analysis, from the menus choose: Analyze Correlate Bivariate... ► Select Sales in thousands and Fuel efficiency as analysis variables. ► Click OK. These selections produce a correlation matrix for Sales in thousands and Fuel efficiency. The Pearson correlation coefficient measures the linear association between two scale variables. CorrelationsFuelSales inefficiencythousandsFuelPearson Correlation1-.017efficiencySig. (2-tailed).837N154154Sales inPearson Correlation-.0171thousandsSig. (2-tailed).837N154157The correlation reported in the table is negative(!), although not significantly different from 0 because the p-value of 0.837 is greater than 0.10. This suggests that designers should not focus their efforts on making cars more fuel efficient because there isn't an appreciable effect on sales. However, the Pearson correlation coefficient works best when the variables are approximately normally distributed and have no outliers. A scatterplot can reveal these possible problems. ► To produce a scatterplot of Sales in thousands by Fuel efficiency, from the menus choose: Graphs Chart Builder... ► Select the Scatter/Dot gallery and choose Simple Scatter. ► Select Sales in thousands as the y variable and Fuel efficiency as the x variable. ► Click the Groups/Point ID tab and select Point ID Label. ► Select Model as the variable to label cases by. ► Click OK. The resulting scatterplot shows two potential outliers, one in the lower right of the plot and one in the upper left.

► To identify these points, activate the graph by double-clicking on it. Click the Data ID Mode tool. ► Select the point in the lower right. It is identified as the Metro. ► Select the point in the upper left. It is identified as the F-Series. The F-Series is found to be generally representative of the vehicles your design team is working on, so you decide to keep it in the data set for now. This point may appear to be an outlier because of the skew distribution of Sales in thousands, so try replacing it with Log-transformed sales in further analyses. The Metro is not representative of the vehicles that your design team is working on, so you can safely remove it from further analyses. ► To remove the Metro from the correlation computations, from the menus choose: Data Select Cases... ► Select If condition is satisfied and click If. ► Type model ~= 'Metro' in the text box. ► Click Continue. ► Click OK in the Select Cases dialog box. A new variable has been created that uses all cases except for the Metro in further computations. ► To analyze the filtered data, recall the Bivariate Correlations dialog box. ► Deselect Sales in thousands as an analysis variable. ► Select Log-transformed sales as an analysis variable. ► Click OK. CorrelationsFuelLog-transfoefficiencyrmed salesFuelPearson Correlation1.136efficiencySig. (2-tailed).093N153153Log-transforPearson Correlation.1361med salesSig. (2-tailed).093N153156trucks and autos separately. After removing the outlier and looking at the log-transformed sales, the correlation is now positive but still not significantly different from 0. However, the customer demographics for trucks and automobiles are different, and the reasons for buying a truck or a car may not be the same. It's worthwhile to look at another scatterplot, this time marking To produce a scatterplot of Log-transformed sales by Fuel efficiency, controlling for Vehicle type, recall the Simple Scatterplot dialog box. ► Deselect Sales in thousands and select Log-transformed sales as the y variable. ► Select Vehicle type as the variable to set markers by. ► Click OK The scatterplot shows that trucks and automobiles form distinctly different groups. By splitting the data file according to Vehicle type（Select Vehicle type as the variable on which groups should be based.）, you might get a more accurate view of the association. Also note that the log-transformation of sales, the potential outlier in the upper left has disappeared.

► To split the data file according to Vehicle type, from the menus choose: Data Split File... ► Select Compare groups. ► Select Vehicle type as the variable on which groups should be based. ► Click OK. ► To analyze the split file, recall the Bivariate Correlations dialog box. ► Click OK. CorrelationsVehicleLog-transfoFueltypermed salesefficiencyAutomobLog-transformedPearson Correlation1.451**ilesalesSig. (2-tailed).000N115113Fuel efficiencyPearson Correlation.451**1Sig. (2-tailed).000N113113TruckLog-transformedPearson Correlation1.203salesSig. (2-tailed).210N4140Fuel efficiencyPearson Correlation.2031Sig. (2-tailed).210N4040**. Correlation is significant at the 0.01 level (2-tailed).Splitting the file on Vehicle type has made the relationship between sales and fuel efficiency much more clear. There is a significant and fairly strong positive correlation between sales and fuel efficiency for automobiles. For trucks, the correlation is positive but not significantly different from 0. Reaching these conclusions has required some work and shown that correlation analysis using the Pearson correlation coefficient is not always straightforward. For comparison, see how you can avoid the difficulty of transforming variables by using nonparametric correlation measures. The Spearman's rho and Kendall's tau-b statistics measure the rank-order association between two scale or ordinal variables. They work regardless of the distributions of the variables. 8

CorrelationsVehicle type: AutomobileSales inLog-transfoFuelthousandsrmed salesefficiencySpearman'sSales inCorrelation Coefficient1.0001.000**.418**rhothousandsSig. (2-tailed)...000N116116114Log-transfoCorrelation Coefficient1.000**1.000.418**rmed salesSig. (2-tailed)...000N116116114FuelCorrelation Coefficient.418**.418**1.000efficiencySig. (2-tailed).000.000.N114114114**. Correlation is significant at the 0.01 level (2-tailed). ►select all cases ► To obtain an analysis using Spearman's rho, recall the Bivariate Correlations dialog box. ► Select Sales in thousands as an analysis variable. ► Deselect Pearson and select Spearman. ► Click OK. Spearman's rho is reported separately for automobiles and trucks. As with Pearson's correlation coefficient, the association between Log-transformed sales and Fuel efficiency is fairly strong. However, Spearman's rho reports the same correlation for the untransformed sales! This is because rho is based on rank orders, which are unchanged by log transformation. Moreover, outliers have less of an effect on Spearman's rho, so it's possible to save some time and effort by using it as a measure of association. Using Bivariate Correlations, you produced a correlation matrix for Sales in thousands by Fuel efficiency and, surprisingly, found a negative correlation. Upon removing an outlier and using Log-transformed sales, the correlation became positive, although not significantly different from 0. However, you found that by computing the correlations separately for trucks and autos, there is a positive and statistically significant correlation between sales and fuel efficiency for automobiles. Furthermore, you found similar results without the transformation using Spearman's rho, and perhaps are wondering why you should go through the effort of transforming variables when Spearman's rho is so convenient. The measures of rank order are handy for discovering whether there is any kind of association between two variables, but when they find an association it's a good idea to find a transformation that makes the relationship linear. This is because there are more predictive models available for linear relationships, and the linear models are generally easier to implement and interpret. The Bivariate Correlations procedure is useful for studying the pairwise associations for a set of scale or ordinal variables. 

If you have nominal variables, use the Crosstabs procedure to obtain measures of association.

 

If you want to model the value of a scale variable based on its linear relationship to other variables, try the Linear Regression procedure.

If you want to decompose the variation in your data to look for underlying patterns, try the Factor Analysis procedure.

Pearson Correlation. The most widely-used type of correlation coefficient is Pearson r (Pearson, 16), also called linear or product-moment correlation (the term correlation was first used by Galton, 1888). Using non technical language, we can say that the correlation coefficient determines the extent to which values of two variables are \"proportional\" to each other. The value of the correlation (i.e., correlation coefficient) does not depend on the specific measurement units used; for example, the correlation between height and weight will be identical regardless of whether inches and pounds, or centimeters and kilograms are used as measurement units. Proportional means linearly related; that is, the correlation is high if it can be

approximated by a straight line (sloped upwards or downwards). This line is called the regression line or least squares line, because it is determined such that the sum of the squared distances of all the data points from the line is the lowest possible. Pearson correlation assumes that the two variables are measured on at least interval scales.

Spearman R. Spearman R can be thought of as the regular Pearson product-moment correlation coefficient (Pearson r); that is, in terms of the proportion of variability accounted for, except that Spearman R is computed from ranks. As mentioned above, Spearman R assumes that the variables under consideration were measured on at least an ordinal (rank order) scale; that is, the individual observations (cases) can be ranked into two ordered series.

The Spearman rank correlation coefficient is a measure of the relationship between two variables when data in the form of rank orders are available. For instance, the Spearman rank correlation coefficient could be used to determine the degree of agreement between men and women concerning their preference ranking of 10 different television shows. A Spearman rank correlation coefficient of 1 would indicate complete agreement, a coefficient of −1 would indicate complete disagreement, and a coefficient of 0 would indicate that the rankings were unrelated.

Kendall Tau. Kendall tau is equivalent to the Spearman R statistic with regard to the underlying assumptions. It is also comparable in terms of its statistical power. However, Spearman R and Kendall tau are usually not identical in magnitude because their underlying logic, as well as their computational formulas are very different.

Two different variants of tau are computed, usually called tau-b and tau-c. These measures differ only with regard as to how tied ranks are handled. In most cases these values will be fairly similar, and when discrepancies occur, it is probably always safest to interpret the lowest value.

Rank Correlation. A rank correlation coefficient is a coefficient of correlation between two random variables that is based on the ranks of the measurements and not the actual values.

2-Partial Correlations

The Partial Correlations procedure computes partial correlation coefficients that describe the linear relationship between two variables while controlling for the effects of one or more additional variables. Correlations are measures of linear association. Two variables can be perfectly related, but if the relationship is not linear, a correlation coefficient is not an appropriate statistic for measuring their association. Example. Is there a relationship between healthcare funding and disease rates? Although you might expect any such relationship to be a negative one, a study reports a significant positive correlation: as healthcare funding increases, disease rates appear to increase. Controlling for the rate of visits to healthcare providers, however, virtually eliminates the observed positive correlation. healthcare funding and disease rates only appear to be positively related because more people have access to healthcare when funding increases, which leads to more reported diseases by doctors and hospitals.

Data. Use symmetric, quantitative variables. Use one or more scale variables as control variables. Correlations are adjusted to account for the influence of the control variables.

Assumptions. The Partial Correlations procedure assumes that each pair of variables is bivariate normal.

Using Partial Correlations to Unravel \"Relationships\"

The Partial Correlations procedure computes partial correlation coefficients that describe the linear

relationship between two variables while controlling for the effects of one or more additional variables. All the variables should be scale variables.

A popular radio talk show host has just received the latest government study on public health care funding and has uncovered a startling fact: As health care funding increases, disease rates also increase! Cities that spend more actually seem to be worse off than cities that spend less.

Is health care funding bad for your health? The radio talk show host has the evidence that appears to prove that claim: The data in the government report yield a high, positive correlation between health care funding and disease rates -- which seems to indicate that people would be much healthier if the government simply stopped putting money into health care programs.

But is this really true? It certainly isn't likely that there's a causal relationship between health care funding and disease rates. Assuming the numbers are correct, are there other factors that might create the appearance of a relationship where none actually exists?

This example uses the data file health_funding.sav . This is a hypothetical data file that contains data on health care funding (amount per 100 population), disease rates (rate per 10,000 population), and visits to health care providers (rate per 10,000 population). Each case represents a different city.

► To obtain partial correlations: ► From the menus choose: Analyze Correlate Partial

► Select Health care funding and Disease rate as the variables. ► Select Visits to health care providers as the control variable. ► Click Options.

► Click (check) Zero-order correlations and then click Continue. ► In the main Partial Correlations dialog, click OK to run the procedure.

In this example, the Partial Correlations table shows both the zero-order correlations (correlations without any control variables) of all three variables and the partial correlation controlling of the first two variables controlling for the effects of the

third variable.

The zero-order correlation between health care funding and disease rates is, indeed, both fairly high (0.737) and statistically significant(p < 0.001).

The partial correlation controlling for the rate of visits to health care providers, however, is negligible (0.013)and not statistically significant (p = 0.928.)

One interpretation of this finding is that the observed positive \"relationship\" between health care funding and disease rates is due to underlying relationships between each of those variables and the rate of visits to health care providers:

Disease rates only appear to increase as health care funding increases because more people have access to health care providers when funding increases, and doctors and hospitals consequently report more occurrences of diseases since more sick people come to see them.

Going back to the zero-order correlations, you can see that both health care funding rates and reported disease rates are highly positively correlated with the control variable, rate of visits to health care providers. Removing the effects of this variable reduces the correlation between the other two variables to almost zero. It's even possible that controlling for the effects of some other relevant variables might actually reveal an underlying negative relationship between health care funding and disease rates. The Partial Correlations procedure is only appropriate for scale variables.

 

If you have categorical (nominal or ordinal)data, use the Crosstabs procedure. Layer variables in Crosstabs are similar to control variables in Partial Correlations.

If you want to model the value of a scale variable based on its linear relationship to other variables, try the Linear Regression procedure.

Partial Correlation. A correlation between two variables that remains after controlling for (e.g., partialling out) one or more other variables. For example, the HAIR LENGTH may correlate with HEIGHT (with taller individuals having shorter hair), however, that correlation will likely become smaller or even disappear if the influence of GENDER is removed, since women are generally shorter and are more likely to have long hair than men.

Part Correlation(or Semi-Partial Correlation). The semi-partial or part correlation is similar to the partial correlation statistic. Like the, partial correlation, it is a measure of the correlation between two variables that remains after controlling for (i.e., \"partialling\" out) the effects of one or more other predictor variables. However, while the squared partial correlation between a predictor X1 and a response variable Y can be interpreted as the proportion of (unique) variance accounted for by X1, in the presence of other predictors X2, ... , Xk, relative to the residual or unexplained variance that cannot be accounted for by X2, ... , Xk, the squared semi-partial or part correlation is the proportion of (unique) variance accounted for by the predictor X1, relative to the total variance of Y. Thus, the semi-partial or part correlation is a better indicator of the \"practical relevance\" of a predictor, because it is scaled to (i.e., relative to) the total variability in the dependent (response) variable.

The semipartial (or part) correlation statistic is similar to the partial correlation statistic. Both measure variance after certain factors are controlled for, but to calculate the semipartial correlation one holds the third variable constant for either X or Y, whereas for partial correlations one holds the third variable constant for both. The semipartial correlation measures unique and joint variance while the partial correlation measures unique variance. The semipartial (or part) correlation can be viewed as more practically relevant \"because it is scaled to (i.e., relative to) the total variability in the dependent (response) variable.\" Conversely, it is less theoretically useful because it is less precise about the unique contribution of the independent variable. Although it may seem paradoxical, the semipartial correlation of X'' with Y is always less than the partial correlation of X with Y.

Spurious Correlations. A false presumption that two variables are correlated when in reality they are not. Spurious correlation is often a result of a third factor that is not apparent at the time of examination. Spurious comes from the Latin word spurious, which means illegitimate or false.

In statistics, a spurious relationship (or, sometimes, spurious correlation or spurious regression) is a mathematical relationship in which two occurrences have no causal connection, yet it may be inferred that they do, due to a certain third, unseen factor (referred to as a \"confounding factor\" or \"lurking variable\"). The spurious relationship gives an impression of a worthy link between two groups that is invalid when objectively examined.

The misleading correlation between two variables is produced through the operation of a third causal variable. In other words we find a correlation between A and B. So we have three possible relationships: A causes B, B causes A,

C causes both A and B.

（四联图的最后一个图算不算伪相关？）

The last is a spurious correlation. In a regression model, where A is regressed on B, but C is found to be the true causal factor for B; this is called specification error. It is therefore often said that \"Correlation does not imply causation\".

An example of a spurious relationship can be illuminated examining a city's ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Another example is a correlation between the total amount of losses in a fire and the number of firemen that were putting out the fire; however, what this correlation does not indicate is that if we call fewer firemen, we would lower the losses. There is a third variable (the initial size of the fire) that influences both the amount of losses and the number of firemen. If we \"control\" for this variable (e.g., consider only fires of a fixed size), the correlation will either disappear or perhaps even change its sign. The main problem with spurious correlations is that we typically do not know what the \"hidden\" agent is. However, in cases when we know where to look, we can use partial correlations that control for (i.e., partial out) the influence of specified variables.

The term is commonly used in statistics and in particular in experimental research techniques. Experimental research attempts to understand and predict causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X & Y). Intervening variables (X → W → Y), if undetected, may make indirect causation look direct. Because of this, experimentally identified correlations do not represent causal relationships unless spurious relationships can be ruled out.

In practice, three conditions must be met in order to conclude that X causes Y, directly or indirectly:

  

X must precede Y

Y must not occur when X does not occur Y must occur whenever X occurs

Spurious relationships can often be identified by considering whether any of these three conditions have been violated.

Neither regression nor correlation analyses can be interpreted as establishing cause-and-effect relationships. They can indicate only how or to what extent variables are associated with each other. The correlation coefficient measures only the degree of linear association between two variables. Any conclusions about a cause-and-effect relationship must be based on the judgment of the analyst.（专业判断最重要）

因篇幅问题不能全部显示，请点此查看更多更全内容

查看全文