A Cautionary Note on the Use of Linear Regression for Hypothesis Testing
##plugins.themes.bootstrap3.article.main##
This note draws researchers’ attention to the use of linear regression for the purpose of conducting a hypothesis testing. Even when multiple explanatory variables are included in a regression equation to preclude the hazard of a simple regression of omitting other factors, multicollinearity unfortunately is inherent in multiple regression simply because the included explanatory variables can share some common parameter domain; that is, they co-vary. Here we shall show that however one transforms the variance-covariance matrix of the least-squares estimation to reduce the estimation errors, the procedure amounts to affine transformations of the explanatory variables so that despite the transformation they remain to co-vary, rendering the coefficient as a partial derivative invalid. The root of this problem originates from the fact that one receives the explanatory variables’ values by observation rather than predetermining their values and then collecting the corresponding dependent variable’ values. This situation becomes especially disconcerting when a transformed explanatory variable has its estimated coefficient enjoying an exceptional degree of confidence but of no mathematical status as a partial derivative, misleading engineering or medical prescriptions and public policies.
Introduction
It is a well-recognized fact that a simple linear regression can cause the problem of omitted variables or model misspecification. Thus, multiple regression serves as a better methodology; here, however, one must exercise caution about the choices of the explanatory variables and also the design of the sample. In empirical work, there are mainly two approaches: observational (as in astronomy) or experimental (as in a particle collider). In practical applications of regression, mostly it is the former: one draws a sample from the underlying population and observes the input data [1]. This presents problems, as the analyst has no control over the observed values, say, . This note seeks to draw researchers’ attention to a highly likely situation where and are both functions of and (connoting time and space), i.e., and , so that the very construct of is immediately invalid as , does not exist. In the next section we will show that all modifications of the “variance-covariance matrix” () of the least-squares estimation for the purpose of alleviating the problem of multicollinearity cannot treat this problem of explanatory variables sharing the same parameter domain. Then in Section 3 we will conclude with a summary remark.
Analysis
All remedial procedures against the problem of multicollinearity are based on modifications of the matrix of the ordinary least-squares estimation [2], in the case of two explanatory variables,
Since is a linear operator, transformations of which are in general linear or affine [3], [4]. Consider a linear transformation by of the of the above matrix, then each individual observation of undergoes an affine transformation, which still ends as a function of . Consider now an affine transformation of the of the above matrix by , i.e., but then the above (4) can be re-expressed as (2),
Thus, each diagonal entry of remains as a function of following any linear/affine transformations, and thereof each transformed explanatory variable (3) is still a function of . Next, consider the off-diagonal entries of : Denote the transformed and by and respectively; then one has
also still a function of .
As such, any linear or affine transformation of the matrix amounts to a transformation of an explanatory variable into . Consequently, the transformed regression equation remains as
i.e., the root problem of multicollinearity remains: while the t-statistics of can be increased to high (absolute) values to the extent that the confidence intervals around are extremely narrow, one still has the problem
by the simple fact that ; i.e., all the explanatory variables co-vary. Therefore, the engineered highly significant estimates of the coefficients can only lead to a false confidence in their values [5]–[7].
Summary Remark
From the above analysis, we see that hypothesis testing by a regression equation without an underlying mathematical model as based on theories can be problematic, unless one fixes the explanatory variables’ input values in advance and then collects the dependent variable’s values . Otherwise, no matter how one transforms the matrix , the problem of persists; then the transformation not only is futile but can also be dangerous—considering a wrong sign of a coefficient (in the sense of ) that yet carries a high degree of confidence in engineering or medical applications.
References
-
Kmenta J. Elements of Econometrics. New York: Macmillan; 1971.
Google Scholar
1
-
Huffel SV, Vandewalle J. Algebraic Connections between Total Least Squares Estimation and Classical Linear Regression in Multicollinearity Problems. Philadelphia, PA: SIAM; 1991, ch. 9. doi: 10.1137/1.9781611971002.
Google Scholar
2
-
Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. doi: 10.2307/1267351.JSTOR1267351.
Google Scholar
3
-
Liu K. A new class of biased estimate in linear regression. Commun Stat Theory Methods. 1993;22:393–402. doi:10.1080/03610929308831027.
Google Scholar
4
-
Liu R,Wang H,Wang S. Functional variable selection via Gram-Schmidt orthogonalization for multiple functional linear regression. J Stat Comput Simul. 2018;88:3664–80. doi: 10.1080/00949655.2018.1530776.
Google Scholar
5
-
Oman SD. A confidence bound approach to choosing the biasing parameter in ridge regression. JASA. 1981;76:452–61. doi: 10.2307/2287849.
Google Scholar
6
-
Smith G, Campbell F. A critique of some ridge regression methods. JASA. 1980;75:74–81. doi: https://doi.org/10.1080/01621459.1980.10477428.
Google Scholar
7