A Cautionary Note on the Use of Linear Regression for Hypothesis Testing

Gregory L. Light

doi:10.24018/ejmath.2023.4.5.268

Research Article

Gregory L. Light

Providence College, USA

* Corresponding author

$DOI icon$ 10.24018/ejmath.2023.4.5.268

Read Counter
269

Downloads
311

Citations

Share

Submitted 2023-07-17
Published 2023-10-16

Read counter = 269 times

Abstract

This note draws researchers’ attention to the use of linear regression for the purpose of conducting a hypothesis testing. Even when multiple explanatory variables are included in a regression equation to preclude the hazard of a simple regression of omitting other factors, multicollinearity unfortunately is inherent in multiple regression simply because the included explanatory variables can share some common parameter domain; that is, they co-vary. Here we shall show that however one transforms the variance-covariance matrix of the least-squares estimation to reduce the estimation errors, the procedure amounts to affine transformations of the explanatory variables so that despite the transformation they remain to co-vary, rendering the coefficient as a partial derivative invalid. The root of this problem originates from the fact that one receives the explanatory variables’ values by observation rather than predetermining their values and then collecting the corresponding dependent variable’ values. This situation becomes especially disconcerting when a transformed explanatory variable has its estimated coefficient enjoying an exceptional degree of confidence but of no mathematical status as a partial derivative, misleading engineering or medical prescriptions and public policies.

Keywords: Model misspecification multicollinearity orthogonalization ridge regression

Introduction

It is a well-recognized fact that a simple linear regression can cause the problem of omitted variables or model misspecification. Thus, multiple regression serves as a better methodology; here, however, one must exercise caution about the choices of the explanatory variables and also the design of the sample. In empirical work, there are mainly two approaches: observational (as in astronomy) or experimental (as in a particle collider). In practical applications of regression, mostly it is the former: one draws a sample from the underlying population and observes the input data [1]. This presents problems, as the analyst has no control over the observed values, say, ${(x_{i 1}, x_{i 2}; y_{i}) | i = 1, 2, \dots, n}$ . This note seeks to draw researchers’ attention to a highly likely situation where $x_{1}$ and $x_{2}$ are both functions of $t$ and $s$ (connoting time and space), i.e., $x_{1} (t, s)$ and $x_{2} (t, s)$ , so that the very construct of $E (Y) = α + β_{1} X_{1} + β_{2} X_{2}$ is immediately invalid as $β_{j} = (\partial E (Y)) / (\partial X_{j}), j = 1, 2$ , does not exist. In the next section we will show that all modifications of the “variance-covariance matrix” ( $X^{T} X$ ) of the least-squares estimation for the purpose of alleviating the problem of multicollinearity cannot treat this problem of explanatory variables sharing the same parameter domain. Then in Section 3 we will conclude with a summary remark.

Analysis

All remedial procedures against the problem of multicollinearity are based on modifications of the matrix $(X^{T} X)$ of the ordinary least-squares estimation [2], in the case of two explanatory variables,

(1)

\begin{array}{rcl} X^{T} X = (\begin{array}{ll} \sum_{i = 1}^{n} {(x_{i 1} - {\bar{x}}_{1})}^{2} & \sum_{i = 1}^{n} (x_{i 1} - {\bar{x}}_{1}) (x_{i 2} - {\bar{x}}_{2}) \\ \sum_{i = 1}^{n} (x_{i 1} - {\bar{x}}_{1}) (x_{i 2} - {\bar{x}}_{2}) & \sum_{i = 1}^{n} {(x_{i 2} - {\bar{x}}_{2})}^{2} \end{array}) . \end{array}

Since $(X^{T} X)$ is a linear operator, transformations of which are in general linear or affine [3], [4]. Consider a linear transformation by $k_{1} > 1$ of the $(1, 1) - e n t r y$ of the above matrix, then each individual observation $i$ of $x_{1} (t, s)$ undergoes an affine transformation, which still ends as a function of $(t, s)$ . Consider now an affine transformation of the $(1, 1) - e n t r y$ of the above matrix by $M_{1} > 0$ , i.e., but then the above (4) can be re-expressed as (2),

(2)

\begin{array}{rcl} k_{1}^{2} \sum_{i} {(x_{i 1} - {\bar{x}}_{1})}^{2} = \sum_{i} {(k_{1} x_{i 1} - k_{1} {\bar{x}}_{1})}^{2}; \end{array}

(3)

\begin{array}{rcl} k_{1} \cdot x_{i 1} (t_{i}, s_{i}) - (k_{1} {\bar{x}}_{1}), \end{array}

(4)

\begin{array}{rcl} M_{1} + \sum_{i} {(x_{i 1} - {\bar{x}}_{1})}^{2}, \end{array}

(5)

\begin{array}{l} M_{1} + \sum_{i} {(x_{i 1} - {\bar{x}}_{1})}^{2} = k_{1}^{2} \sum_{i} {(x_{i 1} - {\bar{x}}_{1})}^{2} = \sum_{i} {(k_{1} x_{i 1} - k_{1} {\bar{x}}_{1})}^{2}; \\ with k_{1}^{2} = \frac{M_{1}}{\sum {(x_{i 1} - {\bar{x}}_{1})}^{2}} + 1. \end{array}

Thus, each diagonal entry of $(X^{T} X)$ remains as a function of $(t, s)$ following any linear/affine transformations, and thereof each transformed explanatory variable $a^{^{'}} la$ (3) is still a function of $(t, s)$ . Next, consider the off-diagonal entries of $(X^{T} X)$ : Denote the transformed $x_{1}$ and $x_{2}$ by $x_{1}^{*}$ and $x_{2}^{*}$ respectively; then one has

also still a function of $(t, s)$ .

(6)

\begin{array}{l} \sum_{i} (x_{i 1}^{*} - {\bar{x}}_{1}^{*}) (x_{i 2}^{*} - {\bar{x}}_{2}^{*}) = ⟨ k_{1} x_{1, n \times 1} - k_{1} {\bar{x}}_{1, n \times 1}, k_{2} x_{2, n \times 1} - k_{2} {\bar{x}}_{2, n \times 1} ⟩ \\ = k_{1} k_{2} \cdot (⟨ x_{1}, x_{2} ⟩ - ⟨ x_{1}, {\bar{x}}_{2} ⟩ - ⟨ {\bar{x}}_{1}, x_{2} ⟩ + ⟨ {\bar{x}}_{1}, {\bar{x}}_{2} ⟩), \end{array}

As such, any linear or affine transformation of the matrix $(X^{T} X)$ amounts to a transformation of an explanatory variable $x_{j}$ into $k_{j} \cdot x_{j} (t, s) - m_{j} = f_{j} (t, s)$ . Consequently, the transformed regression equation remains as

i.e., the root problem of multicollinearity remains: while the t-statistics of $b_{1}, b_{2}, \dots, b_{k}$ can be increased to high (absolute) values to the extent that the $99 %$ confidence intervals around $b_{1}, b_{2}, \dots, b_{k}$ are extremely narrow, one still has the problem

(7)

\begin{array}{rcl} {\hat{y}}^{*} = b_{1} x_{1}^{*} (t, s) + b_{2} x_{2}^{*} (t, s) + \dots + b_{k} x_{k}^{*} (t, s); \end{array}

by the simple fact that $x_{j}^{*} = f_{j} (t, s), \forall j = 1, 2, \dots, k$ ; i.e., all the explanatory variables co-vary. Therefore, the engineered highly significant estimates of the coefficients can only lead to a false confidence in their values [5]–[7].

(8)

\begin{array}{rcl} E (b_{j}) = β_{j} = \frac{\partial E (Y)}{\partial X_{j}^{*}}, j = 1, 2, \dots, k, do not exist, \end{array}

Summary Remark

From the above analysis, we see that hypothesis testing by a regression equation without an underlying mathematical model as based on theories can be problematic, unless one fixes the explanatory variables’ input values ${(x_{i 1}, x_{i 2}, \dots, x_{i k}) | i = 1, 2, \dots, n}$ in advance and then collects the dependent variable’s values ${y_{i} | i = 1, 2, \dots, n}$ . Otherwise, no matter how one transforms the matrix $(X^{T} X)$ , the problem of $x_{j}^{*} (t, s) \forall j$ persists; then the transformation not only is futile but can also be dangerous—considering a wrong sign of a coefficient (in the sense of $\partial X_{j} / \partial t > 0, \partial X_{j} / \partial s > 0, \partial Y / \partial t > 0, \partial Y / \partial s > 0 but b_{j} < 0$ ) that yet carries a high degree of confidence in engineering or medical applications.

References

Kmenta J. Elements of Econometrics. New York: Macmillan; 1971.
Google Scholar

Huffel SV, Vandewalle J. Algebraic Connections between Total Least Squares Estimation and Classical Linear Regression in Multicollinearity Problems. Philadelphia, PA: SIAM; 1991, ch. 9. doi: 10.1137/1.9781611971002.
Google Scholar

Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. doi: 10.2307/1267351.JSTOR1267351.
Google Scholar

Liu K. A new class of biased estimate in linear regression. Commun Stat Theory Methods. 1993;22:393–402. doi:10.1080/03610929308831027.
Google Scholar

Liu R,Wang H,Wang S. Functional variable selection via Gram-Schmidt orthogonalization for multiple functional linear regression. J Stat Comput Simul. 2018;88:3664–80. doi: 10.1080/00949655.2018.1530776.
Google Scholar

Oman SD. A confidence bound approach to choosing the biasing parameter in ridge regression. JASA. 1981;76:452–61. doi: 10.2307/2287849.
Google Scholar

Smith G, Campbell F. A critique of some ridge regression methods. JASA. 1980;75:74–81. doi: https://doi.org/10.1080/01621459.1980.10477428.
Google Scholar

Downloads

PDF
ePub
HTML
JATS XML

How to Cite

A Cautionary Note on the Use of Linear Regression for Hypothesis Testing. (2023). European Journal of Mathematics and Statistics, 4(5), 5-7. https://doi.org/10.24018/ejmath.2023.4.5.268

Issue

Vol. 4 No. 5 (2023)

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

[1] Kmenta J. Elements of Econometrics. New York: Macmillan; 1971.
Google Scholar

[2] Huffel SV, Vandewalle J. Algebraic Connections between Total Least Squares Estimation and Classical Linear Regression in Multicollinearity Problems. Philadelphia, PA: SIAM; 1991, ch. 9. doi: 10.1137/1.9781611971002.
Google Scholar

[3] Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. doi: 10.2307/1267351.JSTOR1267351.
Google Scholar

[4] Liu K. A new class of biased estimate in linear regression. Commun Stat Theory Methods. 1993;22:393–402. doi:10.1080/03610929308831027.
Google Scholar

[5] Liu R,Wang H,Wang S. Functional variable selection via Gram-Schmidt orthogonalization for multiple functional linear regression. J Stat Comput Simul. 2018;88:3664–80. doi: 10.1080/00949655.2018.1530776.
Google Scholar

[6] Oman SD. A confidence bound approach to choosing the biasing parameter in ridge regression. JASA. 1981;76:452–61. doi: 10.2307/2287849.
Google Scholar

[7] Smith G, Campbell F. A critique of some ridge regression methods. JASA. 1980;75:74–81. doi: https://doi.org/10.1080/01621459.1980.10477428.
Google Scholar

A Cautionary Note on the Use of Linear Regression for Hypothesis Testing

Article Sidebar

Article Main Content

Introduction

Analysis

Summary Remark

References