Regression Models with Multiple Regressors

In what follows linear regression models are introduced that use more than just one explanatory variable. We will discuss important key concepts in multiple regression and as we broaden our scope beyond the relationship of only two variables (the dependent variable and a single regressor), some potential new issues arise such as multicollinearity and omitted variable bias (OVB). In particular, this section deals with omitted variables and its implication for causal interpretation of OLS-estimated coefficients.

Naturally, we will discuss estimation of multiple regression models using R. We will also illustrate the importance of thoughtful usage of multiple regression models via simulation studies that demonstrate the consequences of using highly correlated regressors or misspecified models.

Make sure the packages MASS, stargazer, sandwich, and lmtest are installed before you go ahead and replicate the examples. The safest way to do so is by checking whether the following code chunk executes without any issues.

library(MASS)
library(stargazer)
library(lmtest)
library(sandwich)
library(car)
#install.packages("car")
#install.packages("stargazer")
#install.packages("lmtest")
#install.packages("sandwichr")

Omitted Variable Bias

The previous analysis of the relationship between patient score and patient-doctor ratio has a major flaw: we ignored other determinants of the dependent variable (patient score) that correlate with the regressor (patient-doctor ratio). Remember that influences on the dependent variable which are not captured by the model are collected in the error term, which we so far assumed to be uncorrelated with the regressor. However, this assumption is violated if we exclude determinants of the dependent variable which vary with the regressor. This might induce an estimation bias, i.e., the mean of the OLS estimator’s sampling distribution is no longer equals the true mean. In our example, we therefore wrongly estimate the causal effect on patient scores of a unit change in the patient-doctor ratio, on average. This issue is called omitted variable bias (OVB) and is summarized by Key Concept 23.

Key Concept 23

Omitted Variable Bias in Regression

Omitted variable bias is the bias in the OLS estimator that arises when the regressor, \(X\), is correlated with an omitted variable. For omitted variable bias to occur, two conditions must be fulfilled: 1. \(X\) is correlated with the omitted variable. 2. The omitted variable is a determinant of the dependent variable \(Y\). Together, 1. and 2. result in a violation of the first OLS assumption \(E(u_i\vert X_i) = 0\). Formally, the resulting bias can be expressed as \[ \hat\beta_1 \xrightarrow[]{p} \beta_1 + \rho_{Xu} \frac{\sigma_u}{\sigma_X}. \] This states that OVB is a problem that cannot be solved by increasing the number of observations used to estimate \(\beta_1\), as \(\hat\beta_1\) is inconsistent: OVB prevents the estimator from converging in probability to the true parameter value. Strength and direction of the bias are determined by \(\rho_{Xu}\), the correlation between the error term and the regressor.

In the example of patient score and patient-doctor ratio, it is easy to come up with variables that may cause such a bias, if omitted from the model. A highly relevant variable could be the health of the patient when they inter the hospital: it is highly likely that this is an important factor for patient outcomes. Therefore, patients that enter in poorer health are more likely to suffer complications in hospitals. Also, it is conceivable that the health of the local population is lower in hospitals with worse patient-doctor ratios: think of poor urban districts of high density.

Let us think about a possible bias induced by omitting the health of the patients as they enter the hospital (\(inpatient\) scored as a disease score: sum of issues recorded in medical record). When the estimated regression model does not include \(inpatient\) as a regressor although the true data generating process (DGP) is

\[ Score = \beta_0 + \beta_1 \times PDR + \beta_2 \times inpatient \]

where \(PDR\) and \(inpatient\) are correlated, we have

\[\rho_{PDR,inpatient}\neq0.\]

Let us investigate this using R. After defining our variables we may compute the correlation between \(PDR\) and \(inpatient\) as well as the correlation between \(PDR\) and \(Score\).

# load the data set
data <- read.table("~/Desktop/hospitals.txt", header=T)
# define variables
data$PDR <- data$patients/data$doctors
data$score <- (data$outpatient_1 + data$outpatient_2)/2
# compute correlations
cor(data$PDR, data$score)
## [1] -0.2263627
cor(data$PDR, data$inpatient)
## [1] 0.1876424

The fact that \(\widehat{\rho}_{PDR, score} = -0.2264\) is cause for concern that omitting \(inpatient\) leads to a negatively biased estimate \(\hat\beta_1\) since this indicates that \(\rho_{Xu} < 0\). As a consequence we expect \(\hat\beta_1\), the coefficient on \(PDR\), to be too large in absolute value. Put differently, the OLS estimate of \(\hat\beta_1\) suggests that more doctors relative to patients improves patient scores, but that the effect is overestimated as it captures the effect of having better inpatient health, too.

What happens to the magnitude of \(\hat\beta_1\) if we add the variable \(inpatient\) to the regression, that is, if we estimate the model \[ Score = \beta_0 + \beta_1 \times PDR + \beta_2 \times inpatient + u \]

instead? And what do we expect about the sign of \(\hat\beta_2\), the estimated coefficient on \(inpatient\)? Following the reasoning above we should still end up with a negative but larger coefficient estimate \(\hat\beta_1\) than before and a negative estimate \(\hat\beta_2\).

Let us estimate both regression models and compare. Performing a multiple regression in R is straightforward. One can simply add additional variables to the right hand side of the formula argument of the function lm by using their names and the + operator.

# estimate both regression models
mod <- lm(score ~ PDR, data = data)
mult.mod <- lm(score ~ PDR + inpatient, data = data)
# print the results to the console
mod
##
## Call:
## lm(formula = score ~ PDR, data = data)
##
## Coefficients:
## (Intercept)          PDR
##      698.93        -2.28
mult.mod
##
## Call:
## lm(formula = score ~ PDR + inpatient, data = data)
##
## Coefficients:
## (Intercept)          PDR    inpatient
##    686.0322      -1.1013      -0.6498

We find the outcomes to be consistent with our expectations.

The following section discusses some theory on multiple regression models.

The Multiple Regression Model

The multiple regression model extends the basic concept of the simple regression model. A multiple regression model enables us to estimate the effect on \(Y_i\) of changing a regressor \(X_{1i}\) if the remaining regressors \(X_{2i},X_{3i}\dots,X_{ki}\) do not vary. In fact we already have performed estimation of the multiple regression model using R in the previous section. The interpretation of the coefficient on patient-doctor ratio is the effect on patient scores of a one unit change patient-doctor ratio if the health of the inpatients is kept constant.

Just like in the simple regression model, we assume the true relationship between \(Y\) and \(X_{1i},X_{2i}\dots\dots,X_{ki}\) to be linear. On average, this relation is given by the population regression function

\[ E(Y_i\vert X_{1i}=x_1, X_{2i}=x_2, X_{3i}=x_3,\dots, X_{ki}=x_k) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dots + \beta_k x_k. \]

As in the simple regression model, the relation \[Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \dots + \beta_k X_{ki}\] does not hold exactly since there are disturbing influences to the dependent variable \(Y\) we cannot observe as explanatory variables. Therefore we add an error term \(u\) which represents deviations of the observations from the population regression line. This yields the population multiple regression model \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \dots + \beta_k X_{ki} + u_i, \ i=1,\dots,n. \]

Key Concept 24 summarizes the core concepts of the multiple regression model.

Key Concept 24

The Multiple Regression Model

The multiple regression model is \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_3 X_{3i} + \dots + \beta_k X_{ki} + u_i \ \ , \ \ i=1,\dots,n. \]

The designations are similar to those in the simple regression model:

- \(Y_i\) is the \(i^{th}\) observation in the dependent variable. Observations on the \(k\) regressors are denoted by \(X_{1i},X_{2i},\dots,X_{ki}\) and \(u_i\) is the error term.

- The average relationship between \(Y\) and the regressors is given by the population regression line \[ E(Y_i\vert X_{1i}=x_1, X_{2i}=x_2, X_{3i}=x_3,\dots, X_{ki}=x_k) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \dots + \beta_k x_k. \]

- \(\beta_0\) is the intercept; it is the expected value of \(Y\) when all \(X\)s equal \(0\). \(\beta_j \ , \ j=1,\dots,k\) are the coefficients on \(X_j \ , \ j=1,\dots,k\). \(\beta_1\) measures the expected change in \(Y_i\) that results from a one unit change in \(X_{1i}\) while holding all other regressors constant.

How can we estimate the coefficients of the multiple regression model? We will not go too much into detail on this issue as our focus is on using R. However, it should be pointed out that, similarly to the simple regression model, the coefficients of the multiple regression model can be estimated using OLS. As in the simple model, we seek to minimize the sum of squared mistakes by choosing estimates \(b_0,b_1,\dots,b_k\) for the coefficients \(\beta_0,\beta_1,\dots,\beta_k\) such that

\[\sum_{i=1}^n (Y_i - b_0 - b_1 X_{1i} - b_2 X_{2i} - \dots - b_k X_{ki})^2 \]

is minimized. Note that this is simply an extension of the case with just one regressor and a constant. The estimators that minimize this equation are hence denoted \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k\) and, as in the simple regression model, we call them the ordinary least squares estimators of \(\beta_0,\beta_1,\dots,\beta_k\). For the predicted value of \(Y_i\) given the regressors and the estimates \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k\) we have

\[ \hat{Y}_i = \hat\beta_0 + \hat\beta_1 X_{1i} + \dots +\hat\beta_k X_{ki}. \] The difference between \(Y_i\) and its predicted value \(\hat{Y}_i\) is called the OLS residual of observation \(i\): \(\hat{u} = Y_i - \hat{Y}_i\).

Now let us jump back to the example of patient scores and patient-doctor ratios. The estimated model object is mult.mod. As for simple regression models we can use summary() to obtain information on estimated coefficients and model statistics.

summary(mult.mod)$coef
##                Estimate Std. Error    t value      Pr(>|t|)
## (Intercept) 686.0322445 7.41131160  92.565565 3.871327e-280
## PDR          -1.1012956 0.38027827  -2.896026  3.978059e-03
## inpatient    -0.6497768 0.03934254 -16.515882  1.657448e-47

So the estimated multiple regression model is

\[ \widehat{Score} = 686.03 - 1.10 \times PDR - 0.65 \times inpatient \]

Unlike in the simple regression model where the data can be represented by points in the two-dimensional coordinate system, we now have three dimensions. Hence observations can be represented by points in three-dimensional space. Therefore we no longer have a regression line but a regression plane. This idea extends to higher dimensions when we further expand the number of regressors \(k\). We then say that the regression model can be represented by a hyperplane in the \(k+1\) dimensional space. It is already hard to imagine such a space if \(k=3\) and we best stick with the general idea that, in the multiple regression model, the dependent variable is explained by a linear combination of the regressors. However, in the present case we are able to visualize the situation. The following figure is an interactive 3D visualization of the data and the estimated regression plane.

We observe that the estimated regression plane fits the data reasonably well — at least with regard to the shape and spatial position of the points. The color of the markers is an indicator for the absolute deviation from the predicted regression plane. Observations that are colored more reddish lie close to the regression plane while the color shifts to blue with growing distance. An anomaly that can be seen from the plot is that there might be heteroskedasticity: we see that the dispersion of regression errors made, i.e., the distance of observations to the regression plane tends to decrease as the inpatients disease score increases.

Measures of Fit in Multiple Regression

In multiple regression, common summary statistics are \(SER\), \(R^2\) and the adjusted \(R^2\).

Use summary(mult.mod) to obtain the \(SER\), \(R^2\) and adjusted-\(R^2\). For multiple regression models the \(SER\) is computed as

\[ SER = s_{\hat u} = \sqrt{s_{\hat u}^2} \]

where modify the denominator of the premultiplied factor in \(s_{\hat u}^2\) in order to accommodate for additional regressors. Thus,

\[ s_{\hat u}^2 = \frac{1}{n-k-1} \, SSR \]

with \(k\) denoting the number of regressors excluding the intercept.

While summary() computes the \(R^2\) just as in the case of a single regressor, it is no reliable measure for multiple regression models. This is due to \(R^2\) increasing whenever an additional regressor is added to the model. Adding a regressor decreases the \(SSR\) — at least unless the respective estimated coefficient is exactly zero what practically never happens. The adjusted \(R^2\) takes this into consideration by “punishing” the addition of regressors using a correction factor. So the adjusted \(R^2\), or simply \(\bar{R}^2\), is a modified version of \(R^2\). It is defined as

\[ \bar{R}^2 = 1-\frac{n-1}{n-k-1} \, \frac{SSR}{TSS}. \]

As you may have already suspected, summary() adjusts the formula for \(SER\) and it computes \(\bar{R}^2\) and of course \(R^2\) by default, thereby leaving the decision which measure to rely on to the user.

You can find both measures at the bottom of the output produced by calling summary(mult.mod).

summary(mult.mod)
##
## Call:
## lm(formula = score ~ PDR + inpatient, data = data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -48.845 -10.240  -0.308   9.815  43.461
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 686.03224    7.41131  92.566  < 2e-16 ***
## PDR          -1.10130    0.38028  -2.896  0.00398 **
## inpatient    -0.64978    0.03934 -16.516  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

We can also compute the measures by hand using the formulas above. Let us check that the results coincide with the values provided by summary().

# define the components
n <- nrow(data)                                 # number of observations (rows)
k <- 2                                          # number of regressors
y_mean <- mean(data$score)                      # mean of avg. test-scores
SSR <- sum(residuals(mult.mod)^2)               # sum of squared residuals
TSS <- sum((data$score - y_mean )^2)            # total sum of squares
ESS <- sum((fitted(mult.mod) - y_mean)^2)       # explained sum of squares
# compute the measures
SER <- sqrt(1/(n-k-1) * SSR)                    # standard error of the regression
Rsq <- 1 - (SSR / TSS)                          # R^2
adj_Rsq <- 1 - (n-1)/(n-k-1) * SSR/TSS          # adj. R^2
# print the measures to the console
c("SER" = SER, "R2" = Rsq, "Adj.R2" = adj_Rsq)
##        SER         R2     Adj.R2
## 14.4644831  0.4264315  0.4236805

Now, what can we say about the fit of our multiple regression model for hospital scores with inpatient health as an additional regressor? Does it improve on the simple model including only an intercept and patient-doctor ratio? The answer is yes: compare \(\bar{R}^2\) with that obtained for the simple regression model mod.

Including \(inpatient\) as a regressor improves the \(\bar{R}^2\), which we deem to be more reliable in view of the above discussion. Notice that the difference between \(R^2\) and \(\bar{R}^2\) is small since \(k=2\) and \(n\) is large. In short, the fit improves vastly on the fit of the simple regression model with \(PDR\) as the only regressor.

Comparing regression errors we find that the precision of the multiple regression model improves upon the simple model as adding \(inpatient\) lowers the \(SER\) from \(18.6\) to \(14.5\) units of test score.

As already mentioned, \(\bar{R}^2\) may be used to quantify how good a model fits the data. However, it is rarely a good idea to maximize these measures by stuffing the model with regressors. You will not find any serious study that does so. Instead, it is more useful to include regressors that improve the estimation of the causal effect of interest which is not assessed by means the \(R^2\) of the model.

OLS Assumptions in Multiple Regression

In the multiple regression model, we extend the three least squares assumptions of the simple regression model and add a fourth assumption. These assumptions are presented in Key Concept 25. We will not go into the details of assumptions 1-3 since their ideas generalize easy to the case of multiple regressors. We will focus on the fourth assumption. This assumption rules out perfect correlation between regressors.

Key Concept 25

Multiple Regression Assumptions

The multiple regression model is given by \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_1 X_{2i} + \dots + \beta_k X_{ki} + u_i \ , \ i=1,\dots,n. \]

The OLS assumptions in the multiple regression model are an extension of the ones made for the simple regression model:

1. Regressors \((X_{1i}, X_{2i}, \dots, X_{ki}, Y_i) \ , \ i=1,\dots,n\), are drawn such that the i.i.d. assumption holds.

2. \(u_i\) is an error term with conditional mean zero given the regressors, i.e., \[ E(u_i\vert X_{1i}, X_{2i}, \dots, X_{ki}) = 0. \]

3. Large outliers are unlikely, formally \(X_{1i},\dots,X_{ki}\) and \(Y_i\) have finite fourth moments.

4. No perfect multicollinearity.

Multicollinearity

Multicollinearity means that two or more regressors in a multiple regression model are strongly correlated. If the correlation between two or more regressors is perfect, that is, one regressor can be written as a linear combination of the other(s), we have perfect multicollinearity. While strong multicollinearity in general is unpleasant as it causes the variance of the OLS estimator to be large (we will discuss this in more detail later), the presence of perfect multicollinearity makes it impossible to solve for the OLS estimator, i.e., the model cannot be estimated in the first place.

The next section presents some examples of perfect multicollinearity and demonstrates how lm() deals with them.

Examples of Perfect Multicollinearity

How does R react if we try to estimate a model with perfectly correlated regressors?

lm will produce a warning in the first line of the coefficient section of the output (1 not defined because of singularities) and ignore the regressor(s) which is (are) assumed to be a linear combination of the other(s). Consider the following example where we add another variable scale(inpatient), to data where observations are scaled values of the observations for inpatient and use it as a regressor together with PDR and inpatient in a multiple regression model. In this example inpatient and scale_inpatient are perfectly collinear. The R code is as follows.

# define the scaled inpatient values
data$scale_inpatient <- data$inpatient / 100
# estimate the model
mult.mod <- lm(score ~ PDR + inpatient + scale_inpatient, data = data)
# obtain a summary of the model
summary(mult.mod)                                                 
##
## Call:
## lm(formula = score ~ PDR + inpatient + scale_inpatient, data = data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -48.845 -10.240  -0.308   9.815  43.461
##
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)
## (Intercept)     686.03224    7.41131  92.566  < 2e-16 ***
## PDR              -1.10130    0.38028  -2.896  0.00398 **
## inpatient        -0.64978    0.03934 -16.516  < 2e-16 ***
## scale_inpatient        NA         NA      NA       NA
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.46 on 417 degrees of freedom
## Multiple R-squared:  0.4264, Adjusted R-squared:  0.4237
## F-statistic:   155 on 2 and 417 DF,  p-value: < 2.2e-16

The row scale_inpatient in the coefficients section of the output consists of NA entries since scale_inpatient was excluded from the model.

If we were to compute OLS by hand, we would run into the same problem but no one would be helping us out! The computation simply fails.

Why is this? Take the following example:

Assume you want to estimate a simple linear regression model with a constant and a single regressor \(X\). As mentioned above, for perfect multicollinearity to be present \(X\) has to be a linear combination of the other regressors. Since the only other regressor is a constant (think of the right hand side of the model equation as \(\beta_0 \times 1 + \beta_1 X_i + u_i\) so that \(\beta_1\) is always multiplied by \(1\) for every observation), \(X\) has to be constant as well. For \(\hat\beta_1\) we have

\[ \hat\beta_1 = \frac{\sum_{i = 1}^n (X_i - \bar{X})(Y_i - \bar{Y})} { \sum_{i=1}^n (X_i - \bar{X})^2} = \frac{\widehat{Cov}(X,Y)}{\widehat{Var}(X)}. \]

The variance of the regressor \(X\) is in the denominator. Since the variance of a constant is zero, we are not able to compute this fraction and \(\hat{\beta}_1\) is undefined.

Let us consider two further examples where our selection of regressors induces perfect multicollinearity. First, assume that we intend to analyze the effect of patient-doctor ratio on score by using a dummy variable that identifies whether the ratio is large (\(NS\)). We define that a hospital has the \(NS\) attribute when the average patient-doctor ratio is at least \(12\),

\[ NS = \begin{cases} 0, \ \ \ \text{if PDR < 12} \\ 1 \ \ \ \text{otherwise.} \end{cases} \]

We add the corresponding column to data and estimate a multiple regression model with covariates beds and inpatient, where beds is the number of hospital beds available.

# if PDR smaller 12, NS = 0, else NS = 1
data$NS <- ifelse(data$PDR < 12, 0, 1)
# estimate the model
mult.mod <- lm(score ~ beds + inpatient + NS, data = data)
# obtain a model summary
summary(mult.mod)                                                  
##
## Call:
## lm(formula = score ~ beds + inpatient + NS, data = data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -49.492  -9.976  -0.778   8.761  43.798
##
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 663.704837   0.984259 674.319  < 2e-16 ***
## beds          0.005374   0.001670   3.218  0.00139 **
## inpatient    -0.708947   0.040303 -17.591  < 2e-16 ***
## NS                  NA         NA      NA       NA
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.43 on 417 degrees of freedom
## Multiple R-squared:  0.4291, Adjusted R-squared:  0.4263
## F-statistic: 156.7 on 2 and 417 DF,  p-value: < 2.2e-16

Again, the output of summary(mult.mod) tells us that inclusion of NS in the regression would render the estimation infeasible. What happened here? This is an example where we made a logical mistake when defining the regressor NS: taking a closer look at \(NS\), the redefined measure for patient-doctor ratio, reveals that there is not a single hospital with \(PDR<12\) hence \(NS\) equals one for all observations. We can check this by printing the contents of NS or by using the function table(), see ?table.

table(data$NS)
##
##   1
## 420

data$NS is a vector of \(420\) ones and our data set includes \(420\) observations. This obviously violates assumption 4 of the Key Concept above: the observations for the intercept are always \(1\),

\[\begin{align*} intercept = \, & \lambda \cdot NS \end{align*}\]

\[\begin{align*} \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} = \, & \lambda \cdot \begin{pmatrix} 1 \\ \vdots \\ 1 \end{pmatrix} \\ \Leftrightarrow \, & \lambda = 1. \end{align*}\]

Since the regressors can be written as a linear combination of each other, we face perfect multicollinearity and R excludes NS from the model.

Thus the take-away message is: think carefully about how the regressors in your models relate!

Another example of perfect multicollinearity is known as the dummy variable trap. This may occur when multiple dummy variables are used as regressors. A common case for this is when dummies are used to sort the data into mutually exclusive categories. For example, suppose we have spatial information that indicates whether a hospital is located in the North, West, South or East of Switzerland. This allows us to create the dummy variables

\[\begin{align*} North_i =& \begin{cases} 1 \ \ \text{if located in the north} \\ 0 \ \ \text{otherwise} \end{cases} \\ West_i =& \begin{cases} 1 \ \ \text{if located in the west} \\ 0 \ \ \text{otherwise} \end{cases} \\ South_i =& \begin{cases} 1 \ \ \text{if located in the south} \\ 0 \ \ \text{otherwise} \end{cases} \\ East_i =& \begin{cases} 1 \ \ \text{if located in the east} \\ 0 \ \ \text{otherwise}. \end{cases} \end{align*}\]

Since the regions are mutually exclusive, for every hospital \(i=1,\dots,n\) we have \[ North_i + West_i + South_i + East_i = 1. \]

We run into problems when trying to estimate a model that includes a constant and all four direction dummies in the model, e.g., \[ Score = \beta_0 + \beta_1 \times PDR + \beta_2 \times inpatient + \beta_3 \times North_i + \beta_4 \times West_i + \beta_5 \times South_i + \beta_6 \times East_i + u_i \] since then for all observations \(i=1,\dots,n\) the constant term is a linear combination of the dummies:

\[\begin{align} intercept = \, & \lambda_1 \cdot (North + West + South + East) \\ \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} = \, & \lambda_1 \cdot \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} \\ \Leftrightarrow \, & \lambda_1 = 1 \end{align}\]

and we have perfect multicollinearity. Thus the “dummy variable trap” means not paying attention and falsely including exhaustive dummies and a constant in a regression model.

How does lm() handle a regression like this? Let us first generate some artificial categorical data and append a new column named directions to data and see how lm() behaves when asked to estimate the model.

# set seed for reproducibility
set.seed(1)
# generate artificial data on location
data$direction <- sample(c("West", "North", "South", "East"),
                              420,
                              replace = T)
# estimate the model
mult.mod <- lm(score ~ PDR + inpatient + direction, data = data)
# obtain a model summary
summary(mult.mod)                                                 
##
## Call:
## lm(formula = score ~ PDR + inpatient + direction, data = data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -49.603 -10.175  -0.484   9.524  42.830
##
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)    684.80477    7.54130  90.807  < 2e-16 ***
## PDR             -1.08873    0.38153  -2.854  0.00454 **
## inpatient       -0.65597    0.04018 -16.325  < 2e-16 ***
## directionNorth   1.66314    2.05870   0.808  0.41964
## directionSouth   0.71619    2.06321   0.347  0.72867
## directionWest    1.79351    1.98174   0.905  0.36598
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.5 on 414 degrees of freedom
## Multiple R-squared:  0.4279, Adjusted R-squared:  0.421
## F-statistic: 61.92 on 5 and 414 DF,  p-value: < 2.2e-16

Notice that R solves the problem on its own by generating and including the dummies directionNorth, directionSouth and directionWest but omitting directionEast. Of course, the omission of every other dummy instead would achieve the same. Another solution would be to exclude the constant and to include all dummies instead.

Does this mean that the information on schools located in the East is lost? Fortunately, this is not the case: exclusion of directEast just alters the interpretation of coefficient estimates on the remaining dummies from absolute to relative. For example, the coefficient estimate on directionNorth states that, on average, test scores in the North are about \(1.61\) points higher than in the East.

Imperfect Multicollinearity

As opposed to perfect multicollinearity, imperfect multicollinearity is — to a certain extent — less of a problem. In fact, imperfect multicollinearity is the reason why we are interested in estimating multiple regression models in the first place: the OLS estimator allows us to isolate influences of correlated regressors on the dependent variable. If it was not for these dependencies, there would not be a reason to resort to a multiple regression approach and we could simply work with a single-regressor model. However, this is rarely the case in applications. We already know that ignoring dependencies among regressors which influence the outcome variable has an adverse effect on estimation results.

So when and why is imperfect multicollinearity a problem? Suppose you have the regression model

\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \]

and you are interested in estimating \(\beta_1\), the effect on \(Y_i\) of a one unit change in \(X_{1i}\), while holding \(X_{2i}\) constant. You do not know that the true model indeed includes \(X_2\). You follow some reasoning and add \(X_2\) as a covariate to the model in order to address a potential omitted variable bias. You are confident that \(E(u_i\vert X_{1i}, X_{2i})=0\) and that there is no reason to suspect a violation of the assumptions 2 and 3 made in the Key Concept above. If \(X_1\) and \(X_2\) are highly correlated, OLS struggles to precisely estimate \(\beta_1\). That means that although \(\hat\beta_1\) is a consistent and unbiased estimator for \(\beta_1\), it has a large variance due to \(X_2\) being included in the model. If the errors are homoskedastic, this issue can be better understood from the formula for the variance of \(\hat\beta_1\) in the model:

\[ \sigma^2_{\hat\beta_1} = \frac{1}{n} \left( \frac{1}{1-\rho^2_{X_1,X_2}} \right) \frac{\sigma^2_u}{\sigma^2_{X_1}}. \]

First, if \(\rho_{X_1,X_2}=0\), i.e., if there is no correlation between both regressors, including \(X_2\) in the model has no influence on the variance of \(\hat\beta_1\). Secondly, if \(X_1\) and \(X_2\) are correlated, \(\sigma^2_{\hat\beta_1}\) is inversely proportional to \(1-\rho^2_{X_1,X_2}\) so the stronger the correlation between \(X_1\) and \(X_2\), the smaller is \(1-\rho^2_{X_1,X_2}\) and thus the bigger is the variance of \(\hat\beta_1\). Thirdly, increasing the sample size helps to reduce the variance of \(\hat\beta_1\). Of course, this is not limited to the case with two regressors: in multiple regressions, imperfect multicollinearity inflates the variance of one or more coefficient estimators. It is an empirical question which coefficient estimates are severely affected by this and which are not. When the sample size is small, one often faces the decision whether to accept the consequence of adding a large number of covariates (higher variance) or to use a model with only few regressors (possible omitted variable bias). This is called bias-variance trade-off.

In sum, undesirable consequences of imperfect multicollinearity are generally not the result of a logical error made by the researcher (as is often the case for perfect multicollinearity) but are rather a problem that is linked to the data used, the model to be estimated and the research question at hand.

Simulation Study: Imperfect Multicollinearity

Let us conduct a simulation study to illustrate the issues sketched above.

  1. We use the equation above as the data generating process and choose \(\beta_0 = 5\), \(\beta_1 = 2.5\) and \(\beta_2 = 3\) and \(u_i\) is an error term distributed as \(\mathcal{N}(0,5)\). In a first step, we sample the regressor data from a bivariate normal distribution: \[ X_i = (X_{1i}, X_{2i}) \overset{i.i.d.}{\sim} \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right] \] It is straightforward to see that the correlation between \(X_1\) and \(X_2\) in the population is rather low:

\[ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{2.5}{10} = 0.25 \]

  1. Next, we estimate the model and save the estimates for \(\beta_1\) and \(\beta_2\). This is repeated \(10000\) times with a for loop so we end up with a large number of estimates that allow us to describe the distributions of \(\hat\beta_1\) and \(\hat\beta_2\).

  2. We repeat steps 1 and 2 but increase the covariance between \(X_1\) and \(X_2\) from \(2.5\) to \(8.5\) such that the correlation between the regressors is high: \[ \rho_{X_1,X_2} = \frac{Cov(X_1,X_2)}{\sqrt{Var(X_1)}\sqrt{Var{(X_2)}}} = \frac{8.5}{10} = 0.85 \]

  3. In order to assess the effect on the precision of the estimators of increasing the collinearity between \(X_1\) and \(X_2\) we estimate the variances of \(\hat\beta_1\) and \(\hat\beta_2\) and compare.

# load packages
library(MASS)
library(mvtnorm)
# set number of observations
n <- 50
# initialize vectors of coefficients
coefs1 <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
coefs2 <- coefs1
# set seed
set.seed(1)
# loop sampling and estimation
for (i in 1:10000) {

  # for cov(X_1,X_2) = 0.25
  X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
  u <- rnorm(n, sd = 5)
  Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
  coefs1[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]

  # for cov(X_1,X_2) = 0.85
  X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 8.5), c(8.5, 10)))
  Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
  coefs2[i, ] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]

}
# obtain variance estimates
diag(var(coefs1))
## hat_beta_1 hat_beta_2
## 0.05674375 0.05712459
diag(var(coefs2))
## hat_beta_1 hat_beta_2
##  0.1904949  0.1909056

We are interested in the variances which are the diagonal elements. We see that due to the high collinearity, the variances of \(\hat\beta_1\) and \(\hat\beta_2\) have more than tripled, meaning it is more difficult to precisely estimate the true coefficients.

The Distribution of the OLS Estimators in Multiple Regression

As in simple linear regression, different samples will produce different values of the OLS estimators in the multiple regression model. Again, this variation leads to uncertainty of those estimators which we seek to describe using their sampling distribution(s). In short, if the assumption made in the Key Concept above hold, the large sample distribution of \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k\) is multivariate normal such that the individual estimators themselves are also normally distributed. The Key Concept below summarizes this.

Key Concept 26

Large-sample distribution of \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k\)

If the least squares assumptions in the multiple regression model (see Key Concept 25) hold, then, in large samples, the OLS estimators \(\hat\beta_0,\hat\beta_1,\dots,\hat\beta_k\) are jointly normally distributed. We also say that their joint distribution is multivariate normal. Further, each \(\hat\beta_j\) is distributed as \(\mathcal{N}(\beta_j,\sigma_{\beta_j}^2)\).

Essentially, Key Concept 26 states that, if the sample size is large, we can approximate the individual sampling distributions of the coefficient estimators by specific normal distributions and their joint sampling distribution by a multivariate normal distribution.

How can we use R to get an idea of what the joint PDF of the coefficient estimators in multiple regression model looks like? When estimating a model on some data, we end up with a set of point estimates that do not reveal much information on the joint density of the estimators. However, with a large number of estimations using repeatedly randomly sampled data from the same population we can generate a large set of point estimates that allows us to plot an estimate of the joint density function.

The approach we will use to do this in R is a follows:

  • Generate \(10000\) random samples of size \(50\) using the DGP \[ Y_i = 5 + 2.5 \cdot X_{1i} + 3 \cdot X_{2i} + u_i \] where the regressors \(X_{1i}\) and \(X_{2i}\) are sampled for each observation as \[ X_i = (X_{1i}, X_{2i}) \sim \mathcal{N} \left[\begin{pmatrix} 0 \\ 0 \end{pmatrix}, \begin{pmatrix} 10 & 2.5 \\ 2.5 & 10 \end{pmatrix} \right] \] and \[ u_i \sim \mathcal{N}(0,5) \] is an error term.

  • For each of the \(10000\) simulated sets of sample data, we estimate the model \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \] and save the coefficient estimates \(\hat\beta_1\) and \(\hat\beta_2\).

  • We compute a density estimate of the joint distribution of \(\hat\beta_1\) and \(\hat\beta_2\) in the model above using the function kde2d() from the package MASS, see ?MASS. This estimate is then plotted using the function persp().

# load packages
library(MASS)
library(mvtnorm)
# set sample size
n <- 50
# initialize vector of coefficients
coefs <- cbind("hat_beta_1" = numeric(10000), "hat_beta_2" = numeric(10000))
# set seed for reproducibility
set.seed(1)
# loop sampling and estimation
for (i in 1:10000) {

  X <- rmvnorm(n, c(50, 100), sigma = cbind(c(10, 2.5), c(2.5, 10)))
  u <- rnorm(n, sd = 5)
  Y <- 5 + 2.5 * X[, 1] + 3 * X[, 2] + u
  coefs[i,] <- lm(Y ~ X[, 1] + X[, 2])$coefficients[-1]

}
# compute density estimate
kde <- kde2d(coefs[, 1], coefs[, 2])
# plot density estimate
persp(kde,
      theta = 310,
      phi = 30,
      xlab = "beta_1",
      ylab = "beta_2",
      zlab = "Est. Density")

From the plot above we can see that the density estimate has some similarity to a bivariate normal distribution though it is not very pretty and probably a little rough. Furthermore, there is a correlation between the estimates such that \(\rho\neq0\). Also, the distribution’s shape deviates from the symmetric bell shape of the bivariate standard normal distribution and has an elliptical surface area instead.

# estimate the correlation between estimators
cor(coefs[, 1], coefs[, 2])
## [1] -0.2503028

Where does this correlation come from? Notice that, due to the way we generated the data, there is correlation between the regressors \(X_1\) and \(X_2\). Correlation between the regressors in a multiple regression model always translates into correlation between the estimators. In our case, the positive correlation between \(X_1\) and \(X_2\) translates to negative correlation between \(\hat\beta_1\) and \(\hat\beta_2\). To get a better idea of the distribution you can vary the point of view in the subsequent smooth interactive 3D plot of the same density estimate used for plotting with persp(). Here you can see that the shape of the distribution is somewhat stretched due to \(\rho=-0.20\) and it is also apparent that both estimators are unbiased since their joint density seems to be centered close to the true parameter vector \((\beta_1,\beta_2) = (2.5,3)\).

Hypothesis Tests and Confidence Intervals in Multiple Regression

We now discuss methods that allow quantification of the sampling uncertainty in the OLS estimator of the coefficients in multiple regression models. The basis for this are hypothesis tests and confidence intervals which, just as for the simple linear regression model, can be computed using basic R functions. We will also tackle the issue of testing joint hypotheses on these coefficients.

Hypothesis Tests and Confidence Intervals for a Single Coefficient

We first discuss how to compute standard errors, how to test hypotheses and how to construct confidence intervals for a single regression coefficient \(\beta_j\) in a multiple regression model. The basic idea is summarized in Key Concept 26.

Key Concept 27

Testing the Hypothesis \(\beta_j = \beta_{j,0}\) Against the Alternative \(\beta_j \neq \beta_{j,0}\)

  1. Compute the standard error of \(\hat{\beta_j}\)
  2. Compute the \(t\)-statistic, \[t^{act} = \frac{\hat{\beta}_j - \beta_{j,0}} {SE(\hat{\beta_j})}\]
  3. Compute the \(p\)-value, \[p\text{-value} = 2 \Phi(-|t^{act}|)\] where \(t^{act}\) is the value of the \(t\)-statistic actually computed. Reject the hypothesis at the \(5\%\) significance level if the \(p\)-value is less than \(0.05\) or, equivalently, if \(|t^{act}| > 1.96\). The standard error and (typically) the \(t\)-statistic and the corresponding \(p\)-value for testing \(\beta_j = 0\) are computed automatically by suitable R functions, e.g., by summary.

Testing a single hypothesis about the significance of a coefficient in the multiple regression model proceeds as in in the simple regression model.

You can easily see this by inspecting the coefficient summary of the regression model

\[ Score = \beta_0 + \beta_1 \times PDR + \beta_2 \times inpatient + u \]

Let us review this:

##
## t test of coefficients:
##
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 638.72916    7.30124  87.482  < 2e-16 ***
## PDR          -0.64874    0.35334  -1.836  0.06707 .
## income        1.83911    0.11473  16.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You may check that these quantities are computed as in the simple regression model by computing the \(t\)-statistics or \(p\)-values by hand using the output above and R as a calculator.

For example, using the definition of the \(p\)-value for a two-sided test as given in Key Concept 27, we can confirm the \(p\)-value for a test of the hypothesis that the coefficient \(\beta_1\), the coefficient on PDR, to be approximately zero.

# compute two-sided p-value
2 * (1 - pt(abs(coeftest(model, vcov. = vcovHC, type = "HC1")[2, 3]),
            df = model$df.residual))
## [1] 0.06706649

Key Concept 28

Confidence Intervals for a Single Coefficient in Multiple Regression

A \(95\%\) two-sided confidence interval for the coefficient \(\beta_j\) is an interval that contains the true value of \(\beta_j\) with a \(95 \%\) probability; that is, it contains the true value of \(\beta_j\) in \(95 \%\) of all repeated samples. Equivalently, it is the set of values of \(\beta_j\) that cannot be rejected by a \(5 \%\) two-sided hypothesis test. When the sample size is large, the \(95 \%\) confidence interval for \(\beta_j\) is \[\left[\hat{\beta_j}- 1.96 \times SE(\hat{\beta}_j), \hat{\beta_j} + 1.96 \times SE(\hat{\beta_j})\right].\]

Let us take a look at the hospital data regression again.

Computing confidence intervals for individual coefficients in the multiple regression model proceeds as in the simple regression model using the function confint().

model <- lm(score ~ PDR + inpatient, data = data)
confint(model)
##                   2.5 %      97.5 %
## (Intercept) 671.4640580 700.6004311
## PDR          -1.8487969  -0.3537944
## inpatient    -0.7271113  -0.5724424

To obtain confidence intervals at another level, say \(90\%\), just set the argument level in our call of confint() accordingly.

confint(model, level = 0.9)
##                     5 %        95 %
## (Intercept) 673.8145793 698.2499098
## PDR          -1.7281904  -0.4744009
## inpatient    -0.7146336  -0.5849200

The output now reports the desired \(90\%\) confidence intervals for all coefficients.

A disadvantage of confint() is that is does not use robust standard errors to compute the confidence interval. For large-sample confidence intervals, this is quickly done manually as follows.

# compute robust standard errors
rob_se <- diag(vcovHC(model, type = "HC1"))^0.5
# compute robust 95% confidence intervals
rbind("lower" = coef(model) - qnorm(0.975) * rob_se,
      "upper" = coef(model) + qnorm(0.975) * rob_se)
##       (Intercept)        PDR  inpatient
## lower    668.9252 -1.9496606 -0.7105980
## upper    703.1393 -0.2529307 -0.5889557
# compute robust 90% confidence intervals
rbind("lower" = coef(model) - qnorm(0.95) * rob_se,
      "upper" = coef(model) + qnorm(0.95) * rob_se)
##       (Intercept)        PDR  inpatient
## lower    671.6756 -1.8132659 -0.7008195
## upper    700.3889 -0.3893254 -0.5987341

Recall that the \(95\%\) confidence interval computed above does not tell us that a one-unit decrease in the patient-doctor ratio has an effect on hospital scores that lies in the interval with a lower bound of \(-1.9497\) and an upper bound of \(-0.2529\). Once a confidence interval has been computed, a probabilistic statement like this is wrong: either the interval contains the true parameter or it does not. We do not know which is true.

Another Augmentation of the Model

What is the average effect on hospital scores of reducing the patient-doctor ratio when the expenditures per patient and the health of the inpatients are held constant?

Let us augment our model by an additional regressor that is a measure for expenditure per patient.

Our model now is \[ Score = \beta_0 + \beta_1 \times PDR + \beta_2 \times inpatient + \beta_3 \times expenditure + u \]

with \(expenditure\) the total amount of expenditure per patient in the district (thousands of CHF).

Let us now estimate the model:

##
## t test of coefficients:
##
##               Estimate Std. Error  t value Pr(>|t|)
## (Intercept) 649.577947  15.458344  42.0212  < 2e-16 ***
## PDR          -0.286399   0.482073  -0.5941  0.55277
## inpatient    -0.656023   0.031784 -20.6398  < 2e-16 ***
## expenditure   3.867901   1.580722   2.4469  0.01482 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated effect of a one unit change in the patient-doctor ratio on hospital scores with expenditure and the health of inpatients held constant is \(-0.29\), which is rather small. What is more, the coefficient on \(PDR\) is not significantly different from zero anymore even at \(10\%\) since \(p\text{-value}=0.55\). Can you come up with an interpretation for these findings?

The insignificance of \(\hat\beta_1\) could be due to a larger standard error of \(\hat{\beta}_1\) resulting from adding \(expenditure\) to the model so that we estimate the coefficient on \(size\) less precisely. This illustrates the issue of strongly correlated regressors (imperfect multicollinearity). The correlation between \(PDR\) and \(expenditure\) can be computed using cor().

# compute the sample correlation between 'size' and 'expenditure'
cor(data$PDR, data$expenditure)
## [1] -0.6199822

Altogether, we conclude that the new model provides no evidence that changing the patient-doctor ratio, e.g., by hiring new doctors, has any effect on the hospital scores while keeping expenditures per patient and the health of inpatients constant.

Joint Hypothesis Testing Using the F-Statistic

The estimated model is

\[ \widehat{Score} = \underset{(15.21)}{649.58} -\underset{(0.48)}{0.29} \times PDR - \underset{(0.04)}{0.66} \times inpatient + \underset{(1.41)}{3.87} \times expenditure. \]

Now, can we reject the hypothesis that the coefficient on \(PDR\) and the coefficient on \(expenditure\) are zero? To answer this, we have to resort to joint hypothesis tests. A joint hypothesis imposes restrictions on multiple regression coefficients. This is different from conducting individual \(t\)-tests where a restriction is imposed on a single coefficient.

The homoskedasticity-only \(F\)-Statistic is given by

\[ F = \frac{(SSR_{\text{restricted}} - SSR_{\text{unrestricted}})/q}{SSR_{\text{unrestricted}} / (n-k-1)} \]

with \(SSR_{restricted}\) being the sum of squared residuals from the restricted regression, i.e., the regression where we impose the restriction. \(SSR_{unrestricted}\) is the sum of squared residuals from the full model, \(q\) is the number of restrictions under the null and \(k\) is the number of regressors in the unrestricted regression.

It is fairly easy to conduct \(F\)-tests in R.

# estimate the multiple regression model
model <- lm(score ~ PDR + inpatient + expenditure, data = data)
linearHypothesis(model, c("PDR=0", "expenditure=0"))
## Linear hypothesis test
##
## Hypothesis:
## PDR = 0
## expenditure = 0
##
## Model 1: restricted model
## Model 2: score ~ PDR + inpatient + expenditure
##
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)
## 1    418 89000
## 2    416 85700  2    3300.3 8.0101 0.000386 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output reveals that the \(F\)-statistic for this joint hypothesis test is about \(8.01\) and the corresponding \(p\)-value is \(0.0004\). Thus, we can reject the null hypothesis that both coefficients are zero at any level of significance commonly used in practice.

A heteroskedasticity-robust version of this \(F\)-test (which leads to the same conclusion) can be conducted as follows.

# heteroskedasticity-robust F-test
linearHypothesis(model, c("PDR=0", "expenditure=0"), white.adjust = "hc1")
## Linear hypothesis test
##
## Hypothesis:
## PDR = 0
## expenditure = 0
##
## Model 1: restricted model
## Model 2: score ~ PDR + inpatient + expenditure
##
## Note: Coefficient covariance matrix supplied.
##
##   Res.Df Df      F   Pr(>F)
## 1    418
## 2    416  2 5.4337 0.004682 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The standard output of a model summary also reports an \(F\)-statistic and the corresponding \(p\)-value. The null hypothesis belonging to this \(F\)-test is that all of the population coefficients in the model except for the intercept are zero, so the hypotheses are \[H_0: \beta_1=0, \ \beta_2 =0, \ \beta_3 =0 \quad \text{vs.} \quad H_1: \beta_j \neq 0 \ \text{for at least one} \ j=1,2,3.\]

This is also called the overall regression \(F\)-statistic and the null hypothesis is obviously different from testing if only \(\beta_1\) and \(\beta_3\) are zero.

We now check whether the \(F\)-statistic belonging to the \(p\)-value listed in the model’s summary coincides with the result reported by linearHypothesis().

# execute the function on the model object and provide the restrictions
# to be tested as a character vector
linearHypothesis(model, c("PDR=0", "inpatient=0", "expenditure=0"))
## Linear hypothesis test
##
## Hypothesis:
## PDR = 0
## inpatient = 0
## expenditure = 0
##
## Model 1: restricted model
## Model 2: score ~ PDR + inpatient + expenditure
##
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)
## 1    419 152110
## 2    416  85700  3     66410 107.45 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Access the overall F-statistic from the model's summary
summary(model)$fstatistic
##    value    numdf    dendf
## 107.4547   3.0000 416.0000

The entry value is the overall \(F\)-statistics and it equals the result of linearHypothesis(). The \(F\)-test rejects the null hypothesis that the model has no power in explaining test scores. It is important to know that the \(F\)-statistic reported by summary is not robust to heteroskedasticity!

Confidence Sets for Multiple Coefficients

Based on the \(F\)-statistic that we have previously encountered, we can specify confidence sets. Confidence sets are analogous to confidence intervals for single coefficients. As such, confidence sets consist of combinations of coefficients that contain the true combination of coefficients in, say, \(95\%\) of all cases if we could repeatedly draw random samples, just like in the univariate case. Put differently, a confidence set is the set of all coefficient combinations for which we cannot reject the corresponding joint null hypothesis tested using an \(F\)-test.

The confidence set for two coefficients an ellipse which is centered around the point defined by both coefficient estimates. Again, there is a very convenient way to plot the confidence set for two coefficients of model objects, namely the function confidenceEllipse() from the car package.

We now plot the \(95\%\) confidence ellipse for the coefficients on PDR and expenditure from the regression conducted above. By specifying the additional argument fill, the confidence set is colored.

# draw the 95% confidence set for coefficients on PDR and expenditure
confidenceEllipse(model,
                  fill = T,
                  lwd = 0,
                  which.coef = c("PDR", "expenditure"),
                  main = "95% Confidence Set")

We see that the ellipse is centered around \((-0.29, 3.87)\), the pair of coefficients estimates on \(size\) and \(expenditure\). What is more, \((0,0)\) is not element of the \(95\%\) confidence set so that we can reject \(H_0: \beta_1 = 0, \ \beta_3 = 0\).

By default, confidenceEllipse() uses homoskedasticity-only standard errors. The following code chunk shows how compute a robust confidence ellipse and how to overlay it with the previous plot.

# draw the robust 95% confidence set for coefficients on PDR and expenditure
confidenceEllipse(model,
                  fill = T,
                  lwd = 0,
                  which.coef = c("PDR", "expenditure"),
                  main = "95% Confidence Sets",
                  vcov. = vcovHC(model, type = "HC1"),
                  col = "red")

# draw the 95% confidence set for coefficients on PDR and expenditure
confidenceEllipse(model,
                  fill = T,
                  lwd = 0,
                  which.coef = c("PDR", "expenditure"),
                  add = T)

As the robust standard errors are slightly larger than those valid under homoskedasticity only in this case, the robust confidence set is slightly larger. This is analogous to the confidence intervals for the individual coefficients.

Model Specification for Multiple Regression

Choosing a regression specification, i.e., selecting the variables to be included in a regression model, is a difficult task. However, there are some guidelines on how to proceed. The goal is clear: obtaining an unbiased and precise estimate of the causal effect of interest. As a starting point, think about omitted variables, that is, to avoid possible bias by using suitable control variables. Omitted variables bias in the context of multiple regression is explained in Key Concept 29. A second step could be to compare different specifications by measures of fit. However, as we shall see one should not rely solely on \(\bar{R}^2\).

Key Concept 29

Omitted Variable Bias in Multiple Regression

Omitted variable bias is the bias in the OLS estimator that arises when regressors correlate with an omitted variable. For omitted variable bias to arise, two things must be true: 1. At least one of the included regressors must be correlated with the omitted variable. 2. The omitted variable must be a determinant of the dependent variable, \(Y\).

We now discuss an example were we face a potential omitted variable bias in a multiple regression model:

Consider again the estimated regression equation

\[ \widehat{Score} = \underset{(8.7)}{686.0} - \underset{(0.43)}{1.10} \times PDR - \underset{(0.031)}{0.650} \times inpatient. \]

We are interested in estimating the causal effect of PDR on hospital score. There might be a bias due to omitting “general patient welfare” from our regression since such a measure could be a determinant of the patients’ scores and could also be correlated with both regressors already included in the model (so that both conditions of Key Concept 29 are fulfilled). “General patient welfare” is a complicated concept that is difficult to quantify. A surrogate we can consider instead is the patients’ food choices. We thus augment the model with the variable food, the percentage of unhealthy food choices that patients make, and reestimate the model.

##
## t test of coefficients:
##
##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 700.149957   5.568453 125.7351 < 2.2e-16 ***
## PDR          -0.998309   0.270080  -3.6963 0.0002480 ***
## inpatient    -0.121573   0.032832  -3.7029 0.0002418 ***
## food         -0.547345   0.024107 -22.7046 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Thus, the estimated regression line is

\[ \widehat{Score} = \underset{(5.56)}{700.15} - \underset{(0.27)}{1.00} \times PDR - \underset{(0.03)}{0.12} \times inpatient - \underset{(0.02)}{0.55} \times food. \]

We observe no substantial changes in the conclusion about the effect of \(PDR\) on \(Score\): the coefficient on \(PDR\) changes by only \(0.1\) and retains its significance.

Although the difference in estimated coefficients is not big in this case, it is useful to keep food to make the assumption of conditional mean independence more credible.

Model Specification in Theory and in Practice

Key Concept 30 lists some common pitfalls when using \(R^2\) and \(\bar{R}^2\) to evaluate the predictive ability of regression models.

Key Concept 30

\(R^2\) and \(\bar{R}^2\): what they tell you — and what they do not

The \(R^2\) and \(\bar{R}^2\) tell you whether the regressors are good at explaining the variation of the independent variable in the sample.

If the \(R^2\) (or \(\bar{R}^2\)) is nearly \(1\), then the regressors produce a good prediction of the dependent variable in that sample, in the sense that the variance of OLS residuals is small compared to the variance of the dependent variable.

If the \(R^2\) (or \(\bar{R}^2\)) is nearly \(0\), the opposite is true.

The \(R^2\) and \(\bar{R}^2\) do not tell you whether:

1. An included variable is statistically significant.

2. The regressors are the true cause of the movements in the dependent variable.

3. There is omitted variable bias.

4. You have chosen the most appropriate set of regressors.

For example, think of regressing \(Score\) on \(PLS\) which measures the available parking lot space in thousand square feet. You are likely to observe a significant coefficient of reasonable magnitude and moderate to high values for \(R^2\) and \(\bar{R}^2\). The reason for this is that parking lot space is correlated with many determinants of the test score like location, PDR, financial situation and so on. Although we do not have observations on \(PLS\), we can use R to generate some relatively realistic data.

# set seed for reproducibility
set.seed(1)
# generate observations for parking lot space
data$PLS <- c(22 * data$income
                   - 15 * data$PDR
                   + 0.2 * data$expenditure
                   + rnorm(nrow(data), sd = 80) + 3000)
# plot parking lot space against test score
plot(data$PLS,
     data$score,
     xlab = "Parking Lot Space",
     ylab = "Score",
     pch = 20,
     col = "steelblue")

# regress test score on PLS
summary(lm(score ~ PLS, data = data))
##
## Call:
## lm(formula = score ~ PLS, data = data)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -42.372  -9.742   0.592  10.481  36.867
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.575e+02  1.171e+01   39.07   <2e-16 ***
## PLS         6.453e-02  3.836e-03   16.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.73 on 418 degrees of freedom
## Multiple R-squared:  0.4037, Adjusted R-squared:  0.4022
## F-statistic:   283 on 1 and 418 DF,  p-value: < 2.2e-16

\(PLS\) is generated as a linear function of \(expenditure\), \(income\), \(PDR\) and a random disturbance. Therefore the data suggest that there is some positive relationship between parking lot space and test score. In fact, when estimating the model \[\begin{align} Score = \beta_0 + \beta_1 \times PLS + u \end{align}\] using lm() we find that the coefficient on \(PLS\) is positive and significantly different from zero. Also \(R^2\) and \(\bar{R}^2\) are about \(0.4\) which is a lot more than the roughly \(0.05\) observed when regressing the patient scores on the PDR only. This suggests that increasing the parking lot space boosts a hospitals patient scores and that this model does even better in explaining heterogeneity in the dependent variable than a model with \(PDR\) as the only regressor. Keeping in mind how \(PLS\) is constructed this comes as no surprise. It is evident that the high \(R^2\) cannot be used to the conclude that the estimated relation between parking lot space and patient scores is causal: the (relatively) high \(R^2\) is due to correlation between \(PLS\) and other determinants and/or control variables. Increasing parking lot space is not an appropriate measure to generate improved outcomes!

Nonlinear Regression Functions

Until now we assumed the regression function to be linear, i.e., we have treated the slope parameter of the regression function as a constant. This implies that the effect on \(Y\) of a one unit change in \(X\) does not depend on the level of \(X\). If, however, the effect of a change in \(X\) on \(Y\) does depend on the value of \(X\), we should use a nonlinear regression function.

Let us have a look at an example where using a nonlinear regression function is better suited for estimating the population relationship between the regressor, \(X\), and the regressand, \(Y\): the relationship between the income of hospital districts and their patient scores.

We start our analysis by computing the correlation between both variables.

cor(data$income, data$score)
## [1] 0.7124308

Here, income and scores are positively related: hospital districts with above average income tend to achieve above average patient scores. Does a linear regression function model the data adequately? Let us plot the data and add a linear regression line.

# fit a simple linear model
linear_model<- lm(score ~ income, data = data)
# plot the observations
plot(data$income, data$score,
     col = "steelblue",
     pch = 20,
     xlab = "District Income (thousands of CHF)",
     ylab = "Patient Score",
     cex.main = 0.9,
     main = "Patient Score vs. District Income and a Linear OLS Regression Function")
# add the regression line to the plot
abline(linear_model,
       col = "red",
       lwd = 2)

The linear regression line seems to overestimate the true relationship when income is very high or very low and underestimates it for the middle income group.

Fortunately, OLS does not only handle linear functions of the regressors. We can for example model patient scores as a function of income and the square of income. The corresponding regression model is

\[Score_i = \beta_0 + \beta_1 \times income_i + \beta_2 \times income_i^2 + u_i,\] called a quadratic regression model. That is, \(income^2\) is treated as an additional explanatory variable. Hence, the quadratic model is a special case of a multivariate regression model. When fitting the model with lm() we have to use the ^ operator in conjunction with the function I() to add the quadratic term as an additional regressor to the argument formula. This is because the regression formula we pass to formula is converted to an object of the class formula. For objects of this class, the operators +, -, * and ^ have a nonarithmetic interpretation. I() ensures that they are used as arithmetical operators, see ?I,

# fit the quadratic Model
quadratic_model <- lm(score ~ income + I(income^2), data = data)
# obtain the model summary
coeftest(quadratic_model, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##                Estimate  Std. Error  t value  Pr(>|t|)
## (Intercept) 607.3017435   2.9017544 209.2878 < 2.2e-16 ***
## income        3.8509939   0.2680942  14.3643 < 2.2e-16 ***
## I(income^2)  -0.0423084   0.0047803  -8.8505 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output tells us that the estimated regression function is

\[\widehat{Score}_i = \underset{(2.90)}{607.3} + \underset{(0.27)}{3.85} \times income_i - \underset{(0.0048)}{0.0423} \times income_i^2.\]

This model allows us to test the hypothesis that the relationship between test scores and district income is linear against the alternative that it is quadratic. This corresponds to testing

\[H_0: \beta_2 = 0 \ \ \text{vs.} \ \ H_1: \beta_2\neq0,\]

since \(\beta_2=0\) corresponds to a simple linear equation and \(\beta_2\neq0\) implies a quadratic relationship. We find that \(t=(\hat\beta_2 - 0)/SE(\hat\beta_2) = -0.0423/0.0048 = -8.81\) so the null is rejected at any common level of significance and we conclude that the relationship is nonlinear. This is consistent with the impression gained from the plot.

We now draw the same scatter plot as for the linear model and add the regression line for the quadratic model. Because abline() can only draw straight lines, it cannot be used here. lines() is a function which allows to draw non-straight lines, see ?lines. The most basic call of lines() is lines(x_values, y_values) where x_values and y_values are vectors of the same length that provide coordinates of the points to be sequentially connected by a line. This makes it necessary to sort the coordinate pairs according to the X-values. Here we use the function order() to sort the fitted values of score according to the observations of income.

# draw a scatterplot of the observations for income and patient score
plot(data$income, data$score,
     col  = "steelblue",
     pch = 20,
     xlab = "District Income (thousands of dollars)",
     ylab = "Score",
     main = "Estimated Linear and Quadratic Regression Functions")
# add a linear function to the plot
abline(linear_model, col = "black", lwd = 2)
# add quatratic function to the plot
order_id <- order(data$income)
lines(x = data$income[order_id],
      y = fitted(quadratic_model)[order_id],
      col = "red",
      lwd = 2) 

We see that the quadratic function does fit the data much better than the linear function.

Nonlinear Functions of a Single Independent Variable

Polynomials

The approach used to obtain a quadratic model can be generalized to polynomial models of arbitrary degree \(r\), \[Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \cdots + \beta_r X_i^r + u_i.\]

A cubic model for instance can be estimated in the same way as the quadratic model; we just have to use a polynomial of degree \(r=3\) in income. This is conveniently done using the function poly().

# estimate a cubic model
cubic_model <- lm(score ~ poly(income, degree = 3, raw = TRUE), data = data)

poly() generates orthogonal polynomials which are orthogonal to the constant by default. Here, we set raw = TRUE such that raw polynomials are evaluated, see ?poly.

In practice the question will arise which polynomial order should be chosen. First, similarly as for \(r=2\), we can test the null hypothesis that the true relation is linear against the alternative hypothesis that the relationship is a polynomial of degree \(r\):

\[ H_0: \beta_2=0, \ \beta_3=0,\dots,\beta_r=0 \ \ \ \text{vs.} \ \ \ H_1: \text{at least one} \ \beta_j\neq0, \ j=2,\dots,r \]

This is a joint null hypothesis with \(r-1\) restrictions so it can be tested using the \(F\)-test. linearHypothesis() can be used to conduct such tests. For example, we may test the null of a linear model against the alternative of a polynomial of a maximal degree \(r=3\) as follows.

# test the hypothesis of a linear model against quadratic or polynomial
# alternatives
# set up hypothesis matrix
R <- rbind(c(0, 0, 1, 0),
            c(0, 0, 0, 1))
# do the test
linearHypothesis(cubic_model,
                 hypothesis.matrix = R,
                 white.adj = "hc1")
## Linear hypothesis test
##
## Hypothesis:
## poly(income, degree = 3, raw = TRUE)2 = 0
## poly(income, degree = 3, raw = TRUE)3 = 0
##
## Model 1: restricted model
## Model 2: score ~ poly(income, degree = 3, raw = TRUE)
##
## Note: Coefficient covariance matrix supplied.
##
##   Res.Df Df      F    Pr(>F)
## 1    418
## 2    416  2 37.691 9.043e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We provide a hypothesis matrix as the argument hypothesis.matrix. This is useful when the coefficients have long names, as is the case here due to using poly(), or when the restrictions include multiple coefficients. How the hypothesis matrix \(\mathbf{R}\) is interpreted by linearHypothesis() is best seen using matrix algebra:

For the two linear constrains above, we have \[\begin{align*} \mathbf{R}\boldsymbol{\beta} =& \mathbf{s} \\ \begin{pmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \end{pmatrix} =& \begin{pmatrix} 0 \\ 0 \end{pmatrix} \\ \begin{pmatrix} \beta_2 \\ \beta_3 \end{pmatrix}= & \begin{pmatrix} 0 \\ 0 \end{pmatrix}. \end{align*}\] linearHypothesis() uses the zero vector for \(\mathbf{s}\) by default, see ?linearHypothesis.

The \(p\)-value for is very small so that we reject the null hypothesis. However, this does not tell us which \(r\) to choose. In practice, one approach to determine the degree of the polynomial is to use sequential testing:

  1. Estimate a polynomial model for some maximum value \(r\).
  2. Use a \(t\)-test to test \(\beta_r = 0\). Rejection of the null means that \(X^r\) belongs in the regression equation.
  3. Acceptance of the null in step 2 means that \(X^r\) can be eliminated from the model. Continue by repeating step 1 with order \(r-1\) and test whether \(\beta_{r-1}=0\). If the test rejects, use a polynomial model of order \(r-1\).
  4. If the tests from step 3 rejects, continue with the procedure until the coefficient on the highest power is statistically significant.

There is no unambiguous guideline how to choose \(r\) in step one. However, it is appropriate to choose small orders like \(2\), \(3\), or \(4\).

We will demonstrate how to apply sequential testing by the example of the cubic model.

summary(cubic_model)
##
## Call:
## lm(formula = score ~ poly(income, degree = 3, raw = TRUE), data = data)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -44.28  -9.21   0.20   8.32  31.16
##
## Coefficients:
##                                         Estimate Std. Error t value
## (Intercept)                            6.001e+02  5.830e+00 102.937
## poly(income, degree = 3, raw = TRUE)1  5.019e+00  8.595e-01   5.839
## poly(income, degree = 3, raw = TRUE)2 -9.581e-02  3.736e-02  -2.564
## poly(income, degree = 3, raw = TRUE)3  6.855e-04  4.720e-04   1.452
##                                       Pr(>|t|)
## (Intercept)                            < 2e-16 ***
## poly(income, degree = 3, raw = TRUE)1 1.06e-08 ***
## poly(income, degree = 3, raw = TRUE)2   0.0107 *
## poly(income, degree = 3, raw = TRUE)3   0.1471
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.71 on 416 degrees of freedom
## Multiple R-squared:  0.5584, Adjusted R-squared:  0.5552
## F-statistic: 175.4 on 3 and 416 DF,  p-value: < 2.2e-16

The estimated cubic model stored in cubic_model is

\[ \widehat{Score}_i = \underset{(5.83)}{600.1} + \underset{(0.86)}{5.02} \times income -\underset{(0.03)}{0.96} \times income^2 - \underset{(0.00047)}{0.00069} \times income^3. \]

The \(t\)-statistic on \(income^3\) is \(1.42\) so the null that the relationship is quadratic cannot be rejected, even at the \(10\%\) level.

# test the hypothesis using robust standard errors
coeftest(cubic_model, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##                                          Estimate  Std. Error  t value
## (Intercept)                            6.0008e+02  5.1021e+00 117.6150
## poly(income, degree = 3, raw = TRUE)1  5.0187e+00  7.0735e-01   7.0950
## poly(income, degree = 3, raw = TRUE)2 -9.5805e-02  2.8954e-02  -3.3089
## poly(income, degree = 3, raw = TRUE)3  6.8549e-04  3.4706e-04   1.9751
##                                        Pr(>|t|)
## (Intercept)                           < 2.2e-16 ***
## poly(income, degree = 3, raw = TRUE)1 5.606e-12 ***
## poly(income, degree = 3, raw = TRUE)2  0.001018 **
## poly(income, degree = 3, raw = TRUE)3  0.048918 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The reported standard errors have changed. Furthermore, the coefficient for income^3 is now significant at the \(5\%\) level. This means we reject the hypothesis that the regression function is quadratic against the alternative that it is cubic. Furthermore, we can also test if the coefficients for income^2 and income^3 are jointly significant using a robust version of the \(F\)-test.

# perform robust F-test
linearHypothesis(cubic_model,
                 hypothesis.matrix = R,
                 vcov. = vcovHC, type = "HC1")
## Linear hypothesis test
##
## Hypothesis:
## poly(income, degree = 3, raw = TRUE)2 = 0
## poly(income, degree = 3, raw = TRUE)3 = 0
##
## Model 1: restricted model
## Model 2: score ~ poly(income, degree = 3, raw = TRUE)
##
## Note: Coefficient covariance matrix supplied.
##
##   Res.Df Df      F    Pr(>F)
## 1    418
## 2    416  2 29.678 8.945e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a \(p\)-value of \(9.043e^{-16}\), i.e., much less than \(0.05\), the null hypothesis of linearity is rejected in favor of the alternative that the relationship is quadratic or cubic.

Interpretation of Coefficients in Nonlinear Regression Models

The coefficients in polynomial regression do not have a simple interpretation. Why? Think of a quadratic model: it is not helpful to think of the coefficient on \(X\) as the expected change in \(Y\) associated with a change in \(X\) holding the other regressors constant because \(X^2\) changes as \(X\) varies. This is also the case for other deviations from linearity, for example in models where regressors and/or the dependent variable are log-transformed. A way to approach this is to calculate the estimated effect on \(Y\) associated with a change in \(X\) for one or more values of \(X\). This idea is summarized in Key Concept 31.

Key Concept 31

The Expected Effect on \(Y\) of a Change in \(X_1\) in a Nonlinear Regression Model

Consider the nonlinear population regression model \[ Y_i = f(X_{1i}, X_{2i}, \dots, X_{ki}) + u_i \ , \ i=1,\dots,n,\]

where \(f(X_{1i}, X_{2i}, \dots, X_{ki})\) is the population regression function and \(u_i\) is the error term.

Denote by \(\Delta Y\) the expected change in \(Y\) associated with \(\Delta X_1\), the change in \(X_1\) while holding \(X_2, \cdots , X_k\) constant. That is, the expected change in \(Y\) is the difference \[\Delta Y = f(X_1 + \Delta X_1, X_2, \cdots, X_k) - f(X_1, X_2, \cdots, X_k).\]

The estimator of this unknown population difference is the difference between the predicted values for these two cases.

Let \(\hat{f}(X_1, X_2, \cdots, X_k)\) be the predicted value of of \(Y\) based on the estimator \(\hat{f}\) of the population regression function. Then the predicted change in \(Y\) is \[\Delta \widehat{Y} = \hat{f}(X_1 + \Delta X_1, X_2, \cdots, X_k) - \hat{f}(X_1, X_2, \cdots, X_k).\]

For example, we may ask the following: what is the predicted change in test scores associated with a one unit change (i.e., \(1000 CHF\)) in income, based on the estimated quadratic regression function

\[\widehat{Score} = 607.3 + 3.85 \times income - 0.0423 \times income^2\ ?\]

Because the regression function is quadratic, this effect depends on the initial district income. We therefore consider two cases:

  1. An increase in district income form \(10\) to \(11\) (from \(10000 CHF\) per capita to \(11000 CHF\)).

  2. An increase in district income from \(40\) to \(41\) (that is from \(40000 CHF\) to \(41000 CHF\)).

In order to obtain the \(\Delta \widehat{Y}\) associated with a change in income form \(10\) to \(11\), we use the following formula:

\[\Delta \widehat{Y} = \left(\hat{\beta}_0 + \hat{\beta}_1 \times 11 + \hat{\beta}_2 \times 11^2\right) - \left(\hat{\beta}_0 + \hat{\beta}_1 \times 10 + \hat{\beta}_2 \times 10^2\right) \] To compute \(\widehat{Y}\) using R we may use predict().

# compute and assign the quadratic model
quadriatic_model <- lm(score ~ income + I(income^2), data = data)
# set up data for prediction
new_data <- data.frame(income = c(10, 11))
# do the prediction
Y_hat <- predict(quadriatic_model, newdata = new_data)
# compute the difference
diff(Y_hat)
##        2
## 2.962517

Analogously we can compute the effect of a change in district income from \(40\) to \(41\):

# set up data for prediction
new_data <- data.frame(income = c(40, 41))
# do the prediction
Y_hat <- predict(quadriatic_model, newdata = new_data)
# compute the difference
diff(Y_hat)
##         2
## 0.4240097

So for the quadratic model, the expected change in \(Score\) induced by an increase in \(income\) from \(10\) to \(11\) is about \(2.96\) points but an increase in \(income\) from \(40\) to \(41\) increases the predicted score by only \(0.42\). Hence, the slope of the estimated quadratic regression function is steeper at low levels of income than at higher levels.

Logarithms

Another way to specify a nonlinear regression function is to use the natural logarithm of \(Y\) and/or \(X\).

Logarithms convert changes in variables into percentage changes. This is convenient as many relationships are naturally expressed in terms of percentages.

There are three different cases in which logarithms might be used.

  1. Transform \(X\) with its logarithm, but not \(Y\).

  2. Analogously we could transform \(Y\) to its logarithm but leave \(X\) at level.

  3. Both \(Y\) and \(X\) are transformed to their logarithms.

The interpretation of the regression coefficients is different in each case.

Case I: \(X\) is in Logarithm, \(Y\) is not.

The regression model then is \[Y_i = \beta_0 + \beta_1 \times \ln(X_i) + u_i \text{, } i=1,...,n. \] Similar as for polynomial regression we do not have to create a new variable before using lm(). We can simply adjust the formula argument of lm() to tell R that the log-transformation of a variable should be used.

# estimate a level-log model
LinearLog_model <- lm(score ~ log(income), data = data)
# compute robust summary
coeftest(LinearLog_model,
         vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) 557.8323     3.8399 145.271 < 2.2e-16 ***
## log(income)  36.4197     1.3969  26.071 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hence, the estimated regression function is

\[\widehat{Score} = 557.8 + 36.42 \times \ln(income).\]

Let us draw a plot of this function.

# draw a scatterplot
plot(score ~ income,
     col = "steelblue",
     pch = 20,
     data = data,
     main = "Linear-Log Regression Line")
# add the linear-log regression line
order_id  <- order(data$income)
lines(data$income[order_id],
      fitted(LinearLog_model)[order_id],
      col = "red",
      lwd = 2)

We can interpret \(\hat{\beta}_1\) as follows: a \(1\%\) increase in income is associated with an increase in test scores of \(0.01 \times 36.42 = 0.36\) points. In order to get the estimated effect of a one unit change in income (that is, a change in the original units, thousands of CHF) on patient scores.

Case II: \(Y\) is in Logarithm, \(X\) is not

There are cases where it is useful to regress \(\ln(Y)\).

The corresponding regression model then is

\[ \ln(Y_i) = \beta_0 + \beta_1 \times X_i + u_i , \ \ i=1,...,n. \]

# estimate a log-linear model
LogLinear_model <- lm(log(score) ~ income, data = data)
# obtain a robust coefficient summary
coeftest(LogLinear_model,
         vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 6.43936234 0.00289382 2225.210 < 2.2e-16 ***
## income      0.00284407 0.00017509   16.244 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression function is \[\widehat{\ln(Score)} = 6.439 + 0.00284 \times income.\] An increase in district income by \(\1000 CHF\) is expected to increase patient scores by \(100\times 0.00284 \% = 0.284\%\).

When the dependent variable in logarithm, one cannot simply use \(e^{\log(\cdot)}\) to transform predictions back to the original scale.

Case III: \(X\) and \(Y\) are in Logarithms

The log-log regression model is \[\ln(Y_i) = \beta_0 + \beta_1 \times \ln(X_i) + u_i, \ \ i=1,...,n.\]

# estimate the log-log model
LogLog_model <- lm(log(score) ~ log(income), data = data)
# print robust coefficient summary to the console
coeftest(LogLog_model,
         vcov = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##              Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 6.3363494  0.0059246 1069.501 < 2.2e-16 ***
## log(income) 0.0554190  0.0021446   25.841 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression function hence is \[\widehat{\ln(Score)} = 6.336 + 0.0554 \times \ln(income).\] In a log-log model, a \(1\%\) change in \(X\) is associated with an estimated \(\hat\beta_1 \%\) change in \(Y\).

We can now show a figure of what we have just done above.

# generate a scatterplot
plot(log(score) ~ income,
     col = "steelblue",
     pch = 20,
     data = data,
     main = "Log-Linear Regression Function")
# add the log-linear regression line
order_id  <- order(data$income)
lines(data$income[order_id],
      fitted(LogLinear_model)[order_id],
      col = "red",
      lwd = 2)
# add the log-log regression line
lines(sort(data$income),
      fitted(LogLog_model)[order(data$income)],
      col = "green",
      lwd = 2)
# add a legend
legend("bottomright",
       legend = c("log-linear model", "log-log model"),
       lwd = 2,
       col = c("red", "green"))

Key Concept 32 summarizes the three logarithmic regression models.

Key Concept 32

Logarithms in Regression: Three Cases

Logarithms can be used to transform the dependent variable \(Y\) or the independent variable \(X\), or both (the variable being transformed must be positive). The following table summarizes these three cases and the interpretation of the regression coefficient \(\beta_1\). In each case, \(\beta_1\), can be estimated by applying OLS after taking the logarithm(s) of the dependent and/or the independent variable.

Case Model Specification Interpretation of \(\beta_1\)
\((I)\) \(Y_i = \beta_0 + \beta_1 \ln(X_i) + u_i\) A \(1 \%\) change in \(X\) is associated with a change in \(Y\) of \(0.01 \times \beta_1\).
\((II)\) \(\ln(Y_i) = \beta_0 + \beta_1 X_i + u_i\) A change in \(X\) by one unit (\(\Delta X = 1\)) is associated with a \(100 \times \beta_1 \%\) change in \(Y\).
\((III)\) \(\ln(Y_i) = \beta_0 + \beta_1 \ln(X_i) + u_i\) A \(1\%\) change in \(X\) is associated with a \(\beta_1\%\) change in \(Y\), so \(\beta_1\) is the elasticity of \(Y\) with respect to \(X\).

Interactions Between Independent Variables

There are research questions where it is interesting to learn how the effect on \(Y\) of a change in an independent variable depends on the value of another independent variable. For example, we may ask if districts with high inpatient health benefit differentially from changing the patient-doctor ratio to those with low inpatient health. To assess this using a multiple regression model, we include an interaction term. We consider three cases:

  1. Interactions between two binary variables.

  2. Interactions between a binary and a continuous variable.

  3. Interactions between two continuous variables.

The following subsections discuss these cases briefly and demonstrate how to perform such regressions in R.

Interactions Between Two Binary Variables

Take two binary variables \(D_1\) and \(D_2\) and the population regression model

\[ Y_i = \beta_0 + \beta_1 \times D_{1i} + \beta_2 \times D_{2i} + u_i. \]

Now assume that

\[\begin{align*} Y_i=& \, \ln(Blood pressure_i),\\ D_{1i} =& \, \begin{cases} 1 & \text{if $i^{th}$ person smokes,} \\ 0 & \text{else}. \end{cases} \\ D_{2i} =& \, \begin{cases} 1 & \text{if $i^{th}$ person is female,} \\ 0 & \text{if $i^{th}$ person is male}. \end{cases} \end{align*}\]

We know that \(\beta_1\) measures the average difference in \(\ln(Blood pressure)\) between individuals that smoke and those that do not and \(\beta_2\) is the gender differential in \(\ln(Blood pressure)\). This model does not allow us to determine if there is a gender specific effect of smoking and, if so, how strong this effect is. It is easy to come up with a model specification that allows to investigate this:

\[ Y_i = \beta_0 + \beta_1 \times D_{1i} + \beta_2 \times D_{2i} + \beta_3 \times (D_{1i} \times D_{2i}) + u_i \]

\((D_{1i} \times D_{2i})\) is called an interaction term and \(\beta_3\) measures the difference in the effect of smoking for women versus men.

Key Concept 33

A Method for Interpreting Coefficients in Regression with Binary Variables

Compute expected values of \(Y\) for each possible set described by the set of binary variables. Compare the expected values. The coefficients can be expressed either as expected values or as the difference between at least two expected values.

Following Key Concept 32 we have

\[\begin{align*} E(Y_i\vert D_{1i}=0, D_{2i} = d_2) =& \, \beta_0 + \beta_1 \times 0 + \beta_2 \times d_2 + \beta_3 \times (0 \times d_2) \\ =& \, \beta_0 + \beta_2 \times d_2. \end{align*}\]

If \(D_{1i}\) switches from \(0\) to \(1\) we obtain

\[\begin{align*} E(Y_i\vert D_{1i}=1, D_{2i} = d_2) =& \, \beta_0 + \beta_1 \times 1 + \beta_2 \times d_2 + \beta_3 \times (1 \times d_2) \\ =& \, \beta_0 + \beta_1 + \beta_2 \times d_2 + \beta_3 \times d_2. \end{align*}\]

Hence, the overall effect is

\[ E(Y_i\vert D_{1i}=1, D_{2i} = d_2) - E(Y_i\vert D_{1i}=0, D_{2i} = d_2) = \beta_1 + \beta_3 \times d_2 \] so the effect is a difference of expected values.

Application to Hospital Data

Now let

\[\begin{align*} HiPDR =& \, \begin{cases} 1, & \text{if $PDR \geq 20$} \\ 0, & \text{else}, \end{cases} \\ \\ HiInp =& \, \begin{cases} 1, & \text{if $inpatient \geq 10$} \\ 0, & \text{else}. \end{cases} \end{align*}\]

We may use R to construct the variables above as follows. Remember high values of PDR means more patients per doctor, and high values for inpatient means patients comming into hospital with a more complex medical history.

# append HiPDR to data
data$HiPDR <- as.numeric(data$PDR >= 20)
# append HiInp to data
data$HiInp <- as.numeric(data$inpatient >= 10)

We proceed by estimating the model

\[\begin{align} Score = \beta_0 + \beta_1 \times HiPDR + \beta_2 \times HiInp + \beta_3 \times (HiPDR \times HiInp) + u_i. \end{align}\]

There are several ways to add the interaction term to the formula argument when using lm() but the most intuitive way is to use HiInp * HiPDR. Note that appending HiInp * HiPDR to the formula will add HiInp, HiPDR and their interaction as regressors while HiInp:HiPDR only adds the interaction term.

# estimate the model with a binary interaction term
bi_model <- lm(score ~ HiPDR * HiInp, data = data)
# print a robust summary of the coefficients
coeftest(bi_model, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##             Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 664.1433     1.3881 478.4589 < 2.2e-16 ***
## HiPDR        -1.9078     1.9322  -0.9874    0.3240
## HiInp       -18.3155     2.3340  -7.8472 3.634e-14 ***
## HiPDR:HiInp  -3.2601     3.1189  -1.0453    0.2965
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression model is \[\widehat{Score} = \underset{(1.39)}{664.1} - \underset{(1.93)}{1.9} \times HiPDR - \underset{(2.33)}{18.3} \times HiInp - \underset{(3.12)}{3.3} \times (HiPDR \times HiInp)\] and it predicts that the effect of moving from a hospital with a high patient-doctor ratio to a hospital with a low patient-doctor ratio, depending on high or low inpatient health is \(-1.9-3.3\times HiInp\). So for hospitals with healthy inpatients (\(HiEL = 0\)), the estimated effect is a decrease of \(1.9\) points in patient scores while for hospital with patients with more complex medical histories (\(HiEL = 1\)), the predicted decrease in patient scores amounts to \(1.9 + 3.3 = 5.2\) points.

We can also use the model to estimate the mean test score for each possible combination of the included binary variables.

# estimate means for all combinations of HiPDR and HiEL
# 1.
predict(bi_model, newdata = data.frame("HiPDR" = 0, "HiInp" = 0))
##        1
## 664.1433
# 2.
predict(bi_model, newdata = data.frame("HiPDR" = 0, "HiInp" = 1))
##        1
## 645.8278
# 3.
predict(bi_model, newdata = data.frame("HiPDR" = 1, "HiInp" = 0))
##        1
## 662.2354
# 4.
predict(bi_model, newdata = data.frame("HiPDR" = 1, "HiInp" = 1))
##        1
## 640.6598

We now verify that these predictions are differences in the coefficient estimates presented in equation @ref(eq:im):

\[\begin{align*} \widehat{Score} = \hat\beta_0 = 664.1 \quad &\Leftrightarrow \quad HiPDR = 0, \, HiInp = 0\\ \widehat{Score} = \hat\beta_0 + \hat\beta_2 = 664.1 - 18.3 = 645.8 \quad &\Leftrightarrow \quad HiPDR = 0, \, HiInp = 1\\ \widehat{Score} = \hat\beta_0 + \hat\beta_1 = 664.1 - 1.9 = 662.2 \quad &\Leftrightarrow \quad HiPDR = 1, \, HiInp = 0\\ \widehat{Score} = \hat\beta_0 + \hat\beta_1 + \hat\beta_2 + \hat\beta_3 = 664.1 - 1.9 - 18.3 - 3.3 = 640.6 \quad &\Leftrightarrow \quad HiPDR = 1, \, HiInp = 1 \end{align*}\]

Interactions Between a Continuous and a Binary Variable

Now let \(X_i\) denote the hours of exercise per week of person \(i\), which is a continuous variable. We have \[\begin{align*} Y_i =& \, \ln(Blood pressure_i), \\ \\ X_i =& \, \text{hours exercise per week}i, \\ \\ D_i =& \, \begin{cases} 1, & \text{if $i^{th}$ person smokes} \\ 0, & \text{else}. \end{cases} \end{align*}\]

The baseline model thus is

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + u_i, \]

a multiple regression model that allows us to estimate the average effect of smoking holding weekly exercise constant as well as the average effect on blood pressure of a change in weekly exercise holding smoking constant.

By adding the interaction term \(X_i \times D_i\) we allow the effect of an additional hour of exercise per week to differ between individuals who smoke and those that do not,

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + \beta_3 (X_i \times D_i) + u_i. \]

Here, \(\beta_3\) is the expected difference in the effect of an additional hour of exercise per week for smokers versus non-smokers. Another possible specification is

\[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 (X_i \times D_i) + u_i. \]

This model states that the expected impact of an additional hour of exercise on blood pressure differs for smokers and non-smokers but that smoking on its own does not increase blood pressure.

All three regression functions can be visualized by straight lines. Key Concept 34 summarizes the differences.

Key Concept 34

Interactions Between Binary and Continuous Variables

An interaction term like \(X_i \times D_i\) (where \(X_i\) is continuous and \(D_i\) is binary) allows for the slope to depend on the binary variable \(D_i\).

There are three possibilities:

1. Different intercept and same slope: \[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + u_i \]

2. Different intercept and different slope: \[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 D_i + \beta_3 \times (X_i \times D_i) + u_i \]

3. Same intercept and different slope: \[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 (X_i \times D_i) + u_i \]

The following code chunk demonstrates how to show this using artificial data.

# generate artificial data
set.seed(1)
X <- runif(200,0, 15)
D <- sample(0:1, 200, replace = T)
Y <- 450 +  150 * X + 500 * D + 50 * (X * D) + rnorm(200, sd = 300)
# divide plotting area accordingly
m <- rbind(c(1, 2), c(3, 0))
graphics::layout(m)
# estimate the models and plot the regression lines
# 1. (baseline model)
plot(X, log(Y),
     pch = 20,
     col = "steelblue",
     main = "Different Intercepts, Same Slope")
mod1_coef <- lm(log(Y) ~ X + D)$coefficients
abline(coef = c(mod1_coef[1], mod1_coef[2]),
       col = "red",
       lwd = 1.5)
abline(coef = c(mod1_coef[1] + mod1_coef[3], mod1_coef[2]),
       col = "green",
       lwd = 1.5)

# 2. (baseline model + interaction term)
plot(X, log(Y),
     pch = 20,
     col = "steelblue",
     main = "Different Intercepts, Different Slopes")
mod2_coef <- lm(log(Y) ~ X + D + X:D)$coefficients
abline(coef = c(mod2_coef[1], mod2_coef[2]),
       col = "red",
       lwd = 1.5)
abline(coef = c(mod2_coef[1] + mod2_coef[3], mod2_coef[2] + mod2_coef[4]),
       col = "green",
       lwd = 1.5)
# 3. (omission of D as regressor + interaction term)
plot(X, log(Y),
     pch = 20,
     col = "steelblue",
     main = "Same Intercept, Different Slopes")
mod3_coef <- lm(log(Y) ~ X + X:D)$coefficients
abline(coef = c(mod3_coef[1], mod3_coef[2]),
       col = "red",
       lwd = 1.5)
abline(coef = c(mod3_coef[1], mod3_coef[2] + mod3_coef[3]),
       col = "green",
       lwd = 1.5)

Application to Hospital Data

Using a model specification like the second one discussed in Key Concept 34 (different slope, different intercept) we may answer the question whether the effect on patient scores of decreasing the patient-doctor ratio depends on whether there are unhealthy or healthy inpatients. We estimate the regression model

\[ \widehat{Score_i} = \beta_0 + \beta_1 \times PDR_i + \beta_2 \times HiInp_i + \beta_2 (PDR_i \times HiInp_i) + u_i. \]

# estimate the model
bci_model <- lm(score ~ PDR + HiInp + PDR * HiInp, data = data)
# print robust summary of coefficients to the console
coeftest(bci_model, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 682.24584   11.86781 57.4871   <2e-16 ***
## PDR          -0.96846    0.58910 -1.6440   0.1009
## HiInp         5.63914   19.51456  0.2890   0.7727
## PDR:HiInp    -1.27661    0.96692 -1.3203   0.1875
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression model is \[ \widehat{Score} = \underset{(11.87)}{682.2} - \underset{(0.59)}{0.97} \times PDR + \underset{(19.51)}{5.6} \times HiInp - \underset{(0.97)}{1.28} \times (PDR \times HiInp). \] The estimated regression line for hospitals with healthy inpatients (\(HiInp_i=0\)) is \[ \widehat{Score} = 682.2 - 0.97\times PDR_i. \]

For hospital with unhealthy inpatients we have

\[\begin{align*} \widehat{Score} =& \, 682.2 + 5.6 - 0.97\times PDR_i - 1.28 \times PDR_i \\ =& \, 687.8 - 2.25 \times PDR_i. \end{align*}\]

The predicted increase in patient scores following a reduction of the patient-doctor ratio by \(1\) unit is about \(0.97\) points in hospitals with healthy inpatients but \(2.25\) in hospitals with unhealthy inpatients. From the coefficient on the interaction term \(PDR \times HiInp\) we see that the difference between both effects is \(1.28\) points.

The next code chunk draws both lines belonging to the model. In order to make observations with \(HiInp = 0\) distinguishable from those with \(HiInp = 1\), we use different colors.

# identify observations with inpatient >= 10
id <- data$inpatient >= 10
# plot observations with HiEL = 0 as red dots
plot(data$PDR[!id], data$score[!id],
     xlim = c(0, 27),
     ylim = c(600, 720),
     pch = 20,
     col = "red",
     main = "",
     xlab = "Class Size",
     ylab = "Test Score")
# plot observations with HiEL = 1 as green dots
points(data$PDR[id], data$score[id],
     pch = 20,
     col = "green")
# read out estimated coefficients of bci_model
coefs <- bci_model$coefficients
# draw the estimated regression line for HiEL = 0
abline(coef = c(coefs[1], coefs[2]),
       col = "red",
       lwd = 1.5)
# draw the estimated regression line for HiEL = 1
abline(coef = c(coefs[1] + coefs[3], coefs[2] + coefs[4]),
       col = "green",
       lwd = 1.5 )
# add a legend to the plot
legend("topright",
       pch = c(20, 20),
       col = c("red", "green"),
       legend = c("HiInp = 0", "HiInp = 1"))

Interactions Between Two Continuous Variables

Consider a regression model with \(Y\) the log blood pressure and two continuous regressors \(X_1\), the hours of exercise per week, and \(X_2\), the pack years of smoking. We want to estimate the effect on blood pressure of an additional hour of exercise per week depending on a given level of smoking. This effect can be assessed by including the interaction term \((X_{1i} \times X_{2i})\) in the model:

\[ \Delta Y_i = \beta_0 + \beta_1 \times X_{1i} + \beta_2 \times X_{2i} + \beta_3 \times (X_{1i} \times X_{2i}) + u_i \]

Following Key Concept 34 we find that the effect on \(Y\) of a change on \(X_1\) given \(X_2\) is \[ \frac{\Delta Y}{\Delta X_1} = \beta_1 + \beta_3 X_2. \]

In the blood pressure example, a negative \(\beta_3\) implies that the effect on log blood pressure of an additional hour of exercise per week declines linearly with pack years of smoking. Vice versa we have \[ \frac{\Delta Y}{\Delta X_2} = \beta_2 + \beta_3 X_1 \] as the effect on log blood pressure of an additional pack year of smoking holding exercise constant.

Altogether we find that \(\beta_3\) measures the effect of a unit increase in \(X_1\) and \(X_2\) beyond the effects of increasing \(X_1\) alone and \(X_2\) alone by one unit. The overall change in \(Y\) is thus

\[\begin{align} Y_i = (\beta_1 + \beta_3 X_2) \Delta X_1 + (\beta_2 + \beta_3 X_1) \Delta X_2 + \beta_3\Delta X_1 \Delta X_2. \end{align}\]

Key Concept 35 summarizes interactions between two regressors in multiple regression.

Key Concept 35

Interactions in Multiple Regression

The interaction term between the two regressors \(X_1\) and \(X_2\) is given by their product \(X_1 \times X_2\). Adding this interaction term as a regressor to the model \[ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + u_i \] allows the effect on \(Y\) of a change in \(X_2\) to depend on the value of \(X_1\) and vice versa. Thus the coefficient \(\beta_3\) in the model \[ Y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (X_1 \times X_2) + u_i \] measures the effect of a one-unit increase in both \(X_1\) and \(X_2\) above and beyond the sum of both individual effects. This holds for continuous and binary regressors.

Application to Hospital Data

We now examine the interaction between the continuous variables PDR and inpatient health.

# estimate regression model including the interaction between 'inpatient' and 'PDR'
cci_model <- lm(score ~ PDR + inpatient + inpatient * PDR, data = data)
# print a summary to the console
coeftest(cci_model, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##                  Estimate  Std. Error t value Pr(>|t|)
## (Intercept)   686.3385268  11.7593466 58.3654  < 2e-16 ***
## PDR            -1.1170184   0.5875136 -1.9013  0.05796 .
## inpatient      -0.6729119   0.3741231 -1.7986  0.07280 .
## PDR:inpatient   0.0011618   0.0185357  0.0627  0.95005
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated model equation is \[ \widehat{Score} = \underset{(11.76)}{686.3} - \underset{(0.59)}{1.12} \times PDR - \underset{(0.37)}{0.67} \times inpatient + \underset{(0.02)}{0.0012} \times (PDR\times inpatient). \]

For the interpretation, let us consider the quartiles of \(inpatient\).

summary(data$inpatient)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   1.941   8.778  15.768  22.970  85.540

If \(inpatient\) is at its median value of \(8.78\), the slope of the regression function relating scores and the patient-doctor ratio is predicted to be \(-1.12 + 0.0012 \times 8.78 = -1.11\). This means that increasing the patient-doctor ratio by one unit is expected to deteriorate patient scores by \(1.11\) points. For the \(75\%\) quantile, the estimated change on \(Score\) of a one-unit increase in \(PDR\) is estimated by \(-1.12 + 0.0012 \times 23.0 = -1.09\) so the slope is somewhat lower. The interpretation is that for a hospital with a poor inpatient health of \(23\), a reduction of the patient-doctor ratio by one unit is expected to increase patient scores by only \(1.09\) points.

However, the output of summary() indicates that the difference of the effect for the median and the \(75\%\) quantile is not statistically significant. \(H_0: \beta_3 = 0\) cannot be rejected at the \(5\%\) level of significance (the \(p\)-value is \(0.95\)).

Regression with a Binary Dependent Variable

We now discusses a special class of regression models that aim to explain a limited dependent variable. In particular, we consider models where the dependent variable is binary. We will see that in such models, the regression function can be interpreted as a conditional probability function of the binary dependent variable.

We review the following concepts:

Of course, we will also see how to estimate above models using R and discuss an application.

Binary Dependent Variables and the Linear Probability Model

Key Concept 36

The Linear Probability Model

The linear regression model \[Y_i = \beta_0 + \beta_1 + X_{1i} + \beta_2 X_{2i} + \dots + \beta_k X_{ki} + u_i\] with a binary dependent variable \(Y_i\) is called the linear probability model. In the linear probability model we have \[E(Y\vert X_1,X_2,\dots,X_k) = P(Y=1\vert X_1, X_2,\dots, X_3)\] where \[ P(Y = 1 \vert X_1, X_2, \dots, X_k) = \beta_0 + \beta_1 + X_{1i} + \beta_2 X_{2i} + \dots + \beta_k X_{ki}.\] Thus, \(\beta_j\) can be interpreted as the change in the probability that \(Y_i=1\), holding constant the other \(k-1\) regressors. Just as in common multiple regression, the \(\beta_j\) can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals. In most linear probability models, \(R^2\) has no meaningful interpretation since the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This can be seen in the application below. It is essential to use robust standard errors since the \(u_i\) in a linear probability model are always heteroskedastic. Linear probability models are easily estimated in R using the function lm().

Case/control Data

We start by loading the data set case_control.txt which provides data that relate to type-2 diabetes cases in Lausanne reported to the CHUV last year.

data <- read.table("~/Desktop/case_control.txt", header=T)
head(data)
##   case   WHR    sex
## 1    0 0.221 female
## 2    0 0.265 female
## 3    0 0.372 female
## 4    0 0.320 female
## 5    0 0.360 female
## 6    0 0.240 female

The variable we are interested in modelling is case, an indicator for whether an individual has type- diabetes (case = 1) or not (case = 0). A regressor that ought to have power in explaining whether an individual has T2D or not is WHR, the waist-to-hip ratio of the individual. It is straightforward to translate this into the simple regression model

\[\begin{align} case = \beta_0 + \beta_1 \times P/I\ WHR + u. \end{align}\]

We estimate this model just as any other linear regression model using lm().

# estimate a simple linear probabilty model
mod1 <- lm(case ~ WHR, data = data)
mod1
##
## Call:
## lm(formula = case ~ WHR, data = data)
##
## Coefficients:
## (Intercept)          WHR
##    -0.07991      0.60353

Next, we plot the data and the regression line.

# plot the data
plot(x = data$WHR,
     y = data$case,
     main = "Case-control status and Waist-Hip ratio",
     xlab = "WH ratio",
     ylab = "Case-control",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.8)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Case")
text(2.5, -0.1, cex= 0.8, "Control")
# add the estimated regression line
abline(mod1,
       lwd = 1.8,
       col = "steelblue")

According to the estimated model, a WHR ratio of \(1\) is associated with an expected probability of disease of roughly \(50\%\). The model indicates that there is a positive relation between the WHR ratio and the probability of disease so individuals with a high WHR are more likely to have T2D.

We may use coeftest() to obtain robust standard errors for both coefficient estimates.

# print robust coefficient summary
coeftest(mod1, vcov. = vcovHC, type = "HC1")
##
## t test of coefficients:
##
##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -0.079910   0.031967 -2.4998   0.01249 *
## WHR          0.603535   0.098483  6.1283 1.036e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression line is \[\begin{align} \widehat{case} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} WHR. \end{align}\] The true coefficient on \(WHR\) is statistically different from \(0\) at the \(1\%\) level. Its estimate can be interpreted as follows: a 1 percentage point increase in \(WHR\) leads to an increase in the probability of disease \(0.604 \cdot 0.01 = 0.00604 \approx 0.6\%\).

We augment the simple model by an additional regressor \(sex\) which equals \(1\) for men and equals \(0\) for women. Such a specification is the baseline for investigating if there are sex difference in disease prevalence when we control for other factors such as WHR.

# estimate the model
mod2 <- lm(case ~ WHR + sex, data = data)
coeftest(mod2, vcov. = vcovHC)
##
## t test of coefficients:
##
##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -0.090514   0.033430 -2.7076  0.006826 **
## WHR          0.559195   0.103671  5.3939 7.575e-08 ***
## sexmale      0.177428   0.025055  7.0815 1.871e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated regression function is \[\begin{align} \widehat{case} =& \, -\underset{(0.029)}{0.091} + \underset{(0.089)}{0.559} WHR + \underset{(0.025)}{0.177} sex. \end{align}\]

The coefficient on \(sex\) is positive and significantly different from zero at the \(0.01\%\) level. The interpretation is that, holding constant the \(WHR\), being male increases the probability of T2D by about \(17.7\%\).

Probit and Logit Regression

The linear probability model has a major flaw: it assumes the conditional probability function to be linear. This does not restrict \(P(Y=1\vert X_1,\dots,X_k)\) to lie between \(0\) and \(1\). We can easily see this in our figure above where predicted value from the regression line go above 1 and below 0.

This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable. Commonly used methods are Probit and Logit regression.

Probit Regression

In Probit regression, the cumulative standard normal distribution function \(\Phi(\cdot)\) is used to model the regression function when the dependent variable is binary, that is, we assume \[\begin{align} E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 X). \end{align}\]

\(\beta_0 + \beta_1 X\) play the role of a quantile \(z\). Remember that \[\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)\] such that the Probit coefficient \(\beta_1\) is the change in \(z\) associated with a one unit change in \(X\). Although the effect on \(z\) of a change in \(X\) is linear, the link between \(z\) and the dependent variable \(Y\) is nonlinear since \(\Phi\) is a nonlinear function of \(X\).

Since the dependent variable is a nonlinear function of the regressors, the coefficient on \(X\) has no simple interpretation. The expected change in the probability that \(Y=1\) due to a change in \(P/I \ ratio\) can be computed as follows:

  1. Compute the predicted probability that \(Y=1\) for the original value of \(X\).
  2. Compute the predicted probability that \(Y=1\) for \(X + \Delta X\).
  3. Compute the difference between both predicted probabilities.

Of course we can generalize this to Probit regression with multiple regressors to mitigate the risk of facing omitted variable bias. Probit regression essentials are summarized in Key Concept 37.

Key Concept 37

Probit Model, Predicted Probabilities and Estimated Effects

Assume that \(Y\) is a binary variable. The model \[ Y= \beta_0 + \beta_1 + X_1 + \beta_2 X_2 + \dots + \beta_k X_k + u \] with \[P(Y = 1 \vert X_1, X_2, \dots ,X_k) = \Phi(\beta_0 + \beta_1 + X_1 + \beta_2 X_2 + \dots + \beta_k X_k)\] is the population Probit model with multiple regressors \(X_1, X_2, \dots, X_k\) and \(\Phi(\cdot)\) is the cumulative standard normal distribution function. The predicted probability that \(Y=1\) given \(X_1, X_2, \dots, X_k\) can be calculated in two steps: 1. Compute \(z = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\) 2. Look up \(\Phi(z)\) by calling pnorm(). \(\beta_j\) is the effect on \(z\) of a one unit change in regressor \(X_j\), holding constant all other \(k-1\) regressors. In R, Probit models can be estimated using the function glm() from the package stats. Using the argument family we specify that we want to use a Probit link function.

We now estimate a simple Probit model of the probability of disease.

# estimate the simple probit model
probit_mod <- glm(case ~ WHR,
                  family = binomial(link = "probit"),
                  data = data)
coeftest(probit_mod, vcov. = vcovHC, type = "HC1")
##
## z test of coefficients:
##
##             Estimate Std. Error  z value  Pr(>|z|)
## (Intercept) -2.19415    0.18901 -11.6087 < 2.2e-16 ***
## WHR          2.96787    0.53698   5.5269 3.259e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated model is

\[\begin{align} \widehat{P(case\vert WHR}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} WHR). \end{align}\]

Just as in the linear probability model we find that the relation between the probability of disease and the WHR is positive and that the corresponding coefficient is highly significant.

# plot data
plot(x = data$WHR,
     y = data$case,
     main = "Probit Model of the Probability of Disease, Given WHR",
     xlab = "WHR",
     ylab = "Case-control",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.85)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "case")
text(2.5, -0.1, cex= 0.8, "control")
# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(probit_mod, list(WHR = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")

The estimated regression function has a stretched “S”-shape which is typical for the CDF of a continuous random variable with symmetric PDF like that of a normal random variable. The function is clearly nonlinear and flattens out for large and small values of \(WHR\). The functional form thus also ensures that the predicted conditional probabilities of a denial lie between \(0\) and \(1\).

We use predict() to compute the predicted change in the disease probability when \(WHR\) is increased from \(0.3\) to \(0.4\).

# 1. compute predictions for P/I ratio = 0.3, 0.4
predictions <- predict(probit_mod,
                       newdata = data.frame("WHR" = c(0.3, 0.4)),
                       type = "response")
# 2. Compute difference in probabilities
diff(predictions)
##          2
## 0.06081433

We find that an increase in the WHRfrom \(0.3\) to \(0.4\) is predicted to increase the probability of disease by approximately \(6.2\%\).

We continue by using an augmented Probit model to estimate the effect of sex on the probability of disease.

probit_mod2 <- glm(case ~ WHR + sex,
                   family = binomial(link = "probit"),
                   data = data)
coeftest(probit_mod2, vcov. = vcovHC, type = "HC1")
##
## z test of coefficients:
##
##              Estimate Std. Error  z value  Pr(>|z|)
## (Intercept) -2.258787   0.176608 -12.7898 < 2.2e-16 ***
## WHR          2.741779   0.497673   5.5092 3.605e-08 ***
## sexmale      0.708155   0.083091   8.5227 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimated model equation is

\[\begin{align} \widehat{P(case\vert WHR, sex)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} WHR + \underset{(0.08)}{0.71} sex). \end{align}\]

While all coefficients are highly significant, both the estimated coefficients on the WHR and the indicator for sex are positive. Again, the coefficients are difficult to interpret but they indicate that, first, men have a higher probability of disease than women, holding constant the WHR and second, individuals with a high WHR face a higher risk of disease.

How big is the estimated difference in disease probabilities between a man and a woman with the same WHR? As before, we may use predict() to compute this difference.

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(probit_mod2,
                       newdata = data.frame("sex" = c("female", "male"),
                                            "WHR" = c(0.3, 0.3)),
                       type = "response")
# 2. compute difference in probabilities
diff(predictions)
##         2
## 0.1578117

In this case, the estimated difference in disease probabilities is about \(15.8\%\).

Logit Regression

Key Concept 38 summarizes the Logit regression function.

Key Concept 38

Logit Regression

The population Logit regression function is \[\begin{align*} P(Y=1\vert X_1, X_2, \dots, X_k) =& \, F(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k) \\ =& \, \frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k)}}. \end{align*}\] The idea is similar to Probit regression except that a different CDF is used: \[F(x) = \frac{1}{1+e^{-x}}\] is the CDF of a standard logistically distributed random variable.

As for Probit regression, there is no simple interpretation of the model coefficients and it is best to consider predicted probabilities or differences in predicted probabilities. Here again, \(t\)-statistics and confidence intervals based on large sample normal approximations can be computed as usual.

It is fairly easy to estimate a Logit regression model using R.

logit_mod <- glm(case ~ WHR,
                 family = binomial(link = "logit"),
                 data = data)
coeftest(logit_mod, vcov. = vcovHC, type = "HC1")
##
## z test of coefficients:
##
##             Estimate Std. Error  z value  Pr(>|z|)
## (Intercept) -4.02843    0.35898 -11.2218 < 2.2e-16 ***
## WHR          5.88450    1.00015   5.8836 4.014e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# plot data
plot(x = data$WHR,
     y = data$case,
     main = "Probit and Logit Models of the Probability of Disease, Given WHR",
     xlab = "WHR",
     ylab = "Case-control",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "case")
text(2.5, -0.1, cex= 0.8, "control")
# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(probit_mod, list(WHR = x), type = "response")
y_logit <- predict(logit_mod, list(WHR = x), type = "response")
lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)
# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"),
       lty = c(1, 2))

Both models produce very similar estimates of the probability of disease depending on the WHR.

We extend the simple Logit model of with the additional regressor \(sex\).

# estimate a Logit regression with multiple regressors
logit_mod2 <- glm(case ~ WHR + sex,
                  family = binomial(link = "logit"),
                  data = data)
coeftest(logit_mod2, vcov. = vcovHC, type = "HC1")
##
## z test of coefficients:
##
##             Estimate Std. Error  z value  Pr(>|z|)
## (Intercept) -4.12556    0.34597 -11.9245 < 2.2e-16 ***
## WHR          5.37036    0.96376   5.5723 2.514e-08 ***
## sexmale      1.27278    0.14616   8.7081 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We obtain

\[\begin{align} \widehat{P(case=1 \vert WHR, sex)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} WHR + \underset{(0.15)}{1.27} sex). \end{align}\]

As for the Probit model all model coefficients are highly significant and we obtain positive estimates for the coefficients on \(WHR\) and \(sex\). For comparison we compute the predicted probability of disease for a man and a woman that have a \(WHR\) of \(0.3\).

# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(logit_mod2,
                       newdata = data.frame("sex" = c("female", "male"),
                                            "WHR" = c(0.3, 0.3)),
                       type = "response")
predictions
##          1          2
## 0.07485143 0.22414592
# 2. Compute difference in probabilities
diff(predictions)
##         2
## 0.1492945

We find that the female faces a disease probability of only \(7.5\%\), while the male has probability of \(22.4\%\), a difference of \(14.9\%\).

Comparison of the Models

The Probit model and the Logit model deliver only approximations to the unknown population regression function \(E(Y\vert X)\). It is not obvious how to decide which model to use in practice. The linear probability model has the clear drawback of not being able to capture the nonlinear nature of the population regression function and it may predict probabilities to lie outside the interval \([0,1]\). Probit and Logit models are harder to interpret but capture the nonlinearities better than the linear approach: both models produce predictions of probabilities that lie inside the interval \([0,1]\). Predictions of all three models are often close to each other.

Estimation and Inference in the Logit and Probit Models

So far nothing has been said about how Logit and Probit models are estimated by statistical software. The reason why this is interesting is that both models are nonlinear in the parameters and thus cannot be estimated using OLS. Instead one relies on maximum likelihood estimation (MLE). Another approach is estimation by nonlinear least squares (NLS).

Nonlinear Least Squares

Consider the multiple regression Probit model \[\begin{align} E(Y_i\vert X_{1i}, \dots, X_{ki}) = P(Y_i=1\vert X_{1i}, \dots, X_{ki}) = \Phi(\beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki}). \end{align}\] Similarly to OLS, NLS estimates the parameters \(\beta_0,\beta_1,\dots,\beta_k\) by minimizing the sum of squared mistakes \[\sum_{i=1}^n\left[ Y_i - \Phi(b_0 + b_1 X_{1i} + \dots + b_k X_{ki}) \right]^2.\] NLS estimation is a consistent approach that produces estimates which are normally distributed in large samples. In R there are functions like nls from package stats which provide algorithms for solving nonlinear least squares problems. However, NLS is inefficient, meaning that there are estimation techniques that have a smaller variance which is why we will not dwell any further on this topic.

Maximum Likelihood Estimation

In MLE we seek to estimate the unknown parameters choosing them such that the likelihood of drawing the sample observed is maximized. This probability is measured by means of the likelihood function, the joint probability distribution of the data treated as a function of the unknown parameters. Put differently, the maximum likelihood estimates of the unknown parameters are the values that result in a model which is most likely to produce the data observed. It turns out that MLE is more efficient than NLS.

As maximum likelihood estimates are normally distributed in large samples, statistical inference for coefficients in nonlinear models like Logit and Probit regression can be made using the same tools that are used for linear regression models: we can compute \(t\)-statistics and confidence intervals.

Many software packages use an MLE algorithm for estimation of nonlinear models. The function glm() uses an algorithm named iteratively reweighted least squares.

Measures of Fit

It is important to be aware that the usual \(R^2\) and \(\bar{R}^2\) are invalid for nonlinear regression models. The reason for this is simple: both measures assume that the relation between the dependent and the explanatory variable(s) is linear. This obviously does not hold for Probit and Logit models. Thus \(R^2\) need not lie between \(0\) and \(1\) and there is no meaningful interpretation. However, statistical software sometimes reports these measures anyway.

There are many measures of fit for nonlinear regression models and there is no consensus which one should be reported. The situation is even more complicated because there is no measure of fit that is generally meaningful. For models with a binary response variable like \(case\) one could use the following rule: If \(Y_i = 1\) and \(\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} > 0.5\) or if \(Y_i = 0\) and \(\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} < 0.5\), consider the \(Y_i\) as correctly predicted. Otherwise \(Y_i\) is said to be incorrectly predicted. The measure of fit is the share of correctly predicted observations. The downside of such an approach is that it does not mirror the quality of the prediction: whether \(\widehat{P(Y_i = 1|X_{i1}, \dots, X_{ik}) = 0.51}\) or \(\widehat{P(Y_i =1|X_{i1}, \dots, X_{ik}) = 0.99}\) is not reflected, we just predict \(Y_i = 1\). This is in contrast to the case of a numeric dependent variable where we use the squared errors for assessment of the quality of the prediction.

An alternative to the latter are so called pseudo-\(R^2\) measures. In order to measure the quality of the fit, these measures compare the value of the maximized (log-)likelihood of the model with all regressors (the full model) to the likelihood of a model with no regressors (null model, regression on a constant).

For example, consider a Probit regression. The \(\text{pseudo-}R^2\) is given by \[\text{pseudo-}R^2 = 1 - \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}\] where \(f^{max}_j \in [0,1]\) denotes the maximized likelihood for model \(j\).

The reasoning behind this is that the maximized likelihood increases as additional regressors are added to the model, similarly to the decrease in \(SSR\) when regressors are added in a linear regression model. If the full model has a similar maximized likelihood as the null model, the full model does not really improve upon a model that uses only the information in the dependent variable, so \(\text{pseudo-}R^2 \approx 0\). If the full model fits the data very well, the maximized likelihood should be close to \(1\) such that \(\ln(f^{max}_{full}) \approx 0\) and \(\text{pseudo-}R^2 \approx 1\).

summary() does not report \(\text{pseudo-}R^2\) for models estimated by glm() but we can use the entries residual deviance (deviance) and null deviance (null.deviance) instead. These are computed as \[\text{deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{full}) \right]\] and

\[\text{null deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{null}) \right]\] where \(f^{max}_{saturated}\) is the maximized likelihood for a model which assumes that each observation has its own parameter (there are \(n+1\) parameters to be estimated which leads to a perfect fit). For models with a binary dependent variable, it holds that \[\text{pseudo-}R^2 = 1 - \frac{\text{deviance}}{\text{null deviance}} = 1- \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}.\]

We now compute the \(\text{pseudo-}R^2\) for the augmented Probit model of disease.

# compute pseudo-R2 for the probit model of mortgage denial
pseudoR2 <- 1 - (probit_mod2$deviance) / (probit_mod2$null.deviance)
pseudoR2
## [1] 0.08594259

Another way to obtain the \(\text{pseudo-}R^2\) is to estimate the null model using glm() and extract the maximized log-likelihoods for both the null and the full model using the function logLik().

# compute the null model
probit_null <- glm(formula = case ~ 1,
                       family = binomial(link = "probit"),
                       data = data)
# compute the pseudo-R2 using 'logLik'
1 - logLik(probit_mod2)[1]/logLik(probit_null)[1]
## [1] 0.08594259

Exercise - Application to data

We are now going to analyse a set of data.

On the source file page of the website https://github.com/complex-trait-genetics/data-analysis you will find a dataset L3_exercise.txt and the file L3_exercise_readme.txt

The exercise is to analyse this data. You get to decide what is interesting and what you want to do

The readme file has the names of all of the columns and a short description of the data

You need to submit a PDF document containing only one page. On this page should be a figure, a table, a paragraph describing the methods you used to analyse the data, and one paragraph explaining the results