Lab 4: Added Variable Plots and Bootstrapping

PSYC 7804 - Regression with Lab

Fabio Setti

Spring 2026

Today’s Packages and Data 🤗

No new packages for today!

library(tidyverse)
library(car)
theme_set(theme_classic(base_size = 16, 
          accent = "#3492eb", 
          base_family = 'serif'))

Data

We’ll continue looking at the 2024 World Happiness report data.

dat <- rio::import("https://fabio-setti.netlify.app/data/World_happiness_2024.csv")

Partial regression coefficients

In Lab 3 , we used Corruption and Freedom to predict Happiness_score. We observed that the results were very different when we use both variables together compared to when we look at them separately

The intercept and slopes with both predictors in the model

reg_full <- lm(Happiness_score ~ Corruption + Freedom, dat)
coef(reg_full)

(Intercept)  Corruption     Freedom 
   5.093774   -2.441528    4.032015

(Keep these values in mind, they come back in the next slide!)

The intercept and slopes with separate models with one predictor are quite different

reg_free <- lm(Happiness_score ~ Freedom, dat)
coef(reg_free)

(Intercept)     Freedom 
   2.623351    4.684888

reg_corr <- lm(Happiness_score ~ Corruption, dat)
coef(reg_corr)

(Intercept)  Corruption 
   9.107178   -4.227894

The catch is that the model that includes both predictors calculates partial regression coefficients.

Partial regression coefficients “By Hand”

Personally, the concept of partial regression coefficients starts making sense once I “compute” them myself and see what is going on behind the scenes. The residuals are once again very important here!

Partial regression coefficient for Happiness

resid_happy_Free <- resid(lm(Happiness_score ~ Freedom, 
               dat))

resid_corr <- resid(lm(Corruption ~ Freedom, dat))

# the 0 +  the intercept, which is exactly 0 anyway in this case 
coef(lm(resid_happy_Free ~ 0 + resid_corr))

resid_corr 
 -2.441528

Partial regression coefficient for Freedom

resid_happy_corr <- resid(lm(Happiness_score ~ Corruption, 
               dat))

resid_Free <- resid(lm(Freedom ~ Corruption, dat))

# the 0 + removes the intercept, which is exactly 0 anyway in this case 
coef(lm(resid_happy_corr ~ 0 + resid_Free))

resid_Free 
  4.032015

So, when you run multiple regression, all your slopes are calculated based on the residuals after accounting for all the other variables in the model.

Added variable plots

Then, added variables plots are simply plots between the residuals of one the predictors and \(Y\) that result by removing the variance explained by all the other variables.

ggplot() +
  geom_point(aes(y = resid_happy_Free, 
                 x = resid_corr))

ggplot() +
  geom_point(aes(y = resid_happy_corr, 
                 x = resid_Free))

The slopes that you get when you include both predictors are the slopes for the lines of best fit these 2 plots.

The Quick way: `car::avPlots()`

car::avPlots(reg_full)

To get added variable plots in practice you would use the avPlot() function.

When you have more than 2 predictors, it is not possible to build a visualization of all variables at once like a 3D plot. Added variable plots can help you visualize regression results no matter the number of predictors!

Another Look at the 3D plane

You can see that the avPlots() function identifies 4 point per plot: 2 points with the highest residuals, and 2 points with the highest leverage

Try tilting the plane such that it resembles the added variable plots on the previous slide. If you hover over the dots in the 3D plot, you should see what point you are looking at (“row #”). Use the points on the added variable plots as a reference.

library(FabioFun)

nice_3D_plot(y = dat$Happiness_score,
             x1 = dat$Corruption,
             x2 = dat$Freedom,
             axis_names = c("Happiness", 
                            "Corruption", 
                            "Freedom"),
             reg_plane = TRUE)

Bootstrapping

Bootstrapping (Efron, 1992) is a really smart idea to calculate confidence intervals while avoid doing math! You probably have seen some complicated equations that need to be derived to calculate confidence intervals for statistics (e.g., see here for the 95% CI of \(R^2\)).

Turns out that even more complex math goes behind deriving the complicated equations 😱 Usually, that involves figuring out what is the theoretical sampling distribution of a certain statistic for an infinite number of experiments.

Instead bootstrapping says:

“By Hand” Example of Bootstrapping

As always, we can do things ourselves to get a better understanding of the process.

Bootstrap \(R^2\) code

# empty element to save R^2 to
r_squared <- c()

reg_vars <- dat[, c("Happiness_score",
                        "Corruption",
                        "Freedom")]

set.seed(34677)

for(i in 1:2000){

# sample from data 
sample <- sample(1:nrow(reg_vars), 
                 replace = TRUE)  

dat_boot <- reg_vars[sample,]

# Run regression
reg_boot <- lm(Happiness_score ~ Corruption + Freedom, 
               dat_boot)  

# Save R^2
r_squared[i] <- summary(reg_boot)$r.squared
}

There is a lot of value in understanding how the code above works, but the main point is that we are running the same regression 2000 times with a different sample from our data, and then saving the \(R^2\) that we get every time.

The \(R^2\) from our regression with both variables was

summary(reg_full)$r.squared

[1] 0.4753323

We mostly care about the 95% confidence interval.

If we look out the distribution of the all the 2000 \(R^2\) we get

hist(r_squared)

quantile(r_squared, c(.025, .5, .975))

     2.5%       50%     97.5% 
0.3373832 0.4793844 0.6113752

The Quick way: `Boot()` function from `car`

Once again, the car package comes to the rescue

set.seed(875)
boot_model <- car::Boot(reg_full, R = 2000)

confint(boot_model, type = "perc")

Bootstrap percent confidence intervals

                2.5 %    97.5 %
(Intercept)  3.969337  6.309630
Corruption  -3.303885 -1.572729
Freedom      2.963881  5.004918

The confidence interval for \(R^2\) is a bit annoying to get from car, so here we just have the CIs for intercept and slopes

These are our normal confidence intervals from the regression:

confint(reg_full)

                2.5 %    97.5 %
(Intercept)  3.734221  6.453328
Corruption  -3.661122 -1.221935
Freedom      3.084526  4.979504

Bootstrapping becomes more and more accurate as sample size increases. Here it may not be as accurate because we only have 140 observations, although it seems to do alright.

Visualizing Bootstrap Results

You can also visualize the distribution of the bootstrapped samples for the regression slopes.

NOTE: You need to provide the names of the regression slopes as shown in the summary() output through the parm= argument

hist(boot_model, 
     parm = c("Corruption", 
              "Freedom"))

Notice that the bootstrap samples are very normally distributed. This is almost always the case for regression slopes. In contrast, statistics like \(R^2\) will not be normally distributed as we saw 2 slides ago.

References

Efron, B. (1992). Bootstrap Methods: Another Look at the Jackknife. In S. Kotz & N. L. Johnson (Eds.), Breakthroughs in Statistics: Methodology and Distribution (pp. 569–593). Springer. https://doi.org/10.1007/978-1-4612-4380-9_41

Fox, J., Weisberg, S., Price, B., Adler, D., Bates, D., Baud-Bovy, G., Bolker, B., Ellison, S., Firth, D., Friendly, M., Gorjanc, G., Graves, S., Heiberger, R., Krivitsky, P., Laboissiere, R., Maechler, M., Monette, G., Murdoch, D., Nilsson, H., … R-Core. (2024). Car: Companion to Applied Regression.

Wickham, H., & RStudio. (2023). Tidyverse: Easily Install and Load the ’Tidyverse’.