Today we will be diving deeper into correlational analyses by talking about controls, interactions, and coefficient plots.
OLS Regression
Remember that OLS is a technique for estimating coefficients in a linear regression equation, describing the relationship between one or more independent variables and a dependent variable.
OLS fits a line of best fit that minimizes the squared error of the points from the line.
Prepare Code
Let’s run through an example of how you might use a regression to answer one of your research questions.
Step 1: DVs, IVs, and Research Question
Research Q: How does movie length affect audience perceptions of movies?
DV: Movie length in minutes
IV: Audience rating for any particular movie
We’re starting with a fresh R script. What do we do first?
Step 2: Create a Bivariate Regression and Interpret
Here is a regression. Can practice interpreting this output?
model1 =lm(imdb_rating ~ runtime, data = imdb)summary(model1)
Call:
lm(formula = imdb_rating ~ runtime, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.53550 -0.20486 -0.03168 0.15282 1.30515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6563477 0.0379558 201.718 < 2e-16 ***
runtime 0.0023838 0.0003011 7.917 6.44e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2674 on 998 degrees of freedom
Multiple R-squared: 0.0591, Adjusted R-squared: 0.05815
F-statistic: 62.68 on 1 and 998 DF, p-value: 6.442e-15
Step 3: Include a Control Variable
Control variables improve the precision of your models by accounting for information that may otherwise interfere with the relationship between your DV and IV. One might think that the relationship between runtime and imdb_rating can be explained by gross revenue.
What’s a story for why that might be?
Step 3: Include a Control Variable
Let’s check it out. First, we regress imdb_ratings on runtime. Then, let’s add in a control for gross.
model2 =lm(imdb_rating ~ runtime +as.numeric(gross), data = imdb)summary(model2)
Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.52565 -0.21483 -0.03582 0.15780 1.31556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.632e+00 4.398e-02 173.514 < 2e-16 ***
runtime 2.451e-03 3.495e-04 7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10 8.758e-11 1.847 0.0651 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2742 on 828 degrees of freedom
(169 observations deleted due to missingness)
Multiple R-squared: 0.06474, Adjusted R-squared: 0.06248
F-statistic: 28.66 on 2 and 828 DF, p-value: 9.243e-13
It seems that gross revenue has a weak positive effect on IMDB rating, but the effect of runtime remains strong.
Step 3: Include a Control Variable
Structurally, nothing here is different from having two independent variables. Theoretically, however, it is different. We expect that some of the variation in this relationship is explained by the gross revenue of a movie, so we want to identify that variation (control for it) so as to better isolate the relationship between runtime and imdb_rating.
Step 4: Include an Interaction
We employ interactions when we think that the effect of one independent variable on the dependent variable is conditioned on another independent variable.
Motivating Example for Interaction
Suppose you are interested in studying the relationship between social media usage and political engagement. You can include an interaction variable between social media usage (e.g., hours spent on social media platforms) and age group to explore if the effect of social media usage on political engagement varies across different age groups.
social_media_usage x age_group
When examining political participation (e.g., voting behavior, political activism), you might include an interaction variable between gender and education level to investigate whether the relationship between education and political participation differs by gender.
education x gender
Step 4: Include an Interaction
Say we want to know how action movies are rated by critics, depending on whether the film is appropriate for kids or not.
First, let’s make some dummies for whether or not a movie is suitable for kids and whether it is an action movie.
Step 4 (If Necessary): Include an Interaction Term
Then, let’s run the regression with an interaction term. Note the asterisk, which multiplies two variables together within the code to create the interaction term. What do we see here?
model3 =lm(imdb_rating ~ kids*action, data = imdb)summary(model3)
Call:
lm(formula = imdb_rating ~ kids * action, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.4110 -0.2232 -0.0232 0.1768 1.2890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.92320 0.01131 700.651 < 2e-16 ***
kids 0.08778 0.02075 4.231 2.54e-05 ***
action 0.05549 0.02716 2.043 0.041333 *
kids:action -0.18847 0.05029 -3.748 0.000189 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2728 on 996 degrees of freedom
Multiple R-squared: 0.02231, Adjusted R-squared: 0.01937
F-statistic: 7.578 on 3 and 996 DF, p-value: 5.161e-05
Step 4 (If Necessary): Include an Interaction Term
Action movies for kids are rated significantly lower on average than other movies. However, we can also (carefully) interpret the constituent terms of the interaction term.
The kids term tells us that, among NON-ACTION movies (action == 0), movies for kids are rated more highly than movies not rated for kids. The action term tells us that, among NON-KIDS rated movies (kids == 0), action movies are more highly rated than non-action movies.
Regression Recap
Those are your basic steps for the decisionmaking process behind running a regression model. It is a lot to take in, but with time and practice, you will have no problem mastering it!
Just remember that you should build your regression from the inside out: first, identify your key IV and DV. Then account for any controls. Finally, if you have a theoretically-motivated reason, throw in an interaction term.
What Controls Should I Include?
Here we take again from Dr. Brenton Kenkel’s website, using a helpful table to guide our decision to include (or exclude) control variables.
Assumptions of OLS Models
Next, we walk you through the assumptions underlying OLS regression models and the variables used within them.
Assumptions of OLS Models
OLS models assume:
Linearity
Constant variance
Independence of residuals
Normality of residuals
Linearity
OLS models assume linear relationships between the IV and DV.
Constant Variance
OLS models assume that the variance of the residuals is constant (and does not get better/worse as values increase or decrease).
Independence of Residuals
OLS models assume that the residuals are independent of one another. In other words, there is no serial autocorrelation (where past residuals predict future residuals).
Normality of Residuals
OLS models assume that residuals are normally distributed.
Assessing Assumptions
We can assess the validity of these assumptions in our specific use cases with a variety of diagnostic tests.
Assessing Assumptions
QQ-Plots: Plots of the quantiles of the first data set against the quantiles of the second data set i.e. our sample values against our predicted values
Checking Assumptions in R
Here, we provide a quick overview as to how you can check OLS model assumptions using quantile plots.
First, let’s conduct a gut check by visualizing our data. We can use this scatterplot to check for linearity and heteroskedasticity.
imdb %>%ggplot(aes(runtime, imdb_rating)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Scatterplot with Regression Line", y ="IMDB Rating", x ="Runtime")
Checking Assumptions in R
We can also use a quantile-quantile plot to see if the residuals of your variables are normally distributed or not.
qqplot(imdb$runtime, imdb$imdb_rating)
Advanced Visualization Techniques
In the final portion of our final lesson, we will run through how to prepare our regression output for papers and plot coefficients.
Exporting Regression Output
First, let’s run and save a regression model.
model1 =lm(imdb_rating ~ runtime, data = imdb)
Exporting Regression Output
Next, we will load in the stargazer package, which allows us to make nice regression tables. It is pretty easy to do. Just use the stargazer function with the option type = “text” and then manually name your independent variables in the covariate.labels option.
Then, you copy and paste this output into Word. You could also use the modelsummary package with type=“html”, taking a screenshot of the results and pasting it into Word.
#install.packages("stargazer")library(stargazer)
Exporting Regression Output
First let’s look at what happens without any additional editing or options.
type = “text” specifies the type of output we want
covariate.labels = c(…) specifies how to rename the independent/control variables instead of using the default variable names
omit.stat = c(…) removes statistics we do not need from the table
dep.var.labels = “…” does the same as covariate.labels but for the dependent variable
Coefficient Plots
Regression tables can be hard to plot because they have so much information. However, coefficient plots make it easy to succinctly visualize your key regression results. First, install and load the sjPlot package.
#install.packages("sjPlot")library(sjPlot)
Coefficient Plots in R
This is a basic coefficient plot
model2 =lm(imdb_rating ~ runtime +as.numeric(gross), data = imdb)plot_model(model2) +ylim(-0.01, 0.01) +labs(title ="Coefficient Plot for IMDB Rating")
Coefficient Plots in R
The previous model nicely illustrates the point estimate and confidence intervals of both runtime and gross.
summary(model2)
Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.52565 -0.21483 -0.03582 0.15780 1.31556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.632e+00 4.398e-02 173.514 < 2e-16 ***
runtime 2.451e-03 3.495e-04 7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10 8.758e-11 1.847 0.0651 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2742 on 828 degrees of freedom
(169 observations deleted due to missingness)
Multiple R-squared: 0.06474, Adjusted R-squared: 0.06248
F-statistic: 28.66 on 2 and 828 DF, p-value: 9.243e-13
Coefficient Plots in R
We can also make coefficient plots with interaction terms.
model3 =lm(imdb_rating ~ kids*action, data = imdb)plot_model(model3) +ylim(-0.5, 0.5) +labs(title ="Coefficient Plot for IMDB Rating") +geom_hline(yintercept =0, linetype ="dashed")
Coefficient Plots in R
And these coefficients are also displayed nicely in the previous plot.
summary(model3)
Call:
lm(formula = imdb_rating ~ kids * action, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.4110 -0.2232 -0.0232 0.1768 1.2890
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.92320 0.01131 700.651 < 2e-16 ***
kids 0.08778 0.02075 4.231 2.54e-05 ***
action 0.05549 0.02716 2.043 0.041333 *
kids:action -0.18847 0.05029 -3.748 0.000189 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2728 on 996 degrees of freedom
Multiple R-squared: 0.02231, Adjusted R-squared: 0.01937
F-statistic: 7.578 on 3 and 996 DF, p-value: 5.161e-05
Close Lesson 6
Well, that’s all we have for you! In this lesson, you learned more about how to run a respectable regression, the assumptions underlying regression analyses, what interaction terms are, and how to perform more advanced visualizations using regression output.
We hope you have enjoyed your time in this intro series to statistical programming in R (and learned a bit along the way)!