Introduction to Correlational Analysis II

Jennifer Barnes and Alexander Tripp

Objectives

  • Understand regression modeling
  • Be able to visualize regression output
  • Have a working knowledge of interaction terms

More Correlational Work

Today we will be diving deeper into correlational analyses by talking about controls, interactions, and coefficient plots.

OLS Regression

Remember that OLS is a technique for estimating coefficients in a linear regression equation, describing the relationship between one or more independent variables and a dependent variable.

OLS fits a line of best fit that minimizes the squared error of the points from the line.

Prepare Code

Let’s run through an example of how you might use a regression to answer one of your research questions.

Step 1: DVs, IVs, and Research Question

Research Q: How does movie length affect audience perceptions of movies?

DV: Movie length in minutes

IV: Audience rating for any particular movie

We’re starting with a fresh R script. What do we do first?

Review: OLS Regressions in R

First, let’s prepare our code.

rm(list = ls())

library(tidyverse)

imdb = read_csv("imdb.csv")

Step 2: Create a Bivariate Regression and Interpret

Here is a regression. Can practice interpreting this output?

model1 = lm(imdb_rating ~ runtime, data = imdb)
summary(model1)

Call:
lm(formula = imdb_rating ~ runtime, data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53550 -0.20486 -0.03168  0.15282  1.30515 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.6563477  0.0379558 201.718  < 2e-16 ***
runtime     0.0023838  0.0003011   7.917 6.44e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2674 on 998 degrees of freedom
Multiple R-squared:  0.0591,    Adjusted R-squared:  0.05815 
F-statistic: 62.68 on 1 and 998 DF,  p-value: 6.442e-15

Step 3: Include a Control Variable

Control variables improve the precision of your models by accounting for information that may otherwise interfere with the relationship between your DV and IV. One might think that the relationship between runtime and imdb_rating can be explained by gross revenue.

What’s a story for why that might be?

Step 3: Include a Control Variable

Let’s check it out. First, we regress imdb_ratings on runtime. Then, let’s add in a control for gross.

model2 = lm(imdb_rating ~ runtime + as.numeric(gross), data = imdb)
summary(model2)

Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52565 -0.21483 -0.03582  0.15780  1.31556 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.632e+00  4.398e-02 173.514  < 2e-16 ***
runtime           2.451e-03  3.495e-04   7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10  8.758e-11   1.847   0.0651 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2742 on 828 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.06474,   Adjusted R-squared:  0.06248 
F-statistic: 28.66 on 2 and 828 DF,  p-value: 9.243e-13

It seems that gross revenue has a weak positive effect on IMDB rating, but the effect of runtime remains strong.

Step 3: Include a Control Variable

Structurally, nothing here is different from having two independent variables. Theoretically, however, it is different. We expect that some of the variation in this relationship is explained by the gross revenue of a movie, so we want to identify that variation (control for it) so as to better isolate the relationship between runtime and imdb_rating.

Step 4: Include an Interaction

We employ interactions when we think that the effect of one independent variable on the dependent variable is conditioned on another independent variable.

Motivating Example for Interaction

Suppose you are interested in studying the relationship between social media usage and political engagement. You can include an interaction variable between social media usage (e.g., hours spent on social media platforms) and age group to explore if the effect of social media usage on political engagement varies across different age groups.

social_media_usage x age_group

When examining political participation (e.g., voting behavior, political activism), you might include an interaction variable between gender and education level to investigate whether the relationship between education and political participation differs by gender.

education x gender

Step 4: Include an Interaction

Say we want to know how action movies are rated by critics, depending on whether the film is appropriate for kids or not.

First, let’s make some dummies for whether or not a movie is suitable for kids and whether it is an action movie.

imdb$kids = ifelse(imdb$certificate %in% 
              c("A", "Approved", "G", "GP", "Passed", "PG", "TV-PG"), 
              1, 0)
              
imdb$action = ifelse(imdb$genre_main == "Action", 1, 0)

Step 4 (If Necessary): Include an Interaction Term

Then, let’s run the regression with an interaction term. Note the asterisk, which multiplies two variables together within the code to create the interaction term. What do we see here?

model3 = lm(imdb_rating ~ kids*action, data = imdb)

summary(model3) 

Call:
lm(formula = imdb_rating ~ kids * action, data = imdb)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4110 -0.2232 -0.0232  0.1768  1.2890 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.92320    0.01131 700.651  < 2e-16 ***
kids         0.08778    0.02075   4.231 2.54e-05 ***
action       0.05549    0.02716   2.043 0.041333 *  
kids:action -0.18847    0.05029  -3.748 0.000189 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2728 on 996 degrees of freedom
Multiple R-squared:  0.02231,   Adjusted R-squared:  0.01937 
F-statistic: 7.578 on 3 and 996 DF,  p-value: 5.161e-05

Step 4 (If Necessary): Include an Interaction Term

Action movies for kids are rated significantly lower on average than other movies. However, we can also (carefully) interpret the constituent terms of the interaction term.

The kids term tells us that, among NON-ACTION movies (action == 0), movies for kids are rated more highly than movies not rated for kids. The action term tells us that, among NON-KIDS rated movies (kids == 0), action movies are more highly rated than non-action movies.

Regression Recap

Those are your basic steps for the decisionmaking process behind running a regression model. It is a lot to take in, but with time and practice, you will have no problem mastering it!

Just remember that you should build your regression from the inside out: first, identify your key IV and DV. Then account for any controls. Finally, if you have a theoretically-motivated reason, throw in an interaction term.

What Controls Should I Include?

Here we take again from Dr. Brenton Kenkel’s website, using a helpful table to guide our decision to include (or exclude) control variables.

Assumptions of OLS Models

Next, we walk you through the assumptions underlying OLS regression models and the variables used within them.

Assumptions of OLS Models

OLS models assume:

  • Linearity
  • Constant variance
  • Independence of residuals
  • Normality of residuals

Linearity

OLS models assume linear relationships between the IV and DV.

Constant Variance

OLS models assume that the variance of the residuals is constant (and does not get better/worse as values increase or decrease).

Independence of Residuals

OLS models assume that the residuals are independent of one another. In other words, there is no serial autocorrelation (where past residuals predict future residuals).

Normality of Residuals

OLS models assume that residuals are normally distributed.

Assessing Assumptions

We can assess the validity of these assumptions in our specific use cases with a variety of diagnostic tests.

Assessing Assumptions

QQ-Plots: Plots of the quantiles of the first data set against the quantiles of the second data set i.e. our sample values against our predicted values

Checking Assumptions in R

Here, we provide a quick overview as to how you can check OLS model assumptions using quantile plots.

First, let’s conduct a gut check by visualizing our data. We can use this scatterplot to check for linearity and heteroskedasticity.

imdb %>% 
  ggplot(aes(runtime, imdb_rating)) + geom_point() + geom_smooth(method = "lm") + 
  labs(title = "Scatterplot with Regression Line", y = "IMDB Rating", x = "Runtime")

Checking Assumptions in R

We can also use a quantile-quantile plot to see if the residuals of your variables are normally distributed or not.

qqplot(imdb$runtime, imdb$imdb_rating)

Advanced Visualization Techniques

In the final portion of our final lesson, we will run through how to prepare our regression output for papers and plot coefficients.

Exporting Regression Output

First, let’s run and save a regression model.

model1 = lm(imdb_rating ~ runtime, data = imdb)

Exporting Regression Output

Next, we will load in the stargazer package, which allows us to make nice regression tables. It is pretty easy to do. Just use the stargazer function with the option type = “text” and then manually name your independent variables in the covariate.labels option.

Then, you copy and paste this output into Word. You could also use the modelsummary package with type=“html”, taking a screenshot of the results and pasting it into Word.

#install.packages("stargazer")
library(stargazer)

Exporting Regression Output

First let’s look at what happens without any additional editing or options.

stargazer(model1, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                            imdb_rating        
-----------------------------------------------
runtime                      0.002***          
                             (0.0003)          
                                               
Constant                     7.656***          
                              (0.038)          
                                               
-----------------------------------------------
Observations                   1,000           
R2                             0.059           
Adjusted R2                    0.058           
Residual Std. Error      0.267 (df = 998)      
F Statistic           62.682*** (df = 1; 998)  
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Exporting Regression Output

Now, let’s pretty it up.

stargazer(model1, type = "text", covariate.labels = c("Runtime", "Gross Revenue", "Intercept"), 
          omit.stat = c("f", "ser"), dep.var.labels = "IMDB Rating")

=========================================
                  Dependent variable:    
              ---------------------------
                      IMDB Rating        
-----------------------------------------
Runtime                0.002***          
                       (0.0003)          
                                         
Gross Revenue          7.656***          
                        (0.038)          
                                         
-----------------------------------------
Observations             1,000           
R2                       0.059           
Adjusted R2              0.058           
=========================================
Note:         *p<0.1; **p<0.05; ***p<0.01

Exporting Regression Output

  • type = “text” specifies the type of output we want

  • covariate.labels = c(…) specifies how to rename the independent/control variables instead of using the default variable names

  • omit.stat = c(…) removes statistics we do not need from the table

  • dep.var.labels = “…” does the same as covariate.labels but for the dependent variable

Coefficient Plots

Regression tables can be hard to plot because they have so much information. However, coefficient plots make it easy to succinctly visualize your key regression results. First, install and load the sjPlot package.

#install.packages("sjPlot")
library(sjPlot)

Coefficient Plots in R

This is a basic coefficient plot

model2 = lm(imdb_rating ~ runtime + as.numeric(gross), data = imdb)

plot_model(model2) + ylim(-0.01, 0.01) + 
  labs(title = "Coefficient Plot for IMDB Rating")

Coefficient Plots in R

The previous model nicely illustrates the point estimate and confidence intervals of both runtime and gross.

summary(model2)

Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52565 -0.21483 -0.03582  0.15780  1.31556 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.632e+00  4.398e-02 173.514  < 2e-16 ***
runtime           2.451e-03  3.495e-04   7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10  8.758e-11   1.847   0.0651 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2742 on 828 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.06474,   Adjusted R-squared:  0.06248 
F-statistic: 28.66 on 2 and 828 DF,  p-value: 9.243e-13

Coefficient Plots in R

We can also make coefficient plots with interaction terms.

model3 = lm(imdb_rating ~ kids*action, data = imdb)

plot_model(model3) + ylim(-0.5, 0.5) + 
  labs(title = "Coefficient Plot for IMDB Rating") + 
  geom_hline(yintercept = 0, linetype = "dashed")

Coefficient Plots in R

And these coefficients are also displayed nicely in the previous plot.

summary(model3)

Call:
lm(formula = imdb_rating ~ kids * action, data = imdb)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.4110 -0.2232 -0.0232  0.1768  1.2890 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.92320    0.01131 700.651  < 2e-16 ***
kids         0.08778    0.02075   4.231 2.54e-05 ***
action       0.05549    0.02716   2.043 0.041333 *  
kids:action -0.18847    0.05029  -3.748 0.000189 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2728 on 996 degrees of freedom
Multiple R-squared:  0.02231,   Adjusted R-squared:  0.01937 
F-statistic: 7.578 on 3 and 996 DF,  p-value: 5.161e-05

Close Lesson 6

Well, that’s all we have for you! In this lesson, you learned more about how to run a respectable regression, the assumptions underlying regression analyses, what interaction terms are, and how to perform more advanced visualizations using regression output.

We hope you have enjoyed your time in this intro series to statistical programming in R (and learned a bit along the way)!

← Back to Home Page