Introduction to Correlational Analysis

Jennifer Barnes and Alexander Tripp

Objectives

Know the warning signs for messy data
Understand when to use and how to interpret correlations and p-values
Be able to identify statistics of interest within t-tests and linear regressions
Present a linear regression and correctly interpret coefficients and significance levels

Running Each R File Independently

Each of your R files should…

Clear your working directory at the top
Load in your data
Perform any data cleaning or analyses
Save any work if so desired

You should be able to close R (and all of your R scripts), then open one and run it from start to finish with no errors.

Data Cleaning

Is your data clean and ready to work with?

Make sure that you know the range of your variable and what each number within it represents
Check the variable type and make sure that it is compatible with your preferred method of analysis
Know your unit of analysis
Are you missing large amounts of data? If so, do you know why?
Are the variables in a useful format? Are there any capitalization or spelling errors in strings?

Hypothesis Testing

Before you explore relationships within your data, you should have some expectations on what the relationship will look like. These expectations are hypotheses. Hypotheses motivate the tests that you use and the importance of different statistics.

𝐻_𝑛𝑢𝑙𝑙 or 𝐻_𝑂 (read as “H naught”)
𝐻_𝑎𝑙𝑡

Example 1

Research Question: Does the gender of political candidates affect their chances of winning elections?
𝐻_𝑛𝑢𝑙𝑙: The gender of political candidates has no effect on their chances of winning elections.
𝐻_𝑎𝑙𝑡: The gender of political candidates significantly affects their chances of winning elections.

Example 2

Research Question: Does the implementation of stricter gun control laws reduce crime rates?
𝐻_𝑂: The implementation of stricter gun control laws has no effect on crime rates.
𝐻_𝑎: The implementation of stricter gun control laws significantly reduces crime rates.

Correlations

Remember covariance? We don’t either.

Nevertheless, correlation coefficients are just the covariance scaled from 0 to 1!

Correlations

Correlations show you how variables change together. However, this does not tell you about the direction of the relationship. X could cause Y, Y could cause X, or there could be a messier relationship in which they both have causal influence at different times.

First, let’s prepare our code.

library(tidyverse)

imdb = read_csv("imdb.csv")

Correlational Analysis

This week, we will teach you how to perform basic correlational analyses. In research, these are the backbone of most analytical work and are very useful descriptive tools. First, let’s calculate the correlation between two variables.

cor(imdb$runtime, imdb$no_of_votes)

[1] 0.1732638

This positive correlation means that both variables trend positively in the same direction…as runtime increases, number of votes also increase, and vice versa.

P-Values

The P-Value helps us determine how random the influence of one variable is on another variable. It tells you how likely it is to see your observed data (or something more extreme) given that the null hypothesis is true.

Ranges from 0 to 1.

0.05 is the conventional threshold that denotes “statistical significance.”

P-Values

T-Tests

T-tests are used to determine differences in means for two groups. There are two types, and they are simple and intuitive.

One-sample (single group differs from a particular, predefined value)

Two-sample (two groups differ from one another)

One-Sample T-Test

One sample t-tests tells you whether or not the mean of one sample is statistically different from zero (or some other predefined value).

t.test(imdb$runtime)


    One Sample t-test

data:  imdb$runtime
t = 138.33, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 121.1477 124.6343
sample estimates:
mean of x 
  122.891

As we can see here, the mean runtime of movies is 123 minutes, and based on the p-value being less than 0.05, we are confident that it is different from zero.

Two-Sample T-Test

The two sample t-test determines if the means of two samples are significantly different from one another. Here, we compare whether the runtimes of movies that are or are not action movies differ. First, we create an indicator for action movies. Then, we compare the mean runtimes across our two groups.

imdb$action = ifelse(imdb$genre_main == "Action", 1, 0)

t.test(imdb$runtime[imdb$action == 1], imdb$runtime[imdb$action == 0])


    Welch Two Sample t-test

data:  imdb$runtime[imdb$action == 1] and imdb$runtime[imdb$action == 0]
t = 3.1249, df = 243.61, p-value = 0.001994
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.748122 12.120263
sample estimates:
mean of x mean of y 
 129.0465  121.6123

So, we find that action movies have a significantly longer mean runtime than non-action movies.

Linear Regression

Linear regressions are used to explore how one variable is associated with another. They test how the dependent variable responds to changes in the independent variable. This is our first step toward assessing causal relationships. We most often use OLS (“ordinary least squares”) regressions.

Bivariate regression: a model that predicts the effect of one x variable on y

Multivariate regression: a model that predicts the effect of multiple x variables on y

Equations

We can see that the regression equation is pretty similar to the slope-intercept equation.

Parts of the Regression Equation

Intercept (α): The value of y when all x variables are equal to zero. Can be meaningful, but often is not.
Regression coefficient (β): The effect of a one-unit shift in your x variable on your y variable. Interpretable and meaningful, if significant.
Error term (ε): Shows the model’s total margin of error when predicting the effects of xs on y.

Interpreting Regressions

Here is a handy chart from Dr. Brenton Kenkel’s website on how to interpret your x and y variables after log-transforming them.

Linear Regression in R

To run a regression in R, first create and save the regression object. Then, run a summary of the regression object to see the output.

model1 = lm(imdb_rating ~ runtime, data = imdb)  
summary(model1)


Call:
lm(formula = imdb_rating ~ runtime, data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.53550 -0.20486 -0.03168  0.15282  1.30515 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 7.6563477  0.0379558 201.718  < 2e-16 ***
runtime     0.0023838  0.0003011   7.917 6.44e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2674 on 998 degrees of freedom
Multiple R-squared:  0.0591,    Adjusted R-squared:  0.05815 
F-statistic: 62.68 on 1 and 998 DF,  p-value: 6.442e-15

A one unit increase (in this case, a one minute increase) in runtime is associated with a 0.002 unit increase in the IMDB rating of a movie. Interesting!

Linear Regression in R

We can also add in multiple explanatory variables and determine how their effects condition our predictor of interest. Holding all else constant (which is how we are able to meaningfully incorporate all of this information), we see that a one minute increase in runtime is still associated with a 0.002 unit increase in IMDB rating.

model2 = lm(imdb_rating ~ runtime + as.numeric(gross), data = imdb)
summary(model2)


Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.52565 -0.21483 -0.03582  0.15780  1.31556 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.632e+00  4.398e-02 173.514  < 2e-16 ***
runtime           2.451e-03  3.495e-04   7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10  8.758e-11   1.847   0.0651 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2742 on 828 degrees of freedom
  (169 observations deleted due to missingness)
Multiple R-squared:  0.06474,   Adjusted R-squared:  0.06248 
F-statistic: 28.66 on 2 and 828 DF,  p-value: 9.243e-13

Linear Regression

Finally, we can also see how categorical independent variables are associated with our dependent variables. Holding all else constant, action movies do not significantly affect the IMDB ratings of movies as compared to non-action movies.

model3 = lm(imdb_rating ~ runtime + factor(action), data = imdb)  
summary(model3)


Call:
lm(formula = imdb_rating ~ runtime + factor(action), data = imdb)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54042 -0.20379 -0.03095  0.15126  1.30164 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.6564717  0.0379633 201.681  < 2e-16 ***
runtime          0.0024077  0.0003027   7.955 4.85e-15 ***
factor(action)1 -0.0177559  0.0225206  -0.788    0.431    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2674 on 997 degrees of freedom
Multiple R-squared:  0.05968,   Adjusted R-squared:  0.0578 
F-statistic: 31.64 on 2 and 997 DF,  p-value: 4.759e-14

Close Lesson 5

Nice work! We just got through some of the more complicated content of our mini-series. Correlational (regression) analyses are incredibly important for scientific researchers and all sorts of statistics practitioners. In our next (and final!) lesson, we will go a bit further into correlational analysis and tie up some loose ends.

← Back to Home Page

To Next Lesson →