[1] 0.1732638
Each of your R files should…
Clear your working directory at the top
Load in your data
Perform any data cleaning or analyses
Save any work if so desired
You should be able to close R (and all of your R scripts), then open one and run it from start to finish with no errors.
Is your data clean and ready to work with?
Make sure that you know the range of your variable and what each number within it represents
Check the variable type and make sure that it is compatible with your preferred method of analysis
Know your unit of analysis
Are you missing large amounts of data? If so, do you know why?
Are the variables in a useful format? Are there any capitalization or spelling errors in strings?
Before you explore relationships within your data, you should have some expectations on what the relationship will look like. These expectations are hypotheses. Hypotheses motivate the tests that you use and the importance of different statistics.
Research Question: Does the gender of political candidates affect their chances of winning elections?
𝐻_𝑛𝑢𝑙𝑙: The gender of political candidates has no effect on their chances of winning elections.
𝐻_𝑎𝑙𝑡: The gender of political candidates significantly affects their chances of winning elections.
Research Question: Does the implementation of stricter gun control laws reduce crime rates?
𝐻_𝑂: The implementation of stricter gun control laws has no effect on crime rates.
𝐻_𝑎: The implementation of stricter gun control laws significantly reduces crime rates.
Remember covariance? We don’t either.
Nevertheless, correlation coefficients are just the covariance scaled from 0 to 1!
Correlations show you how variables change together. However, this does not tell you about the direction of the relationship. X could cause Y, Y could cause X, or there could be a messier relationship in which they both have causal influence at different times.
First, let’s prepare our code.
This week, we will teach you how to perform basic correlational analyses. In research, these are the backbone of most analytical work and are very useful descriptive tools. First, let’s calculate the correlation between two variables.
[1] 0.1732638
This positive correlation means that both variables trend positively in the same direction…as runtime increases, number of votes also increase, and vice versa.
The P-Value helps us determine how random the influence of one variable is on another variable. It tells you how likely it is to see your observed data (or something more extreme) given that the null hypothesis is true.
Ranges from 0 to 1.
0.05 is the conventional threshold that denotes “statistical significance.”
T-tests are used to determine differences in means for two groups. There are two types, and they are simple and intuitive.
One-sample (single group differs from a particular, predefined value)
Two-sample (two groups differ from one another)
One sample t-tests tells you whether or not the mean of one sample is statistically different from zero (or some other predefined value).
One Sample t-test
data: imdb$runtime
t = 138.33, df = 999, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
121.1477 124.6343
sample estimates:
mean of x
122.891
As we can see here, the mean runtime of movies is 123 minutes, and based on the p-value being less than 0.05, we are confident that it is different from zero.
The two sample t-test determines if the means of two samples are significantly different from one another. Here, we compare whether the runtimes of movies that are or are not action movies differ. First, we create an indicator for action movies. Then, we compare the mean runtimes across our two groups.
imdb$action = ifelse(imdb$genre_main == "Action", 1, 0)
t.test(imdb$runtime[imdb$action == 1], imdb$runtime[imdb$action == 0])
Welch Two Sample t-test
data: imdb$runtime[imdb$action == 1] and imdb$runtime[imdb$action == 0]
t = 3.1249, df = 243.61, p-value = 0.001994
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
2.748122 12.120263
sample estimates:
mean of x mean of y
129.0465 121.6123
So, we find that action movies have a significantly longer mean runtime than non-action movies.
Linear regressions are used to explore how one variable is associated with another. They test how the dependent variable responds to changes in the independent variable. This is our first step toward assessing causal relationships. We most often use OLS (“ordinary least squares”) regressions.
Bivariate regression: a model that predicts the effect of one x variable on y
Multivariate regression: a model that predicts the effect of multiple x variables on y
We can see that the regression equation is pretty similar to the slope-intercept equation.
Intercept (α): The value of y when all x variables are equal to zero. Can be meaningful, but often is not.
Regression coefficient (β): The effect of a one-unit shift in your x variable on your y variable. Interpretable and meaningful, if significant.
Error term (ε): Shows the model’s total margin of error when predicting the effects of xs on y.
Here is a handy chart from Dr. Brenton Kenkel’s website on how to interpret your x and y variables after log-transforming them.
To run a regression in R, first create and save the regression object. Then, run a summary of the regression object to see the output.
Call:
lm(formula = imdb_rating ~ runtime, data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.53550 -0.20486 -0.03168 0.15282 1.30515
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6563477 0.0379558 201.718 < 2e-16 ***
runtime 0.0023838 0.0003011 7.917 6.44e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2674 on 998 degrees of freedom
Multiple R-squared: 0.0591, Adjusted R-squared: 0.05815
F-statistic: 62.68 on 1 and 998 DF, p-value: 6.442e-15
A one unit increase (in this case, a one minute increase) in runtime is associated with a 0.002 unit increase in the IMDB rating of a movie. Interesting!
We can also add in multiple explanatory variables and determine how their effects condition our predictor of interest. Holding all else constant (which is how we are able to meaningfully incorporate all of this information), we see that a one minute increase in runtime is still associated with a 0.002 unit increase in IMDB rating.
Call:
lm(formula = imdb_rating ~ runtime + as.numeric(gross), data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.52565 -0.21483 -0.03582 0.15780 1.31556
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.632e+00 4.398e-02 173.514 < 2e-16 ***
runtime 2.451e-03 3.495e-04 7.012 4.87e-12 ***
as.numeric(gross) 1.617e-10 8.758e-11 1.847 0.0651 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2742 on 828 degrees of freedom
(169 observations deleted due to missingness)
Multiple R-squared: 0.06474, Adjusted R-squared: 0.06248
F-statistic: 28.66 on 2 and 828 DF, p-value: 9.243e-13
Finally, we can also see how categorical independent variables are associated with our dependent variables. Holding all else constant, action movies do not significantly affect the IMDB ratings of movies as compared to non-action movies.
Call:
lm(formula = imdb_rating ~ runtime + factor(action), data = imdb)
Residuals:
Min 1Q Median 3Q Max
-0.54042 -0.20379 -0.03095 0.15126 1.30164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6564717 0.0379633 201.681 < 2e-16 ***
runtime 0.0024077 0.0003027 7.955 4.85e-15 ***
factor(action)1 -0.0177559 0.0225206 -0.788 0.431
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2674 on 997 degrees of freedom
Multiple R-squared: 0.05968, Adjusted R-squared: 0.0578
F-statistic: 31.64 on 2 and 997 DF, p-value: 4.759e-14
Nice work! We just got through some of the more complicated content of our mini-series. Correlational (regression) analyses are incredibly important for scientific researchers and all sorts of statistics practitioners. In our next (and final!) lesson, we will go a bit further into correlational analysis and tie up some loose ends.