First, we will walk through the types of variables and how they manifest in the real world.
Nominal Variables
Categories with no order ranking
For example: gender, race, eye color, political party
Ordinal Variables
Categories with an order
For example: socio-economic status, education level, income level
Interval Variables
Numbers where the order and the difference between two values is meaningful
For example: temperature (Fahrenheit or Celsius), SAT score, credit score
Ratio Variables
Interval variable + a clear definition of zero. That is, when the variable equals zero, there is none of that variable.
For example: medicine dose amount, weight, length, survival time
Measures of Central Tendency
There are different ways to describe the center of a dataset, including:
Mean: Average value. Add all values together and then divide by the sample size.
Median: Middle value of a dataset ordered from least to greatest.
Mode: Most frequently occurring value in a dataset.
Exercise: Calculate measures of central tendency!
For the following data, calculate the mean, median, and mode. What would be best for describing that data and why? How do they differ?
Group 1: (1, 1, 1, 2, 0, 11)
Group 2: (101, 35, 19, 11, 1000, 21)
Group 3: (1000, 1, 1000, 1, 1000, 1001)
Group 4: (174, 870, 674, 6, 616, 100)
Odds and Ends
What to do with missing data:
Usually, you can just ignore or drop it.
R requires an explicit option to ignore NAs.
Be aware of the NAs in your dataset. If they are systematic, you have bigger issues with the dataset or your design.
Measures of central tendency change based on variable types: For example, how would you summarize a nominal variable that looks like this: (Republican, Republican, Republican, Independent, Democrat, Independent)?
Discussion and Questions
Some things to consider as you review these materials:
Why don’t we report the mean of a nominal variable?
Will the mean of a sample always be equal to an actual value that can appear in that sample?
Will the mean of a sample always equal the sample’s median?
Will the mean of a sample always equal the sample’s mode?
If all values of a sample double, will the mean double?
If two samples have the same mean, will they have the same range?
If two samples have the same median, will they have the same range?
Summarizing a Dataset
One important task for researchers is to quickly and effectively summarize datasets for lay audiences.
Summarizing a Dataset
Common ways to do this:
Summary Statistics: Present the mean or median, alongside the range, and explain what it means in real-world terms
Figures: Create a clean, clear, and quickly interpretable graph for your audience
Source and Abnormalities: Report NAs, outliers, the source of the data, and any other weirdness about the data
Practice Summarizing
The modal respondent believes that immigrants are coming to the US in order to escape crime or violence in their home country.
[Now, what might you add here?]
Respondents were asked “What is the primary reason that immigrants from Central America are coming to the US?” This data comes from a lab survey conducted on 220 Vanderbilt undergraduates in Spring 2023.
Dummy (Indicator) Variables
Dichotomizing or “dummying” a variable is reducing it to 1s and 0s, where 1s indicate the presence of something and 0s the absence. For example, one could dummy a variable measuring gender to be 1s if the respondent is female and 0s if the respondent is NOT female.
PROS: easier interpretation, may be a better fit with substantive significance for the variable
CONS: reduce variation, can mask other categories of a variable
Dummy (Indicator) Variables
Let’s return to this figure:
How might we dummy this variable?
What would ‘dummying’ this variable help us with?
And how might it hurt our results?
More Summary Statistics
There are other summary statistics that serve more specific or foundational roles and aren’t as often reported to lay audiences.
More Summary Statistics
Variance: Measures how far each number in the dataset is from the mean.
More Summary Statistics
Covariance: Measures how two variables in a dataset will change together.
More Summary Statistics
Standard Deviation: This statistic is reported much more often than variance and covariance. Notice the similarities between the two formulas: the standard deviation is just the square root of the variance.
Frequency Tables
Frequency tables present a list of values and how often each unique value appears.
This is a quick and easy way to present the distribution of your data to a lay audience.
Starting up in R
Let’s first run our setup chunk. Can you remember what each line in this code does?
You can also use various functions to look at the summary statistics of your data.
mean(imdb$runtime)
[1] 122.891
median(imdb$runtime)
[1] 119
Can you explain why the median is different from the mean?
Describing Variables
Why doesn’t the following code work?
mean(imdb$certificate)
[1] NA
Describing Variables
…it’s because your variable needs to be numeric to calculate a measure of central tendency like the mean or median!
str(imdb$certificate)
chr [1:1000] "A" "A" "UA" "A" "U" "U" "A" "A" "UA" "A" "U" "UA" "A" "UA" ...
Describing Variables
You can also calculate the mode of your variables, including for categorical variables. But in each case, it’s easiest to look at the distribution and just pick out the mode.
table(imdb$genre_main)
Action Adventure Animation Biography Comedy Crime Drama Family
172 72 82 88 155 107 289 2
Fantasy Film-Noir Horror Mystery Thriller Western
2 3 11 12 1 4
And here is a review on how to calculate some other important descriptive statistics.
var(imdb$runtime)
[1] 789.2544
cov(imdb$runtime, imdb$no_of_votes)
[1] 1593525
sd(imdb$no_of_votes)
[1] 327372.7
Conditional Descriptives
We can also condition our descriptive statistics on certain values or subsets of the data. For example, what is the mean runtime movies based on their main genre? The means column below will show us.
# A tibble: 14 × 2
genre_main means
<chr> <dbl>
1 Action 129.
2 Adventure 134.
3 Animation 99.6
4 Biography 136.
5 Comedy 112.
6 Crime 126.
7 Drama 125.
8 Family 108.
9 Fantasy 85
10 Film-Noir 104
11 Horror 102.
12 Mystery 119.
13 Thriller 108
14 Western 148.
Close Lesson 2
And that’s all for our second lesson! Those are the basics for running descriptive statistics in R. In the next lesson, we will go a bit more in-depth as to the statistical foundations for some of these statistics, motivating with some fun examples.