Introduction to Descriptive Statistics

Jennifer Barnes and Alexander Tripp

Objectives

Learn how to use basic summary statistics
Understand the benefits of dichotomizing a variable
Know how to effectively write a descriptive report

Points of Note

You can use either R scripts, RMarkdown files, or even Quarto files—this is up to preference!
Typos, misplaced spaces, misplaced parentheses, and weird indents can be your worst enemy.
For .dta data that has labels included, make note of the attr(data$variable, “labels”) function.
Comment your code! Explain what you are doing, and what you learn about your data through it.
ChatGPT/internet can help…but be discerning

Variable Types: Nominal, Ordinal, Interval, oh my!

First, we will walk through the types of variables and how they manifest in the real world.

Nominal Variables

Categories with no order ranking

For example: gender, race, eye color, political party

Ordinal Variables

Categories with an order

For example: socio-economic status, education level, income level

Interval Variables

Numbers where the order and the difference between two values is meaningful

For example: temperature (Fahrenheit or Celsius), SAT score, credit score

Ratio Variables

Interval variable + a clear definition of zero. That is, when the variable equals zero, there is none of that variable.

For example: medicine dose amount, weight, length, survival time

Measures of Central Tendency

There are different ways to describe the center of a dataset, including:

Mean: Average value. Add all values together and then divide by the sample size.

Median: Middle value of a dataset ordered from least to greatest.

Mode: Most frequently occurring value in a dataset.

Exercise: Calculate measures of central tendency!

For the following data, calculate the mean, median, and mode. What would be best for describing that data and why? How do they differ?

Group 1: (1, 1, 1, 2, 0, 11)

Group 2: (101, 35, 19, 11, 1000, 21)

Group 3: (1000, 1, 1000, 1, 1000, 1001)

Group 4: (174, 870, 674, 6, 616, 100)

Odds and Ends

What to do with missing data:

Usually, you can just ignore or drop it.
R requires an explicit option to ignore NAs.
Be aware of the NAs in your dataset. If they are systematic, you have bigger issues with the dataset or your design.

Measures of central tendency change based on variable types: For example, how would you summarize a nominal variable that looks like this: (Republican, Republican, Republican, Independent, Democrat, Independent)?

Discussion and Questions

Some things to consider as you review these materials:

Why don’t we report the mean of a nominal variable?
Will the mean of a sample always be equal to an actual value that can appear in that sample?
Will the mean of a sample always equal the sample’s median?
Will the mean of a sample always equal the sample’s mode?
If all values of a sample double, will the mean double?
If two samples have the same mean, will they have the same range?
If two samples have the same median, will they have the same range?

Summarizing a Dataset

One important task for researchers is to quickly and effectively summarize datasets for lay audiences.

Summarizing a Dataset

Common ways to do this:

Summary Statistics: Present the mean or median, alongside the range, and explain what it means in real-world terms
Figures: Create a clean, clear, and quickly interpretable graph for your audience
Source and Abnormalities: Report NAs, outliers, the source of the data, and any other weirdness about the data

Practice Summarizing

The modal respondent believes that immigrants are coming to the US in order to escape crime or violence in their home country.

[Now, what might you add here?]

Respondents were asked “What is the primary reason that immigrants from Central America are coming to the US?” This data comes from a lab survey conducted on 220 Vanderbilt undergraduates in Spring 2023.

Dummy (Indicator) Variables

Dichotomizing or “dummying” a variable is reducing it to 1s and 0s, where 1s indicate the presence of something and 0s the absence. For example, one could dummy a variable measuring gender to be 1s if the respondent is female and 0s if the respondent is NOT female.

PROS: easier interpretation, may be a better fit with substantive significance for the variable

CONS: reduce variation, can mask other categories of a variable

Dummy (Indicator) Variables

Let’s return to this figure:

How might we dummy this variable?
What would ‘dummying’ this variable help us with?
And how might it hurt our results?

More Summary Statistics

There are other summary statistics that serve more specific or foundational roles and aren’t as often reported to lay audiences.

More Summary Statistics

Variance: Measures how far each number in the dataset is from the mean.

More Summary Statistics

Covariance: Measures how two variables in a dataset will change together.

More Summary Statistics

Standard Deviation: This statistic is reported much more often than variance and covariance. Notice the similarities between the two formulas: the standard deviation is just the square root of the variance.

Frequency Tables

Frequency tables present a list of values and how often each unique value appears.

This is a quick and easy way to present the distribution of your data to a lay audience.

Starting up in R

Let’s first run our setup chunk. Can you remember what each line in this code does?

rm(list = ls())

library(tidyverse)

getwd()

imdb <- read_csv("imdb.csv")

Describing Variables

To summarize the structure of and information found within variables, you can use the following functions. You will use these a lot.

table(imdb$genre_main)


   Action Adventure Animation Biography    Comedy     Crime     Drama    Family 
      172        72        82        88       155       107       289         2 
  Fantasy Film-Noir    Horror   Mystery  Thriller   Western 
        2         3        11        12         1         4

str(imdb$genre_main)

 chr [1:1000] "Drama" "Crime" "Action" "Crime" "Crime" "Action" "Crime" ...

Describing Variables

If you wanted to identify all unique observations in a variable, you can use the unique function.

unique(imdb$genre_main)

 [1] "Drama"     "Crime"     "Action"    "Biography" "Western"   "Comedy"   
 [7] "Adventure" "Animation" "Horror"    "Mystery"   "Film-Noir" "Fantasy"  
[13] "Family"    "Thriller"

Describing Variables

You can also use various functions to look at the summary statistics of your data.

mean(imdb$runtime)

[1] 122.891

median(imdb$runtime)

[1] 119

Can you explain why the median is different from the mean?

Describing Variables

Why doesn’t the following code work?

mean(imdb$certificate)

[1] NA

Describing Variables

…it’s because your variable needs to be numeric to calculate a measure of central tendency like the mean or median!

str(imdb$certificate)

 chr [1:1000] "A" "A" "UA" "A" "U" "U" "A" "A" "UA" "A" "U" "UA" "A" "UA" ...

Describing Variables

You can also calculate the mode of your variables, including for categorical variables. But in each case, it’s easiest to look at the distribution and just pick out the mode.

table(imdb$genre_main)


   Action Adventure Animation Biography    Comedy     Crime     Drama    Family 
      172        72        82        88       155       107       289         2 
  Fantasy Film-Noir    Horror   Mystery  Thriller   Western 
        2         3        11        12         1         4

imdb %>%
  group_by(genre_main) %>%
  summarize(count = n())

# A tibble: 14 × 2
   genre_main count
   <chr>      <int>
 1 Action       172
 2 Adventure     72
 3 Animation     82
 4 Biography     88
 5 Comedy       155
 6 Crime        107
 7 Drama        289
 8 Family         2
 9 Fantasy        2
10 Film-Noir      3
11 Horror        11
12 Mystery       12
13 Thriller       1
14 Western        4

Describing Variables

And here is a review on how to calculate some other important descriptive statistics.

var(imdb$runtime)

[1] 789.2544

cov(imdb$runtime, imdb$no_of_votes)

[1] 1593525

sd(imdb$no_of_votes)

[1] 327372.7

Conditional Descriptives

We can also condition our descriptive statistics on certain values or subsets of the data. For example, what is the mean runtime movies based on their main genre? The means column below will show us.

imdb %>%
  group_by(genre_main) %>%
  summarize(means = mean(runtime))

# A tibble: 14 × 2
   genre_main means
   <chr>      <dbl>
 1 Action     129. 
 2 Adventure  134. 
 3 Animation   99.6
 4 Biography  136. 
 5 Comedy     112. 
 6 Crime      126. 
 7 Drama      125. 
 8 Family     108. 
 9 Fantasy     85  
10 Film-Noir  104  
11 Horror     102. 
12 Mystery    119. 
13 Thriller   108  
14 Western    148.

Close Lesson 2

And that’s all for our second lesson! Those are the basics for running descriptive statistics in R. In the next lesson, we will go a bit more in-depth as to the statistical foundations for some of these statistics, motivating with some fun examples.

← Back to Home Page

To Next Lesson →