Introduction to Descriptive Statistics II

Jennifer Barnes and Alexander Tripp

Objectives

  • Understand the basics of Central Limit Theorem
  • Know several of the most common types of distributions
  • Be able to identify outliers and skew and how to correct for each

Populations and Samples

Definitions

Population: The entire universe of observations about which you want to draw conclusions

Sample: The specific group from which you can collect data and test hypotheses

Practice Understanding

What is the population for the following samples?

  • War-related deaths in Ukraine

  • Divorce rate in Nashville

Practice Understanding

What would you need to consider when taking a sample of the following populations?

  • Voting age adults in the US

  • US Congresspeople

Practice Understanding

What is the population and sample for the following research questions?

  • What is the effect of eating potato chips on blood pressure?

  • What is the relationship between time spent on homework and grades?

  • Why do some wars last longer than others?

Data Distributions

Reviewing Common Distributions

We are now going to walk through a series of statistical data distributions. These are important to know for 1) basic data literacy and 2) transforming your data to better fit into a linear regression model (which we will talk about later).

Normal Distribution

Everyone’s favorite distribution! Has useful statistical properties.

T Distribution

This is very similar to a normal distribution, but it has fatter tails and wider confidence intervals. As an old statistics professor taught me, imagine Mr. T is punching the normal distribution downward. As the degrees of freedom (number of observations – number of coefficients in model) increase, it begins to resemble a normal distribution.

Uniform Distribution

In this distribution, every observation has the same probability of occurring.

Exponential Distribution

The rate of increase or decrease in this distribution is—you guessed it—exponential, rather than linear. So, we see positive slopes accelerate more quickly as values increase, or vice versa.

Logarithmic Distribution

This distribution is the normal distribution, log-transformed. It is best used for normalizing a variable with high skew.

How do these distributions occur in the wild?

In addition to height, there are many natural examples of normal distributions

  • Dice rolls

  • Blood pressure

  • Lengths of teeth shed by sharks over time

Data Distributions in R

Preparing our Code

First, let’s set up our code.

rm(list = ls())

library(tidyverse)

imdb = read_csv("imdb.csv")

Ensuring Replicability

Here, we are going to randomly generate four types of data distributions. First, to ensure replicability, we need to set a seed (which makes sure that the randomization is the same each time). This is a very important concept for more advanced statistical analyses, such as Monte Carlo simulations.

set.seed(1221)

Preparing our Data

Now, let’s generate the distributions and save them into objects…the default means are 0 with standard deviations of 1.

d1 = rnorm(1000) #Normal distribution

d2 = runif(1000) #Uniform distribution

d3 = rexp(1000) #Exponential distribution

d4 = rlnorm(1000) #Log-normal distribution

d5a = rt(1000, 10) #T distribution

d5b = rt(1000, 100) #T distribution with more degrees of freedom

Visualizing Distributions: Normal

Now, let’s visualize them!

hist(d1) #Normal distribution

Visualizing Distributions: Uniform

hist(d2) #Uniform distribution

Visualizing Distributions: Exponential

hist(d3) #Exponential distribution

Visualizing Distributions: Log-Normal

hist(d4) #Log-normal distribution

Visualizing Distributions: T (v1)

hist(d5a) #T distribution...notice the longer tails

Visualizing Distributions: T (v2)

hist(d5b) #T distribution with more degrees of freedom...looks normal!

Law of Large Numbers and Central Limit Theorem

Introducing the Concepts

Today, we will be focusing on two of the foundations for statistics: the Law of Large Numbers and the Central Limit Theorem.

Law of Large Numbers

The more iterations/values we have in a dataset, the more closely that its distribution will converge to “true” values.

Think about flipping a fair coin: it’s not crazy to imagine that you get 3-5 heads in a row. It would be crazy, however, to get 10,000 heads in a row.

Central Limit Theorem

The sampling distribution of a sample of means will follow a normal distribution, regardless of the population distribution or the data generating process.

This allows us to estimate the “true” distribution or data generating process.

We can also visualize this with a website made by Dr. Erin York and PhD candidate Nguyen Ha!

Access it here.

Close Lesson 3

Thus ends our third lesson. By this point, you should be fairly comfortable with descriptive statistics (theoretically) and start to be a bit more confident in your R skills. If not, don’t fret—just spend a bit more time practicing and going over our previous lessons. Next lesson deals with a very important skill: data visualization.

← Back to Home Page

To Next Lesson →