Population: The entire universe of observations about which you want to draw conclusions
Sample: The specific group from which you can collect data and test hypotheses
What is the population for the following samples?
War-related deaths in Ukraine
Divorce rate in Nashville
What would you need to consider when taking a sample of the following populations?
Voting age adults in the US
US Congresspeople
What is the population and sample for the following research questions?
What is the effect of eating potato chips on blood pressure?
What is the relationship between time spent on homework and grades?
Why do some wars last longer than others?
We are now going to walk through a series of statistical data distributions. These are important to know for 1) basic data literacy and 2) transforming your data to better fit into a linear regression model (which we will talk about later).
Everyone’s favorite distribution! Has useful statistical properties.
This is very similar to a normal distribution, but it has fatter tails and wider confidence intervals. As an old statistics professor taught me, imagine Mr. T is punching the normal distribution downward. As the degrees of freedom (number of observations – number of coefficients in model) increase, it begins to resemble a normal distribution.
In this distribution, every observation has the same probability of occurring.
The rate of increase or decrease in this distribution is—you guessed it—exponential, rather than linear. So, we see positive slopes accelerate more quickly as values increase, or vice versa.
This distribution is the normal distribution, log-transformed. It is best used for normalizing a variable with high skew.
In addition to height, there are many natural examples of normal distributions
Dice rolls
Blood pressure
Lengths of teeth shed by sharks over time
First, let’s set up our code.
Here, we are going to randomly generate four types of data distributions. First, to ensure replicability, we need to set a seed (which makes sure that the randomization is the same each time). This is a very important concept for more advanced statistical analyses, such as Monte Carlo simulations.
Now, let’s generate the distributions and save them into objects…the default means are 0 with standard deviations of 1.
Now, let’s visualize them!
Today, we will be focusing on two of the foundations for statistics: the Law of Large Numbers and the Central Limit Theorem.
The more iterations/values we have in a dataset, the more closely that its distribution will converge to “true” values.
Think about flipping a fair coin: it’s not crazy to imagine that you get 3-5 heads in a row. It would be crazy, however, to get 10,000 heads in a row.
The sampling distribution of a sample of means will follow a normal distribution, regardless of the population distribution or the data generating process.
This allows us to estimate the “true” distribution or data generating process.
We can also visualize this with a website made by Dr. Erin York and PhD candidate Nguyen Ha!
Thus ends our third lesson. By this point, you should be fairly comfortable with descriptive statistics (theoretically) and start to be a bit more confident in your R skills. If not, don’t fret—just spend a bit more time practicing and going over our previous lessons. Next lesson deals with a very important skill: data visualization.