[1] 1990 1970 2000 1950 2010 1960 1980 2020 1940 1930 1920 NA
Graphing Problems (Healy 2019)
Bad Taste (Aesthetic):
Includes attributes that take away from the main point of the graph, such as extra colors, graphics, lines, and formatting
Bad Data (Substantive):
Misperceptions or mistakes in the data lead to a graph that does not accurately capture the real trend in the data
Bad Perception (Perceptual):
Adding too many dimensions to a plot can confuse readers, such as weird aspect ratios, unnecessary legends, and 3D
There is wayyyy too much information here. You should hesitate to use 3d graphs, and background photos are not a good idea. We think this figure was a joke, but hey, some people don’t have great taste.
The exponential y-axis scale is deceitful and makes it look like the UK is generating more oil than the rest of the world. Beware malevolent statisticians.
The two y-axes here make interpretation really confusing. Usually best to avoid this practice, even though sometimes you’ll really want to.
Pie charts are bad! Plus, the slices do not add up to 100%.
The y-axis changes from 5 to 5.5 to 6, manipulating the scale of the GDP increase in 2021 to look larger.
Map relied on being printed in color. It wasn’t, so there is no differentiation in the data points. Silly newspaper!
Moving forward, we will go over types of graphs and explain how to make good, informative visualizations. Much of the following discussion is taken from https://r-graphics.org/.
Shows categories (nominal variables) on the x-axis and numeric values on the y-axis
Typically used for visualizing how one continuous variable (on the y-axis) changes in relation to another continuous variable (on the x-axis). The x-axis variable often represents time.
Scatterplots display the relationship between two continuous variables. Each observation in a dataset is represented by a point.
Histograms show the distribution of a continuous variable on the x-axis with counts of some variable (either the x-axis variable or some other continuous variable) on the y-axis.
Edward Tufte in Visual Display of Quantitative Information (2001, 51):
“Graphical excellence is the well-designed presentation of interesting data-a matter of substance, of statistics, and of design..
[It] consists of complex ideas communicated with clarity, precision, and efficiency..
[It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space..
[It] is nearly always multivariate..
And [it] requires telling the truth about the data.”
First, the usual suspects:
Today we are going to use a powerful visualization tool native to R: ggplot2. The ggplot2 package is included within the tidyverse, and it uses a similar flow logic. Let’s start with barplots. Barplots are best for categorical variables.
Let’s try making a barplot of the genre_main variable.
ggplot(imdb, aes(genre_main)) +
geom_bar() +
labs(title = "Barplot", x = "Primary Genres", y = "Count")You can also flip the axes and order by frequency.
imdb %>% count(genre_main) %>%
ggplot(aes(reorder(genre_main, n), n)) +
coord_flip() +
geom_bar(stat = "identity") +
labs(title = "Barplot", x = "Primary Genres", y = "Count")Sometimes it is helpful to visualize variables—such as the action variable here—as dummies.
imdb %>%
mutate(action_dummy = if_else(genre_main == "Action", "Action", "Not Action")) %>%
ggplot(aes(action_dummy)) +
geom_bar() +
labs(title = "Count of Action vs. Non-Action Movies",
x = "Genre",
y = "Count")Now let’s move to histograms. They are better suited for numeric variables.
Box and whisker plots do a good job showing the quartiles of variables. You can also manually change the scales of the x and y axes to better suit your visualization. We show the box and whisker plot on the next page.
imdb %>%
ggplot(aes(x = meta_score)) +
geom_boxplot() +
labs(title = "Box and Whisker Plot",
x = "Audience Rating", y = "") +
scale_x_continuous(breaks = c(0, 20, 40, 60, 80, 100),
labels = c(0, 20, 40, 60, 80, 100), limits = c(0, 100))Next up, line graphs. Line graphs are a bit different because they pool data—most often over time—to show trends. As such, we will have to pool our variable of interest over time and do some extra data cleaning.
First, we need to create a decade variable (years pooled into decades). To do this, we load in the lubridate package for dealing with date objects. Remember, you’ll only need to install this package once.
The following code 1) makes the annoying “PG” value into an NA, 2) makes the released_year variable into a Date object, and 3) converts that date object into a year object, rounding all dates down to the year in which they occurred.
imdb$decade <- imdb$released_year %>%
na_if("PG") %>%
as.Date(format = "%Y") %>%
floor_date(unit = "10 years") %>%
year()
head(imdb$decade) [1] 1990 1970 2000 1950 2010 1960 1980 2020 1940 1930 1920 NA
Now, let’s create a variable that counts the number of top 1000 movies released each decade, then join that variable back into the original dataset. (Stay with us, we know this is a bit more advanced).
Finally, let’s visualize the line graph!
line_graph_decade <- imdb %>%
ggplot(aes(decade, decade_movie_count)) +
geom_line() +
labs(title = "Line Graph", x = "Decade", y = "")It is easy enough to save graphs for use outside of R. Just use ggsave() and enter in the file name and plot object. You can also specify width and height, but these are optional. Note: the figure will save to your working directory unless otherwise specified with a file path.
First, assign a plot to an object so that you can reference it later, then use ggsave().
Now we move to our final visualization of the day: scatterplots. Scatterplots show the raw data of two different variables at the same time and are good at depicting trends in the data.
Let’s see how much money the moves grossed over each decade.
Some of the values are a bit weird, so let’s remove those first.
decade_gross <- imdb %>% group_by(decade) %>%
summarize(gross_mean = mean(as.numeric(gross), na.rm = T))
decade_gross
decade_gross <- decade_gross[-11:-12,]# A tibble: 12 × 2
decade gross_mean
<dbl> <dbl>
1 1920 1844978.
2 1930 18403476.
3 1940 8293462.
4 1950 14174694
5 1960 31634777.
6 1970 52879527.
7 1980 62314866.
8 1990 61229999.
9 2000 69658081.
10 2010 103720361.
11 2020 NaN
12 NA 173837933
Now let’s create the scatterplot. This first plot pools by decade.
The second plot pools by year.
imdb %>%
ggplot(aes(as.numeric(released_year), as.numeric(gross))) +
geom_point() +
labs(y = "Gross Revenue", x = "Year") +
theme(axis.text = element_blank(), axis.ticks = element_blank())I like to use this website to get a sense of the options available. There are also packages like RColorBrewer and viridis that have prepackaged colors that scale well. Use the “color” and/or “fill” options to specify colors.
imdb %>% ggplot(aes(imdb_rating, fill = "firebrick")) + labs(y = "Count", x = "IMDB Rating") +
geom_histogram() + guides(fill = "none") #This guides() command removes the fill legend that pops up when you specify fill...in this case, it is not necessary. At its most basic, linear regression is a statistical model that draws the line of best fit in order to predict the relationship between your dependent variable and independent variable
The line of best fit minimizes error, or the distance from the line to the point. We will go over this more in depth later in a later class.
Make sure your figures are fully interpretable with clear labels and captions. The reader should not need anything outside of the figure to know what it’s communicating.
Do not just drop a figure into your text without explaining it.
We hope you had a good time running through the good, the bad, and the ugly of data visualization. It can be intense and time-consuming, but nothing feels better than making a good plot. In the next lesson, we will review correlational analyses and how to perform them in R.