Introduction to Data Visualization

Jennifer Barnes and Alexander Tripp

Objectives

Understand the differences between good and bad visualizations
Learn some of the most common types of visualizations
Learn how to create good visualizations

Anatomy of a Figure

What Makes a Bad Figure?

Graphing Problems (Healy 2019)

Bad Taste (Aesthetic):
Includes attributes that take away from the main point of the graph, such as extra colors, graphics, lines, and formatting

Bad Data (Substantive):
Misperceptions or mistakes in the data lead to a graph that does not accurately capture the real trend in the data

Bad Perception (Perceptual):
Adding too many dimensions to a plot can confuse readers, such as weird aspect ratios, unnecessary legends, and 3D

In the next slides, where do we see the main graphing problems?

Bad Graph 1

There is wayyyy too much information here. You should hesitate to use 3d graphs, and background photos are not a good idea. We think this figure was a joke, but hey, some people don’t have great taste.

Bad Graph 2

The exponential y-axis scale is deceitful and makes it look like the UK is generating more oil than the rest of the world. Beware malevolent statisticians.

Bad Graph 3

The two y-axes here make interpretation really confusing. Usually best to avoid this practice, even though sometimes you’ll really want to.

Bad Graph 4

Pie charts are bad! Plus, the slices do not add up to 100%.

Bad Graph 5

The y-axis changes from 5 to 5.5 to 6, manipulating the scale of the GDP increase in 2021 to look larger.

Bad Graph 6

Map relied on being printed in color. It wasn’t, so there is no differentiation in the data points. Silly newspaper!

Pretty bad, right?

Moving forward, we will go over types of graphs and explain how to make good, informative visualizations. Much of the following discussion is taken from https://r-graphics.org/.

Bar Plot

Shows categories (nominal variables) on the x-axis and numeric values on the y-axis

Line Graph

Typically used for visualizing how one continuous variable (on the y-axis) changes in relation to another continuous variable (on the x-axis). The x-axis variable often represents time.

Scatterplot

Scatterplots display the relationship between two continuous variables. Each observation in a dataset is represented by a point.

Histogram

Histograms show the distribution of a continuous variable on the x-axis with counts of some variable (either the x-axis variable or some other continuous variable) on the y-axis.

Some Wise Words…

Edward Tufte in Visual Display of Quantitative Information (2001, 51):

“Graphical excellence is the well-designed presentation of interesting data-a matter of substance, of statistics, and of design..

[It] consists of complex ideas communicated with clarity, precision, and efficiency..

[It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space..

[It] is nearly always multivariate..

And [it] requires telling the truth about the data.”

Best Practices

Interesting in terms of substance and statistics
Complex ideas presented clearly
Multivariate
Truthful
Simple (though see “Monstrous Costs” here and the discussion on pp. 5-14 in Kieran Healy’s 2019 book, Data Visualization)

Visualizing in R

First, the usual suspects:

library(tidyverse)

getwd()

imdb <- read_csv("imdb.csv")

ggplot2

Today we are going to use a powerful visualization tool native to R: ggplot2. The ggplot2 package is included within the tidyverse, and it uses a similar flow logic. Let’s start with barplots. Barplots are best for categorical variables.

Barplot

Let’s try making a barplot of the genre_main variable.

ggplot(imdb, aes(genre_main)) +
  geom_bar() + 
  labs(title = "Barplot", x = "Primary Genres", y = "Count")

Barplot

You can also flip the axes and order by frequency.

imdb %>% count(genre_main) %>% 
ggplot(aes(reorder(genre_main, n), n)) + 
  coord_flip() +
  geom_bar(stat = "identity") + 
  labs(title = "Barplot", x = "Primary Genres", y = "Count")

Barplot

Sometimes it is helpful to visualize variables—such as the action variable here—as dummies.

imdb %>% 
  mutate(action_dummy = if_else(genre_main == "Action", "Action", "Not Action")) %>% 
  ggplot(aes(action_dummy)) + 
    geom_bar() + 
    labs(title = "Count of Action vs. Non-Action Movies", 
          x = "Genre", 
          y = "Count")

Histograms

Now let’s move to histograms. They are better suited for numeric variables.

imdb %>% 
  ggplot(aes(x = runtime)) +
    geom_histogram()

Box and Whisker Plots

Box and whisker plots do a good job showing the quartiles of variables. You can also manually change the scales of the x and y axes to better suit your visualization. We show the box and whisker plot on the next page.

Box and Whisker Plots

imdb %>% 
  ggplot(aes(x = meta_score)) + 
    geom_boxplot() + 
    labs(title = "Box and Whisker Plot", 
    x = "Audience Rating", y = "") + 
    scale_x_continuous(breaks = c(0, 20, 40, 60, 80, 100), 
      labels = c(0, 20, 40, 60, 80, 100), limits = c(0, 100))

Line Graphs

Next up, line graphs. Line graphs are a bit different because they pool data—most often over time—to show trends. As such, we will have to pool our variable of interest over time and do some extra data cleaning.

First, we need to create a decade variable (years pooled into decades). To do this, we load in the lubridate package for dealing with date objects. Remember, you’ll only need to install this package once.

install.packages("lubridate")
library(lubridate)

Line Graphs

The following code 1) makes the annoying “PG” value into an NA, 2) makes the released_year variable into a Date object, and 3) converts that date object into a year object, rounding all dates down to the year in which they occurred.

imdb$decade <- imdb$released_year %>% 
  na_if("PG") %>%
  as.Date(format = "%Y") %>%
  floor_date(unit = "10 years") %>%
  year()

head(imdb$decade)

 [1] 1990 1970 2000 1950 2010 1960 1980 2020 1940 1930 1920   NA

Line Graphs

Now, let’s create a variable that counts the number of top 1000 movies released each decade, then join that variable back into the original dataset. (Stay with us, we know this is a bit more advanced).

decade_movie_count <- imdb %>% 
  group_by(decade) %>% 
  summarize(decade_movie_count = n(), decade = max(decade))

imdb <- left_join(imdb, decade_movie_count, by = "decade")

Line Graphs

Finally, let’s visualize the line graph!

line_graph_decade <- imdb %>% 
  ggplot(aes(decade, decade_movie_count)) + 
  geom_line() + 
  labs(title = "Line Graph", x = "Decade", y = "")

Saving Graphs

It is easy enough to save graphs for use outside of R. Just use ggsave() and enter in the file name and plot object. You can also specify width and height, but these are optional. Note: the figure will save to your working directory unless otherwise specified with a file path.

First, assign a plot to an object so that you can reference it later, then use ggsave().

line_graph_decade <- imdb %>% 
  ggplot(aes(decade, decade_movie_count)) + 
  geom_line() + 
  labs(title = "Line Graph", x = "Decade", y = "")
  
ggsave("line_graph_decade.png", line_graph_decade, width = 7, height = 5)

Scatterplots

Now we move to our final visualization of the day: scatterplots. Scatterplots show the raw data of two different variables at the same time and are good at depicting trends in the data.

Let’s see how much money the moves grossed over each decade.

Scatterplots

Some of the values are a bit weird, so let’s remove those first.

decade_gross <- imdb %>% group_by(decade) %>% 
  summarize(gross_mean = mean(as.numeric(gross), na.rm = T))

decade_gross

decade_gross <- decade_gross[-11:-12,]

# A tibble: 12 × 2
   decade gross_mean
    <dbl>      <dbl>
 1   1920   1844978.
 2   1930  18403476.
 3   1940   8293462.
 4   1950  14174694 
 5   1960  31634777.
 6   1970  52879527.
 7   1980  62314866.
 8   1990  61229999.
 9   2000  69658081.
10   2010 103720361.
11   2020       NaN 
12     NA 173837933

Scatterplots

Now let’s create the scatterplot. This first plot pools by decade.

decade_gross %>% 
  ggplot(aes(decade, gross_mean)) + geom_point()

Scatterplots

The second plot pools by year.

imdb %>% 
  ggplot(aes(as.numeric(released_year), as.numeric(gross))) + 
  geom_point() + 
  labs(y = "Gross Revenue", x = "Year") +
  theme(axis.text = element_blank(), axis.ticks = element_blank())

Coloring Graphs

I like to use this website to get a sense of the options available. There are also packages like RColorBrewer and viridis that have prepackaged colors that scale well. Use the “color” and/or “fill” options to specify colors.

imdb %>% ggplot(aes(imdb_rating, fill = "firebrick")) + labs(y = "Count", x = "IMDB Rating") +
  geom_histogram() + guides(fill = "none") #This guides() command removes the fill legend that pops up when you specify fill...in this case, it is not necessary.

Linear Regression

At its most basic, linear regression is a statistical model that draws the line of best fit in order to predict the relationship between your dependent variable and independent variable

The line of best fit minimizes error, or the distance from the line to the point. We will go over this more in depth later in a later class.

Additional Tips for Making Visualizations

Make sure your figures are fully interpretable with clear labels and captions. The reader should not need anything outside of the figure to know what it’s communicating.
Do not just drop a figure into your text without explaining it.

Close Lesson 4

We hope you had a good time running through the good, the bad, and the ugly of data visualization. It can be intense and time-consuming, but nothing feels better than making a good plot. In the next lesson, we will review correlational analyses and how to perform them in R.

← Back to Home Page

To Next Lesson →