Introduction to Data Processing with R

Jennifer Barnes and Alexander Tripp

Objectives

Get acquainted with R and RStudio
Learn how to write transparent code using .R files
Know how to access and browse a pre-cleaned dataset
Basic data manipulation

Why use R?

free and open-source
cross-platform
tons of resources online
regularly updated
extremely flexible and can accomplish a lot

Installing R & RStudio

You will need to separately install R and RStudio (our IDE of choice for R).

https://posit.co/download/rstudio-desktop/

RStudio Layout

On your R console, you will see 1) in the top left, 2) in the top right, 3) in the bottom left, and 4) in the bottom right.

Initial Setup

Before anything else, in every R file you make, you should first 1) clear the working directory and 2) load in any necessary packages.

To clear any data currently loaded into your environment, you can run the following command.

rm(list = ls())

Packages

You load packages using, first, the install(“packagename”) function, and then, library(packagename). You only need to install a package on your computer once. Afterwards, you can always reference it using the library function. For example, throughout this course we will be using the “tidyverse” package. The following code installs this package, and then loads it.

install.packages("tidyverse")

library(tidyverse)

Checking the Working Directory

Before we load in data, it’s important to set your working directory and understand where you are saving files. Your working directory is where R will go to look for any data that you are trying to load in, and where it will save any data, figures, etc unless directed otherwise. Typically, the working directory is wherever you have your R file saved—but you always want to check and make sure you are where you intend to be.

You can check where the working directory currently is by using the getwd() command.

getwd()

Changing the Working Directory

If this is not where you want your working directory, you can change it using the setwd() command and directing R to the correct directory. For example:

setwd("C:/Users/atrip/Dropbox/REU 2024 Stats/Shiny App Lessons")

Importantly, if you are working with multiple R scripts open, simply clicking over to another open script will not change your working directory.

IMDb Data

For class examples this summer, we’ll be using data of the Top 1000 (by user rating) movies on IMDb (as of 2020).

First, let’s take a look at the different variables that are included in this data.

poster_link: link of the poser that IMDb is using
series_title: name of the movie
released_year: year the movie was released
certificate: certificate earned by that movie (i.e. U, PG-13, R, etc)
runtime: total runtime of the movie in minutes
genre: all genres of the movie
genre_main: first listed genre of the movie

imdb_rating: rating of the movie on IMDb by users
overview: written summary of the movie
meta_score: rating of the movie by critics
director: name of the director
star1, star2, star3, star4: names of main actors
no_of_votes: total number of user rating votes
gross: total money earned by the movie in USD

Loading in Data

Now, let’s load in the IMDb data. First, make sure the data file is in the same folder as your working directory. Then, run the following code.

imdb <- read_csv("imdb.csv")

In this command, we first specify the name of the object that we’d like to hold our data. In this case, an object named “imdb” will hold the data we’re reading in from the csv “imdb.csv”.

Viewing data

Having loaded in our data, you can take a look at the full dataset by typing the object name.

imdb

# A tibble: 1,000 × 18
   poster_link   series_title released_year certificate runtime genre_main genre
   <chr>         <chr>        <chr>         <chr>         <dbl> <chr>      <chr>
 1 https://m.me… The Shawsha… 1994          A               142 Drama      Drama
 2 https://m.me… The Godfath… 1972          A               175 Crime      Crim…
 3 https://m.me… The Dark Kn… 2008          UA              152 Action     Acti…
 4 https://m.me… The Godfath… 1974          A               202 Crime      Crim…
 5 https://m.me… 12 Angry Men 1957          U                96 Crime      Crim…
 6 https://m.me… The Lord of… 2003          U               201 Action     Acti…
 7 https://m.me… Pulp Fiction 1994          A               154 Crime      Crim…
 8 https://m.me… Schindler's… 1993          A               195 Biography  Biog…
 9 https://m.me… Inception    2010          UA              148 Action     Acti…
10 https://m.me… Fight Club   1999          A               139 Drama      Drama
# ℹ 990 more rows
# ℹ 11 more variables: imdb_rating <dbl>, no_of_votes <dbl>, overview <chr>,
#   meta_score <dbl>, director <chr>, star1 <chr>, star2 <chr>, star3 <chr>,
#   star4 <chr>, gross <chr>, top10score <dbl>

Viewing data

You can also look at specific variables by typing the name of the dataset and then selecting a particular column.

imdb %>% select(director)

# A tibble: 1,000 × 1
   director            
   <chr>               
 1 Frank Darabont      
 2 Francis Ford Coppola
 3 Christopher Nolan   
 4 Francis Ford Coppola
 5 Sidney Lumet        
 6 Peter Jackson       
 7 Quentin Tarantino   
 8 Steven Spielberg    
 9 Christopher Nolan   
10 David Fincher       
# ℹ 990 more rows

Describing Variables

First, let’s look at a table of variables in the dataset, their structures, and their first values using the str() function.

str(imdb)

spc_tbl_ [1,000 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ poster_link  : chr [1:1000] "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ "https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg" "https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ ...
 $ series_title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
 $ released_year: chr [1:1000] "1994" "1972" "2008" "1974" ...
 $ certificate  : chr [1:1000] "A" "A" "UA" "A" ...
 $ runtime      : num [1:1000] 142 175 152 202 96 201 154 195 148 139 ...
 $ genre_main   : chr [1:1000] "Drama" "Crime" "Action" "Crime" ...
 $ genre        : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
 $ imdb_rating  : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
 $ no_of_votes  : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
 $ overview     : chr [1:1000] "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency." "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son." "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of th"| __truncated__ "The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands"| __truncated__ ...
 $ meta_score   : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
 $ director     : chr [1:1000] "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
 $ star1        : chr [1:1000] "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
 $ star2        : chr [1:1000] "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
 $ star3        : chr [1:1000] "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
 $ star4        : chr [1:1000] "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
 $ gross        : chr [1:1000] "28341469" "134966411" "534858444" "57300000" ...
 $ top10score   : num [1:1000] 0 1 0 1 1 1 1 1 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   poster_link = col_character(),
  ..   series_title = col_character(),
  ..   released_year = col_character(),
  ..   certificate = col_character(),
  ..   runtime = col_double(),
  ..   genre_main = col_character(),
  ..   genre = col_character(),
  ..   imdb_rating = col_double(),
  ..   no_of_votes = col_double(),
  ..   overview = col_character(),
  ..   meta_score = col_double(),
  ..   director = col_character(),
  ..   star1 = col_character(),
  ..   star2 = col_character(),
  ..   star3 = col_character(),
  ..   star4 = col_character(),
  ..   gross = col_character(),
  ..   top10score = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Describing Variables

To look at a particular variable, you can use the table() function and specify which variable you’d like a table of values for. To specify the column, you can use the $ operator and the name of that column. For example, this will tell us the different main movie genres are in our database and how many observations match each main genre.

table(imdb$genre_main)


   Action Adventure Animation Biography    Comedy     Crime     Drama    Family 
      172        72        82        88       155       107       289         2 
  Fantasy Film-Noir    Horror   Mystery  Thriller   Western 
        2         3        11        12         1         4

Describing Variables

You can also look at how two variables compare with one another. For example, the following code will tells how different main genres vary across our years in our top 1000 movies database from 2000 forward.

imdb %>%
  select(genre_main, released_year) %>%
  filter(released_year >= 2000) %>%
  table()

           released_year
genre_main  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
  Action       5    1    6    5    5    4    6    5    6    8    5    3    6
  Adventure    3    2    1    1    3    1    3    2    0    2    1    1    2
  Animation    1    5    0    3    2    0    2    2    2    5    5    0    2
  Biography    1    2    2    0    4    3    2    4    2    2    3    2    1
  Comedy       2    7    1    6    1    2    3    1    1    3    1    4    6
  Crime        2    2    4    4    3    1    4    3    1    3    0    3    0
  Drama        4    7    5    3   12    6    6    9    9    6    7    5    5
  Horror       0    1    0    0    1    0    0    0    0    0    0    0    0
  Mystery      1    0    0    0    0    0    0    0    0    0    1    0    2
           released_year
genre_main  2013 2014 2015 2016 2017 2018 2019 2020 PG
  Action       2    8    7    6    9    6    3    0  0
  Adventure    4    2    1    1    1    0    2    0  1
  Animation    2    5    2    5    2    3    2    1  0
  Biography    5    4    3    4    1    2    3    1  0
  Comedy       1    4    2    4    1    1    6    2  0
  Crime        4    2    4    1    1    2    1    0  0
  Drama       10    7    6    7    6    5    6    2  0
  Horror       0    0    0    0    1    0    0    0  0
  Mystery      0    0    0    0    0    0    0    0  0

Describing Variables

We can also look at basic descriptive statistics of numeric variables using the summary() function.

summary(imdb$runtime)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   45.0   103.0   119.0   122.9   137.0   321.0

Describing Variables

What if we look at summary() for the entire dataset, including for non-numeric variables?

summary(imdb)

 poster_link        series_title       released_year      certificate       
 Length:1000        Length:1000        Length:1000        Length:1000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    runtime       genre_main           genre            imdb_rating   
 Min.   : 45.0   Length:1000        Length:1000        Min.   :7.600  
 1st Qu.:103.0   Class :character   Class :character   1st Qu.:7.700  
 Median :119.0   Mode  :character   Mode  :character   Median :7.900  
 Mean   :122.9                                         Mean   :7.949  
 3rd Qu.:137.0                                         3rd Qu.:8.100  
 Max.   :321.0                                         Max.   :9.300  
                                                                      
  no_of_votes        overview           meta_score       director        
 Min.   :  25088   Length:1000        Min.   : 28.00   Length:1000       
 1st Qu.:  55526   Class :character   1st Qu.: 70.00   Class :character  
 Median : 138549   Mode  :character   Median : 79.00   Mode  :character  
 Mean   : 273693                      Mean   : 77.97                     
 3rd Qu.: 374161                      3rd Qu.: 87.00                     
 Max.   :2343110                      Max.   :100.00                     
                                      NA's   :157                        
    star1              star2              star3              star4          
 Length:1000        Length:1000        Length:1000        Length:1000       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    gross             top10score   
 Length:1000        Min.   :0.000  
 Class :character   1st Qu.:0.000  
 Mode  :character   Median :0.000  
                    Mean   :0.318  
                    3rd Qu.:1.000  
                    Max.   :1.000

Creating Variables

We can also create new variables based on certain values of existing variables. For example, if we want to create a new variable that indicates whether a film has a rating of 9 or higher by user reviews.

imdb$rating_above_nine <- ifelse(imdb$imdb_rating >= 9, 1, 0)
summary(imdb$rating_above_nine)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   0.000   0.005   0.000   1.000

The ifelse() function works by taking a logical statement about a variable and providing corresponding values for if that statement is true or false.

Creating Variables

Now, can we do the same thing to indicate if the movie came out after 2000?

imdb$post2000_movie <- ifelse(imdb$released_year > 2000, 1, 0)

Yes! But be careful…the released_year variable is a character variable.

class(imdb$released_year)

[1] "character"

Creating Variables

So, we are lucky that worked, but better practice would be to first convert it to a numeric variable.

imdb$released_year_numeric <- as.numeric(imdb$released_year)

Uh oh! An NA? (A missing value?) Let’s investigate by looking at the table of the original values again.

Creating Variables

table(imdb$released_year)


1920 1921 1922 1924 1925 1926 1927 1928 1930 1931 1932 1933 1934 1935 1936 1937 
   1    1    1    1    2    1    2    2    1    3    2    3    2    3    1    1 
1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 
   3    5    7    2    3    1    4    2    5    2    6    3    5    5    4    5 
1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 
   6    6    5    9    4    7   11    5   13    5    7    4    7   10    8    3 
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 
   3    9    8   12    6    9    7    3    7   12    8    4   11    5    9    9 
1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 
   9   12   11   11    8   12   12   23   13   19   10   19   17   17   19   27 
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 
  19   22   31   17   26   26   21   29   23   18   24   28   32   25   28   22 
2018 2019 2020   PG 
  19   23    6    1

Creating Variables

Hmm, looks like one of the observations has a value of “PG” instead of a year…that’s an issue, so it’s nice that R converted that to an NA automatically.

However, you should always keep your eye out for weird data changes like that. You don’t want this to be happening to many of your observations, as that’s more indicative of a major data issue than a little error we can fix.

Other Dataset Operations

We can also change the names of existing variables. Let’s rename the “movie_name” variable to “series_name”.

imdb <- imdb %>% rename(movie_name = series_title)
names(imdb)

 [1] "poster_link"           "movie_name"            "released_year"        
 [4] "certificate"           "runtime"               "genre_main"           
 [7] "genre"                 "imdb_rating"           "no_of_votes"          
[10] "overview"              "meta_score"            "director"             
[13] "star1"                 "star2"                 "star3"                
[16] "star4"                 "gross"                 "top10score"           
[19] "rating_above_nine"     "post2000_movie"        "released_year_numeric"

Now, let’s save our changes as a .csv file (the most common file type that you’ll be working with).

write.csv(imdb, file = "imdb_practice.csv")

Close Lesson 1

That’s it for the first lesson! We know that R can be intimidating at first, but keep practicing and you’ll have it mastered in no time. In the next lesson, we will cover descriptive statistics.

← Back to Home Page

To Next Lesson →