On your R console, you will see 1) in the top left, 2) in the top right, 3) in the bottom left, and 4) in the bottom right.
Initial Setup
Before anything else, in every R file you make, you should first 1) clear the working directory and 2) load in any necessary packages.
To clear any data currently loaded into your environment, you can run the following command.
rm(list =ls())
Packages
You load packages using, first, the install(“packagename”) function, and then, library(packagename). You only need to install a package on your computer once. Afterwards, you can always reference it using the library function. For example, throughout this course we will be using the “tidyverse” package. The following code installs this package, and then loads it.
install.packages("tidyverse")library(tidyverse)
Checking the Working Directory
Before we load in data, it’s important to set your working directory and understand where you are saving files. Your working directory is where R will go to look for any data that you are trying to load in, and where it will save any data, figures, etc unless directed otherwise. Typically, the working directory is wherever you have your R file saved—but you always want to check and make sure you are where you intend to be.
You can check where the working directory currently is by using the getwd() command.
getwd()
Changing the Working Directory
If this is not where you want your working directory, you can change it using the setwd() command and directing R to the correct directory. For example:
First, let’s take a look at the different variables that are included in this data.
poster_link: link of the poser that IMDb is using
series_title: name of the movie
released_year: year the movie was released
certificate: certificate earned by that movie (i.e. U, PG-13, R, etc)
runtime: total runtime of the movie in minutes
genre: all genres of the movie
genre_main: first listed genre of the movie
imdb_rating: rating of the movie on IMDb by users
overview: written summary of the movie
meta_score: rating of the movie by critics
director: name of the director
star1, star2, star3, star4: names of main actors
no_of_votes: total number of user rating votes
gross: total money earned by the movie in USD
Loading in Data
Now, let’s load in the IMDb data. First, make sure the data file is in the same folder as your working directory. Then, run the following code.
imdb <-read_csv("imdb.csv")
In this command, we first specify the name of the object that we’d like to hold our data. In this case, an object named “imdb” will hold the data we’re reading in from the csv “imdb.csv”.
Viewing data
Having loaded in our data, you can take a look at the full dataset by typing the object name.
imdb
# A tibble: 1,000 × 18
poster_link series_title released_year certificate runtime genre_main genre
<chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 https://m.me… The Shawsha… 1994 A 142 Drama Drama
2 https://m.me… The Godfath… 1972 A 175 Crime Crim…
3 https://m.me… The Dark Kn… 2008 UA 152 Action Acti…
4 https://m.me… The Godfath… 1974 A 202 Crime Crim…
5 https://m.me… 12 Angry Men 1957 U 96 Crime Crim…
6 https://m.me… The Lord of… 2003 U 201 Action Acti…
7 https://m.me… Pulp Fiction 1994 A 154 Crime Crim…
8 https://m.me… Schindler's… 1993 A 195 Biography Biog…
9 https://m.me… Inception 2010 UA 148 Action Acti…
10 https://m.me… Fight Club 1999 A 139 Drama Drama
# ℹ 990 more rows
# ℹ 11 more variables: imdb_rating <dbl>, no_of_votes <dbl>, overview <chr>,
# meta_score <dbl>, director <chr>, star1 <chr>, star2 <chr>, star3 <chr>,
# star4 <chr>, gross <chr>, top10score <dbl>
Viewing data
You can also look at specific variables by typing the name of the dataset and then selecting a particular column.
imdb %>%select(director)
# A tibble: 1,000 × 1
director
<chr>
1 Frank Darabont
2 Francis Ford Coppola
3 Christopher Nolan
4 Francis Ford Coppola
5 Sidney Lumet
6 Peter Jackson
7 Quentin Tarantino
8 Steven Spielberg
9 Christopher Nolan
10 David Fincher
# ℹ 990 more rows
Describing Variables
First, let’s look at a table of variables in the dataset, their structures, and their first values using the str() function.
str(imdb)
spc_tbl_ [1,000 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ poster_link : chr [1:1000] "https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk"| __truncated__ "https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ "https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg" "https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ"| __truncated__ ...
$ series_title : chr [1:1000] "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather: Part II" ...
$ released_year: chr [1:1000] "1994" "1972" "2008" "1974" ...
$ certificate : chr [1:1000] "A" "A" "UA" "A" ...
$ runtime : num [1:1000] 142 175 152 202 96 201 154 195 148 139 ...
$ genre_main : chr [1:1000] "Drama" "Crime" "Action" "Crime" ...
$ genre : chr [1:1000] "Drama" "Crime, Drama" "Action, Crime, Drama" "Crime, Drama" ...
$ imdb_rating : num [1:1000] 9.3 9.2 9 9 9 8.9 8.9 8.9 8.8 8.8 ...
$ no_of_votes : num [1:1000] 2343110 1620367 2303232 1129952 689845 ...
$ overview : chr [1:1000] "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency." "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son." "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of th"| __truncated__ "The early life and career of Vito Corleone in 1920s New York City is portrayed, while his son, Michael, expands"| __truncated__ ...
$ meta_score : num [1:1000] 80 100 84 90 96 94 94 94 74 66 ...
$ director : chr [1:1000] "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
$ star1 : chr [1:1000] "Tim Robbins" "Marlon Brando" "Christian Bale" "Al Pacino" ...
$ star2 : chr [1:1000] "Morgan Freeman" "Al Pacino" "Heath Ledger" "Robert De Niro" ...
$ star3 : chr [1:1000] "Bob Gunton" "James Caan" "Aaron Eckhart" "Robert Duvall" ...
$ star4 : chr [1:1000] "William Sadler" "Diane Keaton" "Michael Caine" "Diane Keaton" ...
$ gross : chr [1:1000] "28341469" "134966411" "534858444" "57300000" ...
$ top10score : num [1:1000] 0 1 0 1 1 1 1 1 0 0 ...
- attr(*, "spec")=
.. cols(
.. poster_link = col_character(),
.. series_title = col_character(),
.. released_year = col_character(),
.. certificate = col_character(),
.. runtime = col_double(),
.. genre_main = col_character(),
.. genre = col_character(),
.. imdb_rating = col_double(),
.. no_of_votes = col_double(),
.. overview = col_character(),
.. meta_score = col_double(),
.. director = col_character(),
.. star1 = col_character(),
.. star2 = col_character(),
.. star3 = col_character(),
.. star4 = col_character(),
.. gross = col_character(),
.. top10score = col_double()
.. )
- attr(*, "problems")=<externalptr>
Describing Variables
To look at a particular variable, you can use the table() function and specify which variable you’d like a table of values for. To specify the column, you can use the $ operator and the name of that column. For example, this will tell us the different main movie genres are in our database and how many observations match each main genre.
table(imdb$genre_main)
Action Adventure Animation Biography Comedy Crime Drama Family
172 72 82 88 155 107 289 2
Fantasy Film-Noir Horror Mystery Thriller Western
2 3 11 12 1 4
Describing Variables
You can also look at how two variables compare with one another. For example, the following code will tells how different main genres vary across our years in our top 1000 movies database from 2000 forward.
We can also look at basic descriptive statistics of numeric variables using the summary() function.
summary(imdb$runtime)
Min. 1st Qu. Median Mean 3rd Qu. Max.
45.0 103.0 119.0 122.9 137.0 321.0
Describing Variables
What if we look at summary() for the entire dataset, including for non-numeric variables?
summary(imdb)
poster_link series_title released_year certificate
Length:1000 Length:1000 Length:1000 Length:1000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
runtime genre_main genre imdb_rating
Min. : 45.0 Length:1000 Length:1000 Min. :7.600
1st Qu.:103.0 Class :character Class :character 1st Qu.:7.700
Median :119.0 Mode :character Mode :character Median :7.900
Mean :122.9 Mean :7.949
3rd Qu.:137.0 3rd Qu.:8.100
Max. :321.0 Max. :9.300
no_of_votes overview meta_score director
Min. : 25088 Length:1000 Min. : 28.00 Length:1000
1st Qu.: 55526 Class :character 1st Qu.: 70.00 Class :character
Median : 138549 Mode :character Median : 79.00 Mode :character
Mean : 273693 Mean : 77.97
3rd Qu.: 374161 3rd Qu.: 87.00
Max. :2343110 Max. :100.00
NA's :157
star1 star2 star3 star4
Length:1000 Length:1000 Length:1000 Length:1000
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
gross top10score
Length:1000 Min. :0.000
Class :character 1st Qu.:0.000
Mode :character Median :0.000
Mean :0.318
3rd Qu.:1.000
Max. :1.000
Creating Variables
We can also create new variables based on certain values of existing variables. For example, if we want to create a new variable that indicates whether a film has a rating of 9 or higher by user reviews.
Hmm, looks like one of the observations has a value of “PG” instead of a year…that’s an issue, so it’s nice that R converted that to an NA automatically.
However, you should always keep your eye out for weird data changes like that. You don’t want this to be happening to many of your observations, as that’s more indicative of a major data issue than a little error we can fix.
Other Dataset Operations
We can also change the names of existing variables. Let’s rename the “movie_name” variable to “series_name”.
Now, let’s save our changes as a .csv file (the most common file type that you’ll be working with).
write.csv(imdb, file ="imdb_practice.csv")
Close Lesson 1
That’s it for the first lesson! We know that R can be intimidating at first, but keep practicing and you’ll have it mastered in no time. In the next lesson, we will cover descriptive statistics.