Lesson 1: Intro to plotting data in R with ggplot
Functions for Lesson 1
?
, str
, glimpse
, summary
, table
, min
, max
, ggplot
, geom_point
, geom_smooth
, theme_minimal
, theme_classic
, theme_tufte
Packages for Lesson 1
tidyverse
, ggplot2
, dplyr
Agenda
Data visualisation in R
for Data Science, Section 3.1.1.
- Intro to the
R
environment (IDE)
- Loading packages, e.g.
tidyverse
- Using built-in
R
data: the mpg
dataset
- Using ggplot with the built-in data set (to make scatterplots)
- Modifying plot aesthetics
- Reading in outside data: Airbnb data
- Plotting Airbnb data with ggplot
Intro to the R
environment (IDE)
The RStudio
integrated development environment (IDE) and what you can do with it.
A more complete example of what you can acheive with the interface.
Loading packages, e.g. tidyverse
How to load packages in R
.
install.packages("tidyverse") # install package
library(tidyverse) # load the package library
require(tidyverse) # same as library
# We are typing in an R Script. Things with # in front make them comments and notes to ourselves
# Command Return to execute the line/ 'run the code'
Using built-in R
data: the mpg
dataset
Section 3.2.1
We'll use a built-in tidyverse dataset called mpg
with data about cars and gas-mileage.
mpg
# run help page with '?'
`?`(mpg)
- This is a tibble (data frame) that we've "printed" out. It's like R's version of an excel spreadsheet, but much better.
- A tibble will show us the first 10 rows, rows containing the data, column names, and the class of data within each column, such as numeric, integer, or character.
Summarising data
str(mpg) # structure of data
glimpse(mpg) # preview of data
summary(mpg) # basic summary stats
table(mpg$manufacturer) # counts of each column
head(mpg) # visualise first 6 rows of data
tail(mpg, 10) # visualise last 10 (or N) rows of data
names(mpg) # get column names
class(mpg) # class of data frame
class(mpg$manufacturer) # class of data column
mpg$displ # print a column
mpg$hwy # print a column
Creating a plot with ggplot
Section 3.2.2
ggplot()
Creates a coordinate system for us--basically an empty graph.
geom_point()
Adds a "layer", e.g. geom_point (but there are many for different kinds of graphs).
Plot two of the data columns
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))
Changing the data column inputs for the x and y axis of the plot
ggplot(data = mpg) + geom_point(mapping = aes(x = class, y = drv))
Assign data to variables to create dynamic inputs
my_data <- mpg # create own variable using a name of your choice
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy))
Themes
Change plot style. Link for more ggplot themes.
require(ggthemes)
# classic theme
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + theme_minimal()
# minimal theme
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + theme_tufte()
# assign theme to variable
my_theme <- theme_classic()
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + my_theme # apply your chosen theme
Aesthetic mapping
Section 3.3
colour
. Change the colour of the data points. size
. Change the size of the data points.
alpha
. Change the transparency of the data points.
Colour
Colour by colour name.
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "light blue") + my_theme
Colour by a hex code in quotes.
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "#BB5C42") + my_theme
colour by data column
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class)) + my_theme
Inside versus outside the aes
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = "blue")) + my_theme
Size
Size by integer
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, size = 5)) + my_theme
Size by data column
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, size = class)) + my_theme
We get a warning, but this is okay.
Transparency
# map classe column to different transparencies
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) + my_theme
Shape
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, shape = class)) + my_theme
Any warnings? Yes, because shape maxes out at six levels.
Manually changing aesthetic properties
But we can set the aesthetic properties manually, instead of having ggplot do the scaling automatically. For example, we can make our ggplot points all blue like this. This time, putting colour OUTSIDE the aes
argument.
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "blue") + my_theme
Using colour both inside and outside the aes
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class), colour = "#AE42BB") +
my_theme
The inner one is overridden.
Putting it all together as a snapshot of what's possible
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class, size = class, alpha = class)) +
my_theme
Aesthetics you can manually set
- The name of a colour as a character string.
- The size of a point in mm.
- The shape of a point as a number, as shown in Figure 3.1.
R
has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the colour
and fill
aesthetics. The hollow shapes (0--14) have a border determined by colour
; the solid shapes (15--18) are filled with colour
; the filled shapes (21--24) have a border of colour
and are filled with fill
.
Further plotting examples
Section 3.3.1
The online reference contains further examples of how to visualise your data.
Reading in outside data: NYC Airbnb data
# A tibble: 6 x 74
id listing_url scrape_id last_scraped name description neighborhood_ov… picture_url host_id
<dbl> <chr> <dbl> <date> <chr> <chr> <chr> <chr> <dbl>
1 2595 https://www… 2.02e13 2021-04-09 Skyli… "Beautiful,… Centrally locat… https://a0… 2845
2 3831 https://www… 2.02e13 2021-04-12 Whole… "Enjoy 500 … Just the right … https://a0… 4869
3 5121 https://www… 2.02e13 2021-04-09 Bliss… "<b>The spa… <NA> https://a0… 7356
4 5136 https://www… 2.02e13 2021-04-10 Spaci… "We welcome… <NA> https://a0… 7378
5 5178 https://www… 2.02e13 2021-04-12 Large… "Please don… Theater distric… https://a0… 8967
6 5203 https://www… 2.02e13 2021-04-09 Cozy … "Our best g… Our neighborhoo… https://a0… 7490
# … with 65 more variables: host_url <chr>, host_name <chr>, host_since <date>,
# host_location <chr>, host_about <chr>, host_response_time <chr>, host_response_rate <chr>,
# host_acceptance_rate <chr>, host_is_superhost <lgl>, host_thumbnail_url <chr>,
# host_picture_url <chr>, host_neighbourhood <chr>, host_listings_count <dbl>,
# host_total_listings_count <dbl>, host_verifications <chr>, host_has_profile_pic <lgl>,
# host_identity_verified <lgl>, neighbourhood <chr>, neighbourhood_cleansed <chr>,
# neighbourhood_group_cleansed <chr>, latitude <dbl>, longitude <dbl>, property_type <chr>,
# room_type <chr>, accommodates <dbl>, bathrooms <lgl>, bathrooms_text <chr>, bedrooms <dbl>,
# beds <dbl>, amenities <chr>, price <chr>, minimum_nights <dbl>, maximum_nights <dbl>,
# minimum_minimum_nights <dbl>, maximum_minimum_nights <dbl>, minimum_maximum_nights <dbl>,
# maximum_maximum_nights <dbl>, minimum_nights_avg_ntm <dbl>, maximum_nights_avg_ntm <dbl>,
# calendar_updated <lgl>, has_availability <lgl>, availability_30 <dbl>, availability_60 <dbl>,
# availability_90 <dbl>, availability_365 <dbl>, calendar_last_scraped <date>,
# number_of_reviews <dbl>, number_of_reviews_ltm <dbl>, number_of_reviews_l30d <dbl>,
# first_review <date>, last_review <date>, review_scores_rating <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>, review_scores_checkin <dbl>,
# review_scores_communication <dbl>, review_scores_location <dbl>, review_scores_value <dbl>,
# license <lgl>, instant_bookable <lgl>, calculated_host_listings_count <dbl>,
# calculated_host_listings_count_entire_homes <dbl>,
# calculated_host_listings_count_private_rooms <dbl>,
# calculated_host_listings_count_shared_rooms <dbl>, reviews_per_month <dbl>
Using a smaller dataset
# smaller csv file (16 cols)
url <- "http://data.insideairbnb.com/united-states/ny/new-york-city/2021-04-07/data/listings.csv.gz"
nyc <- read_csv(url)
nyc <- nyc[nyc$id < 20000, ] # get smaller subet of data
length(nyc$id) # print length of 'id' column
head(nyc)
Plotting AirBnB data with ggplot
Using the above plotting functions to visualise the AirBnB data
# plot neighborhood_group vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = neighbourhood_group_cleansed, y = price, colour = neighbourhood_group_cleansed),
shape = 21, stroke = 1) + my_theme
# plot minimum_nights vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = minimum_nights, y = price, colour = neighbourhood_group_cleansed),
shape = 20, size = 3, stroke = 1) + my_theme
# availability_365 vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = availability_365, y = price, colour = neighbourhood_group_cleansed),
shape = 21, stroke = 1) + my_theme
# plot longitude vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = longitude, y = price, colour = neighbourhood_group_cleansed),
shape = 21, stroke = 1) + my_theme
Try your own plot using the other variables in the dataset
# plot neighborhood_group vs price
names(airbnb)
glimpse(airbnb)
my_data <- NULL
x <- NULL
y <- NULL
colour <- NULL
shape <- NULL
stroke <- NULL
