Lesson 1: Intro to plotting data in R with ggplot

 

Functions for Lesson 1
?, str, glimpse, summary, table, min, max, ggplot, geom_point, geom_smooth, theme_minimal, theme_classic, theme_tufte
 

Packages for Lesson 1
tidyverse, ggplot2, dplyr
 

Agenda

Data visualisation in R for Data Science, Section 3.1.1.

  • Intro to the Renvironment (IDE)
  • Loading packages, e.g. tidyverse
  • Using built-in R data: the mpg dataset
  • Using ggplot with the built-in data set (to make scatterplots)
  • Modifying plot aesthetics
  • Reading in outside data: Airbnb data
  • Plotting Airbnb data with ggplot

 

Intro to the R environment (IDE)

The RStudio integrated development environment (IDE) and what you can do with it.
 

 

A more complete example of what you can acheive with the interface.
 

 

Loading packages, e.g. tidyverse

How to load packages in R.

install.packages("tidyverse")  # install package
library(tidyverse)  # load the package library
require(tidyverse)  # same as library    

# We are typing in an R Script. Things with # in front make them comments and notes to ourselves
# Command Return to execute the line/ 'run the code'

 

Using built-in R data: the mpg dataset

Section 3.2.1

We'll use a built-in tidyverse dataset called mpg with data about cars and gas-mileage.

mpg
# run help page with '?'
`?`(mpg)
  • This is a tibble (data frame) that we've "printed" out. It's like R's version of an excel spreadsheet, but much better.
  • A tibble will show us the first 10 rows, rows containing the data, column names, and the class of data within each column, such as numeric, integer, or character.

Summarising data

str(mpg)  # structure of data
glimpse(mpg)  # preview of data 
summary(mpg)  # basic summary stats  
table(mpg$manufacturer)  # counts of each column
head(mpg)  # visualise first 6 rows of data
tail(mpg, 10)  # visualise last 10 (or N) rows of data 
names(mpg)  # get column names
class(mpg)  # class of data frame
class(mpg$manufacturer)  # class of data column
mpg$displ  # print a column
mpg$hwy  # print a column

 

Creating a plot with ggplot

Section 3.2.2

  • ggplot() Creates a coordinate system for us--basically an empty graph.
  • geom_point() Adds a "layer", e.g. geom_point (but there are many for different kinds of graphs).

Plot two of the data columns

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

 

Changing the data column inputs for the x and y axis of the plot

ggplot(data = mpg) + geom_point(mapping = aes(x = class, y = drv))

 

Assign data to variables to create dynamic inputs

my_data <- mpg  # create own variable using a name of your choice  

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy))

 

Themes

Change plot style. Link for more ggplot themes.

require(ggthemes)

# classic theme
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + theme_minimal()

# minimal theme
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + theme_tufte()

# assign theme to variable
my_theme <- theme_classic()
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy)) + my_theme  # apply your chosen theme  

 

Aesthetic mapping

Section 3.3

colour. Change the colour of the data points. size. Change the size of the data points.
alpha. Change the transparency of the data points.

Colour

Colour by colour name.

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "light blue") + my_theme

 

Colour by a hex code in quotes.

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "#BB5C42") + my_theme

 

colour by data column

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class)) + my_theme

 

Inside versus outside the aes

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = "blue")) + my_theme

 

Size

Size by integer

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, size = 5)) + my_theme

 

Size by data column

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, size = class)) + my_theme

 

We get a warning, but this is okay.
 

Transparency

# map classe column to different transparencies
ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, alpha = class)) + my_theme

 

Shape

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, shape = class)) + my_theme

 

Any warnings? Yes, because shape maxes out at six levels.  

Manually changing aesthetic properties

But we can set the aesthetic properties manually, instead of having ggplot do the scaling automatically. For example, we can make our ggplot points all blue like this. This time, putting colour OUTSIDE the aes argument.

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy), colour = "blue") + my_theme

 

Using colour both inside and outside the aes

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class), colour = "#AE42BB") + 
    my_theme

 

The inner one is overridden.
 

Putting it all together as a snapshot of what's possible

ggplot(data = my_data) + geom_point(mapping = aes(x = displ, y = hwy, colour = class, size = class, alpha = class)) + 
    my_theme

Aesthetics you can manually set

  • The name of a colour as a character string.
  • The size of a point in mm.
  • The shape of a point as a number, as shown in Figure 3.1.

 

R has 25 built in shapes that are identified by numbers. There are some seeming duplicates: for example, 0, 15, and 22 are all squares. The difference comes from the interaction of the colour and fill aesthetics. The hollow shapes (0--14) have a border determined by colour; the solid shapes (15--18) are filled with colour; the filled shapes (21--24) have a border of colour and are filled with fill.

Further plotting examples

Section 3.3.1

The online reference contains further examples of how to visualise your data.

Reading in outside data: NYC Airbnb data

# A tibble: 6 x 74
     id listing_url  scrape_id last_scraped name   description  neighborhood_ov… picture_url host_id
  <dbl> <chr>            <dbl> <date>       <chr>  <chr>        <chr>            <chr>         <dbl>
1  2595 https://www…   2.02e13 2021-04-09   Skyli… "Beautiful,… Centrally locat… https://a0…    2845
2  3831 https://www…   2.02e13 2021-04-12   Whole… "Enjoy 500 … Just the right … https://a0…    4869
3  5121 https://www…   2.02e13 2021-04-09   Bliss… "<b>The spa… <NA>             https://a0…    7356
4  5136 https://www…   2.02e13 2021-04-10   Spaci… "We welcome… <NA>             https://a0…    7378
5  5178 https://www…   2.02e13 2021-04-12   Large… "Please don… Theater distric… https://a0…    8967
6  5203 https://www…   2.02e13 2021-04-09   Cozy … "Our best g… Our neighborhoo… https://a0…    7490
# … with 65 more variables: host_url <chr>, host_name <chr>, host_since <date>,
#   host_location <chr>, host_about <chr>, host_response_time <chr>, host_response_rate <chr>,
#   host_acceptance_rate <chr>, host_is_superhost <lgl>, host_thumbnail_url <chr>,
#   host_picture_url <chr>, host_neighbourhood <chr>, host_listings_count <dbl>,
#   host_total_listings_count <dbl>, host_verifications <chr>, host_has_profile_pic <lgl>,
#   host_identity_verified <lgl>, neighbourhood <chr>, neighbourhood_cleansed <chr>,
#   neighbourhood_group_cleansed <chr>, latitude <dbl>, longitude <dbl>, property_type <chr>,
#   room_type <chr>, accommodates <dbl>, bathrooms <lgl>, bathrooms_text <chr>, bedrooms <dbl>,
#   beds <dbl>, amenities <chr>, price <chr>, minimum_nights <dbl>, maximum_nights <dbl>,
#   minimum_minimum_nights <dbl>, maximum_minimum_nights <dbl>, minimum_maximum_nights <dbl>,
#   maximum_maximum_nights <dbl>, minimum_nights_avg_ntm <dbl>, maximum_nights_avg_ntm <dbl>,
#   calendar_updated <lgl>, has_availability <lgl>, availability_30 <dbl>, availability_60 <dbl>,
#   availability_90 <dbl>, availability_365 <dbl>, calendar_last_scraped <date>,
#   number_of_reviews <dbl>, number_of_reviews_ltm <dbl>, number_of_reviews_l30d <dbl>,
#   first_review <date>, last_review <date>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>, review_scores_checkin <dbl>,
#   review_scores_communication <dbl>, review_scores_location <dbl>, review_scores_value <dbl>,
#   license <lgl>, instant_bookable <lgl>, calculated_host_listings_count <dbl>,
#   calculated_host_listings_count_entire_homes <dbl>,
#   calculated_host_listings_count_private_rooms <dbl>,
#   calculated_host_listings_count_shared_rooms <dbl>, reviews_per_month <dbl>

 

Using a smaller dataset

# smaller csv file (16 cols)
url <- "http://data.insideairbnb.com/united-states/ny/new-york-city/2021-04-07/data/listings.csv.gz"

nyc <- read_csv(url)
nyc <- nyc[nyc$id < 20000, ]  # get smaller subet of data
length(nyc$id)  # print length of 'id' column
head(nyc)

 

Plotting AirBnB data with ggplot

Using the above plotting functions to visualise the AirBnB data

# plot neighborhood_group vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = neighbourhood_group_cleansed, y = price, colour = neighbourhood_group_cleansed), 
    shape = 21, stroke = 1) + my_theme

# plot minimum_nights vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = minimum_nights, y = price, colour = neighbourhood_group_cleansed), 
    shape = 20, size = 3, stroke = 1) + my_theme

# availability_365 vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = availability_365, y = price, colour = neighbourhood_group_cleansed), 
    shape = 21, stroke = 1) + my_theme

# plot longitude vs price
ggplot(data = nyc) + geom_point(mapping = aes(x = longitude, y = price, colour = neighbourhood_group_cleansed), 
    shape = 21, stroke = 1) + my_theme

Try your own plot using the other variables in the dataset

# plot neighborhood_group vs price
names(airbnb)
glimpse(airbnb)

my_data <- NULL
x <- NULL
y <- NULL
colour <- NULL
shape <- NULL
stroke <- NULL
