Lesson 8: Scraping and manipulating text strings

 

Functions for Lesson 8
str_which,str_detect,str_locate,str_view,str_sub
 

Packages for Lesson 8
stringr
 

Agenda

Use the stringr package to cut, substitute, print, and manipulate character and text strings in R. Useful for webscraping text from webpages, scraping PDFs and text files for given characters and words, mining genomics data, etc.

Cheat sheet for the stringr package.
 

 

Install necessary packages

# install.packages('pacman') # uncomment and install this first
pacman::p_load(stringr, stringi, dplyr, reprex, xml2, rvest)

First, we need some text data. As an exercise, since we're using strings, we're going to use all the text from the webpage on using strings from the R for Data Science textbook as our text sample.

require(xml2)  # read html data
require(rvest)  # select html elements

url <- "https://r4ds.had.co.nz/strings.html"
txt <- url %>% read_html %>% html_text()  # scrape web text from url  
txt %>% str
 chr "14 Strings | R for Data Science\n    window.dataLayer = window.dataLayer || [];\n    function gtag(){dataLayer."| __truncated__
txt %>% str_length()  # get length of vector  
[1] 49589

Detecting strings

Search for location of string patterns using str_detect, str_which, and str_locate

pat <- "strings"  # string pattern to search for  
txt %>% str_detect(pat)  # returns logical if vector contains that pattern 
[1] TRUE
txt %>% str_which(pat)  # show which vector the pattern exists   
[1] 1
txt %>% str_locate(pat)  # show character positions of the first instance of pattern 
     start  end
[1,]  1089 1095
txt %>% str_locate_all(pat)  # show all positions   
[[1]]
      start   end
 [1,]  1089  1095
 [2,]  1257  1263
 [3,]  1381  1387
 [4,]  1734  1740
 [5,]  2935  2941
 [6,]  3072  3078
 [7,]  3276  3282
 [8,]  3787  3793
 [9,]  3819  3825
[10,]  4772  4778
[11,]  4896  4902
[12,]  5429  5435
[13,]  6452  6458
[14,]  7719  7725
[15,]  8275  8281
[16,]  8943  8949
[17,]  9098  9104
[18,]  9182  9188
[19,] 10161 10167
[20,] 10265 10271
[21,] 19511 19517
[22,] 28886 28892
[23,] 36969 36975
[24,] 38179 38185
[25,] 40201 40207
[26,] 44744 44750
[27,] 46332 46338
[28,] 48152 48158
[29,] 48418 48424
[30,] 48444 48450

Subsetting strings

Subset and cut up strings into manageable pieces

# subset string portion based on char position
txt %>% str_sub(txt %>% str_locate(pat)  # use positions from above func  
)
[1] "strings"

# insert user text into string position, e.g. between 1 and 2
str_sub(txt, 1, 2) <- "INSERT TEXT AT POSITION"

Shorten text with ellipsis to nth character

txt_short <- txt %>% str_trunc(20)  # munst be greater than 3 as this is the length of the ellipsis
txt_short
[1] "INSERT TEXT AT PO..."

Return string as char vector containing pattern

txt %>% str_subset(pat)

Extract string patterns as characters

txt %>% str_extract(pat)  # pull pattern out of string 
[1] "strings"
txt %>% str_extract_all(pat, simplify = F)  # extract all patterns as string . set simplify = T to return matrix
[[1]]
 [1] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[10] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[19] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[28] "strings" "strings" "strings"
txt %>% str_match(pat)  # extract pattern as matrix 
     [,1]     
[1,] "strings"
txt %>% str_match_all(pat)  # extract all pattern instances as matrix 
[[1]]
      [,1]     
 [1,] "strings"
 [2,] "strings"
 [3,] "strings"
 [4,] "strings"
 [5,] "strings"
 [6,] "strings"
 [7,] "strings"
 [8,] "strings"
 [9,] "strings"
[10,] "strings"
[11,] "strings"
[12,] "strings"
[13,] "strings"
[14,] "strings"
[15,] "strings"
[16,] "strings"
[17,] "strings"
[18,] "strings"
[19,] "strings"
[20,] "strings"
[21,] "strings"
[22,] "strings"
[23,] "strings"
[24,] "strings"
[25,] "strings"
[26,] "strings"
[27,] "strings"
[28,] "strings"
[29,] "strings"
[30,] "strings"

View an HTML rendering of the text using str_view()

# visualise the first 100 characters
txt %>% str_sub(1, 100) %>% str_view(" ")

 

Split the text into separate components and apply the str_sub function to each new component

# split into matrix at every instance of pattern
txt_split <- txt %>% str_split_fixed(pat, n = Inf)
txt_split %>% dim  # get dimensions of matrix 
[1]  1 31
txt_split[1, 20]  # view 1st row and 3rd column  
[1] " that represent the regular expression as \"\\\\.\".\n\n\n14.3.1.1 Exercises\n\nExplain why each of these "

Mutating and joining strings

Replace pattern instances with new pattern

repl <- "when you really need that coffee hit"  # replacement character string
txt %>% str_replace_all(pat, repl)

You'll notice that the first instance of the returned pattern is capitalised, so the replacement doesn't catch it and thus ignores the string. We can easily tell R to detect all instances of the pattern by ignoring case using regex

pat_all <- regex(pat, ignore_case = T)
pat_all
txt %>% str_replace_all(pat_all, repl)

Further useful functions

Duplicate string

# use the smaller, split text
txt_s <- txt_split[5]
txt_s %>% str_dup(3)  # duplicate string n number of times (3)     
[1] " with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see  with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see  with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see "

Removing white space and truncating text

txt_s %>% str_replace_all(" ", "")  # remove all spaces  
[1] "witheithersinglequotesordoublequotes.Unlikeotherlanguages,thereisnodifferenceinbehaviour.Irecommendalwaysusing\",unlessyouwanttocreateastringthatcontainsmultiple\".\n\nstring1<-\"Thisisastring\"\nstring2<-'IfIwanttoincludea\"quote\"insideastring,Iusesinglequotes'\nIfyouforgettocloseaquote,you’llsee+,thecontinuationcharacter:\n>\"Thisisastringwithoutaclosingquote\n+\n+\n+HELPI'MSTUCK\nIfthishappentoyou,pressEscapeandtryagain!\nToincludealiteralsingleordoublequoteinastringyoucanuse\\to“escape”it:\n\ndouble_quote<-\"\\\"\"#or'\"'\nsingle_quote<-'\\''#or\"'\"\nThatmeansifyouwanttoincludealiteralbackslash,you’llneedtodoubleitup:\"\\\\\".\nBewarethattheprintedrepresentationofastringisnotthesameasstringitself,becausetheprintedrepresentationshowstheescapes.Toseetherawcontentsofthestring,usewriteLines():\n\nx<-c(\"\\\"\",\"\\\\\")\nx\n#>[1]\"\\\"\"\"\\\\\"\nwriteLines(x)\n#>\"\n#>\\\nThereareahandfulofotherspecialcharacters.Themostcommonare\"\\n\",newline,and\"\\t\",tab,butyoucanseethecompletelistbyrequestinghelpon\":?'\"',or?\"'\".You’llalsosometimessee"
txt_s %>% str_trim(side = "both")  # strip white space from both ends  
[1] "with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see"

Alternative functions from stringi package

require(stringi)
txt_s %>% stri_replace_all_charclass("\\p{WHITE_SPACE}", "")  # remove middle white space  
[1] "witheithersinglequotesordoublequotes.Unlikeotherlanguages,thereisnodifferenceinbehaviour.Irecommendalwaysusing\",unlessyouwanttocreateastringthatcontainsmultiple\".string1<-\"Thisisastring\"string2<-'IfIwanttoincludea\"quote\"insideastring,Iusesinglequotes'Ifyouforgettocloseaquote,you’llsee+,thecontinuationcharacter:>\"Thisisastringwithoutaclosingquote+++HELPI'MSTUCKIfthishappentoyou,pressEscapeandtryagain!Toincludealiteralsingleordoublequoteinastringyoucanuse\\to“escape”it:double_quote<-\"\\\"\"#or'\"'single_quote<-'\\''#or\"'\"Thatmeansifyouwanttoincludealiteralbackslash,you’llneedtodoubleitup:\"\\\\\".Bewarethattheprintedrepresentationofastringisnotthesameasstringitself,becausetheprintedrepresentationshowstheescapes.Toseetherawcontentsofthestring,usewriteLines():x<-c(\"\\\"\",\"\\\\\")x#>[1]\"\\\"\"\"\\\\\"writeLines(x)#>\"#>\\Thereareahandfulofotherspecialcharacters.Themostcommonare\"\\n\",newline,and\"\\t\",tab,butyoucanseethecompletelistbyrequestinghelpon\":?'\"',or?\"'\".You’llalsosometimessee"
txt_s %>% str_replace_na()  # change NAs into true 'NA'    
[1] " with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see "

Including vectors within strings

Insert numeric vectors without breaking character string

vect <- 1000
str_interp("For including vectors like this ${vect} when you can't break the character strng")
[1] "For including vectors like this 1000 when you can't break the character strng"

Useful when breaking character quotes e.g. HTML tags

str_interp("<div style=\"color:#F90F40;\"> <strong> Total count </strong> ${vect}")
[1] "<div style=\"color:#F90F40;\"> <strong> Total count </strong> 1000"

Include lists within function

str_interp("First value, ${v1}, Second value, ${v2*2}.", list(v1 = 10, v2 = 20))
[1] "First value, 10, Second value, 40."

And data frames

str_interp("Values are $[.2f]{max(Sepal.Width)} and $[.2f]{min(Sepal.Width)}.", iris)
[1] "Values are 4.40 and 2.00."

Regular expressions, i.e. regex

You can find in-depth info on how to parse character vectors or strings or find specific character patterns using regular expressions in the R for Data Science book. There's also a handy regex tool for live text parsing.