Functions for Lesson 8
str_which
,str_detect
,str_locate
,str_view
,str_sub
Packages for Lesson 8
stringr
Use the stringr
package to cut, substitute, print, and manipulate character and text strings in R
. Useful for webscraping text from webpages, scraping PDFs and text files for given characters and words, mining genomics data, etc.
Cheat sheet for the stringr
package.
Install necessary packages
# install.packages('pacman') # uncomment and install this first
pacman::p_load(stringr, stringi, dplyr, reprex, xml2, rvest)
First, we need some text data. As an exercise, since we're using strings, we're going to use all the text from the webpage on using strings from the R for Data Science textbook as our text sample.
require(xml2) # read html data
require(rvest) # select html elements
url <- "https://r4ds.had.co.nz/strings.html"
txt <- url %>% read_html %>% html_text() # scrape web text from url
txt %>% str
chr "14 Strings | R for Data Science\n window.dataLayer = window.dataLayer || [];\n function gtag(){dataLayer."| __truncated__
txt %>% str_length() # get length of vector
[1] 49589
Search for location of string patterns using str_detect
, str_which
, and str_locate
pat <- "strings" # string pattern to search for
txt %>% str_detect(pat) # returns logical if vector contains that pattern
[1] TRUE
txt %>% str_which(pat) # show which vector the pattern exists
[1] 1
txt %>% str_locate(pat) # show character positions of the first instance of pattern
start end
[1,] 1089 1095
txt %>% str_locate_all(pat) # show all positions
[[1]]
start end
[1,] 1089 1095
[2,] 1257 1263
[3,] 1381 1387
[4,] 1734 1740
[5,] 2935 2941
[6,] 3072 3078
[7,] 3276 3282
[8,] 3787 3793
[9,] 3819 3825
[10,] 4772 4778
[11,] 4896 4902
[12,] 5429 5435
[13,] 6452 6458
[14,] 7719 7725
[15,] 8275 8281
[16,] 8943 8949
[17,] 9098 9104
[18,] 9182 9188
[19,] 10161 10167
[20,] 10265 10271
[21,] 19511 19517
[22,] 28886 28892
[23,] 36969 36975
[24,] 38179 38185
[25,] 40201 40207
[26,] 44744 44750
[27,] 46332 46338
[28,] 48152 48158
[29,] 48418 48424
[30,] 48444 48450
Subset and cut up strings into manageable pieces
# subset string portion based on char position
txt %>% str_sub(txt %>% str_locate(pat) # use positions from above func
)
[1] "strings"
# insert user text into string position, e.g. between 1 and 2
str_sub(txt, 1, 2) <- "INSERT TEXT AT POSITION"
Shorten text with ellipsis to nth character
txt_short <- txt %>% str_trunc(20) # munst be greater than 3 as this is the length of the ellipsis
txt_short
[1] "INSERT TEXT AT PO..."
Return string as char vector containing pattern
txt %>% str_subset(pat)
Extract string patterns as characters
txt %>% str_extract(pat) # pull pattern out of string
[1] "strings"
txt %>% str_extract_all(pat, simplify = F) # extract all patterns as string . set simplify = T to return matrix
[[1]]
[1] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[10] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[19] "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings" "strings"
[28] "strings" "strings" "strings"
txt %>% str_match(pat) # extract pattern as matrix
[,1]
[1,] "strings"
txt %>% str_match_all(pat) # extract all pattern instances as matrix
[[1]]
[,1]
[1,] "strings"
[2,] "strings"
[3,] "strings"
[4,] "strings"
[5,] "strings"
[6,] "strings"
[7,] "strings"
[8,] "strings"
[9,] "strings"
[10,] "strings"
[11,] "strings"
[12,] "strings"
[13,] "strings"
[14,] "strings"
[15,] "strings"
[16,] "strings"
[17,] "strings"
[18,] "strings"
[19,] "strings"
[20,] "strings"
[21,] "strings"
[22,] "strings"
[23,] "strings"
[24,] "strings"
[25,] "strings"
[26,] "strings"
[27,] "strings"
[28,] "strings"
[29,] "strings"
[30,] "strings"
View an HTML rendering of the text using str_view()
# visualise the first 100 characters
txt %>% str_sub(1, 100) %>% str_view(" ")
Split the text into separate components and apply the str_sub
function to each new component
# split into matrix at every instance of pattern
txt_split <- txt %>% str_split_fixed(pat, n = Inf)
txt_split %>% dim # get dimensions of matrix
[1] 1 31
txt_split[1, 20] # view 1st row and 3rd column
[1] " that represent the regular expression as \"\\\\.\".\n\n\n14.3.1.1 Exercises\n\nExplain why each of these "
Replace pattern instances with new pattern
repl <- "when you really need that coffee hit" # replacement character string
txt %>% str_replace_all(pat, repl)
You'll notice that the first instance of the returned pattern is capitalised, so the replacement doesn't catch it and thus ignores the string. We can easily tell R
to detect all instances of the pattern by ignoring case using regex
pat_all <- regex(pat, ignore_case = T)
pat_all
txt %>% str_replace_all(pat_all, repl)
Duplicate string
# use the smaller, split text
txt_s <- txt_split[5]
txt_s %>% str_dup(3) # duplicate string n number of times (3)
[1] " with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see "
Removing white space and truncating text
txt_s %>% str_replace_all(" ", "") # remove all spaces
[1] "witheithersinglequotesordoublequotes.Unlikeotherlanguages,thereisnodifferenceinbehaviour.Irecommendalwaysusing\",unlessyouwanttocreateastringthatcontainsmultiple\".\n\nstring1<-\"Thisisastring\"\nstring2<-'IfIwanttoincludea\"quote\"insideastring,Iusesinglequotes'\nIfyouforgettocloseaquote,you’llsee+,thecontinuationcharacter:\n>\"Thisisastringwithoutaclosingquote\n+\n+\n+HELPI'MSTUCK\nIfthishappentoyou,pressEscapeandtryagain!\nToincludealiteralsingleordoublequoteinastringyoucanuse\\to“escape”it:\n\ndouble_quote<-\"\\\"\"#or'\"'\nsingle_quote<-'\\''#or\"'\"\nThatmeansifyouwanttoincludealiteralbackslash,you’llneedtodoubleitup:\"\\\\\".\nBewarethattheprintedrepresentationofastringisnotthesameasstringitself,becausetheprintedrepresentationshowstheescapes.Toseetherawcontentsofthestring,usewriteLines():\n\nx<-c(\"\\\"\",\"\\\\\")\nx\n#>[1]\"\\\"\"\"\\\\\"\nwriteLines(x)\n#>\"\n#>\\\nThereareahandfulofotherspecialcharacters.Themostcommonare\"\\n\",newline,and\"\\t\",tab,butyoucanseethecompletelistbyrequestinghelpon\":?'\"',or?\"'\".You’llalsosometimessee"
txt_s %>% str_trim(side = "both") # strip white space from both ends
[1] "with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see"
stringi
packagerequire(stringi)
txt_s %>% stri_replace_all_charclass("\\p{WHITE_SPACE}", "") # remove middle white space
[1] "witheithersinglequotesordoublequotes.Unlikeotherlanguages,thereisnodifferenceinbehaviour.Irecommendalwaysusing\",unlessyouwanttocreateastringthatcontainsmultiple\".string1<-\"Thisisastring\"string2<-'IfIwanttoincludea\"quote\"insideastring,Iusesinglequotes'Ifyouforgettocloseaquote,you’llsee+,thecontinuationcharacter:>\"Thisisastringwithoutaclosingquote+++HELPI'MSTUCKIfthishappentoyou,pressEscapeandtryagain!Toincludealiteralsingleordoublequoteinastringyoucanuse\\to“escape”it:double_quote<-\"\\\"\"#or'\"'single_quote<-'\\''#or\"'\"Thatmeansifyouwanttoincludealiteralbackslash,you’llneedtodoubleitup:\"\\\\\".Bewarethattheprintedrepresentationofastringisnotthesameasstringitself,becausetheprintedrepresentationshowstheescapes.Toseetherawcontentsofthestring,usewriteLines():x<-c(\"\\\"\",\"\\\\\")x#>[1]\"\\\"\"\"\\\\\"writeLines(x)#>\"#>\\Thereareahandfulofotherspecialcharacters.Themostcommonare\"\\n\",newline,and\"\\t\",tab,butyoucanseethecompletelistbyrequestinghelpon\":?'\"',or?\"'\".You’llalsosometimessee"
txt_s %>% str_replace_na() # change NAs into true 'NA'
[1] " with either single quotes or double quotes. Unlike other languages, there is no difference in behaviour. I recommend always using \", unless you want to create a string that contains multiple \".\n\nstring1 <- \"This is a string\"\nstring2 <- 'If I want to include a \"quote\" inside a string, I use single quotes'\nIf you forget to close a quote, you’ll see +, the continuation character:\n> \"This is a string without a closing quote\n+ \n+ \n+ HELP I'M STUCK\nIf this happen to you, press Escape and try again!\nTo include a literal single or double quote in a string you can use \\ to “escape” it:\n\ndouble_quote <- \"\\\"\" # or '\"'\nsingle_quote <- '\\'' # or \"'\"\nThat means if you want to include a literal backslash, you’ll need to double it up: \"\\\\\".\nBeware that the printed representation of a string is not the same as string itself, because the printed representation shows the escapes. To see the raw contents of the string, use writeLines():\n\nx <- c(\"\\\"\", \"\\\\\")\nx\n#> [1] \"\\\"\" \"\\\\\"\nwriteLines(x)\n#> \"\n#> \\\nThere are a handful of other special characters. The most common are \"\\n\", newline, and \"\\t\", tab, but you can see the complete list by requesting help on \": ?'\"', or ?\"'\". You’ll also sometimes see "
Insert numeric vectors without breaking character string
vect <- 1000
str_interp("For including vectors like this ${vect} when you can't break the character strng")
[1] "For including vectors like this 1000 when you can't break the character strng"
Useful when breaking character quotes e.g. HTML tags
str_interp("<div style=\"color:#F90F40;\"> <strong> Total count </strong> ${vect}")
[1] "<div style=\"color:#F90F40;\"> <strong> Total count </strong> 1000"
Include lists within function
str_interp("First value, ${v1}, Second value, ${v2*2}.", list(v1 = 10, v2 = 20))
[1] "First value, 10, Second value, 40."
And data frames
str_interp("Values are $[.2f]{max(Sepal.Width)} and $[.2f]{min(Sepal.Width)}.", iris)
[1] "Values are 4.40 and 2.00."
You can find in-depth info on how to parse character vectors or strings or find specific character patterns using regular expressions in the R
for Data Science book. There's also a handy regex tool for live text parsing.