Alex Sanchez (asanchez@ub.edu)
Francesc Carmona (fcarmona@ub.edu)
GME Department. Universitat de Barcelona
Statistics and Bioinformatics Unit. Vall d’Hebron Institut de Recerca
March 2020
License: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License http://creativecommons.org/licenses/by-nc-sa/4.0/
These slides have been prepared based on multiple sources: websites, blogs, courses. While it is hard to cite them all I wish to acknowledge those sources that have been particularly useful.
This chapter makes a heavy use of strings so two additional sources are:
dplyr.grep(), grepl(), regexpr() and gregexpr() functions are used for searching for matches.grep(pattern, x, value = FALSE) returns an integer vector of the indices of the elements of x that yielded a match (or not, for invert = TRUE).## [1] 2 3
grep(pattern, x, value = TRUE) returns the specific strings that happen to have the match.## [1] "expression" "examples of R language"
grepl(pattern, x) returns a TRUE/FALSE vector indicating which elements of the character vector contain a match.## [1] FALSE TRUE TRUE
regexpr(pattern, text) returns an integer vector of the same length as text giving the starting position of the first match or \(-1\) if there is none, with attribute “match.length”, an integer vector giving the length of the matched text (or -1 for no match).## [1] -1 1 1
## attr(,"match.length")
## [1] -1 2 2
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
gregexpr(pattern, text) returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.## [[1]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 4
## attr(,"match.length")
## [1] 4
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE
sub(pattern, replacement, string) replace first match## [1] "He is now 5 years old, and weights 130lbs"
gsub(pattern, replacement, string) replace all matches## [1] "He is now years old, and weights lbs"
regmatches(string, regexpre()) extract first matchx <- c("Arkansas", "Alabama", "Calabash", "Washington")
pattern <- "[Aa][^a]*a" # sequences starting with A or a and continuing up to next a.
regmatches(x, regexpr(pattern, x))## [1] "Arka" "Ala" "ala"
## [[1]]
## [1] "" "nsas"
##
## [[2]]
## [1] "" "bama"
##
## [[3]]
## [1] "C" "bash"
##
## [[4]]
## [1] "Washington"
regmatches(string, gregexpre()) extract all matches, outputs a listx <- c("Arkansas", "Alabama", "Calabash", "Washington")
pattern <- "[Aa][^a]*a" # sequences starting with A or a and continuing up to next a.
regmatches(x, gregexpr(pattern, x)) ## [[1]]
## [1] "Arka"
##
## [[2]]
## [1] "Ala" "ama"
##
## [[3]]
## [1] "ala"
##
## [[4]]
## character(0)
## [[1]]
## [1] "" "nsas"
##
## [[2]]
## [1] "" "b" ""
##
## [[3]]
## [1] "C" "bash"
##
## [[4]]
## [1] "Washington"
base R functionsstringr package, developed for extending and simplifying R base functionalities.stringr at: stringi regular expressions vignettegrep(..., value = FALSE), grepl(),stringr::str_detect()regexpr(), gregexpr(),stringr::str_locate(), string::str_locate_all()grep(..., value = TRUE),stringr::str_extract(), stringr::str_extract_all()sub(), gsub(),stringr::str_replace(), stringr::str_replace_all()strsplit(),stringr::str_split()$ * + . ? [ ] ^ { } | ( ) \.## [1] NA "an" NA
ignore_case = TRUE:## [1] TRUE FALSE FALSE
## [1] TRUE TRUE TRUE
str_whatever functionsstringr package offers both str_whatever() and str_whatever_all() in many instances.
require(stringr)
example.obj <- "1. A small sentence. - 2. Another tiny sentence."
str_extract(example.obj, "e")## [1] "e"
## [[1]]
## [1] "e" "e" "e" "e" "e" "e" "e"
str_extract vs grep## [1] NA "an" NA
## [1] "banana"
grep() one can use str_subset()## [1] "banana"
## [1] TRUE FALSE FALSE
## [1] NA "ban" "ear"
\n (new line), by setting dotall = TRUE:## [1] FALSE
## [1] TRUE
## [1] "small"
## [1] "cat"
## [[1]]
## [1] "cat" "sat" "mat"
\”, to escape special behaviour.\.”. Unfortunately this creates a problem.\” is also used as an escape symbol in strings.\.” we need the string “\\.”.## [1] NA "a.c" NA
\ is used as an escape character in regular expressions, how do you match a literal \?
\\.\.\ you need to write “\\\\” — you need four backslashes to match one!## a\b
## [1] "\\"
\Q...\E: all the characters in … are treated as exact matches. This is useful if you want to exactly match user input as part of a regular expression.\': single quote. You don’t need to escape single quote inside a double-quoted string, so we can also use "'" in the previous example.\": double quote. Similarly, double quotes can be used inside a single-quoted string, i.e. '"'.\n: newline.\r: carriage return.\t: tab character.cat() and print() to handle escape sequences differently, if you want to print a string out with these sequences interpreted, use cat().Example - Let’s say you specify your pattern with single quotes and you want to find countries with the single quote “’”. - You would have to “escape” the single quote in the pattern, by preceding it with “\”, so it’s clear it is not part of the string-specifying machinery:
library(XML)
library(RCurl)
url <- getURL("https://www.nationsonline.org/oneworld/countries_of_the_world.htm")
df <- readHTMLTable(url, header = T)
countries <- c(levels(df[[1]]$V2),levels(df[[2]]$V2),levels(df[[3]]$V2),
levels(df[[4]]$V2),levels(df[[5]]$V2))
grep("\'", countries, value = TRUE)## [1] "CĂ´te D'ivoire (Ivory Coast)"
## [2] "Korea, Democratic People's Rep. (North Korea)"
## [3] "Lao, People's Democratic Republic"
*: matches at least 0 times (0 or more).+: matches at least 1 times (1 or more).?: matches at most 1 times (0 or 1).{n}: matches exactly n times.{n,}: matches at least n times.{n,m}: matches between n and m times.(strings <- c("a", "ab", "acb", "accb", "acccb", "accccb"))
grep("ac*b", strings, value = TRUE) # "ab" "acb" "accb" "acccb" "accccb"
grep("ac+b", strings, value = TRUE) # "acb" "accb" "acccb" "accccb"
grep("ac?b", strings, value = TRUE) # "ab" "acb"
grep("ac{2}b", strings, value = TRUE) # "accb"
grep("ac{2,}b", strings, value = TRUE) # "accb" "acccb" "accccb"
grep("ac{2,3}b", strings, value = TRUE) # "accb" "acccb"Exercise
Find all countries with ee in their name using quantifiers.
## [1] "Cocos (Keeling) Islands" "Greece"
## [3] "Greenland" "Holy See"
## [5] "Vatican City State (Holy See)"
^ matches the start of string.$ matches the end of the string.\b: matches the empty string at either edge of a word. Don’t confuse it with ^ $ which marks the edge of a string.\B: matches the empty string provided it is not at an edge of a word.(strings <- c("abcd", "cdab", "cabd", "c abd"))
grep("ab", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("ab$", strings, value = TRUE)
grep("\\bab", strings, value = TRUE)## [1] "_The_ _quick_ _brown_ _fox_"
## [1] "T_h_e q_u_i_c_k b_r_o_w_n f_o_x"
Exercises
Find the countries that end up with land.
## [1] "Christmas Island" "Finland" "Greenland" "Iceland"
## [5] "Ireland" "New Zealand" "Pitcairn Island" "Poland"
## [9] "Reunion Island" "Swaziland" "Switzerland" "Thailand"
Find the countries that have the word and in their name.
## [1] "Antigua and Barbuda" "Bosnia and Herzegovina"
## [3] "Saint Kitts and Nevis" "Saint Vincent and the Grenadines"
## [5] "Sao Tome and Principe" "Trinidad and Tobago"
## [7] "Turks and Caicos Islands" "Wallis and Futuna Islands"
.: matches any single character.[...]: a character list, matches any one of the characters inside the square brackets. - inside the brackets to specify a range of characters.[^...]: an inverted character list, similar to [...], but matches any characters except those inside the square brackets.\: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) \, similar to its usage in escape sequences. Since \ itself needs to be escaped in R, we need to escape these metacharacters with double backslash like \\$.|: an “or” operator, matches patterns on either side of the |.(...): grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \\N, with N being the No. of (...) used. This is called backreference.(strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12", "acb"))
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)
gsub("(ab) 12", "\\1 34", strings)url <- getURL("https://en.wikipedia.org/wiki/List_of_culinary_fruits")
df <- readHTMLTable(url, header = T)
fruits <- c(levels(df[[1]]$`Common name`), levels(df[[2]]$`Common name`),
levels(df[[3]]$`Common name`), levels(df[[4]]$`Common name`),
levels(df[[5]]$`Common name`), levels(df[[6]]$`Common name`),
levels(df[[7]]$`Common name`), levels(df[[8]]$`Common name`))
pattern <- "(..)\\1"
str_subset(fruits, pattern)## [1] "Bolivian mountain coconut" "King coconut"
## [3] "Sea coconut" "Salal"
## [5] "Cassabanana" "Banana"
Exercise
Find countries with letter i or t, and ends with land, and replace land with LAND.
## [1] "Christmas IsLAND" "FinLAND" "IceLAND" "IreLAND"
## [5] "Pitcairn IsLAND" "Reunion IsLAND" "SwaziLAND" "SwitzerLAND"
## [9] "ThaiLAND"
There are two flavors of character classes, one uses [: and :] around a predefined name inside square brackets and the other uses \ and a special character. They are sometimes interchangeable.
[:digit:] or \d: digits, 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9].\D: non-digits, equivalent to [^0-9].[:lower:]: lower-case letters, equivalent to [a-z].[:upper:]: upper-case letters, equivalent to [A-Z].[:alpha:]: alphabetic characters, equivalent to [[:lower:][:upper:]] or [A-z].[:alnum:]: alphanumeric characters, equivalent to [[:alpha:][:digit:]] or [A-z0-9].[:blank:]: blank characters, i.e. space and tab.[:space:]: space characters: tab, newline, vertical tab, form feed, carriage return, space.Other character classes are described below:
\w: word characters, equivalent to [[:alnum:]_] or [A-z0-9_].\W: not word, equivalent to [^A-z0-9_].[:xdigit:]: hexadecimal digits (base 16), 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f, equivalent to [0-9A-Fa-f].\s: space, .\S: not space.[:punct:]: punctuation characters, ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~.[:graph:]: graphical (human readable) characters: equivalent to [[:alnum:][:punct:]].[:print:]: printable characters, equivalent to [[:alnum:][:punct:]\\s].[:cntrl:]: control characters, like \n or \r, [\x00-\x1F\x7F].Note:
* [:...:] has to be used inside square brackets, e.g. [[:digit:]].
* \ itself is a special character that needs escape, e.g. \\d. Do not confuse these regular expressions with R escape sequences such as \t.
perl = FALSE/TRUE in base R functions, such as grep() and sub().stringr package, wrap the pattern with perl().Functions in the stringr package
stringr vs in functions in base RFunctions in stringr vs in functions in base R
qdapRegex package: a collection of handy regular expression tools, including handling abbreviations, dates, email addresses, hash tags, phone numbers, times, emoticons, and URL etc.