Character Manipulation in r

gearss 2018-05-09

展开全文

From 《Data Manipulation with R》

1.字符数字

nchar函数可以计算一个字符(串)的长度,state.name是美国50个州的名字，下面的例子是计算各州名称的长度：

head(state.name)

## [1] "Alabama"    "Alaska"     "Arizona"    "Arkansas"   "California"
## [6] "Colorado"

nchar(state.name)

##  [1]  7  6  7  8 10  8 11  8  7  7  6  5  8  7  4  6  8  9  5  8 13  8  9
## [24] 11  8  7  8  6 13 10 10  8 14 12  4  8  6 12 12 14 12  9  5  4  7  8
## [47] 10 13  9  7

2.显示和连接字符(串)

cat函数可以连接多个字符(串)并打印到屏幕上

x = 7
y = 10
cat("x should be greater than y, but x=", x, "and y=", y, "\n")

## x should be greater than y, but x= 7 and y= 10

当有多个字符(串)需要连接时，fill命令可以用于自动插入新的一行,例:

cat("Long strings can", "be displayed over", "several lines using", "the fill= argument", 
    fill = 40)

## Long strings can be displayed over 
## several lines using the fill= argument

file=命令可以用于将连接的字符串存储到电脑的物理位置，a ppend=TRUE命令可以对已有文件进行追加字符，例如:

cat("Long strings can", "be displayed over", "several lines using", "the fill= argument", 
    fill = 40, file = "c:/test.txt")
cat("Long strings can", "be displayed over", "several lines using", "the fill= argument", 
    fill = 40, file = "c:/test.txt", append = TRUE)

paste也可以用于字符连接

paste("one", 2, "three", 4, "five")

## [1] "one 2 three 4 five"

# 对于字符型向量间元素的连接，collapse=命令可以指定字符元素间空隔
paste(c("one", "two", "three", "four"), collapse = " ")

## [1] "one two three four"

# sep=命令同样可以指定字符间间隔,但对向量无效
paste(c("one", "two", "three", "four"), sep = " ,")

## [1] "one"   "two"   "three" "four"

# 对于多个参数需要传递连接时，此时可以使用sep配合paste
paste("X", 1:5, sep = "")

## [1] "X1" "X2" "X3" "X4" "X5"

paste(c("X", "Y"), 1:5, sep = "")

## [1] "X1" "Y2" "X3" "Y4" "X5"

# collapse和sep联合使用
paste(c("X", "Y"), 1:5, sep = "_", collapse = "|")

## [1] "X_1|Y_2|X_3|Y_4|X_5"

paste(c("X", "Y"), 1:5, "^", c("a", "b"), sep = "_", collapse = "|")

## [1] "X_1_^_a|Y_2_^_b|X_3_^_a|Y_4_^_b|X_5_^_a"

paste(c("X", "Y"), 1:5, "^", c("a", "b"), sep = "_")

## [1] "X_1_^_a" "Y_2_^_b" "X_3_^_a" "Y_4_^_b" "X_5_^_a"

3.剥离字符

substring函数可以用于剥离字符(串)的部门字符(串)

substring(state.name, 2, 6)

##  [1] "labam" "laska" "rizon" "rkans" "alifo" "olora" "onnec" "elawa"
##  [9] "lorid" "eorgi" "awaii" "daho"  "llino" "ndian" "owa"   "ansas"
## [17] "entuc" "ouisi" "aine"  "aryla" "assac" "ichig" "innes" "issis"
## [25] "issou" "ontan" "ebras" "evada" "ew Ha" "ew Je" "ew Me" "ew Yo"
## [33] "orth " "orth " "hio"   "klaho" "regon" "ennsy" "hode " "outh "
## [41] "outh " "ennes" "exas"  "tah"   "ermon" "irgin" "ashin" "est V"
## [49] "iscon" "yomin"

substring函数可以连接first和last指定剥离字符的起止位置

mystring = "dog cat duck"
substring(mystring, c(1, 5, 9), c(3, 7, 12))

## [1] "dog"  "cat"  "duck"

找到字符串中每个字符的位置需要将字符转换为向量形式，进而确定字母所处的位置：

state = "Mississippi"
ll = nchar(state)
ltrs = substring(state, 1:ll, 1:ll)
ltrs

##  [1] "M" "i" "s" "s" "i" "s" "s" "i" "p" "p" "i"

which(ltrs == "s")

## [1] 3 4 6 7

substring函数可以用于字符串某些字符的替换，当待替换的字符长于替换字符时，局部字符被替换：

mystring = "dog cat duck"
substring(mystring, 5, 7) = "feline"
mystring

## [1] "dog fel duck"

mystring = "dog cat duck"
substring(mystring, 5, 7) = "a"
mystring

## [1] "dog aat duck"

4.R中的正则表达式

R语言中的正则表达式类似于UNIX中的grep。正则表达中的"\“意味着表达式中特殊字符被视作正常字符，所以正则表达中特殊字符前面需要添加”\“。虽然打印的时候“\”被打印出来，但是nchar或者cat函数只会识别一个“\”。“.”在R正则中是一个通配符。以创建一个txt文件名为例：

expr = ".*\\.txt"
nchar(expr)

## [1] 7

cat(expr, "\n")

## .*\.txt

readline函数可以在读取的时候忽略''或”“号，例：

expr = readline()

nchar(expr)

## [1] 0

5.正则表达基础

R正则表达式由三部分组成：用于匹配的引用字符、字符类和操作引用字符和字符类的修饰符。很多标点符号都是修饰符，只有在它们前面添加”\“才能保持它们作为符号的本意。常见的修饰符有“. ^ $ + ? * ( ) [ ] { } | \”。字符类需要使用“[]”将字符括起来，例如匹配a、b或3中任意一个的字符类可以用[ab3]表示。“-”可以表示一段字符，如[a-z]表示所有小写字符，[5-9]表示5、6、7、8和9。如果字符类中包含“-”，那么它必须在字符类的第一个位置或者需要在它前面使用”\“。

R中的正则表达式都是字符串，所以表达式本身可以像字符串一样进行管理和操作，例如表达式间使用“|”可以分割字符串：

strs = c("chicken", "dog", "cat")
expr = paste(strs, collapse = "|")
expr

## [1] "chicken|dog|cat"

6.字符值的分割

6.1 字符串情况

strsplit函数可以用于字符串和正则表达式的分割，如：

sentence = "R is a free software environment for statistical computing"
parts = strsplit(sentence, " ")
parts

## [[1]]
## [1] "R"           "is"          "a"           "free"        "software"   
## [6] "environment" "for"         "statistical" "computing"

可以发现parts是list格式，访问parts中的元素需要使用下标

length(parts)

## [1] 1

length(parts[[1]])

## [1] 9

如果需要对由字符串组成的向量进行分割，sapply函数对list格式分割结果分别显示长度：

more = c("R is a free software environment for statistical computing", "It compiles and runs on a wide variety of UNIX platforms")
result = strsplit(more, " ")
sapply(result, length)

## [1]  9 11

另外，如果分割结果不需要严格按list格式显示，也可以将所有元素合并起来

allparts = unlist(result)
allparts

##  [1] "R"           "is"          "a"           "free"        "software"   
##  [6] "environment" "for"         "statistical" "computing"   "It"         
## [11] "compiles"    "and"         "runs"        "on"          "a"          
## [16] "wide"        "variety"     "of"          "UNIX"        "platforms"

6.2正则表达式的分割

正则表达式可以像字符串一样使用strsplit函数进行分割，以对于带多个空格的情况为例,以空格作为分隔符：

str = "one  two  three  four"
strsplit(str, " ")

## [[1]]
## [1] "one"   ""      "two"   ""      "three" ""      "four"

使用正则表达式替代多个空格(使用“+”修饰符)

strsplit(str, " +")

## [[1]]
## [1] "one"   "two"   "three" "four"

下面是使用空字符作为分隔符，得到的结果是单个字符组成的list

words = c("one two", "three four")
strsplit(words, "")

## [[1]]
## [1] "o" "n" "e" " " "t" "w" "o"
## 
## [[2]]
##  [1] "t" "h" "r" "e" "e" " " "f" "o" "u" "r"

7.R中的正则表达式

grep函数接受字符串、字符串向量或正则表达，返回能够与它们匹配的索引。如果在grep函数中添加value=TRUE，则可以得到与表达式匹配的真实值。正则表达一个主要运用是从数据中提取符合要求的变量，例如从LifeCycleSavings数据中获取pop15和pop75这两个变量：

grep("^pop", names(LifeCycleSavings))

## [1] 2 3

grep("^pop", names(LifeCycleSavings), value = TRUE)

## [1] "pop15" "pop75"

创建由pop15和pop75组成的数据框：

head(LifeCycleSavings[, grep("^pop", names(LifeCycleSavings))])

##           pop15 pop75
## Australia 29.35  2.87
## Austria   23.32  4.41
## Belgium   23.80  4.43
## Bolivia   41.89  1.67
## Brazil    42.19  0.83
## Canada    31.72  2.85

不区分大小写的情况，可以使用ignore.case=TRUE,例如查询下面字符向量中单词dog且不区分大小写：

inp = c("run dog run", "work doggedly", "CAT AND DOG")
grep("\\<dog\\>", inp, ignore.case = TRUE)

## [1] 1 3

“||<”和“||>”用于匹配由空格、标点或一行开头结尾的单词。

如果正则表达式没有匹配到任何满足要求的字符(串)，那么将返回空数值向量，any函数可以用于判断字符串中是否存在和正则表达式匹配的值，如：

str1 = c("The R Foundation", "is a not for profit organization", "working in the public interest")
str2 = c(" It was founded by the members", "of the R Core Team in order", "to provide support for the R project")
any(grep("profit", str1))

## [1] TRUE

any(grep("profit", str2))

## [1] FALSE

R中regexpr和gregxpr函数可以用于更精确的匹配，提取字符串中与正则表达式匹配的部分，这些函数得到的是与正则表达式匹配的字符的起始位置，如果没有匹配则返回-1。另外，match.length函数可以用来提供匹配字符的信息。regexpr函数仅仅提供第一个匹配字符的信息，gregexpr函数会提供所有匹配字符的信息。

由于regexpr仅提供第一个匹配字符的起始位置信息，在没有匹配位置以返回-1。由匹配字符起始位置和匹配长度match.length可以计算regexpr函数匹配的结束位置，进而可以使用substr函数剥离与正则表达式匹配的字符值。

tst = c("one x7 two b1", "three c5 four b9", "five six seven", "a8 eight nine")
wh = regexpr("[a-z][0-9]", tst)
wh

## [1]  5  7 -1  1
## attr(,"match.length")
## [1]  2  2 -1  2
## attr(,"useBytes")
## [1] TRUE

# 剥离匹配字符
res = substring(tst, wh, wh + attr(wh, "match.length") - 1)
res

## [1] "x7" "c5" ""   "a8"

对于上例中没有匹配值的字符，可以去除

res[res != ""]

## [1] "x7" "c5" "a8"

gregexpr函数与regexpr函数类似,返回值不同：

wh1 = gregexpr("[a-z][0-9]", tst)
wh1

## [[1]]
## [1]  5 12
## attr(,"match.length")
## [1] 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1]  7 15
## attr(,"match.length")
## [1] 2 2
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1] -1
## attr(,"match.length")
## [1] -1
## attr(,"useBytes")
## [1] TRUE
## 
## [[4]]
## [1] 1
## attr(,"match.length")
## [1] 2
## attr(,"useBytes")
## [1] TRUE

剥离与gregexpr函数匹配的字符，需要对gregexpr函数结果list中每一层元素使用substring函数，可以使用loop循环：

res1 = list()
for (i in 1:length(wh1)) res1[[i]] = substring(tst[i], wh1[[i]], wh1[[i]] + 
    attr(wh1[[i]], "match.length") - 1)
res1

## [[1]]
## [1] "x7" "b1"
## 
## [[2]]
## [1] "c5" "b9"
## 
## [[3]]
## [1] ""
## 
## [[4]]
## [1] "a8"

mapply函数包含多个参数，可以利用mapply函数重构上例：

getexpr = function(str, greg) substring(str, greg, greg + attr(greg, "match.length") - 
    1)
res2 = mapply(getexpr, tst, wh1)
res2

## $`one x7 two b1`
## [1] "x7" "b1"
## 
## $`three c5 four b9`
## [1] "c5" "b9"
## 
## $`five six seven`
## [1] ""
## 
## $`a8 eight nine`
## [1] "a8"

8.替换和标记(substitution & Tagging)

sub和gsub函数可以用于替换文本，sub仅仅替换第一个匹配正则表达式的文本，gsub将替换所有匹配正则表达式的文本。它们的一个主要运用是用于包含单位的数据，例如：

values = c("$11,317.35", "$11,234.51", "$11,275.89", "$11,278.93", "$11,294.94")
as.numeric(gsub("[$,]", "", values))

## [1] 11317 11235 11276 11279 11295

替换时，经常遇到正则表达式中有某些标记，例如对待括号标记的数字，可以使用“\”加数字将括号标记部分替代，“\1”替换第一个标记，“\2”替换第二个标记，依次将其他的标记替代。例如在财务报告中经常利用括号代表负数，下面是将括号替换为负号的例子：

values = c("75.99", "(20.30)", "55.20")
as.numeric(gsub("\\(([0-9.]+)\\)", "-\\1", values))

## [1]  75.99 -20.30  55.20

剥离符合一定正则表达的字符有时会用到“^{”和“$”(开始，结束)字符。例如剥离value=12后面的12,此时简单的替换函数sub达不到效果}

str = "report: 17 value=12 time=2:00"
sub("value=([^ ]+)", "\\1", str)

## [1] "report: 17 12 time=2:00"

将正则表达式扩展为：

sub("^.*value=([^ ]+).*$", "\\1", str)

## [1] "12"

另外，还可以利用regexpr或者gregexpr函数通过定位进行匹配，再使用sub或gsub函数进行替换

str = "report: 17 value=12 time=2:00"
greg = gregexpr("value=[^ ]+", str)[[1]]
sub("value=([^ ]+)", "\\1", substring(str, greg, greg + attr(greg, "match.length") - 
    1))