欢迎来到医科研,这里是白介素2的读书笔记,跟我一起聊临床与科研的故事, 生物医学数据挖掘,R语言,TCGA、GEO数据挖掘。
tidyr总结篇
gather(data,key="“,value=”") ## key是变量,value是值 gather的意义是重新塑造数据的变量,原有数据的变量并不是真正的变量,这时候变量不是变量,变量还是变量。
举例说明: 神奇的gather 参数1:data 参数2:key变量名,参数3:value变量名 参数4:gather的变量指定 其中-表示除外某向量,全部gather
Sys.setlocale('LC_ALL','C')# # [1] "C" library(tidyverse)# # Registered S3 methods overwritten by 'ggplot2': # # method from # # [.quosures rlang # # c.quosures rlang # # print.quosures rlang # # Registered S3 method overwritten by 'rvest': # # method from # # read_xml.response xml2 # # -- Attaching packages -------------------------------------------- tidyverse 1.2.1 -- # # <U+221A> ggplot2 3.1.0 <U+221A> purrr 0.3.0 # # <U+221A> tibble 2.0.1 <U+221A> dplyr 0.8.0.1 # # <U+221A> tidyr 0.8.2 <U+221A> stringr 1.4.0 # # <U+221A> readr 1.3.1 <U+221A> forcats 0.4.0 # # -- Conflicts ----------------------------------------------- tidyverse_conflicts() -- # # x dplyr::filter() masks stats::filter() # # x dplyr::lag() masks stats::lag() stocks <- tibble( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocks# # # A tibble: 10 x 4 # # time X Y Z # # <date> <dbl> <dbl> <dbl> # # 1 2009-01-01 -0.497 -1.20 5.93 # # 2 2009-01-02 1.22 1.58 -4.43 # # 3 2009-01-03 1.68 -2.50 8.03 # # 4 2009-01-04 1.58 0.744 -2.00 # # 5 2009-01-05 0.775 1.87 -3.14 # # 6 2009-01-06 0.0405 0.629 4.31 # # 7 2009-01-07 -1.42 -1.36 9.63 # # 8 2009-01-08 1.18 5.21 -0.231 # # 9 2009-01-09 -0.581 -1.02 -0.680 # # 10 2009-01-10 0.768 0.900 6.43
gather起stocks中的,X,Y,Z. 新命名一个key,命名一个value, 除去time不变化 gather(stocks, stock, price, -time)# # # A tibble: 30 x 3 # # time stock price # # <date> <chr> <dbl> # # 1 2009-01-01 X -0.497 # # 2 2009-01-02 X 1.22 # # 3 2009-01-03 X 1.68 # # 4 2009-01-04 X 1.58 # # 5 2009-01-05 X 0.775 # # 6 2009-01-06 X 0.0405 # # 7 2009-01-07 X -1.42 # # 8 2009-01-08 X 1.18 # # 9 2009-01-09 X -0.581 # # 10 2009-01-10 X 0.768 # # # ... with 20 more rows stocks %>% gather(stock, price, -time)##保留time不变化# # # A tibble: 30 x 3 # # time stock price # # <date> <chr> <dbl> # # 1 2009-01-01 X -0.497 # # 2 2009-01-02 X 1.22 # # 3 2009-01-03 X 1.68 # # 4 2009-01-04 X 1.58 # # 5 2009-01-05 X 0.775 # # 6 2009-01-06 X 0.0405 # # 7 2009-01-07 X -1.42 # # 8 2009-01-08 X 1.18 # # 9 2009-01-09 X -0.581 # # 10 2009-01-10 X 0.768 # # # ... with 20 more rows # # mini_iris <- iris[c(1, 51, 101), ] mini_iris# # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # # 1 5.1 3.5 1.4 0.2 setosa # # 51 7.0 3.2 4.7 1.4 versicolor # # 101 6.3 3.3 6.0 2.5 virginica gather(mini_iris,key = "flower_att",value = "value",Sepal.Length, Sepal.Width, Petal.Length, Petal.Width)# # Species flower_att value # # 1 setosa Sepal.Length 5.1 # # 2 versicolor Sepal.Length 7.0 # # 3 virginica Sepal.Length 6.3 # # 4 setosa Sepal.Width 3.5 # # 5 versicolor Sepal.Width 3.2 # # 6 virginica Sepal.Width 3.3 # # 7 setosa Petal.Length 1.4 # # 8 versicolor Petal.Length 4.7 # # 9 virginica Petal.Length 6.0 # # 10 setosa Petal.Width 0.2 # # 11 versicolor Petal.Width 1.4 # # 12 virginica Petal.Width 2.5 gather(mini_iris,key = "flower_att",value = "value",Sepal.Length:Petal.Width)# # Species flower_att value # # 1 setosa Sepal.Length 5.1 # # 2 versicolor Sepal.Length 7.0 # # 3 virginica Sepal.Length 6.3 # # 4 setosa Sepal.Width 3.5 # # 5 versicolor Sepal.Width 3.2 # # 6 virginica Sepal.Width 3.3 # # 7 setosa Petal.Length 1.4 # # 8 versicolor Petal.Length 4.7 # # 9 virginica Petal.Length 6.0 # # 10 setosa Petal.Width 0.2 # # 11 versicolor Petal.Width 1.4 # # 12 virginica Petal.Width 2.5
-表示不gather的变量 gather(mini_iris,key = "flow_att",value = "value",-Species)# # Species flow_att value # # 1 setosa Sepal.Length 5.1 # # 2 versicolor Sepal.Length 7.0 # # 3 virginica Sepal.Length 6.3 # # 4 setosa Sepal.Width 3.5 # # 5 versicolor Sepal.Width 3.2 # # 6 virginica Sepal.Width 3.3 # # 7 setosa Petal.Length 1.4 # # 8 versicolor Petal.Length 4.7 # # 9 virginica Petal.Length 6.0 # # 10 setosa Petal.Width 0.2 # # 11 versicolor Petal.Width 1.4 # # 12 virginica Petal.Width 2.5
省略掉key, value gather(mini_iris,flow_att,value,-Species)##得到的结果相同# # Species flow_att value # # 1 setosa Sepal.Length 5.1 # # 2 versicolor Sepal.Length 7.0 # # 3 virginica Sepal.Length 6.3 # # 4 setosa Sepal.Width 3.5 # # 5 versicolor Sepal.Width 3.2 # # 6 virginica Sepal.Width 3.3 # # 7 setosa Petal.Length 1.4 # # 8 versicolor Petal.Length 4.7 # # 9 virginica Petal.Length 6.0 # # 10 setosa Petal.Width 0.2 # # 11 versicolor Petal.Width 1.4 # # 12 virginica Petal.Width 2.5
在管道中演示一套 注意group_by与slice联用时,slice切割的是总分组的数目 如果group分了3组,那slice切割1的话 就是显示1* 3,如果切割1:2的话,那就是2*3,显示6个观测
下面举例说明
展示分组中的序列1,包含3个species
注意slice与group_by的联用 library(dplyr) mini_iris <- iris %>% group_by(Species) %>% slice(1) mini_iris %>% gather(key = flower_att, value = measurement, -Species)# # # A tibble: 12 x 3 # # # Groups: Species [3] # # Species flower_att measurement # # <fct> <chr> <dbl> # # 1 setosa Sepal.Length 5.1 # # 2 versicolor Sepal.Length 7 # # 3 virginica Sepal.Length 6.3 # # 4 setosa Sepal.Width 3.5 # # 5 versicolor Sepal.Width 3.2 # # 6 virginica Sepal.Width 3.3 # # 7 setosa Petal.Length 1.4 # # 8 versicolor Petal.Length 4.7 # # 9 virginica Petal.Length 6 # # 10 setosa Petal.Width 0.2 # # 11 versicolor Petal.Width 1.4 # # 12 virginica Petal.Width 2.5
再来举个例子 - mtcars数据集中的cyl分组为4-6-8 - 切割slice 1:2,即显示2组,4-6-8 by_cyl <- group_by (mtcars , cyl ) ##slice (by_cyl , 1:2) ## # A tibble : 6 x 11 ## # Groups : cyl [3] ## mpg cyl disp hp drat wt qsec vs am gear carb ## <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > <dbl > ## 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 ## 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 ## 3 21 6 160 110 3.9 2.62 16.5 0 1 4 4 ## 4 21 6 160 110 3.9 2.88 17.0 0 1 4 4 ## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 ## 6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
spread函数 这是一个与gather互为逆向操作的函数 函数参数有:参数1:data, 参数2:key,参数3:value,达到按key与value展开的效果
library(dplyr) stocks <- data.frame( time = as.Date('2009-01-01') + 0:9, X = rnorm(10, 0, 1), Y = rnorm(10, 0, 2), Z = rnorm(10, 0, 4) ) stocks# # time X Y Z # # 1 2009-01-01 -0.9981963 0.6149012 -1.880305 # # 2 2009-01-02 0.9763906 1.2292060 -2.244749 # # 3 2009-01-03 1.3475060 -1.6466510 4.497477 # # 4 2009-01-04 0.6845907 -2.8694272 -10.145486 # # 5 2009-01-05 -0.3132428 0.2366398 -1.401196 # # 6 2009-01-06 1.0542915 -0.5094071 1.311380 # # 7 2009-01-07 -2.5360015 1.5011045 1.188158 # # 8 2009-01-08 -0.2878114 1.6744369 -3.015077 # # 9 2009-01-09 -0.3004896 -2.8344579 4.376036 # # 10 2009-01-10 -0.1714464 0.8319891 -2.288022
先key value, gather一下 stocksm <- stocks %>% gather(stock, price, -time) stocksm# # time stock price # # 1 2009-01-01 X -0.9981963 # # 2 2009-01-02 X 0.9763906 # # 3 2009-01-03 X 1.3475060 # # 4 2009-01-04 X 0.6845907 # # 5 2009-01-05 X -0.3132428 # # 6 2009-01-06 X 1.0542915 # # 7 2009-01-07 X -2.5360015 # # 8 2009-01-08 X -0.2878114 # # 9 2009-01-09 X -0.3004896 # # 10 2009-01-10 X -0.1714464 # # 11 2009-01-01 Y 0.6149012 # # 12 2009-01-02 Y 1.2292060 # # 13 2009-01-03 Y -1.6466510 # # 14 2009-01-04 Y -2.8694272 # # 15 2009-01-05 Y 0.2366398 # # 16 2009-01-06 Y -0.5094071 # # 17 2009-01-07 Y 1.5011045 # # 18 2009-01-08 Y 1.6744369 # # 19 2009-01-09 Y -2.8344579 # # 20 2009-01-10 Y 0.8319891 # # 21 2009-01-01 Z -1.8803050 # # 22 2009-01-02 Z -2.2447486 # # 23 2009-01-03 Z 4.4974771 # # 24 2009-01-04 Z -10.1454861 # # 25 2009-01-05 Z -1.4011960 # # 26 2009-01-06 Z 1.3113796 # # 27 2009-01-07 Z 1.1881581 # # 28 2009-01-08 Z -3.0150769 # # 29 2009-01-09 Z 4.3760358 # # 30 2009-01-10 Z -2.2880217
spread展开数据 按stock, price展开 stocksm %>% spread (stock , price ) ## time X Y Z ## 1 2009-01-01 -0 .9981963 0.6149012 -1 .880305 ## 2 2009-01-02 0.9763906 1.2292060 -2 .244749 ## 3 2009-01-03 1.3475060 -1 .6466510 4.497477 ## 4 2009-01-04 0.6845907 -2 .8694272 -10 .145486 ## 5 2009-01-05 -0 .3132428 0.2366398 -1 .401196 ## 6 2009-01-06 1.0542915 -0 .5094071 1.311380 ## 7 2009-01-07 -2 .5360015 1.5011045 1.188158 ## 8 2009-01-08 -0 .2878114 1.6744369 -3 .015077 ## 9 2009-01-09 -0 .3004896 -2 .8344579 4.376036 ## 10 2009-01-10 -0 .1714464 0.8319891 -2 .288022
按time, price展开 stocksm %>% spread (time , price ) ## stock 2009-01-01 2009-01-02 2009-01-03 2009-01-04 2009-01-05 2009-01-06 ## 1 X -0 .9981963 0.9763906 1.347506 0.6845907 -0 .3132428 1.0542915 ## 2 Y 0.6149012 1.2292060 -1 .646651 -2 .8694272 0.2366398 -0 .5094071 ## 3 Z -1 .8803050 -2 .2447486 4.497477 -10 .1454861 -1 .4011960 1.3113796 ## 2009-01-07 2009-01-08 2009-01-09 2009-01-10 ## 1 -2 .536001 -0 .2878114 -0 .3004896 -0 .1714464 ## 2 1.501104 1.6744369 -2 .8344579 0.8319891 ## 3 1.188158 -3 .0150769 4.3760358 -2 .2880217
说明一下gather-spread的互补性质 stocks# # time X Y Z # # 1 2009-01-01 -0.9981963 0.6149012 -1.880305 # # 2 2009-01-02 0.9763906 1.2292060 -2.244749 # # 3 2009-01-03 1.3475060 -1.6466510 4.497477 # # 4 2009-01-04 0.6845907 -2.8694272 -10.145486 # # 5 2009-01-05 -0.3132428 0.2366398 -1.401196 # # 6 2009-01-06 1.0542915 -0.5094071 1.311380 # # 7 2009-01-07 -2.5360015 1.5011045 1.188158 # # 8 2009-01-08 -0.2878114 1.6744369 -3.015077 # # 9 2009-01-09 -0.3004896 -2.8344579 4.376036 # # 10 2009-01-10 -0.1714464 0.8319891 -2.288022 stocks %>% gather(key=stock,value = price,-time) %>% ##先聚合 spread(key = stock,value = price) %>% ## 又展开还原 identical(stocks) ## 判断与原来的stocks是否完全一样# # [1] TRUE
总结一下 gather与spread ,可以自如的将数据变换为宽数据或窄数据 gather的数据格式非常适用用于ggplot2的导入,用于可视化 说到这里了我们就绘制一下吧,当然关于可视化的内容暂时不展开讲。
牛刀小试 library(ggplot2) p<-stocks %>% gather(key = stock,value = price,-time ) %>% as_tibble() %>% ##直接导入到ggplot2进行可视化 ggplot2::ggplot(aes(x =stock,y =price,fill=stock))+ geom_boxplot() p
image.png 改改颜色 p +scale_fill_brewer(palette="Dark2" )
image.png 放上自己喜欢的颜色 p+scale_fill_manual(values =c("#999999" , "#E69F00" , "#56B4E9" ))
image.png tidyr::unite函数 能够方便的实现将多列粘贴到一起的功能 参数1:data数据框,参数2:新列名,参数3:sep分隔符,参数4:remove=T移除原列 下面举例说明,这个功能好用,但用起来比较简单
粘贴vs与am列 library (dplyr )unite_ (mtcars , "vs_am ", c ("vs ","am ")) ## mpg cyl disp hp drat wt qsec vs_am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0_1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0_1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1_1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1_0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0_0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1_0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0_0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1_0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1_0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1_0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1_0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0_0 3 3 ## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0_0 3 3 ## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0_0 3 3 ## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0_0 3 4 ## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0_0 3 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0_0 3 4 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1_1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1_1 4 2 ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1_1 4 1 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1_0 3 1 ## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0_0 3 2 ## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0_0 3 2 ## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0_0 3 4 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0_0 3 2 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1_1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0_1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1_1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0_1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0_1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0_1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1_1 4 2
粘贴再分割是可逆的操作 mtcars %>% unite (vs_am , vs , am ) %>% separate (vs_am , c ("vs ", "am ")) ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 ## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 ## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 ## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 ## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 ## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 ## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 ## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 ## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
separate函数-逆向的unite操作 参数1:data,参数2: 要拆分的列,参数3:拆分成的新变量,参数4:Sep分割模式 这个函数能做到将1列拆解为多列,用法与unite非常相似 这里不重复过多,举几个简单示例说明即可
library(dplyr) df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df# # x # # 1 <NA> # # 2 a.b # # 3 a.d # # 4 b.c
分割x为A-B df %>% separate(x, c("A", "B"))# # A B # # 1 <NA> <NA> # # 2 a b # # 3 a d # # 4 b c
如果你想只保留第二个变量 df %>% separate(x, c(NA, "B"))# # B # # 1 <NA> # # 2 b # # 3 d # # 4 c
一个比较难办的问题是,如果需要裂解的列,裂解出来并不是相同的长度怎么办?separate函数提供了几个参数,extra与fill参数来控制裂解的方式 extra用于控制裂解碎片过多,warn:警告信息但扔掉多余,drop:扔掉但并不警告,merge:不扔掉,多余的merge起来 fill,warn警告但从左侧开始填充,right,右侧填充NA,left左侧填充NA
df <- data.frame(x = c("a", "a b", "a b c", NA)) df# # x # # 1 a # # 2 a b # # 3 a b c # # 4 <NA>
这样的方式会有warning df %>% separate(x, c("a", "b"))# # Warning: Expected 2 pieces. Additional pieces discarded in 1 rows [3]. # # Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1]. # # a b # # 1 a <NA> # # 2 a b # # 3 a b # # 4 <NA> <NA>
扔掉多余信息,右侧填充NA df %>% separate(x, c("a" , "b" ), extra = "drop" , fill = "right" )## a b ## 1 a <NA> ## 2 a b ## 3 a b ## 4 <NA> <NA>
merge多余的,并从左侧开始填充 df# # x # # 1 a # # 2 a b # # 3 a b c # # 4 <NA> df %>% separate(x, c("a", "b"), extra = "merge", fill = "left")# # a b # # 1 <NA> a # # 2 a b # # 3 a b c # # 4 <NA> <NA>
同上 df <- data.frame(x = c("x: 123" , "y: error: 7" )) df## x ## 1 x: 123 ## 2 y: error : 7 df %>% separate(x, c("key" , "value" ), ": " , extra = "merge" )## key value ## 1 x 123 ## 2 y error : 7