写完之后,发现之前写过一篇类似的:R语言进阶笔记4 | dplyr 汇总统计。看着重复劳动的产品,无比的感慨,昨日重现?历史重演? 自己的思维模式已经固定? 看似灵机一动的想法,其实都是固定的模式?
以为是一个新的idea,其实想过很多次,不过都忘了!
如果历史能重来,还是如此的选择,人生并不会有什么不同。 死循环,一遍又一遍。 没有新鲜事。 从一篇R语言tidyverse的使用,到感悟人生哲理,再为自己日复一日的原地踏步感到可悲,画地为牢,不敢去挑战新的东西,学习新的领域,坐吃山空,羞愧不已。以此篇为终点, 不要把已经掌握的东西咀嚼很多遍,感慨杯水之波澜,向前,向前!
无以为念,写一篇标题最长的博文作为节点!
正文分割线
之前写过一篇博客,介绍R语言的aggregate 函数的汇总统计,最近学习tidyverse ,感觉更好用,对比学习一下。 1 模拟数据这里模拟了4个因子,5个观测值的数据框, 主要介绍了一下几种方法的汇总统计: - 1, 单变量~单因子,单个个统计量, 这里使用平均数mean
- 2 单变量~单因子,多个个统计量, 这里使用自定义的函数func
1.1 模拟数据代码「模拟数据:」 set.seed(123) dat = data.frame(F1=1:24,F2=rep(1:2,12),F3=rep(1:3,8),F4=rep(1:4,6), y1=rnorm(24),y2=rnorm(24),y3=rnorm(24),y4=rnorm(24)) dat
「结果:」 > dat F1 F2 F3 F4 y1 y2 y3 y4 1 1 1 1 1 -0.56047565 -0.62503927 0.77996512 1.005738524 2 2 2 2 2 -0.23017749 -1.68669331 -0.08336907 -0.709200763 3 3 1 3 3 1.55870831 0.83778704 0.25331851 -0.688008616 4 4 2 1 4 0.07050839 0.15337312 -0.02854676 1.025571370 5 5 1 2 1 0.12928774 -1.13813694 -0.04287046 -0.284773007 6 6 2 3 2 1.71506499 1.25381492 1.36860228 -1.220717712 7 7 1 1 3 0.46091621 0.42646422 -0.22577099 0.181303480 8 8 2 2 4 -1.26506123 -0.29507148 1.51647060 -0.138891362 9 9 1 3 1 -0.68685285 0.89512566 -1.54875280 0.005764186 10 10 2 1 2 -0.44566197 0.87813349 0.58461375 0.385280401 11 11 1 2 3 1.22408180 0.82158108 0.12385424 -0.370660032 12 12 2 3 4 0.35981383 0.68864025 0.21594157 0.644376549 13 13 1 1 1 0.40077145 0.55391765 0.37963948 -0.220486562 14 14 2 2 2 0.11068272 -0.06191171 -0.50232345 0.331781964 15 15 1 3 3 -0.55584113 -0.30596266 -0.33320738 1.096839013 16 16 2 1 4 1.78691314 -0.38047100 -1.01857538 0.435181491 17 17 1 2 1 0.49785048 -0.69470698 -1.07179123 -0.325931586 18 18 2 3 2 -1.96661716 -0.20791728 0.30352864 1.148807618 19 19 1 1 3 0.70135590 -1.26539635 0.44820978 0.993503856 20 20 2 2 4 -0.47279141 2.16895597 0.05300423 0.548396960 21 21 1 3 1 -1.06782371 1.20796200 0.92226747 0.238731735 22 22 2 1 2 -0.21797491 -1.12310858 2.05008469 -0.627906076 23 23 1 2 3 -1.02600445 -0.40288484 -0.49103117 1.360652449 24 24 2 3 4 -0.72889123 -0.46665535 -2.30916888 -0.600259587
2 单性状~ 单因子~单统计量❝一个统计量, 使用mean ❞ 2.1 aggregate方法:「代码」 aggregate(y1 ~ F4, data=dat,mean)
「结果」 > aggregate(y1 ~ F4, data=dat,mean) F4 y1 1 1 -0.21454042 2 2 -0.17244730 3 3 0.39386944 4 4 -0.04158475
2.2 tidyverse方法:「代码」 dat %>% group_by(F4) %>% summarise(y1_mean = mean(y1))
「结果」 > dat %>% group_by(F4) %>% summarise(y1_mean = mean(y1)) # A tibble: 4 x 2 F4 y1_mean * <int> <dbl> 1 1 -0.215 2 2 -0.172 3 3 0.394 4 4 -0.0416
3 单性状~ 单因子~多个统计量❝多个统计量, 使用length,mean,sd,cv ❞ 3.1 aggregate方法:「代码」 func <- function(x)(c(n = length(x),mean=mean(x,na.rm = T),sd=sd(x,na.rm = T),cv=sd(x,na.rm = T)/mean(x,na.rm = T)*100)) aggregate(y1 ~ F4, data=dat,func) %>% round(4)
「结果」 > func <- function(x)(c(n = length(x),mean=mean(x,na.rm = T),sd=sd(x,na.rm = T),cv=sd(x,na.rm = T)/mean(x,na.rm = T)*100)) > aggregate(y1 ~ F4, data=dat,func) %>% round(4) F4 y1.n y1.mean y1.sd y1.cv 1 1 6.0000 -0.2145 0.6442 -300.2843 2 2 6.0000 -0.1724 1.1783 -683.2816 3 3 6.0000 0.3939 1.0063 255.4892 4 4 6.0000 -0.0416 1.0651 -2561.3034
3.2 tidyverse方法:「代码」 dat %>% group_by(F4) %>% summarise_at("y1",list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T)
「结果」 > dat %>% group_by(F4) %>% summarise_at("y1",list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T) # A tibble: 4 x 5 F4 n mean sd cv * <int> <int> <dbl> <dbl> <dbl> 1 1 6 -0.215 0.644 -300. 2 2 6 -0.172 1.18 -683. 3 3 6 0.394 1.01 255. 4 4 6 -0.0416 1.07 -2561.
「注意:」 - 使用group_by分组后,用summarise_at函数
4 单性状~ 多因子4.1 aggregate方法:「代码」 func <- function(x)(c(n = length(x),mean=mean(x,na.rm = T),sd=sd(x,na.rm = T),cv=sd(x,na.rm = T)/mean(x,na.rm = T)*100)) aggregate(y1 ~ F4 + F3, data=dat,func) %>% round(4)
「结果」 > aggregate(y1 ~ F4 + F3, data=dat,func) %>% round(4) F4 F3 y1.n y1.mean y1.sd y1.cv 1 1 1 2.0000 -0.0799 0.6797 -851.2041 2 2 1 2.0000 -0.3318 0.1610 -48.5202 3 3 1 2.0000 0.5811 0.1700 29.2559 4 4 1 2.0000 0.9287 1.2137 130.6845 5 1 2 2.0000 0.3136 0.2606 83.1119 6 2 2 2.0000 -0.0597 0.2410 -403.4060 7 3 2 2.0000 0.0990 1.5911 1606.4949 8 4 2 2.0000 -0.8689 0.5602 -64.4726 9 1 3 2.0000 -0.8773 0.2694 -30.7050 10 2 3 2.0000 -0.1258 2.6033 -2069.8231 11 3 3 2.0000 0.5014 1.4952 298.1875 12 4 3 2.0000 -0.1845 0.7698 -417.1649
4.2 tidyverse方法:「代码」 dat %>% group_by(F4,F3) %>% summarise_at("y1",list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T)
「结果」 > dat %>% group_by(F4,F3) %>% summarise_at("y1",list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T) # A tibble: 12 x 6 # Groups: F4 [4] F4 F3 n mean sd cv <int> <int> <int> <dbl> <dbl> <dbl> 1 1 1 2 -0.0799 0.680 -851. 2 1 2 2 0.314 0.261 83.1 3 1 3 2 -0.877 0.269 -30.7 4 2 1 2 -0.332 0.161 -48.5 5 2 2 2 -0.0597 0.241 -403. 6 2 3 2 -0.126 2.60 -2070. 7 3 1 2 0.581 0.170 29.3 8 3 2 2 0.0990 1.59 1606. 9 3 3 2 0.501 1.50 298. 10 4 1 2 0.929 1.21 131. 11 4 2 2 -0.869 0.560 -64.5 12 4 3 2 -0.185 0.770 -417.
5 多性状~ 多统计量5.1 aggregate方法:「代码」 aggregate(cbind(y1,y2) ~ F4 + F3, data=dat,func) %>% round(4)
「结果」 > aggregate(cbind(y1,y2) ~ F4 + F3, data=dat,func) %>% round(4) F4 F3 y1.n y1.mean y1.sd y1.cv y2.n y2.mean y2.sd y2.cv 1 1 1 2.0000 -0.0799 0.6797 -851.2041 2.0000 -0.0356 0.8336 -2344.2900 2 2 1 2.0000 -0.3318 0.1610 -48.5202 2.0000 -0.1225 1.4151 -1155.2944 3 3 1 2.0000 0.5811 0.1700 29.2559 2.0000 -0.4195 1.1963 -285.2021 4 4 1 2.0000 0.9287 1.2137 130.6845 2.0000 -0.1135 0.3775 -332.4424 5 1 2 2.0000 0.3136 0.2606 83.1119 2.0000 -0.9164 0.3136 -34.2148 6 2 2 2.0000 -0.0597 0.2410 -403.4060 2.0000 -0.8743 1.1489 -131.4069 7 3 2 2.0000 0.0990 1.5911 1606.4949 2.0000 0.2093 0.8658 413.5830 8 4 2 2.0000 -0.8689 0.5602 -64.4726 2.0000 0.9369 1.7423 185.9592 9 1 3 2.0000 -0.8773 0.2694 -30.7050 2.0000 1.0515 0.2212 21.0366 10 2 3 2.0000 -0.1258 2.6033 -2069.8231 2.0000 0.5229 1.0336 197.6485 11 3 3 2.0000 0.5014 1.4952 298.1875 2.0000 0.2659 0.8088 304.1429 12 4 3 2.0000 -0.1845 0.7698 -417.1649 2.0000 0.1110 0.8169 736.0116
5.2 tidyverse方法:「代码」 dat %>% group_by(F4,F3) %>% summarise_at(c("y1","y2"),list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T)
「结果」 > dat %>% group_by(F4,F3) %>% summarise_at(c("y1","y2"),list(n = ~length(.), mean =~ mean(.), sd = ~sd(.), cv = ~sd(.)/mean(.)*100),na.rm=T) # A tibble: 12 x 10 # Groups: F4 [4] F4 F3 y1_n y2_n y1_mean y2_mean y1_sd y2_sd y1_cv y2_cv <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 1 2 2 -0.0799 -0.0356 0.680 0.834 -851. -2344. 2 1 2 2 2 0.314 -0.916 0.261 0.314 83.1 -34.2 3 1 3 2 2 -0.877 1.05 0.269 0.221 -30.7 21.0 4 2 1 2 2 -0.332 -0.122 0.161 1.42 -48.5 -1155. 5 2 2 2 2 -0.0597 -0.874 0.241 1.15 -403. -131. 6 2 3 2 2 -0.126 0.523 2.60 1.03 -2070. 198. 7 3 1 2 2 0.581 -0.419 0.170 1.20 29.3 -285. 8 3 2 2 2 0.0990 0.209 1.59 0.866 1606. 414. 9 3 3 2 2 0.501 0.266 1.50 0.809 298. 304. 10 4 1 2 2 0.929 -0.114 1.21 0.377 131. -332. 11 4 2 2 2 -0.869 0.937 0.560 1.74 -64.5 186. 12 4 3 2 2 -0.185 0.111 0.770 0.817 -417. 736.
6. 实战:一年多点数据「数据预览:」 > library(learnasreml) > data(MET) > str(MET) 'data.frame':400 obs. of 5 variables: $ Year : Factor w/ 2 levels "2009","2010": 1 1 1 1 1 1 1 1 1 1 ... $ Location: Factor w/ 5 levels "CI","FL","KN",..: 3 3 3 3 3 3 3 3 3 3 ... $ Rep : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ Cul : Factor w/ 10 levels "CalhounGray",..: 3 1 9 2 5 4 7 10 6 8 ... $ Yield : num 56.2 74.2 32.6 74.2 64.8 ...
「分析要求:」 - 2,计算每年每个地点产量的平均值,标准差,变异系数
- 3,计算每个品种每个地点产量的平均值,标准差,变异系数
6.1 每个地点的表现> MET %>% group_by(Location) %>% summarise_at("Yield",list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T), cv = ~std/avg*100)) # A tibble: 5 x 4 Location avg std cv * <fct> <dbl> <dbl> <dbl> 1 CI 61.8 19.5 31.5 2 FL 101. 27.7 27.3 3 KN 63.8 17.5 27.4 4 SC 82.7 27.2 32.9 5 TX 27.5 22.4 81.5
6.2 每个地点每年的表现> MET %>% group_by(Location,Year) %>% summarise_at("Yield",list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T), cv = ~std/avg*100)) # A tibble: 10 x 5 # Groups: Location [5] Location Year avg std cv <fct> <fct> <dbl> <dbl> <dbl> 1 CI 2009 54.0 20.7 38.3 2 CI 2010 69.4 14.8 21.4 3 FL 2009 99.0 23.5 23.8 4 FL 2010 104. 31.5 30.4 5 KN 2009 66.0 20.5 31.0 6 KN 2010 61.6 13.7 22.3 7 SC 2009 82.9 26.6 32.1 8 SC 2010 82.5 28.2 34.1 9 TX 2009 40.3 25.2 62.5 10 TX 2010 15.1 8.10 53.8
6.3 每个地点每个品种的表现> MET %>% group_by(Location,Cul) %>% summarise_at("Yield",list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T), cv = ~std/avg*100)) # A tibble: 50 x 5 # Groups: Location [5] Location Cul avg std cv <fct> <fct> <dbl> <dbl> <dbl> 1 CI CalhounGray 73.0 23.3 31.9 2 CI CrimsonSweet 56.5 25.3 44.8 3 CI EarlyCanada 53.1 11.7 22.0 4 CI FiestaF1 75.3 16.4 21.8 5 CI GeorgiaRattlesnake 62.8 12.3 19.6 6 CI Legacy 59.4 17.6 29.6 7 CI Mickylee 54.2 15.9 29.4 8 CI Quetzali 51.8 16.1 31.1 9 CI StarbriteF1 81.2 17.4 21.5 10 CI SugarBaby 50.1 14.4 28.8 # ... with 40 more rows
7. 实战:多个性状的汇总统计「数据:」 > data(fm) > str(fm) 'data.frame':827 obs. of 13 variables: $ TreeID : Factor w/ 827 levels "80001","80002",..: 1 2 3 4 5 6 7 8 9 10 ... $ Spacing: Factor w/ 2 levels "2","3": 2 2 2 2 2 2 2 2 2 2 ... $ Rep : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ Fam : Factor w/ 55 levels "70001","70002",..: 44 44 44 15 15 2 2 10 10 10 ... $ Plot : Factor w/ 4 levels "1","2","3","4": 1 2 4 1 4 2 4 1 2 3 ... $ dj : num 0.334 0.348 0.354 0.335 0.322 0.359 0.368 0.358 0.323 0.298 ... $ dm : num 0.405 0.393 0.429 0.408 0.372 0.45 0.509 0.381 0.393 0.361 ... $ wd : num 0.358 0.365 0.379 0.363 0.332 0.392 0.388 0.369 0.347 0.324 ... $ h1 : int 29 24 19 46 33 30 37 32 34 28 ... $ h2 : int 130 107 82 168 135 132 124 126 153 127 ... $ h3 : int 239 242 180 301 271 258 238 290 251 243 ... $ h4 : int 420 410 300 510 470 390 380 460 430 410 ... $ h5 : int 630 600 500 700 670 570 530 660 600 630 ...
可以看到,性状有dj, dm, wd, h1 h2 h3 h4 h5 可以用summarise_if 批量处理: > fm %>% summarise_if(is.numeric,list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T))) dj_avg dm_avg wd_avg h1_avg h2_avg h3_avg h4_avg h5_avg dj_std dm_std wd_std h1_std h2_std h3_std h4_std h5_std 1 0.3572712 0.4550073 0.3814172 37.04474 138.5054 254.6977 431.1893 619.7573 0.02440212 0.04504642 0.02743006 8.425672 29.54003 45.16661 81.61072 94.87955
可以用pivot_longer将格式变得友好: > fm %>% summarise_if(is.numeric,list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T))) %>% + pivot_longer(everything(),names_sep = "_",names_to = c("trait",".value")) # A tibble: 8 x 3 trait avg std <chr> <dbl> <dbl> 1 dj 0.357 0.0244 2 dm 0.455 0.0450 3 wd 0.381 0.0274 4 h1 37.0 8.43 5 h2 139. 29.5 6 h3 255. 45.2 7 h4 431. 81.6 8 h5 620. 94.9
也可以将数据先变为长数据,然后用group和summarise进行汇总统计: > fm %>% pivot_longer(6:13) %>% group_by(name) %>% summarise_if(is.numeric,list(avg = ~mean(.,na.rm=T), std = ~sd(.,na.rm=T),cv =~ std/avg)) # A tibble: 8 x 4 name avg std cv * <chr> <dbl> <dbl> <dbl> 1 dj 0.357 0.0244 0.0683 2 dm 0.455 0.0450 0.0990 3 h1 37.0 8.43 0.227 4 h2 139. 29.5 0.213 5 h3 255. 45.2 0.177 6 h4 431. 81.6 0.189 7 h5 620. 94.9 0.153 8 wd 0.381 0.0274 0.0719
tidyverse包,值的学习,逻辑清楚,体会简洁之美!
|