本文讲的是aglient芯片原始数据的处理,参考资料是limma 的userguide文档。GEO数据库下载的表达矩阵不符合预期,比如是空的,或者是有负值的,那我们就处理一下它的原始数据。aglient的芯片应用也很广泛,举个OSCC的栗子:GSE23558,跟着学习学习。 1.下载和读取数据1.1获取临床信息数据从前,提到GEO数据下载,我们只有GEOquery,神功盖世,但是死于网速。后来就有了中国人寄几的GEO镜像,AnnoProbe包。还没有正式发表,就已经初露锋芒了,因简单易学,下载迅速,在我们的粉丝圈子里很受欢迎。 rm(list = ls()) library(stringr) library(AnnoProbe) library(GEOquery) library(limma) gse = "GSE23558" geoChina(gse)
## you can also use getGEO from GEOquery, by ## getGEO("GSE23558", destdir=".", AnnotGPL = F, getGPL = F)
## $GSE23558_series_matrix.txt.gz ## ExpressionSet (storageMode: lockedEnvironment) ## assayData: 41000 features, 32 samples ## element names: exprs ## protocolData: none ## phenoData ## sampleNames: GSM577914 GSM577915 ... GSM577945 (32 total) ## varLabels: title geo_accession ... tissue:ch1 (42 total) ## varMetadata: labelDescription ## featureData: none ## experimentData: use 'experimentData(object)' ## pubMedIds: 22072328 ## 28433800 ## Annotation: GPL6480
提供一个GSE编号就可以下载啦。因为表达矩阵是处理过的,我们不要,所以只提取临床信息表格,从中获得分组信息。 load("GSE23558_eSet.Rdata") pd <- pData(gset[[1]])
调整pd的行名与文件读取的顺序一致,并定义分组信息。 raw_dir = "rawdata/GSE23558_RAW/" raw_datas = paste0(raw_dir,"/",dir(raw_dir))
#调整pd与rawdata的顺序一致 raw_order = str_extract(raw_datas,"GSM\\d*") pd = pd[match(raw_order,rownames(pd)),]
group_list <- ifelse(stringr::str_detect(pd$title,"HealthyControl"),"normal","tumor") group_list <- factor(group_list, levels=c("normal","tumor"))
1.2 读取原始数据x <- read.maimages(raw_datas, source="agilent", green.only=TRUE, other.columns = "gIsWellAboveBG")
## Read rawdata/GSE23558_RAW//GSM577914.txt ## Read rawdata/GSE23558_RAW//GSM577915.txt ## Read rawdata/GSE23558_RAW//GSM577916.txt ## Read rawdata/GSE23558_RAW//GSM577917.txt ## Read rawdata/GSE23558_RAW//GSM577918.txt ## Read rawdata/GSE23558_RAW//GSM577919.txt ## Read rawdata/GSE23558_RAW//GSM577920.txt ## Read rawdata/GSE23558_RAW//GSM577921.txt ## Read rawdata/GSE23558_RAW//GSM577922.txt ## Read rawdata/GSE23558_RAW//GSM577923.txt ## Read rawdata/GSE23558_RAW//GSM577924.txt ## Read rawdata/GSE23558_RAW//GSM577925.txt ## Read rawdata/GSE23558_RAW//GSM577926.txt ## Read rawdata/GSE23558_RAW//GSM577927.txt ## Read rawdata/GSE23558_RAW//GSM577928.txt ## Read rawdata/GSE23558_RAW//GSM577929.txt ## Read rawdata/GSE23558_RAW//GSM577930.txt ## Read rawdata/GSE23558_RAW//GSM577931.txt ## Read rawdata/GSE23558_RAW//GSM577932.txt ## Read rawdata/GSE23558_RAW//GSM577933.txt ## Read rawdata/GSE23558_RAW//GSM577934.txt ## Read rawdata/GSE23558_RAW//GSM577935.txt ## Read rawdata/GSE23558_RAW//GSM577936.txt ## Read rawdata/GSE23558_RAW//GSM577937.txt ## Read rawdata/GSE23558_RAW//GSM577938.txt ## Read rawdata/GSE23558_RAW//GSM577939.txt ## Read rawdata/GSE23558_RAW//GSM577940.txt ## Read rawdata/GSE23558_RAW//GSM577941.txt ## Read rawdata/GSE23558_RAW//GSM577942.txt ## Read rawdata/GSE23558_RAW//GSM577943.txt ## Read rawdata/GSE23558_RAW//GSM577944.txt ## Read rawdata/GSE23558_RAW//GSM577945.txt
dim(x)
## [1] 45015 32
2.背景校正和标准化y <- backgroundCorrect(x, method="normexp")
## Array 1 corrected ## Array 2 corrected ## Array 3 corrected ## Array 4 corrected ## Array 5 corrected ## Array 6 corrected ## Array 7 corrected ## Array 8 corrected ## Array 9 corrected ## Array 10 corrected ## Array 11 corrected ## Array 12 corrected ## Array 13 corrected ## Array 14 corrected ## Array 15 corrected ## Array 16 corrected ## Array 17 corrected ## Array 18 corrected ## Array 19 corrected ## Array 20 corrected ## Array 21 corrected ## Array 22 corrected ## Array 23 corrected ## Array 24 corrected ## Array 25 corrected ## Array 26 corrected ## Array 27 corrected ## Array 28 corrected ## Array 29 corrected ## Array 30 corrected ## Array 31 corrected ## Array 32 corrected
y <- normalizeBetweenArrays(y, method="quantile") class(y)
## [1] "EList" ## attr(,"package") ## [1] "limma"
3. 基因过滤Control <- y$genes$ControlType==1L;table(Control)
## Control ## FALSE TRUE ## 43529 1486
NoSymbol <- is.na(y$genes$GeneName);table(NoSymbol)
## NoSymbol ## FALSE ## 45015
IsExpr <- rowSums(y$other$gIsWellAboveBG > 0) >= 16;table(IsExpr)
## IsExpr ## FALSE TRUE ## 13088 31927
Isdup <- duplicated(y$genes$GeneName);table(Isdup)
## Isdup ## FALSE TRUE ## 30328 14687
yfilt <- y[!Control & !NoSymbol & IsExpr & !Isdup, ] dim(yfilt)
## [1] 20650 32
可以看到,过滤后剩下了2万多个探针。 4.得到表达矩阵exp = yfilt@.Data[[1]] boxplot(exp)
exp[1:2,1:2]
## rawdata/GSE23558_RAW//GSM577914 rawdata/GSE23558_RAW//GSM577915 ## [1,] 9.284154 11.473334 ## [2,] 7.341236 7.474406
得到的表达矩阵没问题,但行名和列名均有问题。行名应该是探针名,列名是样本名,调整一下。 4.1获得样本名colnames(exp) = str_extract(colnames(exp),"GSM\\d*") exp[1:2,1:2]
## GSM577914 GSM577915 ## [1,] 9.284154 11.473334 ## [2,] 7.341236 7.474406
4.2 获得基因名limma文档里写的是用了注释R包,在本例的原文件是里有探针注释的,这里直接使用。 可以直接将exp的行名转为基因名。行名不能重复,所以先去重 anno = yfilt$genes nrow(anno);nrow(exp)
## [1] 20650
## [1] 20650
rownames(exp)=rownames(anno) ids = unique(anno$GeneName) exp = exp[!duplicated(anno$GeneName),] rownames(exp) = ids exp[1:4,1:4]
## GSM577914 GSM577915 GSM577916 GSM577917 ## APOBEC3B 9.284154 11.473334 10.439071 11.661000 ## A_32_P77178 7.341236 7.474406 7.310818 7.397149 ## ATP11B 9.963452 8.915621 10.193873 9.321954 ## DNAJA1 13.469790 13.201078 12.827357 13.389431
至此,得到了标准的表达矩阵。后面要做什么就看你啦,这就相当于修复了一下数据库里那个被标准化过的表达矩阵。 5.差异分析design <- model.matrix(~group_list) fit <- lmFit(exp,design) fit <- eBayes(fit,trend=TRUE,robust=TRUE) summary(decideTests(fit))
## (Intercept) group_listtumor ## Down 0 2102 ## NotSig 0 16928 ## Up 20650 1620
deg = topTable(fit,coef=2,n=dim(y)[1]) boxplot(exp[rownames(deg)[1],]~group_list)
这里直接走limma的简易流程,可以画差异最显著的那个基因表达量看看,可以看到差异是超级明显了!
save(exp,group_list,deg,gse,file = paste0(gse,"deg.Rdata"))
后面的步骤就是我们GEO数据挖掘课程的标配啦,如果大家对这一系列“骚操作”感兴趣,欢迎报名我们的GEO数据挖掘课程哈,全年滚动开班,直播互动教学以及答疑,下一期是7月6号开课,可以花时间了解一下:
|