【原】TCGA的28篇教程- 指定癌症查看感兴趣基因的表达量

健明 2021-07-14

展开全文

m长期更新列表：
使用R语言的cgdsr包获取TCGA数据（cBioPortal）TCGA的28篇教程- 使用R语言的RTCGA包获取TCGA数据（离线打包版本）TCGA的28篇教程- 使用R语言的RTCGAToolbox包获取TCGA数据（FireBrowse portal）TCGA的28篇教程- 批量下载TCGA所有数据（ UCSC的 XENA）TCGA的28篇教程- 数据下载就到此为止吧

本教程目录:

- 文章来源

- 那么首先需要得到上图的基因列表

- 使用下载这些基因在GBM的芯片数据的表达量。

- 下载感兴趣基因的表达矩阵

- 画散点图

- 相关系数的总结

文章来源

这里我重现一篇CELL文章的某个分析要点，具体见我博客的解读：http://www./3072.html

在TCGA的GBM的芯片表达数据里面可以看到以下正相关关系：

NCR2 expression
upregulation of PDGF-DD-induced NK cell cytokine genes (芯片数据)
downregulation of tumor cell-cycle genes
greater survival

TCGA GBM mRNA gene expression data obtained using the Affymetrix HT Human Genome U133a microarray platform (n = 539 patients)was downloaded through the UCSC data portal (https://) and matched to the gene expression data (Figures 4C and 4D). 7 of 9 cytokine genes (Figure 4C) and 27 of 34 cell cycle genes (Figure 4D) were matched with the GBM cohort.

分别如下图

NCR2基因表达与细胞因子相关基因表达正相关

NCR2表达与细胞周期相关基因表达负相关

NCR2高表达的保护作用

那么首先需要得到上图的基因列表

在KEGG官网可以查看细胞周期基因列表：http://www./kegg-bin/show_pathway?hsa04110 , 而细胞因子是：http://www./kegg-bin/show_pathway?hsa04060

这里可以用R语言本身的包KEGG.db来获取.

library(KEGG.db)
ls("package:KEGG.db")

##  [1] "KEGG" "KEGGENZYMEID2GO"  "KEGGEXTID2PATHID"
##  [4] "KEGGGO2ENZYMEID"  "KEGGMAPCOUNTS" "KEGGPATHID2EXTID"
##  [7] "KEGGPATHID2NAME"  "KEGGPATHNAME2ID"  "KEGG_dbInfo"  
## [10] "KEGG_dbconn"   "KEGG_dbfile"   "KEGG_dbschema"

cellcycle_genes=KEGGPATHID2EXTID[['hsa04110']]
cytokine_genes=KEGGPATHID2EXTID[['hsa04060']]

使用下载这些基因在GBM的芯片数据的表达量。

library(cgdsr)
library(DT)
# Get list of cancer studies at server
## 获取有哪些数据集
# save(all_TCGA_studies,all_dataset,all_tables,cellcycle_expr,cytokine_expr,NCR2_expr,file = 'GBM_microarray_TCGA.Rdata')
load(file = 'GBM_microarray_TCGA.Rdata')
mycgds <- CGDS("http://www./public-portal/")
#all_TCGA_studies <- getCancerStudies(mycgds)
all_TCGA_studies[grepl('gbm',all_TCGA_studies$cancer_study_id),]

##  cancer_study_id name
## 55 gbm_tcga_pub2013  Glioblastoma (TCGA, Cell 2013)
## 56  gbm_tcga_pub   Glioblastoma (TCGA, Nature 2008)
## 57   gbm_tcga Glioblastoma Multiforme (TCGA, Provisional)
## 97  lgggbm_tcga_pub Merged Cohort of LGG and GBM (TCGA, Cell 2016)
##  description
## 55 <a href="http://cancergenome./">The Cancer Genome Atlas (TCGA)</a> Glioblastoma project.<br> <a href="http://tcga-data.nci./tcga/">Raw data via the TCGA Data Portal</a>.
## 56 <a href="http://cancergenome./">The Cancer Genome Atlas (TCGA)</a> Glioblastoma project. 206 primary glioblastoma samples.<br> <a href="https://tcga-data.nci./docs/publications/gbm_2008/">Raw data via the TCGA Data Portal</a>.
## 57  TCGA Glioblastoma Multiforme; raw data at the <A HREF="https://tcga-data.nci./">NCI</A>.
## 97  Whole-exome sequencing from TCGA LGG and GBM cases

带有GBM的项目有4个。

下载感兴趣基因的表达矩阵

很明显那篇文章作者提到了Affymetrix HT Human Genome U133a microarray platform (n = 539 patients)，所以用的是gbm_tcga那个数据集，用下面的代码下载：

## 获取在 "gbm_tcga" 数据集中有哪些表格（每个表格都是一个样本列表）
#all_tables <- getCaseLists(mycgds, "gbm_tcga")
DT::datatable(all_tables[,1:4],
  extensions = 'FixedColumns',
  options = list(
 #dom = 't',
 scrollX = TRUE,
 fixedColumns = TRUE
  ))

## 而后获取可以下载哪几种数据，一般是mutation，CNV和表达量数据
#all_dataset <- getGeneticProfiles(mycgds, "gbm_tcga")
DT::datatable(all_dataset,
  extensions = 'FixedColumns',
  options = list(
 #dom = 't',
 scrollX = TRUE,
 fixedColumns = TRUE
  ))

all_dataset[4,]

##   genetic_profile_id genetic_profile_name
## 4 gbm_tcga_mrna_U133 mRNA expression (U133 microarray only)
##   genetic_profile_description
## 4 mRNA expression data from the Affymetrix U133 microarray.
##   cancer_study_id genetic_alteration_type show_profile_in_analysis_tab
## 4 740   MRNA_EXPRESSION   false

all_tables[9,1:4]

##   case_list_id  case_list_name
## 9 gbm_tcga_mrna_U133 Tumor Samples with mRNA data (U133 microarray only)
##   case_list_description cancer_study_id
## 9 All samples with mRNA expression data (528 samples) 740

my_dataset <- 'gbm_tcga_mrna_U133'
my_table <- "gbm_tcga_mrna_U133" 
#cellcycle_expr <- getProfileData(mycgds, cellcycle_genes, my_dataset, my_table)
dim(cellcycle_expr)

## [1] 528 124

#cytokine_expr <- getProfileData(mycgds, cytokine_genes, my_dataset, my_table)
dim(cytokine_expr)

## [1] 528 265

#NCR2_expr <- getProfileData(mycgds, 'NCR2', my_dataset, my_table)

画散点图

cor_plot <- function(x,y){
  #x=NCR2_expr$NCR2
  #y=cellcycle_expr$CCNB1
  plot(x,y, xlab = 'NCR2', ylab = 'gene')
  model = lm(y ~ x)
  summary(model) 
  int =  model$coefficient["(Intercept)"]
  slope =model$coefficient["x"]
  abline(int, slope,
   lty=1, lwd=2, col="red")   
  r= format(cor(x,y),digits = 4)
  p= format(cor.test(x,y)$p.value,digits = 4)
  title(main = paste0('p value=',p), 
  sub  = paste0('correlation=',r))
}

cor_plot(x=NCR2_expr$NCR2,y=cellcycle_expr$CCNB1)

cor_plot(x=NCR2_expr$NCR2,y=cellcycle_expr$BUB1B)

cor_plot(x=NCR2_expr$NCR2,y=cytokine_expr$CCL1)

cor_plot(x=NCR2_expr$NCR2,y=cytokine_expr$CCL4)

这些图跟作者的差不多了，但是很明显作者的展现方式并不是最好的。因为这样的散点图太多了。其实只需要p值以及相关性系数即可。