GDCRNATools - An R package for downloading, organizing, and integrative analyzing lncRNA, mRNA, and miRNA data in GDCIntroductionThe Genomic Data Commons (GDC) maintains standardized genomic, clinical, and biospecimen data from National Cancer Institute (NCI) programs including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research To Generate Effective Treatments (TARGET), It also accepts high quality datasets from non-NCI supported cancer research programs, such as genomic data from the Foundation Medicine.
Many analyses can be perfomed using This user-friendly package allows researchers perform the analysis by simply running a few functions and integrate their own pipelines such as molecular subtype classification, weighted correlation network analysis (WGCNA)(Langfelder and Horvath 2008), and TF-miRNA co-regulatory network analysis, etc. into the workflow easily. Installation
On Windows system
On Linux and Mac systemsRun the following command in R install.packages('GDCRNATools_0.99.0.tar.gz', repos = NULL, type='source') If source("https:///biocLite.R") ### install packages simutaneously ### biocLite(c('limma', 'edgeR', 'DESeq2', 'clusterProfiler', 'DOSE', 'org.Hs.eg.db', 'biomaRt', 'BiocParallel')) install.packages(c('shiny', 'jsonlite', 'rjson', 'survival', 'survminer', 'ggplot2', 'gplots', 'Hmisc')) ### install packages seperately ### biocLite('limma') biocLite('edgeR') biocLite('DESeq2') biocLite('clusterProfiler') biocLite('DOSE') biocLite('org.Hs.eg.db') biocLite('biomaRt') biocLite('BiocParallel') install.packages('shiny') install.packages('jsonlite') install.packages('rjson') install.packages('survival') install.packages('survminer') install.packages('ggplot2') install.packages('gplots') install.packages('Hmisc') ManualA simply manual of 1 Data download
Users can also download data from GDC using the API method developed in TCGAbiolinks(Colaprico et al. 2016) or using TCGA-Assembler(Zhu, Qiu, and Ji 2014) 1.1 Manual download1.1.1 Installation of GDC Data Transfer Tool gdc-client Download GDC Data Transfer Tool from the GDC website and unzip the file 1.1.2 Download manifest file and metadata from GDC Data Portal 1.1.3 Download data
1.2 Automatic download
1.2.1 Download RNAseq/miRNAs data
1.2.2 Download clinical data
2 Data organization2.1 Parse metadataMetadata can be parsed by either providing the metadata file that is downloaded in the data download step, or specifying the 2.1.1 Parse metadata by providing the metadata file
2.1.2 Parse metadata by specifying project.id and data.type
2.2 Filter samples2.2.1 Filter duplicated samplesOnly one sample would be kept if the sample had been sequenced more than once by
2.2.2 Filter non-Primary Tumor and non-Solid Tissue Normal samplesSamples that are neither Primary Tumor (code: 01) nor Solid Tissue Normal (code: 11) would be filtered out by
2.3 Merge data
2.3.1 Merge RNAseq/miRNAs data
2.3.2 Merge clinical data
2.4 TMM normalization and voom transformationIt has repeatedly shown that normalization is a critical way to ensure accurate estimation and detection of differential expression (DE) by removing systematic technical effects that occur in the data(Robinson and Oshlack 2010). TMM normalization is a simple and effective method for estimating relative RNA production levels from RNA-seq data. Voom is moreover faster and more convenient than existing RNA-seq methods, and converts RNA-seq data into a form that can be analyzed using similar tools as for microarrays(Law et al. 2014). By running
3. Differential gene expression analysis
3.1 DE analysis
3.2 Report DE genes/miRNAsAll DEGs, DE long non-coding genes, DE protein coding genes and DE miRNAs could be reported separately by setting
3.3 DEG visualizationVolcano plot and bar plot are used to visualize DE analysis results in different manners by 3.3.1 Volcano plot
3.3.2 Barplot
3.3.3 HeatmapHeatmap is generated based on the
4 Competing endogenous RNAs network analysis
4.1 Hypergeometric testHypergenometric test is performed to test whether a lncRNA and mRNA share many miRNAs significantly. A newly developed algorithm spongeScan(Furi’o-Tar’i et al. 2016) is used to predict MREs in lncRNAs acting as ceRNAs. Databases such as starBase v2.0(J.-H. Li et al. 2014), miRcode(Jeggari, Marks, and Larsson 2012) and mirTarBase release 7.0(Chou et al. 2017) are used to collect predicted and experimentally validated miRNA-mRNA and/or miRNA-lncRNA interactions. Gene IDs in these databases are updated to the latest Ensembl 90 annotation of human genome and miRNAs names are updated to the new release miRBase 21 identifiers. Users can also provide their own datasets of miRNA-lncRNA and miRNA-mRNA interactions.
here is the number of shared miRNAs, is the total number of miRNAs in the database, is the number of miRNAs targeting the lncRNA, is the number of miRNAs targeting the protein coding gene. 4.2 Pearson correlation analysisPearson correlation coefficient is a measure of the strength of a linear association between two variables. As we all know, miRNAs are negative regulators of gene expression. If more common miRNAs are occupied by a lncRNA, less of them will bind to the target mRNA, thus increasing the expression level of mRNA. So expression of the lncRNA and mRNA in a ceRNA pair should be positively correlated. 4.3 Regulation pattern analysis
We defined a measurement regulation similarity score to check the similarity between miRNAs-lncRNA expression correlation and miRNAs-mRNA expression correlation. where is the total number of shared miRNAs, is the th shared miRNAs, and represents the Pearson correlation between the th miRNA and lncRNA, the th miRNA and mRNA, respectively
Sensitivity correlation is defined by Paci et al.(2014) to measure if the correlation between a lncRNA and mRNA is mediated by a miRNA in the lncRNA-miRNA-mRNA triplet. We take average of all triplets of a lncRNA-mRNA pair and their shared miRNAs as the sensitivity correlation between a selected lncRNA and mRNA. where is the total number of shared miRNAs, is the th shared miRNAs, , and represents the Pearson correlation between the long non-coding RNA and the protein coding gene, the kth miRNA and lncRNA, the kth miRNA and mRNA, respectively The hypergeometric test of shared miRNAs, expression correlation analysis of lncRNA-mRNA pair, and regulation pattern analysis of shared miRNAs are all implemented in the
4.4 ceRNAs visualization4.4.1 Correlation plot
4.4.2 Correlation plot on a local webpage by shinyCorplotTyping and running
4.4.3 Network visulization in CytoscapelncRNA-miRNA-mRNA interactions can be reported by the
5 Univariate survival analysisTwo methods are provided to perform univariate survival analysis: Cox Proportional-Hazards (CoxPH) model and Kaplan Meier (KM) analysis based on the survival package.
CoxPH model considers expression value as continous variable while KM analysis divides patients into high-expreesion and low-expression groups by a user-defined threshold such as median or mean. 5.1 CoxPH analysis
5.2 KM analysis
5.3 KM analysis visualization5.5.1 KM plotKM survival curves are ploted using the
5.3.2 KM plot on a local webpage by shinyKMPlotThe
6 Functional enrichment analysisOne of the main uses of the GO is to perform enrichment analysis on gene sets. For example, given a set of genes that are up-regulated under certain conditions, an enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for that gene set and pathway enrichment can also be applied afterwards. 6.1 GO, KEGG and DO analyses
6.2 Enrichment visualizationThe output generated by 6.2.1 GO barplot
6.2.2 GO bubble plot
6.2.3 KEGG/DO barplot
6.2.4 KEGG/DO bubble plot
6.2.5 PathviewUsers can visualize a pathway map with
6.2.6 View pathway maps on a local webpage by shinyPathview
sessionInfo
ReferencesChou, Chih-Hung, Sirjana Shrestha, Chi-Dung Yang, Nai-Wen Chang, Yu-Ling Lin, Kuang-Wen Liao, Wei-Chi Huang, et al. 2017. “MiRTarBase Update 2018: A Resource for Experimentally Validated MicroRNA-Target Interactions.” Nucleic Acids Research, November, gkx1067–gkx1067. doi:10.1093/nar/gkx1067. Colaprico, Antonio, Tiago C. Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S. Sabedot, et al. 2016. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of TCGA Data.” Nucleic Acids Research 44 (8): e71. doi:10.1093/nar/gkv1507. Furi’o-Tar’i, Pedro, Sonia Tarazona, Toni Gabald’on, Anton J. Enright, and Ana Conesa. 2016. “SpongeScan: A Web for Detecting MicroRNA Binding Elements in LncRNA Sequences.” Nucleic Acids Research 44 (Web Server issue): W176–W180. doi:10.1093/nar/gkw443. Jeggari, Ashwini, Debora S Marks, and Erik Larsson. 2012. “MiRcode: A Map of Putative MicroRNA Target Sites in the Long Non-Coding Transcriptome.” Bioinformatics 28 (15): 2062–3. doi:10.1093/bioinformatics/bts344. Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9 (December): 559. doi:10.1186/1471-2105-9-559. Law, Charity W., Yunshun Chen, Wei Shi, and Gordon K. Smyth. 2014. “Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts.” Genome Biology 15 (February): R29. doi:10.1186/gb-2014-15-2-r29. Li, Jun-Hao, Shun Liu, Hui Zhou, Liang-Hu Qu, and Jian-Hua Yang. 2014. “StarBase V2.0: Decoding MiRNA-CeRNA, MiRNA-NcRNA and Protein–RNA Interaction Networks from Large-Scale CLIP-Seq Data.” Nucleic Acids Research 42 (Database issue): D92–D97. doi:10.1093/nar/gkt1248. Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (December): 550. doi:10.1186/s13059-014-0550-8. Luo, Weijun, and Cory Brouwer. 2013. “Pathview: An R/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–1. doi:10.1093/bioinformatics/btt285. Paci, Paola, Teresa Colombo, and Lorenzo Farina. 2014. “Computational Analysis Identifies a Sponge Interaction Network Between Long Non-Coding RNAs and Messenger RNAs in Human Breast Cancer.” BMC Systems Biology 8 (July): 83. doi:10.1186/1752-0509-8-83. Ritchie, Matthew E., Belinda Phipson, Di Wu, Yifang Hu, Charity W. Law, Wei Shi, and Gordon K. Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. doi:10.1093/nar/gkv007. Robinson, Mark D., and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11 (March): R25. doi:10.1186/gb-2010-11-3-r25. Robinson, Mark D., Davis J. McCarthy, and Gordon K. Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. doi:10.1093/bioinformatics/btp616. Yu, Guangchuang, Li-Gen Wang, Yanyan Han, and Qing-Yu He. 2012. “ClusterProfiler: An R Package for Comparing Biological Themes Among Gene Clusters.” OMICS : A Journal of Integrative Biology 16 (5): 284–87. doi:10.1089/omi.2011.0118. Yu, Guangchuang, Li-Gen Wang, Guang-Rong Yan, and Qing-Yu He. 2015. “DOSE: An R/Bioconductor Package for Disease Ontology Semantic and Enrichment Analysis.” Bioinformatics 31 (4): 608–9. doi:10.1093/bioinformatics/btu684. Zhu, Yitan, Peng Qiu, and Yuan Ji. 2014. “TCGA-Assembler: An Open-Source Pipeline for TCGA Data Downloading, Assembling, and Processing.” Nature Methods 11 (6): 599–600. doi:10.1038/nmeth.2956. |
|
来自: 昵称58554453 > 《待分类》