psych igraph：共表达网络构建

生物_医药_科研 2019-03-25

展开全文

WGCNA分析，简单全面的最新教程详细介绍了WGCNA构建共表达网络的原理和过程，但有时由于样本太少，效果不理想，找不到合适的Power值。实验室师弟师妹经常在讨论样本较少的时候如何筛选共表达基因（转录因子的突变体/转基因材料和野生型材料在2-3个时间点进行转录组测序），加上3个生物学重复勉强构成18个样本。这样的实验设计，已经足够算出各个时间点或者处理间的差异基因和任意基因之间的相关系数。本文介绍少样本如何构建共表达无向网络，筛选共表达基因的流程。
授权转载自SimonCat，有增删。

0 实验设计

材料	时间点	重复	数据
野生型	胁迫处理1，6，12h	3	RNA-seq SE 单端测序
突变体	胁迫处理1，6，12h	3	RNA-seq SE 单端测序

思路：（1）得到胁迫处理诱导的基因和突变体野生型间的差异基因作为候选基因集（2）两两计算候选基因的皮尔逊相关系数（3）使用igrah计算各个基因的中心度。

从RNA-seq开始构建共表达网络。

1 使用软件FastQC进行测序质量评估 (NGS基础 - FASTQ格式解释和质量评估)

$ ls *.fastq.gz | xargs -n1 -P2 fastqc $1

备注： -n1 表示一个参数传递给fastqc，-P表示线程数

2 使用trimmomatic质量清理

script.trimmo.sh，对测序数据进行质量控制

$ echo 'nohup java -jar trimmomatic-0.36.jar SE $1 $1.trim.fil.gz LEADING:20 TRAILING:20 AVGQUAL:25 SLIDINGWINDOW:10:30 MINLEN:36 > $1.trim.nohup' > script.trimmo.sh

$ ls *.fastq.gz | xargs -n1 -P2 sh script.trimmo.sh

3 使用hisat2比对 (转录组分析工具哪家强？)

使用比对软件hisat2建立索引和比对

$ hisat2-build TAIR10_Chr.all.fasta TAIR10_Chr.all
$ echo ’nohup $WD/tools/hisat2-2.0.5/hisat2 -x TAIR10_Chr.all -U $1 -S $1.sam > $1.align.stat.txt’ > script.align.sh
$ ls *.trim.fil.gz | xargs -n1 -P2 sh script.align.sh

4使用htseq-count，获得每个基因的counts数目

$ echo 'htseq-count -s no $1 TAIR10.gtf > $1.count ' > script_count.sh
$ ls *sam|xargs -n1 sh script_count.sh
# 注意用grep -v 去除掉htseq-count生成文件中的无用信息
$ csvtk join -t treat_1h_rep1.count treat_1h_rep2.count treat_1h_rep3.count control_1h_rep1.count  control_1h_rep2.count  control_1h_rep3.count > 0h_count.txt
$ csvtk join -t treat_6h_rep1.count treat_6h_rep2.count treat_6h_rep3.count control_6h_rep1.count  control_6h_rep2.count  control_6h_rep3.count > 6h_count.txt
$ csvtk join -t treat_12h_rep1.count treat_12h_rep2.count treat_12h_rep3.count control_12h_rep1.count  control_12h_rep2.count  control_12h_rep3.count > 12h_count.txt
$ csvtk join -t 0h_count.txt 6h_count.txt 12h_count.txt > total_count.txt

备注：PE数据必须先根据序列的名字或者位置进行排序，本次项目是单端，所以随意。

5 使用DESeq2 计算差异基因（DESeq2差异基因分析和批次效应移除）

将0、6、12 h的count的table依次导入,分别计算这3个时间点的差异基因。

过滤掉低表达的基因（所有样本的和 >1），log2FC >1 with adjusted p-values<0.01 筛选差异基因

>library(DESeq2)

>table=read.table('0_H_count.txt',header=F,sep='\t',row.names =1)

> colname(table)=c('0h_c_rep1','0h_c_rep2','0h_c_rep3','0h_t_rep1','0h_t_rep2','0h_t_rep3')

>countData <- countData[rowSums(countData) > 1, ] # 去除掉低表达的基因

>condition <- c('c','c','c','t','t','t')

>coldata <- data.frame(row.names=colnames(countData), condition)

>dds <- DESeqDataSetFromMatrix(countData = countData, colData = group, design = ~condition )

>results <- results(DESeq(dds))

>res <- na.exclude(as.data.frame(results))

> filter <- res[(abs(res$log2FoldChange)>1 & res$padj<0.01),]

> write.table(filter,'DGE_0h.txt', quote=F,sep='\t', col. names = NA)

6 使用EBSeq 对count数目进行标准化

（生信宝典注：标准化后的数据可以也直接从DESeq2获取）

获得每个基因的counts数目之后，就可以计算两两基因的皮尔逊相关系数。使用EBSeq的函数median normalization 对count数目进行标准化，再进行对数化。

> if (!requireNamespace('BiocManager', quietly = TRUE))

install.packages('BiocManager')

BiocManager::install('EBSeq', version = '3.8')

> library(EBSeq)

> NormData <- GetNormalizedMat(counts, MedianNorm(counts))

> NormData.log <- log2(NormData + 1)

> Norm.interest <- NormData.log[rownames(filter),] # filter是步骤5获得的基因集

7 使用psych 计算两两相关性

psych在CRAN官网进行下载。使用corr.test (也可以使用WGCNA的bicor函数，如果用Python更快)函数对差异基因进行两两的相关性计算。如果用所有基因计算数据量将大得可怕。因此我们只挑选差异基因进行分析。如我们对这个2 x 3的实验组进行两两差异计算，得到一个差异基因的总表，挑选这些基因进行计算。以相关性 > 0.9 p小于0.01筛选出共表达基因

>library('psych')

>Norm.interest.corr <- corr.test( t(Norm.interest), method='pearson', ci=F)

> Norm.interest.corr$p[lower.tri( Norm.interest.corr$p,diag=TRUE)]=NA

>Pval.adj <- as.data.frame(as.table(Norm.interest.corr$p))

> Correlation <- as.data.frame(as.table(Norm.interest.corr $r))

> Cor.table <- na.exclude(cbind( Correlation, Pval.adj))[,c (1,2,3,6)]

> colnames(Cor.table) <- c('gene1','gene2','cor','p.adj')

> Cor.table.filt <- Cor.table [(abs(Cor.table[,3])>0.9 & Cor.table[,4] <0.01 ),]

8 使用igraph和Cytoscape可视化网络

使用igrah对前面筛出的互作基因Cor.table.filt进行网络分析，degree 是指节点 (这里指基因)的连接度，即一个点有多少条边相连，degree centrality是某个节点的度除以网络中所有点能构成的连接数目，能反应一个基因的中心度。介度中心性 (betweenness centrality) 是指一个节点充当其它两个节点中间人的次数“被经过”的感觉，用“被经过次数”除以总的ties，即n（n-1）/2，计算方法见下图（来源于易生信陈亮博士的扩增子和宏基因组授课，本期推文同期有陈亮博士的“一文学会网络分析 igraph更详细假设”）。输出的文件Node_nw_st.txt和之前获得的Cor.table.filt两个表格共同用于Cytoscape可视化了。

# By Chen Liang
node_pro <- function(igraph){  
  # 节点度
  igraph.degree <- igraph::degree(igraph)
  # 节点度中心性
  igraph.cen.degree <- round(centralization.degree(igraph)$res,3)
  # 节点介数中心性
  igraph.betweenness <- round(centralization.betweenness(igraph)$res,3)
  # 节点中心性
  igraph.closeness <- round(centralization.closeness(igraph)$res,3)
  #igraph.node.pro <- cbind(igraph.degree,igraph.closeness,igraph.betweenness,igraph.cen.degree)
  #colnames(igraph.node.pro)<-c('igraph.degree','igraph.closeness','igraph.betweenness','igraph.cen.degree')
  igraph.node.pro <- data.frame(Node=names(igraph.degree), igraph.degree,igraph.closeness,igraph.betweenness,igraph.cen.degree)
}
net_node_attr <- node_pro(net)[,1:4]
write.table(net_node_attr, file='Node_nw_st.txt', row.names=F, col.names=T,quote=F,sep='\t')

具体可视化见：

psych igraph：共表达网络构建

0 实验设计

思路：（1）得到胁迫处理诱导的基因和突变体野生型间的差异基因作为候选基因集（2）两两计算候选基因的皮尔逊相关系数（3）使用igrah计算各个基因的中心度。

从RNA-seq开始构建共表达网络。

1 使用软件FastQC进行测序质量评估 (NGS基础 - FASTQ格式解释和质量评估)

2 使用trimmomatic质量清理

3 使用hisat2比对 (转录组分析工具哪家强？)

4使用htseq-count，获得每个基因的counts数目

5 使用DESeq2 计算差异基因（DESeq2差异基因分析和批次效应移除）

6 使用EBSeq 对count数目进行标准化

7 使用psych 计算两两相关性

8 使用igraph和Cytoscape可视化网络

R统计和作图

易生信系列培训课程，扫码获取免费资料

更多阅读

psych igraph：共表达网络构建

0 实验设计

思路：（1）得到胁迫处理诱导的基因和突变体野生型间的差异基因作为候选基因集（2）两两计算候选基因的皮尔逊相关系数（3）使用igrah计算各个基因的中心度。

从RNA-seq开始构建共表达网络。

1 使用软件FastQC进行测序质量评估 (NGS基础 - FASTQ格式解释和质量评估)

2 使用trimmomatic质量清理

3 使用hisat2比对 (转录组分析工具哪家强？)

4使用htseq-count，获得每个基因的counts数目

5 使用DESeq2 计算差异基因 （DESeq2差异基因分析和批次效应移除）

6 使用EBSeq 对count数目进行标准化

7 使用psych 计算两两相关性

8 使用igraph和Cytoscape可视化网络

R统计和作图

易生信系列培训课程，扫码获取免费资料

更多阅读

5 使用DESeq2 计算差异基因（DESeq2差异基因分析和批次效应移除）