分享

Volcano plots of microarray data | The Bioinformat...

 树袋熊zkqgpfpq 2019-05-25

To run the R commands in this post, you should first work through Analysing microarry data in BioConductor.

A volcano plot is a scatter plot that is often used when analysing micro-array data sets to give an overview of interesting genes. The log fold change is plotted on the x-axis and the negative log10 p-value is plotted on the y-axis.

Using the GSE20986 data set, the x-axis shows the log fold change between the HUVEC control cell line and one of the three treatment endothelial cell lines. After running the preliminarily analysis, we construct a table containing the fold change and p-values for all changes:

1
2
3
##Complete list of genes with p-values and fold change
##Coef=1, so we are just looking at huvec_choroid
gene_list <- topTable(huvec_ebFit, coef=1, number=1000000, sort.by='logFC')

The relevant columns can be obtained via the “$” operator, viz:

1
2
3
##The head command prints the first few values of a vector
head(gene_list$logFC)
head(gene_list$P.Value)

The volcano plot can then be generated using the standard plot command:

1
2
3
4
5
##The par command sets 'nice' graphical defaults
par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01)
plot(gene_list$logFC, -log10(gene_list$P.Value),
     xlim=c(-10, 10), ylim=c(0, 15), #Set limits
     xlab='log2 fold change', ylab='-log10 p-value')#Set axis labels

which gives:

We can use R to determine how many genes are above certain cut-offs. For example, the number of genes that have an absolute fold change greater than 2 and a p-value less than 0.001 can be found using the command:

1
2
##Gives 717
sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.001)

A more principled way of choosing a p-value cut-off is to use a multiple testing rule. Common rules are Bonferroni and FDR. When we carry out a large number of statistical tests, the Bonferroni cut-off is approximately 0.05/#tests. So

1
2
3
4
##no_of_genes=27,306
no_of_genes = dim(probeset.list)[1]
##Gives 203 genes
sum(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes)

If we wanted to use the false discovery rate as a cut-off (FDR), then we would use the adjusted p-value:

1
2
##Gives 933
sum(abs(gene_list$logFC) > 2 & gene_list$adj.P.Val < 0.05)

In many microarray studies, the FDR cut-off is used. However, when we have a large number of statistically significant genes, as in this example, a more conservative rule can be useful.

Using ggplot2 for volcano plots

The problem with the above volcano plot is that since we are plotting around 27,000 points – one point per gene – then the points overlap and the colour becomes dense. An alternative graphical framework to the base graphics is the ggplot2 package. This package can be installed directly from CRAN, in the usual manner:

1
install.packages('ggplot2')

Then to construct the volcano plot, we use the following commands:

1
2
3
4
5
6
7
8
9
10
11
require(ggplot2)
##Highlight genes that have an absolute fold change > 2 and a p-value < Bonferroni cut-off
gene_list$threshold = as.factor(abs(gene_list$logFC) > 2 & gene_list$P.Value < 0.05/no_of_genes)
##Construct the plot object
g = ggplot(data=gene_list, aes(x=logFC, y=-log10(P.Value), colour=threshold)) +
  geom_point(alpha=0.4, size=1.75) +
  opts(legend.position = 'none') +
  xlim(c(-10, 10)) + ylim(c(0, 15)) +
  xlab('log2 fold change') + ylab('-log10 p-value')
g

which yields

By using the “alpha” and “size” options in ggplot2, we can control the transparency and size of the points.

To add text to the plot, we use geom_text. For example,

1
2
3
##Graph not shown
g + geom_text(aes(x=gene1$logFC, y=-log10(gene1$P.Value),
                     label=gene1$ID, size=1.2), colour='black')

You can also pass vectors of text labels.

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多