How to calculate a p-value in Gene Ontology (GO) term enrichment analysis


Gene Ontology (GO) term enrichment analysis is a technique for interpreting a set of affected genes generated from high-throughput technologies such as RNA-seq and ChIP-seq. For example, if you have a set of genes that are up-regulated under a certain condition, a GO term enrichment analysis can find which GO terms are over-represented in that gene set. You can perform GO term enrichment analysis using DAVID or other tools.

Part of a result page of DAVID 6.8

In the result page of DAVID, you can see a table showing the over-represented GO terms and their p-values. How these p-values are calculated? In this post, I am going to show you how to calculate them in R.

Here I will randomly sample 500 protein-coding genes from the Drosophila genome and check whether a GO term GO:0003723 (RNA binding) is statistically significantly over-represented in the set. First, we are going to access the data in Ensembl using the biomaRt package in R and obtain the list of the protein-coding genes in Drosophila.


ensembl <- useEnsembl(biomart = "genes", dataset = "dmelanogaster_gene_ensembl")
gene_list = getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart = ensembl)
gene_list_p <- gene_list %>% 
  filter(gene_biotype == "protein_coding")

There are 13,947 genes in the gene_list_p table. Out of them, 437 genes are associated with the GO ID GO:0003723.

> nrow(gene_list_p)
[1] 13947

> gene_list_rb <- df %>% filter(go_id == "GO:0003723")
> nrow(gene_list_rb)
[1] 437

Next, we will randomly pick up 500 genes and count how many of them are associated with the GO ID GO:0003723.

> set.seed(123)
> your_gene_list = sample(gene_list_p$ensembl_gene_id, size = 500)
> gene_list_rb %>% filter(ensembl_gene_id %in% your_gene_list) %>% nrow()
[1] 14

In your list of 500 genes, 14 genes are associated with the GO ID GO:0003723 and 500 – 14 (= 486) genes are not. Now we want to know whether this GO term is over-represented in this group compared to the background in which 437 – 14 (= 423) genes are associated with the GO term and 13947 – 500 – 423 (= 13024) genes are not. Let’s create a contingency table like this:

> tb <- matrix(c(14, 486, 423, 13024), nrow = 2, ncol = 2)
> colnames(tb) <- c("Your list", "Background")
> rownames(tb) <- c("RNA binding", "Not")
> tb
            Your list Background
RNA binding        14        423
Not               486      13024

You can calculate the p-value using Fisher’s Exact Test.

> fisher.test(tb)$p.value
[1] 0.7934214

As the p-value is much greater than 0.05, the GO term GO:0003723 is not statistically significantly enriched in your list. This is not surprising since we create the list by random sampling from the genes in the genome.

Fisher’s Exact Test is usually performed on each GO term so you have to use multiple testing correction method, such as the Benjamini and Hockberg method.

[1] Gene Ontology overview
[2] GO enrichment analysis
[3] GeneSCF: a real-time based functional enrichment tool with support for multiple organisms
[4] GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes for multiple hypothesis correction
[5] DAVID Help page Check the “Gene-Enrichment and Functional Annotation Analysis” section.


Copied title and URL