Gene Ontology (GO) term enrichment analysis is a technique for interpreting a set of affected genes generated from high-throughput technologies such as RNA-seq and ChIP-seq. For example, if you have a set of genes that are up-regulated under a certain condition, a GO term enrichment analysis can find which GO terms are over-represented in that gene set. You can perform GO term enrichment analysis using DAVID or other tools.
In the result page of DAVID, you can see a table showing the over-represented GO terms and their p-values. How these p-values are calculated? In this post, I am going to show you how to calculate them in R.
Here I will randomly sample 500 protein-coding genes from the Drosophila genome and check whether a GO term GO:0003723 (RNA binding) is statistically significantly over-represented in the set. First, we are going to access the data in Ensembl using the
biomaRt package in R and obtain the list of the protein-coding genes in Drosophila.
library(biomaRt) library(dplyr) ensembl <- useEnsembl(biomart = "genes", dataset = "dmelanogaster_gene_ensembl") gene_list = getBM(attributes = c("ensembl_gene_id", "gene_biotype"), mart = ensembl) gene_list_p <- gene_list %>% filter(gene_biotype == "protein_coding")
There are 13,947 genes in the
gene_list_p table. Out of them, 437 genes are associated with the GO ID
> nrow(gene_list_p)  13947 > gene_list_rb <- df %>% filter(go_id == "GO:0003723") > nrow(gene_list_rb)  437
Next, we will randomly pick up 500 genes and count how many of them are associated with the GO ID
> set.seed(123) > your_gene_list = sample(gene_list_p$ensembl_gene_id, size = 500) > gene_list_rb %>% filter(ensembl_gene_id %in% your_gene_list) %>% nrow()  14
In your list of 500 genes, 14 genes are associated with the GO ID
GO:0003723 and 500 – 14 (= 486) genes are not. Now we want to know whether this GO term is over-represented in this group compared to the background in which 437 – 14 (= 423) genes are associated with the GO term and 13947 – 500 – 423 (= 13024) genes are not. Let’s create a contingency table like this:
> tb <- matrix(c(14, 486, 423, 13024), nrow = 2, ncol = 2) > colnames(tb) <- c("Your list", "Background") > rownames(tb) <- c("RNA binding", "Not") > tb Your list Background RNA binding 14 423 Not 486 13024
You can calculate the p-value using Fisher’s Exact Test.
> fisher.test(tb)$p.value  0.7934214
As the p-value is much greater than 0.05, the GO term
GO:0003723 is not statistically significantly enriched in your list. This is not surprising since we create the list by random sampling from the genes in the genome.
Fisher’s Exact Test is usually performed on each GO term so you have to use multiple testing correction method, such as the Benjamini and Hockberg method.
 Gene Ontology overview http://geneontology.org/docs/ontology-documentation/
 GO enrichment analysis http://geneontology.org/docs/go-enrichment-analysis/
 GeneSCF: a real-time based functional enrichment tool with support for multiple organisms https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1250-z
 GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3037731/ for multiple hypothesis correction
 DAVID Help page https://david.ncifcrf.gov/helps/functional_annotation.html#E3 Check the “Gene-Enrichment and Functional Annotation Analysis” section.