Introduction to R and Bioconductor
This post was created for students taking CISC 875 (Bioinformatics) and has two goals: (1) introduce the R
programming language, and (2) demonstrate how to use some of the important Bioconductor
packages for the analysis of gene expression datasets. R
has become the dominant programming language for statistical computing in many scientific fields due to its open-source nature and extensive ecosystem of packages developed by other users. While most packages are stored on CRAN, packages specifically designed for biology, bioinformatics, and related areas are stored on Bioconductor. In other words, the collection of biologically-related R
packages are said to be Bioconductor packages.
To install packages from the default CRAN repository, one can use the install.packages()
function. To query more details about any function in R
, you can type ?somefunction
in your command line such as ?install.packages
. In the code block labelled “STEP 0” below I set up a list of Bioconductor and CRAN packages to install on your system if they haven’t been already. The function biocLite
is the Bioconductor equivilant of install.packages
- and it also prompts the version updates of any old packages. I load three Bioconductor packages in the code below: GEOquery, genefilter, and hgu133plus2.db. All of these packages contain dependencies (other packages they need in order to work), and these should be automatically installed at the same time. If a package ever has an error when your try to load it with the library
function, check the error output as it may involve some missing packages which need to be installed.
When reading the code throughout this tutorial, if any function or command is unclear, I strongly recommend querying it with ?
. Additionally, you will notice the %>%
operator in this tutorial. This is a pipe operator, and it is part of the magrittr
packages. Piping allows code to be written in the same way that an English sentence is constructed: from left to right. A pipe passes the object on the left-hand side of %>%
into the first argument position of the function on the right-hand side of the piping operator. So iris %>% head(n=3)
is the same as head(iris,n=3)
. Even if you are new to R
, the following code should be fairly intuitive: 1:10 %>% mod(2) %>% equals(0) %>% which
, which reads like it looks: “generate a sequence of numbers from 1 to 10 THEN apply modulo 2 to each THEN test if each element is equal to zero THEN tell me which of these is true”. In other words, find all the even numbers. In contrast, to write the same statement without pipes: which(mod(1:10,2)==0)
, which is more compact, in this instance, but clearly less readable.
Quite a few functions in this tutorial also take advantage of tidyverse
packages such as tidyr
and ggplot2
. You will see functions like mutate
(create a new column), filter
(subset rows by a logical condition), and gather
(put data into long format). It is worth understanding the difference between wide and long formats of data, the latter being necessary for advanced plotting functions such as ggplot
. Below is an example of a wide data set, where variables with numeric expressions are given their own columns.
If instead we want to collapse this data set to matching key-value pairs, i.e. a long-format dataset, we can use the following gather
command (note that we don’t want to collapse the Species column as we want it to match a numeric expression).
Long format data also has the advantage that it is much easier to query key-pair values. For example, I can ask which species has the largest measurement for each measurement type (clearly virginica flowers for most):
On with the tutorial!
Part 1: Loading in microarray datasets from GEO
Most published academic research related to “microarray, next-generation sequencing, and other forms of high-throughput functional genomic datasets” are uploaded to the Gene Expression Omnibus (GEO). This means that you can download data from interesting research to do your own analysis or replicate results. To find a data set for this tutorial, I just searched HER2 as a keyword and found the following data set which contains:
Analysis of FFPE core biopsies from pre-treated patients with HER2+ breast cancer (BC) randomized to receive neoadjuvant chemotherapy only or chemotherapy+trastuzumab (herceptin). Results provide insight into the molecular basis of survival outcomes following trastuzumab-based chemotherapy.
The summary page tells us that it is “DataSet Record GDS5027” and it used the “Affymetrix Human Genome U133 Plus 2.0 Array” platform. The first step is to download the GDS5027 dataset using the getGEO
function from the GEOquery
library we loaded in during STEP 0. Another thing to note: if you see to colons such as GEOquery::getGEO
, I write this deliberately to point out which library this function comes from, with the general notation being somelibrary::somefunction(arguments)
. It will take a minute or so to download the data to your folder and then load it into R
.
You can then use the Meta
or Table
functions to look at some of the propeties of this GDS-class object we called gds5027
.
Next we will want to convert the GDS object into an ExpressionSet object, which is the object type that Bioconductor packages are built around when dealing with micorarray expression data.
This next step is extemely important as we are going to extract the phenotype data with the phenoData
function and the gene level expression data with the exprs
function, as well as dropping any affymetrix control probes.
Part 2: Getting annotation data
As was discussed in our second lecture, getting annotation data is essential for genomic analysis. To get additional information on the genes contained in this dataset, we will use the Affymetrix probe names (contained as the row names of the eData
object) and use the hgu133plus2.db
Bioconductor package, as we know this was our microarray platform. Several functions from Bioconductor packages allow for the translation of the Affymetrix/Illumina gene ID codes into genomic information such as chromosome location or the common gene symbol name.
In the code below we use: mapIds(hgu133plus2.db,keys=probe.ids,column='SYMBOL',keytype='PROBEID',multiVals='first')
, which does the following: “look up our supplied probe ids in the PROBEID column and return the associated gene symbol from the hgu133plus2 annotation database”. I also write some lines of code that will get the chromosome location information and drop duplicate probes. After doing this, it may be worth while to save the data so that you can load it from this point onwards and not have to repeat these steps.
Let’s look at the data we have created. We see that annot.df
contains the probe and HGNC human gene names along with the associated chromosome location, pData
gives up information related to each person (sample) such as whether they received the Trastuzumab drug treatment, and eData
which is a 19149 x 156 matrix with each row being the expression values for a given gene.
Part 3: Making cool charts!
First let us look at the remission rates contained in the phenotype data and see how they vary by the different groups. We will use the ggplot
function throughout the rest of this tutorial to make graphs. I would recommend glancing at the ggplot2 bible or other tutorials to get a feel of what ggplot can do. The plots are basically built with layers, so: ggplot(...) + geom_point(...) + ggtitle(...)
is saying “make a plot with this data THEN add some points THEN add a title”. The chart below suggests that patients which received the control treatment had a low remission rate, regardless of HER status, whereas HER+ patients that received Trastuzumab did much better.
Barplots
As Trastuzumab is designed to treat breast cancer which is HER2 receptor positive, it would worthwhile to see if we can figure out how the HER2 positive label was generated. The HGNC symbol for HER2 is ERBB2. We will query our annotation data, find which probes have a symbol name matching “ERBB”, and then plot the distribution of normalized expression values, coloured by whether the phenotype data says they are HER2+/HER2-. The plot below does indeed show that HER2+ patients have higher than average expression at the the ERBB2 probe site.
Density plots
Biologists may also be interested in whether HER2+/HER2- patients have other genes which are differentially expressed (in addition to ERBB2). To test this, we can perform a t-test on every gene comparing the mean between the two HER2 status groups. T-tests consider whether the differences in the mean of two groups is likely to arrise from chance alone. A p-value provides a quantitative answer to this question, and can be interpreted as the “the probability of seeing this realization of the data from chance alone, assuming the two groups have the same mean”. Thus, if we see a p-value of 1%, this is saying that were the two groups to have identical means, this realization of the data would be observed 1 out of 100 times and therefore the assumption of mean equality is quite unlikely. Usually a threshold of either 1% or 5% is imposed, and genes which have a lower p-value than this are assumed to be differentially expressed.
As we have more than 19K genes, we will get back more than 19K p-values. This introduces the problem of multiple hypothesis testing, and therefore requires us to modify how we interpret our p-values. Correction procedures, such as the Bonferroni method can be employed. With a 5% threshold, we will now require a p-value to be less than \(0.05/19149\) for us to reject the null. We often do a log-transformation of the new theshold value \(-\log_{10}(0.05/19149)\)=5.6 for easier visualization.
After getting the associated t-statistic p-value and mean difference using the genefilter::rowttests
function, we will plot the difference in means on the horizontal axis and the -log10 p-value on the vertical-axis. We see that 34 out of the 19K plus genes are differentially expressed between HER+/HER- patients, and we add text annotations on the plot to the genes with the highest expression differences (with ERBB2 unsurprisingly having the largest). These sorts of plots are known as Volcano plots in genomic analysis.
Volcano plots
To get an intuitive understanding of what the t-test is measuring, we will plot the four genes with the largest mean difference in expression below. We see that the mean expression of ERBB2, MIEN1, MUCL1, and PMAIP1 between HER2+/HER2- labelled patients are (likely) just too far apart to be due to chance alone. The t-test only uses the number of observations, mean, and standard error from both groups in order to generate a p-value.
Visualizing the t-test
Another way to visualize the HER2+/HER2- genotypes is to use multidimensional scaling techniques (MDS) to represent high-dimensional data on a 2-dimensional plane, where the distance between points on two-dimensions “approximates” the distance between points in higher dimensional space. We will plot whether the distances between the genomic expression of the 34 genes we identified naturally “cluster” in the HER2+/HER2- genotypes. To carry this out in R
, we will use the dist
function to calculate the Euclidian distance between points (where there are 156 points (number of patients), each of which is in \(\mathbb{R}^{34}\)), and then use the cmdscale
function to perform classical MDS. In my plot, I turn off the horizontal/vertical axis values, as all that matters in MDS is the relative distance between points, and not the scale of the axis. We see that were we to assign a given genomic expression profile to one of two groups using the k-means clustering algorithm, the 2-dimensional representation of data would correctly identify most of the HER2+/HER2- patients.
Multidimensional scaling plots
The last plot we will produce will be a Manhattan plot, where the horizontal-axis is organized by choromosome position and the vertical axis contains the associated p-values of gene-wise t-tests (just like the Volcano plot).