Loading Gene Expression Data
{dementia
R-stats
ETL
}
I’ve recently been working with the openly-available Aging, Dementia, and TBI Study data from the Allen Institute for Brain Science as part of a Master’s degree thesis project. The goal of the project is to construct models of dementia status using the gene expression, pathological, and medical history data provided. In the spirit of “open science”, I’m going to try to do an “open thesis”, which means I’ll be blogging along as I work on the project in posts with the tag, dementia
. There’s also a GitHub repository for this project.
As part of exploring this data, one of the tasks I have set for myself is a differential gene expression (DGE) analysis to see which genes have different patterns of expression under different conditions (dementia versus no dementia, for example). The scientists at the Allen Institute have already done an analysis like this. (Be sure to check out their awesome interactive tool).
There are several R packages available (edgeR, DESeq2, others) for performing differential analysis on RNA-seq data, like we have here. RNA-seq is a means of quantifying the amount of mRNA transcripts produced from the genes in a biological sample. More on that later.
For now, in this post, I’ll discuss:
- A brief data overview; and,
- Some R code that loads the gene expression data from URLs.
There’s a Python version of this load script in the project repo, too.
A Little About the Data
I’ll discuss the data and its origins in greater detail in a future post. For now, I’d just like to touch on some major points:
- The data consist of gene expression levels and neuropathological measurements from 377 brain tissue samples, as well as demographic and medical history information, from 107 participants in the Adult Changes in Thought (ACT) study. This massive longitudinal study aims to track the brain function of a large cohort of individuals as they age.
- The initial patient sample was n=110 individuals, 55 who had a history of at least one TBI and 55 age-and-sex-matched controls with no history of TBI. Three subjects were excluded because their data failed to meet quality control standards (see TBI Overview from the study website for more details).
- All of the data is freely available from the Allen Institute for Brain Science (big thanks to the Allen Institute for this resource!).
Loading Gene Expression Measurements
The first thing I have to do is get the data. What I need is a dataframe where each column contains the raw expression counts from the RNA-seq experiment for one of the 377 brain tissue samples and each row is a gene. I’ll be using the fread()
function from the data.table
package to access files on the study data download site via their URLs.
library(data.table)
The first file I want to put into a dataframe is a .csv that contains URLs for files for each of the 377 samples.
data_files <- data.frame(fread('http://aging.brain-map.org/data/tbi_data_files.csv'))
names(data_files)
The URLs for the gene expression data are in the ‘gene_level_fpkm_file_link’ column. I’ll use the ‘rnaseq_profile_id’ to label the columns/samples.
# Series of links to gene expression files
data_links <- data_files['gene_level_fpkm_file_link']
# Grab the list of sample IDs
sample_IDs <- data_files['rnaseq_profile_id']
Behind each of the links in data_links
is a file that contains three different measures of gene expression for each of 50,283 genes:
- ‘expected_count’ - this is the raw estimated count of the number of transcripts;
- ‘TPM’ - transcripts per million; this is count normalized by the length of the gene; and,
- ‘FPKM’ - fragments per kilobase million; normalized a different way than TPM.
Here’s a peak at the file for the first sample:
sample000_url <- paste('http://aging.brain-map.org', toString(data_links[1,1]),sep='')
start <- Sys.time() # timing how long it takes to get the file
sample000 <- data.frame(fread(sample000_url))
stop <- Sys.time()
names(sample000)
The ‘gene_id’ column provides the Entrez Gene database number for the gene. Here are the dimensions of each individual table.
dim(sample000)
Let’s see how long it took to get the file.
stop-start
Not surprisingly, it takes a while to get a file that big over the interwebz (at least with the connection I have here). Regardless, to get all the data, we’ll loop through all 377 URLs in data_links
and add the sample columns to dataframes. Even though I only need the ‘expected_count’ column from each file for the differential analysis, I’ll make ‘TPM’ and ‘FPKM’ dataframes, too.
# Initialize dataframes
TPM <- data.frame(matrix(nrow = 50283))
FPKM <- data.frame(matrix(nrow = 50283))
raw_read_counts <- data.frame(matrix(nrow = 50283))
Now, we populate the dataframes with gene expression data and save them to csv.
# start a timer for DF construction
start <- Sys.time()
# loops through sample files and adds data for each sample to a dataframe with the
# 'rnaseq_profile_id' as header
for (sample in 1:nrow(data_links)){
# I limited the print updates in this notebook.
# Take out the 'if' to print an update with each of the 377 samples
if (sample >= 370){
print(paste('Sample #', sample, 'being added.'))
flush.console() # Forces print in loop
}
url <- paste('http://aging.brain-map.org', toString(data_links[sample,1]),sep='')
sample_data <- data.frame(fread(url))
# Add 'gene_id' from the first sample to each dataframe (it's the same for each file)
if (sample == 1){
TPM['gene_id'] <- sample_data['gene_id']
FPKM['gene_id'] <- sample_data['gene_id']
raw_read_counts['gene_id'] <- sample_data['gene_id']
}
TPM[toString(sample_IDs[sample,1])] <- sample_data['TPM']
FPKM[toString(sample_IDs[sample,1])] <- sample_data['FPKM']
raw_read_counts[toString(sample_IDs[sample,1])] <- sample_data['expected_count']
}
# stop timer, calculate running time
stop <- Sys.time()
duration <- stop-start
# drop extra column
TPM <- TPM[,-1]
FPKM <- FPKM[,-1]
raw_read_counts <- raw_read_counts[,-1]
# Write to csv for later
write.csv(raw_read_counts,'raw_read_counts.csv')
write.csv(TPM,'TPM.csv')
write.csv(FPKM,'FPKM.csv')
print(paste('All samples loaded & saved! Loop duration:', round(duration, 2), 'minutes'))
Here’s a look at the raw_read_counts dataframe.
head(raw_read_counts)
Looks good!