Update, Changes, and Adopting Agile

Welcome back! It’s been 2,151 days since my last post. A lot has happened. Lucky for you, I won’t talk about the vast majority of it here (who wants to relive the pandemic, amirite?) but I figure I’ll catch you up a little on where I’m at professionally, what it’s like working at a company that’s undergoing a lot of growth and changes (Moderna), and a direct consequence of that growth for my team - moving to doing agile data science. Let’s get started!

Read More

Data Mining with Rattle for R

Data mining tools make it easy to get a quick overview of the data we’re working with which can save us loads of time, especially if we’ve got many predictors to investigate. One of the best things about these tools is that a lot of them are open-source and available for free (you can find a list of some of those here).

Read More

Cracking Open Clusters with Clouds

“Cloud” visualizations, or “tag clouds” are great ways of summarizing a bunch of data when a frequency table or list would be too unwieldy or confusing. They’re most commonly encountered in the context of text mining but we could really use them with anything where there’s a large number of categories or bins. In the most common version of these visualizations, the size of category or bin name (or word, in the case of word clouds) is indicative of its frequency. Here’s an example using the text of a favorite poem of mine, The Love Song of J. Alfred Prufrock by T.S. Eliot (made using a free, online word cloud generator available here).

Read More

Using Clustering Results to Define New Features for Modeling

Hi, everyone! In the last post to this blog, we talked about how to use the clValid package for R to try multiple different clustering algorithms with a range of different values for the number of clusters, \(k\), in a single function call. In this post, we go one step further and use the results of a clustering experiment to create new features to put into models.

Read More

Differential Analysis for Dimension Reduction & Exploration of Gene Expression Profiles

We’re back! In this post, we’ll take a look at some of the results from a screen for genes related to dementia using the gene expression data from the Allen Institute for Brain Science’s Aging, Dementia, & TBI Study. This is part of an exploratory analysis I conducted prior to constructing models of donor dementia status using this and other data collected from postmortem brain tissue samples. The idea is to mine this data for information about factors contributing to dementia risk.

Read More

Hypothesis Testing for Differential Gene Expression

Now that we’ve done a little exploration of the gene expression data from the Allen Institute’s Aging, Dementia, & TBI study, we’re ready to get down to the business of identifying genes or groups of genes that could be good predictors in a model of dementia status. To do this, we’ll use hypothesis testing to determine if the expression levels for a given gene differ between Dementia and No Dementia brain tissue samples. And since we’ll be creating a family of these tests - one for each of 20,000+ genes we have - we have to talk about correcting for multiple comparisons as well.

Read More

Interactive 3D Plots with R and Plotly

In the last post I made, I showed some screen shots of 3-dimensional multidimensional scaling plots I made using the `plotly` library in R. I was kind of bummed that I didn't show interactive plots you could play with so I went back and figured out how to use `htmlwidgets` to output them in html.

To get the plots working on this blog (which is hosted on GitHub/powered by Jekyll), I followed some of the advice in this blog post by Ryan Kuhn. A script that makes both plots and writes the html can be found in the repo for this project on GitHub (EDA folder).

Both interactive graphs below show multidimensional scaling (MDS) plots of the 377 gene expression profiles in the Aging, Dementia, and TBI study from the Allen Institute for Brain Science. The first plot is shaded by the sex of the donor. The second is colored by the region of the brain the sample came from.

Have fun!


Read More

Visualizing Gene Expression Profiles with Multidimensional Scaling

Hi everyone! This is the second installment in my quest to blog my way through a Master’s thesis project (Predictive Analytics, Northwestern). In the inaugural post, I loaded and saved the gene expression data for the 377 brain samples in the Allen Institute for Brain Science’s Aging, Dementia, & TBI Study dataset. I’ve messed around with the data quite a bit since then and it seemed like a good time to share some of the things I’ve learned so far. Just to remind you: The ultimate goal of this project is to (somehow) combine this gene expression data with other kinds of data available for these samples (neuropathological measurements, demographic information) in models of dementia status (“Dementia” versus “No Dementia”). Most likely these final models will be linear (AKA logistic regression).

Read More

Loading Gene Expression Data

I’ve recently been working with the openly-available Aging, Dementia, and TBI Study data from the Allen Institute for Brain Science as part of a Master’s degree thesis project. The goal of the project is to construct models of dementia status using the gene expression, pathological, and medical history data provided. In the spirit of “open science”, I’m going to try to do an “open thesis”, which means I’ll be blogging along as I work on the project in posts with the tag, dementia. There’s also a GitHub repository for this project.

Read More

Installing TensorFlow-GPU on a Windows 10 Machine

A few months back I discovered that my three-year-old Dell desktop has a small but nonetheless CUDA-enabled NVIDIA GPU. While the GeForce GT 730 I found - with it’s itty-bitty set of 384 cores and tiny 2GB stash of VRAM - is not super-great for training huge, gnarly deep neural networks, I figured any GPU is better than no GPU. So, as an aspiring data scientist and big, big fan of neural networks (both real and artificial), I got a little excited, wondering…

Read More