The Impact of the New FDA Guidelines on AI in Drug Discovery

The FDA has just issued new guidance last month focused on AI for drug and biological product development, and for those of us in this space, it’s a clear signal: AI isn’t just a nice-to-have anymore. It’s central to the future of pharma, but as any fellow comic book fan knows: With geat power comes great responsibility.

Future-Proofing GenAI Designs

As artificial intelligence continues to evolve at an unprecedented pace, integrating generative AI (genAI) components into software designs has become a key trend. Language models such as GPT and frameworks like LangChain are driving transformative changes across industries. However, while these technologies offer exciting possibilities, the current GenAI ecosystem is still highly fragmented. To ensure long-term success, businesses must take steps to future-proof their software systems, making them scalable, flexible, and adaptable to ongoing innovation. Below are essential considerations for building resilient and future-ready software designs with genAI components.

Update, Changes, and Adopting Agile

Welcome back! It’s been 2,151 days since my last post. A lot has happened. Lucky for you, I won’t talk about the vast majority of it here (who wants to relive the pandemic, amirite?) but I figure I’ll catch you up a little on where I’m at professionally, what it’s like working at a company that’s undergoing a lot of growth and changes (Moderna), and a direct consequence of that growth for my team - moving to doing agile data science. Let’s get started!

Data Mining with Rattle for R

Data mining tools make it easy to get a quick overview of the data we’re working with which can save us loads of time, especially if we’ve got many predictors to investigate. One of the best things about these tools is that a lot of them are open-source and available for free (you can find a list of some of those here).

Cracking Open Clusters with Clouds

“Cloud” visualizations, or “tag clouds” are great ways of summarizing a bunch of data when a frequency table or list would be too unwieldy or confusing. They’re most commonly encountered in the context of text mining but we could really use them with anything where there’s a large number of categories or bins. In the most common version of these visualizations, the size of category or bin name (or word, in the case of word clouds) is indicative of its frequency. Here’s an example using the text of a favorite poem of mine, The Love Song of J. Alfred Prufrock by T.S. Eliot (made using a free, online word cloud generator available here).

Using Clustering Results to Define New Features for Modeling

Hi, everyone! In the last post to this blog, we talked about how to use the clValid package for R to try multiple different clustering algorithms with a range of different values for the number of clusters, \(k\), in a single function call. In this post, we go one step further and use the results of a clustering experiment to create new features to put into models.

Clustering Optimization with the clValid Package for R

Back again! I’m taking a break from thesis writing to share some of the awesome tools I’ve discovered over the course of completing this project.

Differential Analysis for Dimension Reduction & Exploration of Gene Expression Profiles

We’re back! In this post, we’ll take a look at some of the results from a screen for genes related to dementia using the gene expression data from the Allen Institute for Brain Science’s Aging, Dementia, & TBI Study. This is part of an exploratory analysis I conducted prior to constructing models of donor dementia status using this and other data collected from postmortem brain tissue samples. The idea is to mine this data for information about factors contributing to dementia risk.

Hypothesis Testing for Differential Gene Expression

Now that we’ve done a little exploration of the gene expression data from the Allen Institute’s Aging, Dementia, & TBI study, we’re ready to get down to the business of identifying genes or groups of genes that could be good predictors in a model of dementia status. To do this, we’ll use hypothesis testing to determine if the expression levels for a given gene differ between Dementia and No Dementia brain tissue samples. And since we’ll be creating a family of these tests - one for each of 20,000+ genes we have - we have to talk about correcting for multiple comparisons as well.

Interactive 3D Plots with R and Plotly

In the last post I made, I showed some screen shots of 3-dimensional multidimensional scaling plots I made using the `plotly` library in R. I was kind of bummed that I didn't show interactive plots you could play with so I went back and figured out how to use `htmlwidgets` to output them in html.

To get the plots working on this blog (which is hosted on GitHub/powered by Jekyll), I followed some of the advice in this blog post by Ryan Kuhn. A script that makes both plots and writes the html can be found in the repo for this project on GitHub (EDA folder).

Both interactive graphs below show multidimensional scaling (MDS) plots of the 377 gene expression profiles in the Aging, Dementia, and TBI study from the Allen Institute for Brain Science. The first plot is shaded by the sex of the donor. The second is colored by the region of the brain the sample came from.

Have fun!

Visualizing Gene Expression Profiles with Multidimensional Scaling

Hi everyone! This is the second installment in my quest to blog my way through a Master’s thesis project (Predictive Analytics, Northwestern). In the inaugural post, I loaded and saved the gene expression data for the 377 brain samples in the Allen Institute for Brain Science’s Aging, Dementia, & TBI Study dataset. I’ve messed around with the data quite a bit since then and it seemed like a good time to share some of the things I’ve learned so far. Just to remind you: The ultimate goal of this project is to (somehow) combine this gene expression data with other kinds of data available for these samples (neuropathological measurements, demographic information) in models of dementia status (“Dementia” versus “No Dementia”). Most likely these final models will be linear (AKA logistic regression).

Loading Gene Expression Data

I’ve recently been working with the openly-available Aging, Dementia, and TBI Study data from the Allen Institute for Brain Science as part of a Master’s degree thesis project. The goal of the project is to construct models of dementia status using the gene expression, pathological, and medical history data provided. In the spirit of “open science”, I’m going to try to do an “open thesis”, which means I’ll be blogging along as I work on the project in posts with the tag, dementia. There’s also a GitHub repository for this project.

Installing TensorFlow-GPU on a Windows 10 Machine

A few months back I discovered that my three-year-old Dell desktop has a small but nonetheless CUDA-enabled NVIDIA GPU. While the GeForce GT 730 I found - with it’s itty-bitty set of 384 cores and tiny 2GB stash of VRAM - is not super-great for training huge, gnarly deep neural networks, I figured any GPU is better than no GPU. So, as an aspiring data scientist and big, big fan of neural networks (both real and artificial), I got a little excited, wondering…