Welcome to my spot on the web for drafts, supplemental material, and general thoughts about doing a thesis project for the Master of Science in Predictive Analytics degree (now the Master's in Data Science (MSDS) program) from Northwestern University. Below the interactive plots, I've developed an "epilogue" containing thoughts about doing a data science Master's, choosing the thesis option, and some of the things I learned along the way.
Thesis Paper
I'll update this section with drafts as they get finished.
2018-11-04: I have a (mostly) completed draft you can check out here on Google Drive. I'm currently awaiting comments from readers so no doubt it will change substantially. I haven't put in a Table of Contents and I'm still figuring out how to list the supplemental materials you'll find on this page in but everything else is there (hooray!).
2018-12-16: A lot has changed in the last month or so! I've decided to push back my tentative graduation date from this month to the end of the 2019 Winter Quarter, in part due to starting a new position as a data scientist at Highmark Health here in Pittsburgh. I had the thesis draft reviewed by my first reader who suggested some restructuring for the Conclusions section but otherwise found it to be good.
I spent a few weeks away from the thesis which allowed me to come back to it with a fresh set of eyes. I made some grammar edits and added the Table of Contents as well as the Appendix listing the supplemental material (links to the Github repo and this webpage). The most recent version is v.4.0 which can be accessed here. This is a completely formatted draft with all the necessary components as outlined in the Graduate Thesis Handbook.
I'm happy to have some time to finish the process in a way that isn't rushed. I'll be working over the holidays to restructure the Conclusions section and hope to get notes from a second reader by the end of January. Barring any substantial unforeseen issues, I should have everything done by the March 15th deadline to graduate at the end of the Winter 2019 quarter (hooray!).
2019-04-06: It's been a minute but, yes, I finished the degree. As of March 29, 2019, I'm an official graduate of Northwestern's MSPA program. One of the last of the "old guard" since the program has now changed significantly since I started it back in 2016 (including in name as it is now the Master of Science in Data Science).
A copy of the final accepted version of my thesis can be found here.
Many thanks to my readers, Drs. Alianna Maren and Lawrence Fulton, to all my professors, and to all my classmates from whom I learned so much.
I've added to the list of things I've learned at the bottom of this page. Please don't hesitate to reach out using the links at the bottom of the page if you have any questions about doing a data science master degree, choosing between a capstone or doing a thesis, or anything else. If you find a broken link here, please let me know 😃
Good luck to everyone on a data science journey!
Code
All the code (mostly in R) for the thesis can be found in the project repo on GitHub.
Supplemental Material
Interactive Multidimensional Scaling Plots
Below are four interactive multidimensional scaling plots of genetic profiles developed from open-source RNA-seq data available from the Aging, Dementia, and TBI Study from the Allen Brain Science Institute.
Use your mouse to grab them, rotate them, and zoom in and out. Hovering over a data point gives the point's coordinates in the first three MDS dimensions. Each point represents a genetic profile (based on expression levels for 50,000+ genes and gene isoforms) for an individual patient/donor.
These were made using Plotly and htmlwidgets for R. Check out this blog post for more on multidimensional scaling of gene expression level data.
Shaded by Brain Region
HIP = hippocampus
FWM = forebrain white matter
PCx = parietal cortex
TCx = temporal cortex
Shaded by Donor Sex
Shaded by Lifetime Number of Traumatic Brain Injuries (TBIs)
Shaded by Dementia Status
Differential Expression Analysis Filtering & p-Value Cutoff Experiments
A comparison of the numbers of "significant" genes obtained with different filtering parameters and p-value cutoffs for determining differential expression in donors with dementia.
Filtering & P-Value Cutoff Experiment Spreadsheet
Brain Region Intersection Gene Details
As a part of the exploratory analysis of the RNA-seq transcriptome data, I investigated the 29 genes that had altered expression patterns in all four brain regions sampled from donors with dementia (hippocampus, forebrain white matter, parietal cortex, or temporal cortex).
Brain Region Intersection Gene Details
Epilogue
Things I've Learned by Doing a Data Science Master's Program & Thesis Project
As things started to wrap up for me, I found myself reflecting on the entire experience of doing the MSPA program. Maybe you stumbled onto this page beacuse you're thinking of pursuing a data science Master's degree. Or maybe you're already in the MSDS program at Northwestern or somewhere else and are trying to make the "thesis or capstone" decision. In this section, I list of some of the things I've learned from doing this degree with a focus on doing a thesis project. Just my $0.02. FWIW, etc. I'm putting it down here as a sort of epilogue to the thesis now that she's all done.
“Life can only be understood backwards; but it must be lived forwards.” - Kierkegaard
- Doing this program was a great decision for me. As someone moving from academia to industry AND changing careers, the ability to talk to and learn from people already doing data science in a variety of industries was exactly what I needed. Classes were challenging and I appreciated the flexability of an entirely online program. My classmates are incredible people. I learned so much from interacting with them and with our instructors as well. The structure of an actual academic program was good for me because it kept me on track and provided me with a level of accountability, ensuring that I was learning what I needed to learn. Your mileage may vary but, for me, it was well worth the investment I made. As a direct result of the skills I learned as a part of pursuing the Master's, I have been able to start a career in data science.
- You get out of it what you put into it. Most educational experiences are like this, I bet. I'm not saying anything here you probably don't already know. I figure if you (or your company) is going to drop a lot of coin on a program like this, why not go the extra mile, if you can? Show up. Be creative. And don't be afraid to come in last in your class in a Kaggle InClass competition or bomb a technical interview in truly spectacular fashion 😉 It just might be the best thing that ever happens to you.
- Got time and an idea? Not doing data science for a living yet? Do the thesis. If you choose to pursue a data science master's, at some point you'll face the "capstone or thesis question". I can only speak to my personal experience, but I have learned more in the year or so of self-study for the thesis project than I ever thought I would. I grok more about statistics, clustering, penalized linear models, binary classifiers, and so much more now for having done this thing. I feel like I can talk about those things and be confident in what I'm saying. Personally, I learn best by doing, screwing it up, doing it over, screwing it up some more, etc. If you're not doing data science for a living yet, and don't have the opportunity to work with real data on the reg, the thesis project can be a terrific way to get an understanding beyond the Titanic and MNIST.
- BUT, doing the thesis will take a long time. Maybe not for some people, but on average it does take longer than a quarter. Maybe two. I gave myself a year to do it with everything else going on in my life and it ended up taking even longer than that due to unexpected "life stuff" (starting a new data science job) that came up while I was doing the project. But for me it was worth it. You'll have to weigh the options for yourself. The Northwestern MSDS Canvas site has resources to help you decide if doing a thesis is for you. Also, Dr. Alianna Maren has a very honest flowchart for making the decision that you can check out. Be prepared to work independently but don't pass up opportunities to use University resources like The Writing Place. You/your company are, after all, paying for them 😊
- If you have the time, blog about your journey. I'm writing this in HTML right now, something I never thought I'd learn over the course of doing the program/pivoting to a career in data science. I'm so glad I discovered GitHub Pages and set up this website because I've learned so much bonus stuff in the process. A little web design. A little CSS and HTML. Even a little Ruby. It's a place to showcase your work and maybe (hopefully!) interact with others. And speaking of GitHub...
- GitHub is amazing. Get an account if only to share code with your classmates but, it is so much
more than that. Software developers moving to data science probably already know about this thing but I had no idea.
I had an epiphany back in about April/May 2018 when I learned a little about how GitHub is actually
used to organize projects and stuff. I mean...
I changed all the scripts I had written for the thesis project up to that point so that, when I finished the project, anybody could clone the project repository, run the scripts in order, and more-or-less reproduce any of the results I put in my thesis or on this blog. I mean, wow. It blew my mind when I discovered that. Open source is amazeballs. All our favorite R and Python packages are developed there and we can be a part of them. How cool! I have the zeal of the converted but seriously. Make use of GitHub. Especially if you're doing a thesis project that you can share with others/point potential employers towards if you're on the hunt for a new gig. - For writing up, use EndNote for referencing and format the entire document as you go. If you're a student, chances are you can get a free copy of EndNote, a software package that will help keep your references organized and that will automatically construct a bibliography for you. If you're a Northwestern student, you can get a copy of EndNote free from IT here. Do yourself a huge solid and learn about the 'cite-while-you-write' feature. The Graduate Thesis Handbook (links to an older copy) suggests using APA or Chicago style but you can use any citation format so long as it's consistent. Read over the formatting requirements and build them right into your document from the start. It saves a lot of time. One thing I learned was that Word will write a Table of Contents for you automatically if you specify headers. I had no idea! It makes a much nicer ToC than doing it by hand.
- One final thing now that I've finished and graduated: Remember that it takes as long as it takes. For me, I needed to
give myself time and space to let ideas mature and solutions present themselves. I'm happy that I didn't rush through. The last
three months between completing the first draft and finalizing the last version consisted of a lot of "down time". I started
a new job in November 2018 so much of that time was dedicated to getting settled in with that. The thesis was never far from my mind,
though. I strongly believe that letting the draft marinate and coming back to it after weeks of focusing on something else made
for a better final draft.
Feel free to reach out to me if you have any questions!