Creating the "Dropbox of your Genome": Reid Robison Interview

Data Science Weekly Interview with Reid Robison, MD, MBA and CEO at Tute Genomics on how machine learning is transforming genomics

‍

We recently caught up with Reid Robison, MD, MBA and CEO at Tute Genomics. We were keen to learn more about his background, his perspectives on the evolution of genomics, what he's working on now at Tute - and how machine learning is helping…

Hi Reid, firstly thank you for the interview. Let's start with your background and how you became interested in working with data...

Q - What is your 30 second bio?
A - Physician & genetics researcher turned data scientist. Serial entprereneur. Studied neuroscience as undergrad, while doing mouse genetics studies. Then med school by day, MBA school by night. Completed psychiatry residency then jumped into a genetics fellowship focused on autism gene finding. Did a post-doc in Bioinformatics to try and merge computer science & genetics. Joined the faculty at the University of Utah doing research for a while, running a genetics lab, conducting clinical trials. Then left to start my first company, Anolinx, a health-care data analytics firm, in the early stages of the big data hype. We mined health data from big hospital systems, and queried it to answer big reserach questions for pharma. Pharmacoepidemiology, pharmacovigilence, virtual registry studies. Cool stuff. Then it was acquired pretty quickly and is still doing quite well. I had started and sold another company in the life sciences space, and was looking for my next thing. That's when Kai Wang, computational biologist at USC, and I started brainstorming about a genetics venture, and Tute Genomics was born.

Q - What drove your transition from being a Physician to a Data Scientist?
A - Late last year, Vinod Khosla said "By 2025, 80 percent of the functions doctors do will be done much better and much more cheaply by machines and machine learned algorithms". Treating patients one by one was personally satisfying, and I got to help a number of people. But it was hard work, and there was never enough time to help everyone. Long waiting lists, people without insurance, imperfect treatments. Our healthcare system is a bit of a mess... a huge mess in fact... I wanted to move to more of a macro level in terms of conducting research, solving big problems with data science, changing the way we practice medicine.

Q - Very inspiring! ... So, how did you get interested in working with data?
A - I was about 11 or 12 years old when we got our first computer. An early Texas Instruments, I think. It didn’t come with any games. Instead, I had to check out a book of code from the library, type in hundreds and hundreds of lines, hit the go button and voila! A game appeared. It was magical, and I’ve had a bit of an obsession with computers and intelligent machines ever since... And then as time went by I started to become more and more fascinated with all the unanswered questions out there, especially when it came to the human brain and the human genomes - both are massive unexplored frontiers that are very ripe for discovery.

Q - What was the first data set you remember working with? What did you do with it?
A - During my genetics fellowship in 2009, I was doing 'linkage analysis' on large extended pedigrees. Linkage is basically a statistical approach to finding 'peaks', or regions of interest, in the genomes of large families (and groups of families) with a certain disease or phenotype. Back then, we didn't really sequence 'genomes' per se, but we had panels of markers scattered across genomes, instead of information from every letter in the genome like we can do now. I had 6000 markers on each person from a linkage panel chip, and I had this for hundreds of people across dozens of large families. I set it up using some command-line, open-source linkage analysis software called MCLINK on a linux box under my desk at the University of Utah, and it sat there processing for, I kid you not, over a month. I would stare at it every day saying "please don't crash". Eventually, it worked, and we got some decent results from it: http://www.molecularautism.com/content/1/1/3

Q - That's amazing - great ingenuity and patience! :) Final background question ... Was there a specific "aha" moment when you realized the power of data?
A - This study I just mentioned was a bit of an 'aha' moment for me. It was impressive what we could do with just 6000 genetic markers… I couldn't help but wonder what we could find if we had access to all 6 billion letters in the human genome. My research focus shifted from these 6000 marker panels to microarrays with 250,000 genetic markers, then 500k, then 1 million... By then, next-generation sequencing was becoming available so I jumped right into trying to figure out how to use whole exome and whole genome sequencing for gene discovery in autism and other neurodevelopmental conditions, as well as helping to develop the tools to make this possible.

Very compelling background - thanks ever so much for sharing! Let's switch gears and talk in more detail about the "genome revolution"...

Q - What have been some of the main advances that have fueled the "genome revolution" in recent years?
A - The cost of sequencing the human genome has dropped over one million fold and this is literally transforming healthcare as we know it. The $1000 genome was announced this year, and it is now cheaper to sequence the entire human genome than it is to order a bunch of single gene tests. Instead of paying thousands of dollars for a few specific genetics test, why not pay a fraction of that amount to sequence your entire genome and get information on all of your 25,000 genes at once. The problem is that no-one, until now, could handle this massive amount of data.

The advancements in sequencing technology are definitely making it more accessible to use genomics to solve medical problems. This combined with the research insight to use genomic science to design treatments and prevention strategies for major diseases is pushing society to become more accepting of genomics and promote putting resources into this industry...

Q - What are the main types of problems now being addressed in the Genomics space?
A - Our biggest problem in this space is how to quickly translate enormous amounts of sequencing data into information that can be used to fuel discovery. The industry is really advancing with sequencing technologies, and so many researchers and labs have the data, but they don’t have the time and resources to make sense of it at the pace that patients and society would like to see it get done. We are basically delayed in making major strides toward understanding and treating disease. See my post here about the size of the human genome and the problems this causes in terms of bottlenecks in data transfer and processing.

Genomics also faces an issue within the general knowledge base. We need more participation in the collection and distribution of human genomic data to identify disease causing variants, variants responsible for drug response, and so on - and this information needs to be located in a central database which is easily accessed from anywhere, by any researcher.

Q - Who are the big thought leaders?
A - My co-founder, Dr. Kai Wang, is a well-known computational biologist in this space and wrote software called ANNOVAR that quickly became the gold standard in genome analysis and has been cited by over 750 scientific papers now

Another researcher I admire is Dr. Gholson Lyon (physician scientist at Cold Spring Harbor Laboratory in New York) who led the discovery of Ogden Syndrome, a rare, previously undiagnosed genetic disorder that he named after the families he worked with in Ogden, Utah. You can read his account of the discovery here.

Q - What excites you most about bringing Genomics and Data Science / Machine learning together?
A - It's all about getting things done quickly and accurately. Training our machines to identify novel disease causing variants is priceless, this alone can eliminate months or more of work from a research project.

Q - What are the biggest areas of opportunity / questions you would like to tackle?
A - Before long, everyone will get his or her genome sequenced. Your genetic blueprint can and should service as a reference for you and your doctors to query at every important medical event and decision throughout your life. Someone needs to be the keeper of that data, in a secure, accessible and meaningful way. That's what we're working on at Tute Genomics.

On that note, let's talk more about what you're working on at Tute...

Q - What specific problem does Tute Genomics solve? How would you describe it to someone not familiar with it?
A - Tute is kind of like the dropbox of your genome - we are a big data cloud-based platform that lets researchers & healthcare organizations analyze entire human genomes. By doing so, Tute is opening a new door for personalized medicine by helping researchers and clinicians interpret genetic variants and find disease-related genes.

Q - That sounds great! Could you tell us a little more about the technology - firstly, how does it work?
A - Tute Genomics is a clinical genome interpretation platform that assists researchers in identifying disease genes and biomarkers, and assists clinicians/labs in performing genetic diagnosis. Given sequencing data on a genome or a panel of genes, Tute can return over 125 annotations on variants and genes, perform family-based, case/control or tumor sample analyses to identify causal disease genes, and generate clinical reports for clinicians to focus on clinically relevant and actionable findings.

Q - How is Machine Learning helping?
A - Machine learning enables our software to quickly go from DNA to diagnosis. The Tute platform uses machine-learning algorithms to score and rank all genes and genetic variants in a given genome by their likelihood of causing disease. We call this the Tute Score, and it's used to predict whether a genetic variant is likely to be damaging or disease-causing. This machine learning approach shows much improved predictive power compared to traditional approaches, based on cross-validation of a number of genetic data sets. We have acquired multiple public and proprietary databases, along with commonly utilized genetic scoring algorithms, and we utilized Support Vector Machine (SVM)) to build & train the predictive models. SVM is a supervised classifier in the field of machine intelligence. The classification is formulated as the optimization problem to identify the optimal hyperplane that creates the biggest margin between the training points for neutral and deleterious variants/genes. More importantly, linear separability can be obtained in an expanded input feature space by using kernel functions. First we identified a set of functional prediction scores for which coding and non-coding variants can be assigned into. We then built and tested SVM prediction models using a variety of kernel functions and other parameters. The SVM models were optimized using known disease causing mutations from our test data sets.

To comprehensively evaluate the false positive and negative rates of this approach, we've been validating the Tute score on both synthetic and real-world data sets… So far so good, and we've been able to crack undiagnosed genetic diseases in a matter of minutes when you combine our annotation engine and these machine learning algorithms.

Q - Very impressive! What further advances could the Tute approach / technology enable going forward?
A - We are excited about the opportunity we have to make a meaningful dent in the universe accelerating precision medicine by unlocking your genome, personalizing treatment, and powering discovery. This is a massive amount of complex data, and we are making it accessible and useful so that we can all query our genomes at every important medical question throughout our lives.

In terms of next steps, we are already starting to integrate with patient health records, so that genomic data can be accessible where it is most useful and actionable. We are basically sick of our messed up healthcare system and are on a mission to accelerate progress towards patient-centric, precision medicine!

That's a great goal - good luck with next stage of the journey! Finally, let's talk a little about the future...

Q - What does the future of Genomics & Data Science look like?
A - Our healthcare is not yet personalized to each of us as individuals. When you receive a prescription for blood pressure medicine, or cholesterol, or even for cancer, there is a very real chance that it may be the wrong medicine for you, or even the wrong diagnosis. Fast forward a few years to a world where your medical treatment can be 100% unique. Every diagnosis, every treatment, every drug and every dietary change is tailored to you and you alone. Every treatment works in a predictable way. When you walk into the hospital, instead of feeling like a car on an assembly line, you can be treated like the unique human being you are. Instead of seeing a specialist in a certain field of medicine, the information in your genome can turn any doctor into a specialist in YOU. All of this, thanks to one test: whole genome sequencing, and Tute Genomics software technology - an app to unlock your genetic blueprint and enable genome-guided medicine.

Reid - Thank you ever so much for your time! Really enjoyed learning more about your background, your perspectives on the evolution of genomics, what you're working on at Tute - and how machine learning is helping. Tute can be found online at http://tutegenomics.com.

Readers, thanks for joining us!

P.S.If you enjoyed this interview and want to learn more about

what it takes to become a data scientist
what skills do I need
what type of work is currently being done in the field

then check out Data Scientists at Work - a collection of 16 interviews with some the world's most influential and innovative data scientists, who each address all the above and more! :)