Mathematics and Biology III - Bioinformatics

When I sat down in Summer 2018 to begin my blog one of my goals was to write approximately 5 definitive articles about Mathematics and Biology. So far, I have been pretty hard on the efforts in both fields to come together. I began with a review of the very different world-views inherent in the two subjects – combined with a call to arms for likeminded people to come and help out. I followed this with a more practical consideration of the repertoire of techniques necessary and the career constraints, which actively work against combining these two disciplines. Today I want to consider the shining example of bioinformatics – the one area in which mathematics is clearly being used in biology and which demonstrates a clear career path.

Bioinformatics works

That’s it. I’m not going to deny it. Bioinformatics is a positive force in the world. And it involves the application of mathematics in biology. I do have much more to say on the matter, but if you stop now please understand that I am freely admitting that bioinformatics does exactly what it says on the tin.

The Human Genome Project had a moderate influence on my childhood. I remember my dad being very excited about it. The promise extended to one day having huge databases which would contain the entire genome of every human being on earth. I wasn’t entirely sure what a large database containing my genome would allow me to do, but it seemed a valid idea. I have to admit that I was far more likely to pester my dad with questions about how the genome code actually works (see my Influences article about John Holland) and I was bizarrely interested, considering my age, in how the shotgun sequencing technique worked on a mathematical level.

Like many great projects, the Human Genome Project first under-delivered before over-delivering. It didn’t seem crazily interesting when they finally announced the first genome had been sequenced. The communication of this triumph was probably muddled by the mixed messaging that i) it was a ‘generic’ genome of at least 5 people mixed together, and ii) that Craig Venter may have subverted the process and substituted his personal DNA into the samples at a number of the labs under his control. There was also further controversy due to Venter conducting an independent effort to sequence the genome, using faster but less proven techniques (principally the aforementioned shotgun technique), in competition with the public project.

Happily, the evidence of a long-term pay off from the project has grown immensely in the intervening years. The current efforts with CRISPR-Cas9 are utterly mind-blowing and are only really possible due to the understanding of genomics built-up as part of the Human Genome Project.

With the launch of the Human Genome Project there was a tacit agreement that embarking on such a project would bring biology into a new information age. No more stamp-collecting, the genome was an information-rich source and microbiology was to be the first field to tackle it. Throughout the 1990’s this led to the creation of numerous programmes around what has come to be called Bioinformatics.

What is Bioinformatics?

When I want to explain Bioinformatics to physicists I typically refer to it as the coupling of biology – which is obsessed with categorising everything in tables and databases – and statistics. You’ll notice I don’t put a subtext to statistics. If I am particularly sure of my audience, I extend my explanation of statistics to point out that this is the study of association not causation.

Through this approach Bioinformatics has delivered a great deal. We can now state with relative certainty that we have identified certain genes which code for complex diseases. The current best-practice treatment for HIV is based on our understanding of the major morphologies of the HIV virus which in turn can be understood through its genetic sequence. Today, these same approaches are being turned to sequencing cancer tumours with a view to developing more targetted therapies. And we are sitting on the verge of a microbiome revolution which may or may not revolutionise human healthcare (Note: I’ve been a follower of microbiome research since 2005, normally I’m a believer in it, but right now there is evidence that this field has over-sold itself).

Each of these frankly magnificent achievements in human health are based, in part, on advances in Bioinformatics.

Is this just a good news story?

Unfortunately, I run out of good things to say about Bioinformatics rather more quickly than I would like. Partly this is a personal bias – it’s easy to criticise – but it’s worth exploring the limitations of this field before trying to learn from it.

Biologists have long had a basic understanding of statistics. The next natural step after gathering data is usually to count it – gathering summary statistics. So it is probably natural that, when they wanted to introduce numerics into their science they opened up first to statisticians. There is also a timing issue in this relationship; statistics has gone through an explosion, with the widespread adoption of computing technology, since the late 1970s. The computer, the Jacknife, the Bootstrap, all followed in close succession. The later stages of the Human Genome Project overlapped with widespread adoption of Big Data techniques at Internet companies via Map-Reduce and later HADOOP.

However, for each of the good news stories about Bioinformatics, which I shared above, there are years of PhD-level work. Each of the breakthroughs has proven to be not particularly transferable to other illnesses/paradigms. And it is not entirely clear if Bioinformatics was really necessary for the result in the first place!

My understanding of the development of triple-therapy for AIDS patients is that it was developed via clinical insights and not from mathematical models. Within the biological modelling community one of the most famous models is one which explains why the triple therapy works. But it was not the reason for the development of the treatment protocol.

The sequencing of the tumours is one of the more transferrable results. Since we now have a reasonable understanding of the method-of-action of some of our oncology drugs tumour sequencing allows us to try to automate the line-of-treatment decisions in the clinic. I have to say, however, that I have seen multiple research results (still unpublished) that show that there is no improvement in line-of-treatment decisions from incorporating tumour sequence information. It seems that the heuristics which doctors have derived based on phenotype are so far sufficient.

Are they doing it wrong?

Bioinformatics is slowly influencing biology by introducing mathematical techniques to traditional biologists and by bringing mathematically minded practitioners into biology. Today, we are still only undergoing the first phase of this penetration.

Alongside this, there is a quiet war going on inside of statistics. The application of statistics to human health in epidemiology displays a growing schism along the issue of causation. This is an important discussion which will eventually lead to a better understanding of existing techniques and the introduction of new more specific approaches.

It is hardly surprising then that Bioinformatics is caught up in this war. And since not all bioinformaticians are hardcore statisticians they might not even be aware of the macro-level discussions going on.

Human Health and Biology has two problems which are not the most tractable to statistical approaches:

for every rule in biology there is an exception
the arrow of causation is fundamental in influencing health states

The first of these problems is a rule-of-thumb introduced to me by one of my PhD supervisors. It seems to be the one rule which, itself, does not have an exception. At one level, statistics is a very suitable toolbox for categorising and aggregating biological data. Ultimately it is the formalisation of the human/heuristic driven approach of categorisation which previously was the norm in biology. However, when you reach the (mathematical) limit of biology you realise that individual nuances matter – and statistics cannot sensibly cope with this. Statistics can aggregate these nuances and derive population-level descriptions, but it will not work at the level of individual units (whether individual humans, or individual genes).

The issue of association and causation is something with which I have wrestled myself. At one point, I trained in classical statistics. But then I started looking for mechanistic explanations of biological systems. Eventually, the two approaches reach a logical impasse. From my point-of-view, epidemiology has skillfully swept much of this problem under the carpet. Much of the training of industry epidemiologists is focused on mental techniques for balancing the historical influence of R.A Fisher – on statistics – and the real-world needs to show effect and ideally also mechanism.

Biology is a data-rich discipline

Sometimes people and disciplines know where they’re headed long before they get there. Variations of the phrase, about biology being a data discipline, used to be thrown around in the past without ever really being honoured.

I think, today, that the throwaway comments are finally becoming a reality. I love bioinformatics. I love that it exists. I love what bioinformaticians do. As a discipline, bioinformatics is bringing mathematically inclined researchers into closer and closer contact with biological data. Lab pipelines are being organised around the capture of more data.

In the private sector, apart from the previously mentioned microbiome companies, data-plays are to be seen throughout the health sector. Alphabet own Verily. Roche bought Flatiron Health. Every pharma company I know has multiple teams attempting to apply machine learning to drug-molecule design.

I think all of this is wonderful, but it’s never going to be the solution to all of biology’s problems. Human Health, my main interest, is a field which will always be more dependent on knowledge of means-of-action than statistical techniques are apt to provide.

Series notes

This article is number 3 in a planned series of at least 5 articles on the use of Mathematics in Biology, previous articles in the series are:
Mathematics in Biology I
Mathematics in Biology II – Practical Considerations

One Reply to “Mathematics and Biology III – Bioinformatics”

David Higgins says:

June 17, 2019 at 6:14 pm

I am not really happy with this article. It was written in too much of a hurry and has too many holes. However, I would rather leave it stand as it is than spend more time on it.