This topic occurred to me following my recent talk at a dental conference at Charité Berlin. Upon hearing that I have a strong interest in inference, my fellow keynote mentioned that it drives him crazy that random forests, and similar algorithms, work so much better than DNNs on genomic data. He challenged me to come up with a reason for why this is the case.
I think that I know why. The problem I have is that I suspect that I can never prove it. That issue of not being able to prove things in machine learning is probably an equally interesting topic, for a future article, but here I want to address my theory of why random forests work better than DNNs for analysing genome data.
In essence my argument is similar to the learning that went on in the statistical community around L1 and L2 regularisation. Friends of mine will already be aware that I am a fan of techniques such as Elastic Net and LASSO. If anybody needs an introduction to LASSO I just found a really nice tutorial for one of my staff recently.
The reason that I like these methods so much is that they introduced a shift in perspectives in the statistical community, from methods with strong mathematical proofs – eg. L2 (norm) based regularisation with all of the associated matrix and convergence theory – to approaches emphasising a combination of techniques. LASSO still has strong math behind it, but it expresses an acceptance of the trade-off, or balancing, of multiple strengths.
Facepalm time: I originally wrote this article writing LASSO while describing Elastic Net. My bad! I’m going to blame the 40 degree heat for this one. (And special thanks to a friend who pointed it out to me)
When you are training a machine learning model there are three basic methodological constraints on your outcome. The first is of course the basic technique that you are using. The second is the optimisation function. And the third is any regularisation methods you are using. What is not always obvious is that these are really nested requirements rather than completely distinct topics.
The basic techniques, such as random forests or DNNs, are typically capable of extreme levels of expressibility of the underlying data structures. They, ultimately, are just algebraic functions of the data. So unless you are insisting on low dimension techniques, or severely limited network sizes, the result of the fit on something like genome data should be largely independent of the technique.
Similarly the optimisation function is not so relevant in distinguishing which model structure will work better on large-scale statistical functions. However, embedded in the optimisation function – I said these technical constraints are nested – is typically also a portion of the regularisation techniques.
In mathematics, statistics, and computer science, particularly in machine learning and inverse problems, regularization is the process of adding information in order to solve an ill-posed problem or to prevent overfitting.Wikipedia article on Regularization
Regularisation can span both the optimisation function and a set of ad-hoc, best-practice, techniques which are applied to improve fitting outcomes. Some readers may argue with my characterisation of regularisation, perhaps they may wish to make it non-overlapping with the knowledge embedded in the optimisation function. But my nomenclature is completely in keeping with the statistical literature, and my argument for making this link so explicitly is that we need to shift our perspective on what is encompassed by regularisation.
We add information in many ways, some explicit and others largely implicit, when building machine learning models. I believe that the word regularisation needs to be extended to all aspects of the design/implementation of the model. Regularisation should not be at the bottom of the hierarchy running from model style through optimisation function and finally down to regularisation. Rather, it is a perspective (a viewpoint) on the model which starts with an understanding of the constraints the model type will impose on the fitting process and descends all the way to individual techniques such as dropout. Regularisation, combined with data, dictates the predictive structure of the model.
Implicit in the model
Trees provide an implicit understanding of data which is different from that of DNNs. I stated above that trees and DNNs have similar levels of expressibility of the data. This is true. But given the very different understandings of data embedded in these two models, fitting them to a single data set will follow a very different path towards the eventual fit. Furthermore, if you follow different paths and you don’t have infinite data then you are likely to stop the training at a point at which the two paths do not intersect.
In fitting these models, there is also the issue of getting stuck in different local minima (of the fitness landscape) using the different techniques. These minima are equivalently good fits, from a statistical perspective, but do not express the same function of the data. Judea Pearl has some theorems in his excellent book Causality (complied from the original papers of which he is an author) which state that for many data sets there are multiple causal models (implicit understandings of the data) which are statistically equivalent and are completely indistinguishable from one another without further data.
So trees have a very different understanding of data than DNNs. And as a result of this the process towards fitting them will be quite different.
Technically, we don’t know the structure of genomic data. We know a lot about it, but we certainly don’t know the whole structure. This is why we’re using machine learning techniques in the first place.
However, we do know that it is biological data. And we know a hell of a lot about both the biological processes involved and, also, about the selection biases which led to the development of the genome in its current configuration.
Biological processes tend to be hierarchical. Processes interact locally, at a given level, the result typically leads to signals which are sent and processed at higher levels, etc. There is a pretty strict hierarchy of interactions from the molecular level upwards, each level being dependent on the previous. (Actually there’s an interesting point about biology that typically the real magic happens when biology exploits the region where two levels intersect. But that is definitely a different article!) This hierarchical nature of biological process does not appear to be an observer bias effect, from us humans observing biology, but rather an inherent property of the systems.
Looking at the genome from the point-of-view of natural selection, we believe that selection works by a combination of mutation and exploitation of neighbouring traits. Biology rarely invents completely new processes but frequently adapts existing ones – optimising them later on. This adaptation of existing traits has as a prerequisite that the distance from the existing system to the new one be short, otherwise it is exceedingly unlikely to occur. Therefore, it appears that tree-like relationships between the ‘traits’ (which might mean proteins) might be particularly efficient as a genomic encoding on which selection can occur.
In short, I think there is good circumstantial evidence to believe that genomic data contains entire hierarchies of embedded tree-like structures.
You get what you reward for
The longer I work in science the more that I love the phrase, “You get what you reward for.” When applying metrics and data analytics to the world it is vital that you understand this concept. Whether you are attempting to drive behaviour-change or just deep-diving for data insights, where you shine your torch (i.e. whatever you reward for) is likely to be where you find what you were looking for.
My answer to the question of why random forests work better than DNNs for analysing genomics data is the following: DNNs and random forests embody very different implicit regularisations of the world. Genomic data is inherently hierarchical and tree-like in its nature – whether by argument for the nature of biological processes, or for coding efficiency on which natural selection can operate. Tree-based methods have better statistical power than other correlative methods for this particular data set. This is why random forests work so much better than DNNs on genomic data.
Can we ever prove it?
Statistics for most of the 20th century relied on hard mathematical proofs. But most statistical results can only be proven for certain underlying distributions of data. Once we move away from these idealised distributions we have a weaker understanding of the hard limitations of our techniques.
The explosion of computing power has led to a parallel explosion (exponential expansion) in the size of data sets and the number of parameters under exploration. Traditional statistics has struggled to adapt to this new paradigm. The LASSO technique, which I mentioned above, is one of the crossover techniques which is accepted (somewhat) by statisticians but is mainly used in the field of machine learning. Elastic Net, to the best of my knowledge, resides more firmly in the machine learning culture (eg. inferring parameters for differential equations) than in statistics.
I am not sure to what extent we will have hard mathematical proofs of convergence and of the limitations of machine learning techniques in the future. I am strongly opposed to a purely heuristic approach to understanding. Hopefully we can continue to combine the statistical physics techniques, currently used in computational neuroscience, to some of our methods. These, at least, provide guide-rails inside of which we can expect model performance to lie.