I have mentioned, in the past, that I am a huge fan of Nate Silver. Something which he used to repeat quite frequently, on their podcast, is a sort of predictive modelling tautology:
The best prediction of the future is no change.
Nate Silver [Paraphrasing]
This concept has even got a probabilistic and philosophical theory behind it. All other things being equal, over the long history of time, the next moment from now is not likely to be any different from right now. If we repeat this process often enough then we will be right more often than we are wrong. In essence, we are accepting that there is continuity (and perhaps causality) in our experience of the natural world. Political scientist David Runciman even explored the concept in his recent work of political theory.
I originally took this statement in the manner in which, I hope, it was intended. But behind every great phrase there is often an enticing problem. Thinking over this phrase has led me to realise that there are three basic types of predictive models and each one of them has a fundamentally different purpose and indeed parameterisation.
Identify reality
The simplest predictive model is not really looking at the future at all. Its purpose is to identify reality.
The status quo is often harder to measure than we realise. The world is messy. Data is messy. Our data capture devices are noisy.
For example, I am currently supervising a number of projects where we build predictive models based on (medical) real-world-data. Each project, in turn, has first hit the stumbling block of how we fill-in missing data (data imputation) and subsequently on transcription errors in the original human assembly of the data set. It always surprises teams that once you have your ML pipeline properly assembled the greatest gains are always to be had by going back to clean up the source data.
Data imputation is, by its nature, model based. There are two basic approaches, first you look to see if there are ways of genuinely predicting (inferring) what the missing values should be based on the other data. And second, you try to insert values which if wrong will interact with the main predictive algorithm in a manner which will not bias it.
A more advanced example, is a classifier. I have a patient who has a long list of lab results, are they sick or are they healthy? The buzzword in healthcare for such a model is a synthetic biomarker. This diagnostic approach to predictive modelling is still trying to identify ‘what is’ – the ground truth of our reality.
Perhaps my favourite example of identifying reality, and certainly the one which others can most easily ascribe to this category, is opinion polling. I am not talking about trying to predict how somebody might vote on some future date. I am talking about asking them how they feel today on a topic or how they have just voted as they are coming out of the polling station.
Sampling theory was one of the topics which initially pulled me into working on stochastic processes. A poll is a sample in which you measure something and try to infer what the overall value for the entire population is. This is an entirely predictive modelling approach, in which you make decisions about how you construct your sample and what you ask them and these then drive a model which predicts reality.
The word ‘predictive’ sort of implies the future and this is the core of predictive modelling. So it is somewhat ironic that I devoted the first section to just identifying reality. And perhaps even more interesting that I put classifiers firmly in that category. Occasionally people get confused by a classifier which predicts that based on your current status you will experience an adverse outcome. The point is, such a prediction is still only based on your current status; you are already over the thresholds, in a hyperspace, which mean the event is likely to happen.
A pattern to the future
Future prediction comes after you have identified your present reality. Here we look at not just if, something might happen in future, but when and how.
If things are not changing much, then the best prediction is no change. If an object is increasing its velocity with constant acceleration, then the best prediction is ongoing constant acceleration i.e. again no change. Notice how the frame of reference is important.
Predictive models in this category rely on discovering patterns which have happened in the past and identifying which of those patterns is most likely to represent the current dataset. The identification part is the hard part. Once you have identified which mix of history is most applicable the actual prediction involves just rolling that historical pattern forward on top of today’s data.
I hide a lot of complexity in the phrase “mix of history”. This is the core of the model. But once you have it the rolling out of a prediction is actually easy.
A simple example of this approach is trendline regression. Just project the most recent history forwards.
Physics inspired approaches are particularly popular here. The ramifications of Newton’s insights are still only half appreciated today. Derived from these insights is a theory of cognitive learning called slow feature analysis pioneered by Laurenz Wiskott which says that particularly the perceptual centres of the brain focus on learning invariants of our environment, so called slow features. Prediction is all about leveraging consistency – what features from the current situation should rely upon in order to predict the future?
In financial modelling, time series analysis and chartism are two variants of this type of modelling. Chartism is the explicit analysis of a historical record in order to find achetypes for trends which might be composed and used for present and future predictions.
No matter how it is coded, a future prediction algorithm – based on data – is one which either algorithmically or statistically attempts to encode how much our present situation should contribute, through various functions, to the future reality.
Change of paradigm
Finally, there are change of paradigm models. In practice these exhibit very little difference from models which predict the future, which is why they often are neglected. The difference is largely in the mind of the modeller. However, it is important to distinguish this model type because a future prediction model can either predict a trend or value, or alternatively it can predict that the current trend models are no longer valid; regime change. It can’t do both.
In theory, it is always possible to embed layers of complexity within models. So a trend driven model could, in principle, predict major changes and what happens afterwards. In practice, there is never enough data to constrain such a complex model. Therefore, it is safer to build one model for within-regime trends and a wholly different model to predict the point beyond which it is no longer safe to rely upon the first model.
Finance is littered with such models. The major banks operate ‘indicators’ which are supposed to tell when a major economic phase is nearing its end-point. I find the modelling and statistical techniques to detect change to be quite developed but the application sorely lacking in other fields. A scientist will often provide a regime change analysis but not so many algorithms rely upon such for feedback control. I guess that one major application is any plant control system, the system relies on regime change models to operate safety shutdown procedures should things deviate too far from the controllable regime.
Monolithic models
Of course, in real world applications, models often appear as containing more than any single category I have mentioned above.
I read Gregory Zuckerman’s book about Jim Simons and his investment fund Renaissance Technologies recently. I was initially surprised to find that the team, which are effectively the most successful mathematical trading fund in the world are basically using only two strategies: trend and change. The book is quite disparaging about how simple the strategies are. I realised afterwards that this strategy actually fits quite well with my own classification of predictive models. In fact, they religiously cleaned and harmonised their database in the 1990’s in order to first have a perfect understanding of their ‘reality’. The beauty is in how they then use very simple algorithms based on both trend prediction and change detection to trade.
Perhaps the team are not exactly pushing themselves in terms of predicting for longer time horizons. But there is a statistical reason for this. In a noisy model many small gains is a much more certain strategy than waiting for one big one. Nassim Taleb would say that the likelihood of outsized events is poorly constrained by historic data. We know they will happen, but our ability to predict even their true frequency requires a very different set of techniques
In neuroscience, where we model the learning and performance processes in the brain we often think of (i) perception, (ii) action, and (iii) world-model selection. Admittedly this is my own classification, but a theoretical neuroscientist would not disagree with the validity of my selections. Perception equates to figuring out what is current and real. We carry out actions based on a single model of the current problem or task. We carry many of these problem or task models in our brains. And we are able to switch between the problem models based on a separate model which is monitoring how we are doing and trying to align our internal sense of what task we are working on with up-to-date perceptions.
In truth any composite model must still be decomposable into the categories which I have described. I mentioned that a trend predicting model could incorporate greater and greater complexity to cover more and more regimes. But this is not without cost and the cost grows exponentially. It is much easier, to use the change of regime model to decide that it is time to switch trend predicting model.