Randomised controlled trials (RCTs) have been the gold standard for statistical evidence, of treatment effect, for over 100 years. Their strength is in their attempt to avoid major sources of bias in a comparison of the evidence. However, they are costly to run, particularly in the domain of personalised medicine, to which medical AI products typically belong.
As a result, enormous efforts have been made to develop statistical technologies which allow the use of observational data as a form of real-world evidence (RWE) for a treatment effect. In practice, these approaches move the burden of evidence-building, from the thoroughness of the randomization to the completeness of the assumptions. In an ideal world, both will (asymptotically) coincide. But we don’t live in an ideal world. In reality, the behavioural selection pressures on RCT development act towards thorough planning, which increases true success probabilities. Whereas the same pressures applied to observational data analysis lead towards a search for any supporting evidence, running into considerable issues of multiple-testing and false associations.
Due to the costs of RCTs, regulators are willing to consider observational data as evidence in certain circumstances. Right now, in the USA, this amounts to transfer of use – ie. something that has previously been shown to be both safe and effective on a specific patient population, and you’re just extending it to a different patient group, possibly for a different condition – and products which are allowed early to market, based on small trial results, and which require long-term surveillance to demonstrate (via RWE) their effects.
The regulatory common factor of these circumstances is essentially reduced risk of harm. Sometimes this is because of inconsequential treatment effects (eg. trying to shave 1 day off treatment of Athlete’s Foot), or tiny effect sizes (preventing Alzheimers), other times because the patient population under examination are faced with horrible outcomes in any case (such is the case with late-stage cancer therapies).
Of course, pharmaceutical companies spend enormous sums of money employing epidemiologists (in part) to conduct post-hoc analyses of clinical trials and to try to rescue the outcomes. Conceptually, this effort is similar to retraining an ML algorithm with slightly different hyperparameters. Why is it that a data dive by an epidemiologist might be allowable, but one to prove a medical AI platform might not? In my opinion, it comes down to two basic factors (i) the dimensions captured by an RCT vs a ML algorithm, (ii) a realistic appraisal of sunk costs.
While RCTs are the gold standard for medical evidence, it turns out that it is difficult to correctly randomise against all possible sources of bias. A trained epidemiologist can, using biological insights, post-hoc rescue a randomisation and change the outcome of a trial. This is the very definition of the E in RWE (Note: the correct use of the term RWE does not apply to rescuing clinical trials, I am using this as a thought experiment to examine medical AI).
ML techniques’ main strength is in their ability to find associations in dimensions much higher than a human is capable of considering. They do this while even avoiding some issues of multiple testing. But the big risk is that they are specialised techniques which will, by design, always find an association. Running a validation study, of an ML technique against the same data set on which it has been trained, is essentially meaningless. So, a medical AI algorithm already leverages all possible dimensions in the training set – it is not possible to efficiently change this via a post-hoc analysis of RWE data.
The financial cost of bringing a drug candidate to clinical trial costs in the 100’s of millions of euros/dollars. Bringing a medical AI platform to the market, including carrying out the trials, will cost a fraction of this amount. The subsequent cost of deployment of a medical AI platform, in true computing fashion, approaches zero. Perhaps it is acceptable that regulators are at least willing to discuss more lenient approaches for drugs, than with medical AI systems. To be clear, there is no guarantee that any actual leniency is truly the case with drugs – the 2016 changes to US law are still being felt in this area.
Medical AI products will transform medicine. However, investors and product leads who ignore the necessary validation steps do so at their peril. A backlash is coming. RWE rules will, in future, be of increasing importance for longitudinal effects – although this will raise awkward questions about valuation of a drug line – but I predict only small impacts on bringing AI products to market.
This article is a hot-take that I wrote for a friend in the context of regulatory validation of medical AI products. I am also working on a concrete, related, result in Statistical Theory which addresses quantification of foreseeable insights from Observational Data.