Health AI Remains all Buzz and No Honey
Even though it makes all the right noises from its hive of robotic drones
Almost twenty years ago I stood in the sitting room of the imposing house of then-Mayor of Auckland and cereal king, Dick Hubbard. I was having a long conversation with his adult son, Gavin, then a fellow technologist, over some of the philosophies and interesting experiences we had had whilst plying our trade as Infrastructure Architects and Information Systems consultants. Gavin turned the conversation to one maxim that has resonated with me ever since…
In any technology solution you design, buy or implement there are only three things. Speed, Capacity and Price.
Speed means how fast it performs its core task or, essentially, how fast it operates. Speed incorporates such measures or concepts as revolutions per minute, gigahertz, bits per second and parallelism as a multiplier. Capacity can mean how much information it can store, transfer or process for a given time period and how well or accurately we process it. And price. Well, we all know what price means. How much is this going to hurt my wallet… Sir?
However, he said… and here’s the kicker. In any single solution you can only have two.
If you sacrifice capacity you can very often get a fast solution that is cheap, but it won’t be capable of holding or processing anywhere near as much data. Think of this solution as a single two-terabyte laptop hard drive connected via a ten-gigabyte optic fibre backbone to four large multi-processor database servers. Massive speed and processing capacity with an insufficient amount of data that can be entirely consumed in seconds and must be constantly replaced/refreshed. In Health AI capacity can also incorporate or be a substitute for the element of accuracy, such that we see machine learning and artificial intelligence solutions that are willing to sacrifice accuracy in order to achieve speed (in either time taken to develop or overall throughput of the model) at a lower cost.
If you sacrifice speed you can get a capacious and cheap solution that runs like a five year old through the Great Molasses Flood of Boston in 1919. This solution might be the inverse of the one above - a one-hundred-terabyte Storage Area Network (SAN) connected via an old-fashioned HP ten-megabit network hub to a single uni-processor database server. Massive data being transferred and processed so slow that you don’t know which will come first, your completed data analysis task or the heat death of the universe.
Therefore, the tough question becomes what price (how much money) are you willing to sacrifice in order to achieve an adequate measure of both speed and capacity.
I believe this maxim is as true today as it was back then, and that it applies equally to whether I am building a rack of email, intranet and electronic health records servers and their corresponding SAN, or an intelligent tool to improve clinical decision-making and patient outcomes. For what its worth, we could just be lazy like everyone else and call all intelligent tools AI for a minute. Or maybe not.
What’s Wrong with Health AI Right Now?
In 2017 I sat with a table of fellow IT Engineers, academics and PhD students at the IEEE International Conference on Health Informatics (ICHI) in Salt Lake City, Utah. We listened intently as a female keynote speaker from IBM extoled the virtues and accuracy of IBM’s Watson for Health - that multi-billion dollar math and/or machine learning man behind the curtain1 solution IBM developed initially for cancer in partnership with, and at the expense of, Memorial Sloan Kettering Cancer Center. Watson for Health was little more than a machine learning data consumption tool. Their concept was really quite simple. If you give this big expensive computer enough data, including patient medical records, journal articles and clinical guidelines, it will meet and eventually exceed the ability of doctors to make diagnosic or treatment decisions. The speaker waxed lyrical on Watson for Cancer’s ability in 2013 to, with the unmentioned baked-in biases2 of overfitted data and only making predictions for patients it had already also been trained on, correctly match the decisions of human doctors on 99% of 1000 supposedly challenging cases.
Naturally, what she left out was that when you presented Watson a case outside either the patient data it had previously been trained on or from another hospital with different patients and clinicians, it failed miserably. It made clinical decisions and treatment recommendations that, had doctors followed them blindly, were unsafe and could potentially have killed patients. Watson had actually become IBM’s Big Blue Embarassment and, as the senior datacentre engineer from the IBM Watson datacentre in Ireland who was sitting in the seat directly to my right told the group at our table, was already being shuttered and dismantled even as the keynote speaker continued to extole its virtues.
So, what we had was the face of IBM up front telling us how “awesome” and “accurate” their Health “AI” was, while the truth was that behind the curtain they were shutting up shop, repaying tens of millions to hospitals that had bought into the hype, and IBM management were quietly hoping you would forget about their little excursion into Watson fantasy.
The EPIC Sepsis Model was another example of so-called clinical “AI” that, again, was a math and/or machine learning type model we were told could detect and diagnose instances of sepsis (systemic infection) in patients - based on nothing more than training on large datasets of electronic health records (EHR) of patients who had previously had sepsis, and access to the current patient’s in-hospital electronic medical record (EMR). The EPIC Sepsis Model tool was so accurate3 that in a large patient population where the incidence rate of sepsis was 7% it not only failed to identify existing information supporting diagnosis for and therefore missed 67% of those life-threatening sepsis cases (67% false negative rate), it generated alerts for 18% of all hospitalised patients, most of whom did not have sepsis (approximately 74% false positive rate). The only thing it did reliably was to cause alert fatigue for the doctors and nurses working on wards that used the EPIC Sepsis Model. While EPIC’s own marketing spin has gone into overdrive to create the impression that the only real error their model made was false alarms and that they have learned from past modelling mistakes and overhauled their Sepsis Model, it is clear even to the casual observer that the numbers for their ‘updated model’ still add up to nothing more than a lot of missed sepsis cases and a misplaced sense of safety and confidence in any clinician relying on the tool.
Over the last few days an X user who it appears may (or may not) have some connection with the solution he was directing my attention towards, suggested that Pacmed might be the unicorn Health AI solution I hadn’t seen yet. In short, it was not.
For starters, Pacmed’s product is not true AI, rather it is another math and/or machine learning type tool that in this implementation is used to support clinical decisions on whether patients should be discharged from intensive care on the basis of predicted readmission within seven days in two hospitals in the Netherlands. Some of their research reads more alike sales and marketing promotion dressed up as ‘Science’, and is carefully and comically couched with words like “retrain” to gloss over processes that appear for all intents and purposes to be another application of “overfitting” bias. They proclaim that erroneous results for 1-out-of-5 patients (a 21% false positive or false negative rate) are, after this overfitting to a specific hospital patient cohort, infinitely better than their initial error rate of 1-out-of-4 (28% false positive or false negative rate). I am sure that is confidence inspiring for every fourth or fifth patient who gets the wrong prediction and is potentially harmed by either early discharge or a prolonged inappropriate stay4.
I am also leaving aside that at at least one point in the paper the 79% overfitted accuracy score drops confusingly to 78% - which may be neither here nor there unless you happen to be that one patient out of one-hundred caught up in what might be a simple typographical mistake. However, and ignoring that they like every other Health AI tool based on complex math and/or machine learning struggles to get accuracy rates in the 80s, their academic literature does suggest they get one thing right. But more on that in a minute.
What Can We Do?
I have always been a fan of the Mr Scott School of Engineering. The original Star Trek series chief engineer Mr Scott himself had several sensible maxims, such as:
When you quote how long a task is going to take, double it.
Not only does this give you a buffer if something unexpected happens, when you invariably deliver in half the time you look like, to use his term, a miracle worker.
However, his most important and most often quoted maxim is this.
Use the right tool for the right job.
The biggest issue I see repeatedly with these so-called Health AI projects is that they focus wholly on the data to the almost absolute detriment of everything else, including common sense. When their data-centric models eventually and invariably are found to be drilling a dry well, their most oft-heard responses centre on needing either more or better quality data, overfitting of that data, or a better tuned propensity score5 or mathematical model. Because if some data is good, more data is better. If we have an overwhelming amount of data and still can’t find anything it must be because the data is dirty and more time (and money) is needed to clean it. Or we just need a better way to one-to-one match patients and controls in our observational study that is pretending to be a randomised control trial, in order to find outliers in data we’ve already cleaned so much with linear regression smoothers that it is steralised beyond comprehension.
Oh, and before you ask, I’ve been in several situations with folks from Health AI startups and heard these things enough times now that I have my own inate cerebral model, call it a bullshit-meter, that with a high degree of certainty predicts when these statements are coming.
No matter what your intelligent health technology is intended to do, be it image recognition of possible tumours in breast scans or prediction of a patient’s propensity for a heart attack in two years, starting from the data without any, or even an adequate model of the knowledge around the disease, clinical decision or outcome, is always going to be the wrong approach. No ifs. No buts. There are so many examples like Watson that have done so and failed… and so few that succeeded I was unable to find one to mention here in order to balance this sentence fragment.
The first task we should always undertake is expert knowledge elicitation in order to understand the clinical decision or prediction we seek to make. It seemed at least that the PacMed AI people understood this important step - or at least their academic literature suggests they did. What aspects does the expert doctor consider when making a diagnostic or treatment decision in their domain? Note that I said expert. I have seen Health AI projects who think it sufficient, for example, to ask a GP to explain how a cardiologist makes critical surgical decisions or a trauma surgeon detects and treats impending coagulopathy. I repeatedly see examples of Health AI startups using a doctor who is a specialist in one medical domain, say ophthalmology, to help construct models for something he or she has never done, such as administration of spinal anasthesia6. If you are constructing a cardiology model you should engage the services of an experienced cardiology specialist. If you are constructing a model to predict pregnancy outcomes as I did recently, you engage experienced midwives and obstetric consultants.
I am truly sorry but having a degree in data science or applied math and a hard drive full of innocent and unsuspecting patient’s health records is not an equivalent for six years of medical training and another six or more years of specialisation. And having six years specialisation in ophthalmology is not a substitute for six years specialisation in anaesthetics and pain management. Similarly, and for our clinical friends who wish to delve into machine learning and AI, your years of medical training and specialisation will never be a substitute for the knowledge and experience of the decision scientist you should always engage to create your clinical models.
Second, and only after you have understood the clinical decision you seek to model, you should select the right tool for the right job. There are a variety of reasons that linear regression and similar math models (think SKLearn and Generalised Linear Modelling, my techie friends) and machine learning (think tensorflow and other neural network and similar models) struggle to get into the 80s for accuracy. The lower accuracy of math models might be okay for creating the nomogram a doctor might carry around as a laminated card as he goes from patient to patient, because the types of decisions these support are often fairly safe and not life-critical. However, patients have been taught to, and rightly should, expect better from the so-called intelligent systems solutions.
For any risk, probability and predictive decision where accuracy (for example, due to a need for safety) is important, these lower-accuracy approaches have proven time and time again to be flawed and unreliable and, as I suggested above, drilling a dry well. Similarly, I also struggle to see the wisdom of some other Health AI startups who claim they will use the Large Language Models (LLMs) to undertake these complex clinical decision-support tasks. Do we really want Microsoft/OpenAI’s ChatGPT and Facebook/Meta’s LLAMA that, as Hannah Fry once quipped about the incredibly naive Babylon AI’s ‘GP at Hand’, don’t know their arse from their elbow, to be making decisions about whether the red spots on your skin are life-threatening meningitis or simply a benign contact rash?
While I would use neural networks for image and pattern recognition and classification problems. And I would use LLMs for language contextualisation to identify when a doctor is saying the patient does or does not have a particular sign, symptom or disease. As it stands right now the most accurate and credible solutions to making models for complex clinical decision-support appear to be the causal Bayesian probabilitistic networks. These are the only solutions at present that appear to truly replicate the full gamet of what we see clinically, that can more accurately reason with missing data and all those unknown unknowns, and that easily and without bias or overfitting can reach into and beyond the 80s in accuracy evaluation.
As an example over the last month I have been involved in demonstrating to yet another Health AI startup who have spent 18 months or more cleaning and preparing data tables for ingesstion by yet-more-of-the-same math and/or machine learning models without AI success, that causal Bayesian probabilistic networks can be used to rapid-prototype a clinical decision or complete guideline, and identify where wins can be made in identifying what may be resource misuse or waste in clinical processes. And yet again I have been disappointed to see that they intend to throw that opportunity away. They are doing this possibly due to sunken cost in their existing complicated math-y approaches but more than likely due to yet another example of the misplaced belief that data is the key, not how you use it, and if they develop every step of some complex math or machine learning model themselves they can own the IP and rule the (medical) world.
And for what it’s worth my cerebral model yet again correctly predicted that I would again, and did, hear the phrases about their simply needing: (i) more time to create better, cleaner data; (ii) the need for more data (the now all-too-common pipe dream that by incorporating what has repeatedly proven to be useless gene data they can become so accurate and infallible that all human disease will bow to their awesome might); and (iii) simply overfitting the data for each hospital or health provider, all would be the key to why they would succeed where hundreds of others have invariably failed.
However, and completely unexpected was the extended claim that, in spite of millions of examples and all common sense to the contrary, they would somehow be the first not to sacrifice any of the three elements of Gavin’s maxim…
That they would not have to sacrifice speed, capacity (accuracy) or price.
Yeah… I’ll believe that as I fly my Moller Sky Car past a flock of winged pigs on my way to that Martian holiday resort.
Think Wizard of Oz and you’ll understand what I mean
While I realise every mainstream media reporter and DEI/EDI wonk will tell you “bias” in AI is about the developers, AI or algorithm being ‘racist’ - this is attention-grabbing headline nonsense. Bias in this context more correctly means that: (i) the data the model is trained on is incomplete, inaccurate, non-representative or skewed in some way, or (ii) the model having some algorithmic bias that over- or under- values some element present or absent in the data, rather than the developers or AI model they created being intentionally for or against black or white or rich or poor people.
*Sarcasm Alert*
Studies have shown that the chance you will contract a nosocomial (hospital-borne) infection increase beyond 50% after 72 hours on an acute ward, and the risk of experiencing an iatrogenic harm (medical mistakes) increase beyond 20% after five days on the same acute ward.
Don’t get me started on the absolute mathematical nonsense that is called full matching with propensity scores. Talk about trying to reduce the complexity of the human organism down into a complicated system that, disingenuously, claims to be addressing all the important confounders when it in almost all cases has no clue what confounders it is trying to address!
Sadly folks, this is a real example. I actually saw this!
When I was in undergraduate computer science a decade ago, all the rage was about functional programming languages like Haskell. To our wearied reactions to having to program in the clunky syntax and un-intuitive constructs of Haskell we were always told, "but the value of pure functional languages is they enable you to /reason mathematically about the behaviour of code/." Lurking in the background all the while was the spectre of Big Data. You are right; the AI vogue is all driven by the illusion that if we just throw enough data at an unpredictable system it will yield useful results. So much for a future of bug-free software with algorithmic behaviour as transparent as a mathematical proof. It's rather a case of an open proclamation that, "who cares how it works, as long as it seems like it does." Why this approach is manifesting so soon in precisely the sectors, like health and criminal justice where robust analysis is the most critical, is rather troubling.
Yes I agree and not just with health I have always thought the whole AI/data thing is overhyped - they miss soul, intuition/music of the mind and all those things that make life worth bothering with