I see your AUC and I'm not impressed ... yet

Eric Topol’s paper on AI/ML in healthcare made a big splash when it came out earlier this year. It’s a broad review of published models that have the potential to impact three different stakeholder groups: clinicians, patients, and the healthcare delivery system.

One of the key takeaways of the paper is that although researchers report pretty good AUC scores for models that predict health outcomes (Table 3 of the paper), this doesn’t really mean that the models would have any clinical benefits if deployed. In other words, who cares that you can predict well, if you’re not improving patient well-being? This point also came up in an interview with Marzyeh Ghassemi, a researcher at the University of Toronto.

Topol stresses that prospective trials are necessary to show actual clinical benefit. This means it’s not enough to just get high AUC scores on historical data; instead, someone has to actually deploy the model and then track patient outcomes to see whether they improve significantly after the deployment. To date there have been very few of these prospective trials – Topol calls this the “AI chasm”.

What if we look beyond healthcare – has anyone been doing this kind of prospective trial to evaluate their ML/AI models? Yes, it’s standard practice at Booking.com, as this paper shows.

The authors say:

In Booking.com we are very much concerned with the value a model brings to our customers and our business. Such value is estimated through Randomized Controlled Trials (RCTs) and specific business metrics like conversion, customer service tickets, or cancellations.

They go on to say that in their work there are diminishing returns to model performance (better AUC =/= more $$$), and hypothesize about what might explain this. Except for the uncanny valley effect, I think all of the explanations could apply to ML/AI in healthcare as well.