In the last couple of years, we have seen a rapid and escalating increase in the use of artificial intelligence to offer advice on diagnosis and treatment to doctors and other health professions. While the US Food and Drug Administration received less than 30 applications to approve medical AI between 1997 and 2015, in 2020 alone received more than 100 applications. Out of the 350 medical AI approved by the FDA, GE and Siemens lead the way, with 22 and 18 approvals to date respectively.

The good news over where medical AI is heading

More than 70% of FDA approved medical AI are in radiology, but increasingly AI is being applied to other forms of medicine which rely on medical imaging, such as a cardiac event, fracture or neurological conditions.

For example, applying AI to imaging data could help identify the thickening of certain muscle structures or monitor changes in blood flow through the heart and associated arteries.

But the bigger development is that medical AI is moving beyond just providing a ‘keener eye’ for health professions. Medical AI can combine data from different patient sources and settings to predict the potential risk for future illnesses.

For example, in everyday clinical practice, predicting a heart attack is challenging, typically based on cardiovascular risk factors and scores which don’t always show the full picture. A recent study by the Cedars-Sinai Medical Centre found a substantial improvement in predicting heart attacks when AI analysed together clinical data and a PET scan, which assesses disease activity in the coronary arteries, and a CT angiography, which provides a quantitative plaque analysis.

So what’s the worry?

Approval processes for medical treatments and equipment are rightly strict – but they were developed for a world of drugs, vaccines and mechanical equipment (sometimes with some software embedded). So, how are these processes coping with the approval of the very different ‘animal’ of an AI program? Not very well according to a recent study by Stanford University.

The study examined all of the medical AI devices approved by the FDA between January 2015 and December 2020, 130 medical AI in total. The study looked at about how the algorithm was evaluated: the number of patients enrolled in the evaluation study; the number of sites used in the evaluation; whether the test data were collected and evaluated concurrently with device deployment (prospective) or the test set was collected before device deployment (retrospective); and whether stratified performance by disease subtypes or across demographic subgroups was reported.

The study found some major deficiencies in the FDA approval process.

First, almost all of the medical AI only underwent retrospective tests and did not involve a side-by-side comparison of clinicians’ performances with and without AI. The researchers considered that prospective testing was essential given the nature of AI because:

“human–computer interaction can deviate substantially from a model’s intended use. For example, most computer-aided detection diagnostic devices are intended to be decision-support tools rather than primary diagnostic tools. A prospective randomized study may reveal that clinicians are misusing this tool for primary diagnosis and that outcomes are different from what would be expected if the tool were used for decision support.”

Second, over 70% of applications failed to state the number of different sites at which the medical AI was tested (although the FDA probably had this data), and of the few applications that did, most were tested at one or a few sites.

Multisite testing is important in understanding how an AI model’s performance can be generalized to a broad and diverse population. To demonstrate this, the researchers used three top performing AIs used in the detection of pneumothorax (collapsed lung) across patient data from three different US hospitals. They found a high degree of variability in the performance both of each individual AI across the three hospitals and as between the AIs themselves. Disturbingly, they found performance disparity between Black and white patients increased across some hospitals.

Third, the published reports for 59 devices (45%) did not include the sample size of the studies (again the FDA would know this but it was not made public). Of the 71 device studies that had this information, the median evaluation sample size was 300. Only 17 device studies reported that demographic subgroup performance was considered in their evaluations.

The following diagram summarises the analysis – it’s a little visually dense but worth working through the legend:

FDA says it will do better

While not directly related to this study, the FDA over the last year has been engaged with industry and other stakeholders in discussions about how to improve its approval processes for medical AI. In 2001, the FDA published an Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan, which kicked off a consultation process culminating in a workshop in October 2021.

Out of that workshop, the FDA acknowledged that “the need for careful oversight to ensure the benefits of these advanced technologies outweigh the risks to patients, while collaborating with stakeholders and building partnerships to accelerate digital health advances.” While this could be dismissed as “bureaucratese”, the FDA accepted that it needed to take concrete steps to, in effect, crack open the approval process for medical AI to a wider group of stakeholders, including by:

  • collecting post-launch, in-market evidence of device performance, including building a process to make real world evidence (RWE) data collection and analysis core components of the approval ecosystem, supporting development and performance monitoring;
  • including diverse patient groups in clinical trials to produce “unbiased” data sets, especially making trial participation accessible for low-income patient populations and minorities;
  • recognising the importance of patient trust in the medical AI - both by driving medical AI developer responsibility for building patient trust into the development process and once the medical AI is out there, better supporting the role the health care provider plays in building and conveying trust.

What about here in Australia?

Australia’s TGA is starting to move on medical AI, although in a little less of thorough going break from the traditional approval model than the FDA. In May 2022, the TGA announced some new initiatives on medical AI. It has produced draft guidelines for consultation and is teaming with ANDHealth, a leading provider of accelerator, incubator and commercialisation programs for digital health technology companies, to run industry seminars.


Read more: How Medical AI Devices are Evaluated: Limitations and Recommendations from an Analysis of FDA Approvals