An AI-powered transcription tool widely used in the medical field, has been found to hallucinate text, posing potential risks to patient safety, according to a recent academic study.
And that tool is being used in a commercial medical transcription product that, worryingly, deletes the underlying audio from which transcriptions are generated, leaving medical staff no way to verify their accuracy, AP News reported on Saturday.
OpenAI’s Whisper, the underlying AI tool, is integrated into medical transcription services from Nabla, which the company says are used by over 30,000 clinicians at more than 70 organizations. Nabla told AP its product had been used to transcribe around 7 million medical visits.
Whisper is also embedded in Microsoft’s and Oracle’s cloud computing platforms and integrated with certain versions of ChatGPT. Despite its wide adoption, researchers are now raising serious concerns about its accuracy.
In a study conducted by researchers from Cornell University, the University of Washington, and others, researchers discovered that Whisper “hallucinated” in about 1.4% of its transcriptions, sometimes inventing entire sentences, nonsensical phrases, or even dangerous content, including violent and racially charged remarks.
The study, Careless Whisper: Speech-to-Text Hallucination Harms, found that Whisper often inserted phrases during moments of silence in medical conversations, particularly when transcribing patients with aphasia, a condition that affects language and speech patterns.
In these cases, the AI sometimes fabricated unrelated phrases, such as “Thank you for watching!” — likely due to its training on a large dataset of YouTube videos. In more concerning instances, it invented fictional medications like “hyperactivated antibiotics” and even injected racial commentary into transcripts, AP reported.
For example, Whisper correctly transcribed a speaker’s reference to “two other girls and one lady” but added “which were Black,” despite no such racial context in the original conversation.
Whisper is not the only AI model that generates such errors. In a separate study, researchers found that AI models used to help programmers were also prone to hallucinations.
Harmful hallucinations
Whisper’s errors are a result of the AI model creating patterns based on its training data that do not exist in the samples, leading to nonsensical or fabricated outputs. This phenomenon, known as hallucination, has been documented across various AI models. According to the researchers, 40% of Whisper’s hallucinations could have harmful consequences, as the AI misinterpreted or misrepresented the speaker’s intent in several cases.
Although Whisper’s creators have claimed that the tool possesses “human-level robustness and accuracy,” multiple studies have shown otherwise.
In one study of public meetings cited by AP, a researcher from the University of Michigan found hallucinations in eight of every 10 audio transcriptions. Another machine learning engineer reported hallucinations in about half of over 100 hours of transcriptions inspected. A third study identified hallucinations in nearly every one of 26,000 transcripts generated using Whisper, AP said.
Microsoft, which offers Whisper as part of its cloud computing services, advises companies incorporating it in the solutions they offer to “obtain appropriate legal advice to review your solution, particularly if you will use it in sensitive or high-risk applications.”
Despite this, many healthcare providers are already adopting it for transcribing patient consultations.
Nabla, the company integrating Whisper into its medical transcription tools, has acknowledged the hallucination issue and is reportedly working to address it, the AP report said.
With over 4.2 million downloads on the open-source AI platform HuggingFace in the past month, Whisper has become one of the most popular speech recognition models. However, as its usage spreads, researchers are warning against its adoption in critical sectors like healthcare due to the serious implications of its errors.
While other AI transcription tools also make mistakes, the frequency and potential harm caused by Whisper’s hallucinations are raising red flags. Similar AI models, such as Google’s AI Overviews, have faced criticism for producing similarly outlandish outputs, such as recommending non-toxic glue to keep cheese from falling off pizza.
As the healthcare industry increasingly integrates AI solutions, the risks posed by such hallucinations demand immediate attention to avoid harmful consequences for patients.