News Concerns about medical note-taking tool raised after researcher discovers it invents things no one said — Nabla is powered by OpenAI's Whisper

The model that is used in the medical note-taking app is fine-tuned version of Wisper, which is not the same thing as Wisper. They can't prove that the model in the notetaking app even halluciate in medical context (per this article). So to say

"medical note-taking tool raised after researcher discovers IT invents things no one said"​

is wrong. It is another click bait and mis-leading headline by Jowi author.

One can also argue that all the notes are reviewed by doctors who recorded them, and the fact that they keep using it is because it is accurate or else they have ditched it.
 
Last edited:
The model that is used in the medical note-taking app is fine-tuned version of Wisper, which is not the same thing as Wisper. They can't prove that the model in the notetaking app even halluciate in medical context (per this article). So to say

"medical note-taking tool raised after researcher discovers IT invents things no one said"​

is wrong. It is another click bait and mis-leading headline by Jowi author.

One can also argue that all the notes are review by doctors who recorded them, and the fact that they keep using it is because it is accurate or else they have ditch it.
Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
 
Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
You also need to consider that there is a high possibility that not all doctors are checking the transcriptions. A lot of them probably only give it a brief skim over too.
 
Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.

There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.

If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.
Thanks for replying Jlake, I agree with all your points. It is highly possible that the fine-tuned model also hallucinates and may need investigation. The problem I want to point out is saying "it hallucinated" in the article title is logically incorrect unless proven otherwise. This is not the first time, the author has a habit of pushing clickbait titles that defy facts and logic in many of his previous articles, which is a practice that i'm allergic to.
 
You also need to consider that there is a high possibility that not all doctors are checking the transcriptions. A lot of them probably only give it a brief skim over too.
I have been working with EMR with autocomplete features for 7 years now and the natural tendency is to skim through and assume that the software has made the correct inputs.
It is quite common for the non-AI auto complete to add things, modify what the doctors typed, or even delete things.
Currently we deal by having a routine double check by the staff and confirming things with the doctor.
 
This kinda of stifling and limiting of the AI's creativity is why skynet will rise against humanity xD.
 
I have been working with EMR with autocomplete features for 7 years now and the natural tendency is to skim through and assume that the software has made the correct inputs.
It is quite common for the non-AI auto complete to add things, modify what the doctors typed, or even delete things.
Currently we deal by having a routine double check by the staff and confirming things with the doctor.
That is just what you have seen, it is entirely possible that other places and people aren't as thorough or don't follow procedures.
 
  • Like
Reactions: P.Amini
I use Whisper on my Android and my desktop and have encountered the same hallucinations. Text that appears when there was silence or text that appears that is not anywhere close to what I was saying. I don't know what the error rate would be. It wouldn't be a 20% rate, but it's more than 1%. As with all things voice recognition, you still need to always review what initially comes out, even with a fine-tuned model.