Part of the reason why they can't prove the fine-tuned version in the medical app is having issues is that the original recordings are not retained for privacy reasons, so comparison is much more difficult than an app that is collecting samples and telemetry for study. They'd have to create their own hundreds of hours of authentic sounding but fake medical conversions.
There's no recorded complaints by patients against providers for inaccurate AI transcriptions, but it's entirely possible that errors have been minor and flown under the radar, or that the errors haven't been linked to AI transcription. It's possible that doctors are making a lot of corrections and haven't kicked up a stink because it's still a net time savings, or that they're accepting transcriptions that are "close enough" and "get the gist of it". Patients might not be shown the outputs in a timely manner, if they are shown them at all, and may not be looking closely when they do.
If the base model is hallucinating in 80% of transcriptions though, including in a medical example (“hyperactivated antibiotics”), that feels like cause for concern and investigation in any downstream models that use it as a foundation. The downstream model might not be proven to hallucinate in the same way and at the same rate... but it hasn't been cleared yet either.