How To How to transcribe an audio stream from a Raspberry PI microphone in real-time.

Aug 11, 2024
1
0
10
28_JYsjukWcn5.png


I tested OpenAI Whisper audio transcription models on a Raspberry Pi 5. The main goal was to understand whether a Raspberry Pi can transcribe audio from a microphone in real time.

Whisper can run on a CPU or Nvidia only, so I will use a CPU only.

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

I tested on a Raspberry PI with only 4GB of memory, so `medium` and `large` models were out of the scope.

Also, you can watch the transcription process in action:


Whisper setup​

The setup is trivial. Just several commands and it's ready to be used:

Code:
#!/bin/bash

sudo apt update
sudo apt-get install -y ffmpeg sqlite3  portaudio19-dev python3-pyaudio

#
pip install numpy==1.26.4 --break-system-packages
pip install -U openai-whisper --break-system-packages
pip install pyaudio --break-system-packages
pip install pydub --break-system-packageshttps://github.com/openai/whisper

Audio Transcription Process​

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

  1. Audio is recorded with a USB Microphone.
  2. Audio Stream is written to a WAV file.
  3. Every 10 seconds, I start a new file and add the current WAV to a transcription Queue.
  4. AI process constantly grabs an item from the Queue and transcribes it.
  5. AI process writes text to a file/database.
So, If the AI transcription process takes more than a chunk (10 seconds), a Raspberry PI will never finish the Queue. I tested different lengths of audio: 5, 10, 15, and 30 seconds—there weren't many differences, so we can assume that 10 seconds are fine.

Whisper usage​

The code is straightforward. I send a WAV file path to a library and I receive the transcribed text as a result. I added time tracking for a better understanding of the performance of the library.
Python:
import whisper
from time_util import TimeUtil

class AiWhisper:

    _models = ["tiny.en", "base.en", "small.en", "medium.en"]

    _model = None

    def __init__(self, model_index: int = 0):
        TimeUtil.start("AiWhisper init")
        if len(self._models) < model_index:
            raise KeyError(f"Max model index is {len(self._models)}")
        print(f"AiWhisper init. Using {self._models[model_index]}")
        self._model = whisper.load_model(self._models[model_index])
        TimeUtil.end("AiWhisper init")

    def transcode(self, file_path: str):
        TimeUtil.start("AiWhisper transcode")
        result =self._model.transcribe(file_path, fp16=False, language='English')
        TimeUtil.end("AiWhisper transcode")
        return result["text"]

Small.EN Model​

This model + OS consumed 2GB of memory, leaving two more free. Let's see how fast it worked:

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

It took x3 time for a small model to process an audio file. So, for 10-second chunks, the transcription process took ~30 seconds. In a few minutes, I had ten items waiting in the Queue. I found that running a live transcription with these timings is impossible.

Medium.EN Model​

This model + OS consumed 850MB of memory, leaving 3.1GB more free. Let's see how fast it worked:

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

The transcription process took around ~10 seconds, sometimes less, sometimes more. Overall, it was slightly slower than it should have been. This time, I could have won if I had written and read memory rather than an SD card. However, I didn't try to tune the performance, making the experiment clean.

Tiny.EN Model​

It's not a surprise that the smallest model is the fastest. OS + Whisper consumed ~700Mb of memory, leaving 3.3GB free.

There was significantly more transcribed text as a result for the same video:

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

And the performance was pretty descent:

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

The prescription process took ~half the recording time. For a 10-second WAV file, the transcription took 5 seconds, leaving the Queue empty.

The quality of the output text was also good:

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

In conclusion, live-time transcription can be done with a Raspberry PI 5 using OpenAI Whisper.

r/rasberrypi - Testing OpenAI Whisper on a Raspberry PI 5

The source code:

https://github.com/Nerdy-Things/openai-whisper-raspberry-pi/tree/master/python