The motivation for me was to convert the audio from a previous presentation into text so I can make a script for the next presentation. At the end it was not as useful as I thought, since it is not very accurate, and the output text contains a lot of spacing so it ends up being very difficult to edit. Maybe using ChatGPT to reformat the text and clean up could be an option but I have not tried.

Installation

sudo apt-get install ffmpeg
conda create -n vosk python
conda activate
pip install vosk
cd github
git clone https://github.com/alphacep/vosk-api.git
cd github/vosk-api/python/example
unzip vosk-model-small-en-us-0.3.zip
mv vosk-model-small-en-us-0.3 model
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip

Usage

cd github/vosk-api/python/example
# copy file to this folder
ffmpeg -i video.mp4 -ar 16000 -ac 1 audio.wav
python3 test_text.py audio.wav > text.txt