Fine-grained Visual Transcription for YouTube videos
vlm-docs.nos.runTLDR: There are dozens of audio transcription APIs, but nothing for video and visual transcriptions. So we built one.
If you want visual chaptering, summarization, OCR / text-extraction, audio transcriptions, and sentiment analysis on your videos, there’s really nothing out there. We tried stitching this together with several audio/video understanding APIs but kept running into rate limits, hallucinations, high costs and poor accuracy.
Analyzing Audio Podcasts: https://vlm-docs.nos.run/guides/guide-audio-podcasts
Understanding Video Podcasts: https://vlm-docs.nos.run/guides/guide-video-podcasts
I'm not sure why you say that current video transcriptions are bad. I use Whisper on NLP Cloud for video transcription (https://docs.nlpcloud.com/#automatic-speech-recognition) and it works very well.
As far as I understand, video transcription is a no-brainer as long as you install ffmpeg.
Hi Arthur! There's a bit of confusion here. It looks like you're referring to _audio_ transcription; that is, passing the audio component into an ASR pipeline (like Whisper, Otter etc.) to generate a transcript of any spoken words. Our pipleline is meant for fine-grained 'transcriptions' of the _visual_ content of the video. For instance, any text on screen, contents of plots and graphs, the clothing worn by any participants, etc. (though we do transcribe the audio as well, its a multimodal pipeline!).