Fine-grained Visual Transcription for YouTube videos

vlm-docs.nos.run

9 points by EarlyOom 2 years ago · 3 comments

Reader

EarlyOomOP 2 years ago

TLDR: There are dozens of audio transcription APIs, but nothing for video and visual transcriptions. So we built one.

If you want visual chaptering, summarization, OCR / text-extraction, audio transcriptions, and sentiment analysis on your videos, there’s really nothing out there. We tried stitching this together with several audio/video understanding APIs but kept running into rate limits, hallucinations, high costs and poor accuracy.

Analyzing Audio Podcasts: https://vlm-docs.nos.run/guides/guide-audio-podcasts

Understanding Video Podcasts: https://vlm-docs.nos.run/guides/guide-video-podcasts

arthurdelerue 2 years ago

I'm not sure why you say that current video transcriptions are bad. I use Whisper on NLP Cloud for video transcription (https://docs.nlpcloud.com/#automatic-speech-recognition) and it works very well.
As far as I understand, video transcription is a no-brainer as long as you install ffmpeg.
- EarlyOomOP 2 years ago
  
  Hi Arthur! There's a bit of confusion here. It looks like you're referring to _audio_ transcription; that is, passing the audio component into an ASR pipeline (like Whisper, Otter etc.) to generate a transcript of any spoken words. Our pipleline is meant for fine-grained 'transcriptions' of the _visual_ content of the video. For instance, any text on screen, contents of plots and graphs, the clothing worn by any participants, etc. (though we do transcribe the audio as well, its a multimodal pipeline!).

Settings

Fine-grained Visual Transcription for YouTube videos

Keyboard Shortcuts