MiniGPT4-Video - NFHN Reader

For a comprehensive evaluation of our proposed architecture, we assessed its performance across three bench-mark types: Video-ChatGPT, Open-ended Questions, and Multiple-Choice Questions (MCQs). In the Video-ChatGPT benchmark, depicted in Table 1, our model is comparable with the previous methods without subtitles. When we add the subtitles as input, our model achieves the state-of-the- art in all five dimensions, which verified that our model can utilize the subtitle information to improve the video understanding. In the zero-shot evaluation of open-ended and multiple-choice question benchmarks, our proposed MiniGPT4-Video sig- nificantly outperforms existing state-of-the-art methods. It achieves notable margins of improvement 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks, respectively. The results, both with and without subtitles, further demonstrate that integrating subtitle information alongside visual cues significantly enhances performance, with accuracy rising from 33.9% to 54.21% on TVQA. While subtitles contribute substantially to performance improvements on TVQA, their inclusion doesn’t offer added value for datasets like MSVD- QA, MSRVTT-QA, TGIF-QA, and ActivityNet, where ques- tions are exclusively vision-based.: