The School of Computer Science at the University of Windsor is pleased to present …
Addressing ResNetVLLM’s Multi-Modal Hallucinations in Long-form Videos
PhD. Seminar by: Jonathan Khalil
Date: Wednesday, March 19, 2025
Time: 10:00 AM to 11:30 AM
Location: Essex Hall, Room 122
ResNetVLLM integrates visual perception with advanced language understanding and generation, demonstrating high vision-language capabilities. However, ResNetVLLM faces challenges in ensuring its reliability in long-term videos. Our evaluation and experiment presented in this seminar reveal that ResNetVLLM is prone to multi-modal hallucinations in long-form videos, where its responses do not align with the corresponding visual information. Such hallucinations can lead to unintended behaviors in real-world applications, necessitating further investigation and mitigation strategies. In this seminar, we present a detection strategy for identifying hallucinations in ResNetVLLM. Specifically, we use a QA-based method to detect inconsistencies between generated captions and video content. We also introduce our mitigation approach that focuses on ensuring context consistency in generated responses. Our approach encourages the model to align with the video context, reducing hallucinations by employing a Context-Aware Decoding (CAD) technique. This method modifies the model's output distribution to prioritize context-relevant information, enhancing coherence between visual inputs and their corresponding textual descriptions. Our approach improves the performance of ResNetVLLM on the ActivityNet-QA benchmark, increasing its accuracy from 54.8% to 65.3%.
Internal Reader: Dr. Dan Wu
Internal Reader: Dr. Sherif Saad
External Reader: Dr. Mohammad Hassanzadeh
Advisor (s): Dr. Aliune Ngom