Addressing ResNetVLLM’s Multi-Modal Hallucinations in Long-form Videos - PhD Seminar by: Jonathan Khalil

Wednesday, March 19, 2025 - 10:00

The School of Computer Science at the University of Windsor is pleased to present …

 

Addressing ResNetVLLM’s Multi-Modal Hallucinations in Long-form Videos

PhD. Seminar by: Jonathan Khalil

Date: Wednesday, March 19, 2025

Time: 10:00 AM to 11:30 AM

Location: Essex Hall, Room 122

 

Abstract:

ResNetVLLM integrates visual perception with advanced language understanding and generation, demonstrating high vision-language capabilities. However, ResNetVLLM faces challenges in ensuring its reliability in long-term videos. Our evaluation and experiment presented in this seminar reveal that ResNetVLLM is prone to multi-modal hallucinations in long-form videos, where its responses do not align with the corresponding visual information. Such hallucinations can lead to unintended behaviors in real-world applications, necessitating further investigation and mitigation strategies. In this seminar, we present a detection strategy for identifying hallucinations in ResNetVLLM. Specifically, we use a QA-based method to detect inconsistencies between generated captions and video content. We also introduce our mitigation approach that focuses on ensuring context consistency in generated responses. Our approach encourages the model to align with the video context, reducing hallucinations by employing a Context-Aware Decoding (CAD) technique. This method modifies the model's output distribution to prioritize context-relevant information, enhancing coherence between visual inputs and their corresponding textual descriptions. Our approach improves the performance of ResNetVLLM on the ActivityNet-QA benchmark, increasing its accuracy from 54.8% to 65.3%.

PhD Doctoral Committee:

Internal Reader: Dr. Dan Wu

Internal Reader: Dr. Sherif Saad

External Reader: Dr. Mohammad Hassanzadeh

Advisor (s): Dr. Aliune Ngom

Registration Link (only MAC students need to pre-register)