Cognitive and Human-Inspired Evaluation of Vision-Language Models in Scene Understanding
Contextual cues are pivotal in human perception and attention, enabling efficient object recognition and scene understanding. Recent advances in vision-language models suggest that context can also support referring expression generation (REG), especially under noisy or partially occluded conditions. Drawing on Melissa V˜o’s concept of scene grammar, this thesis introduces the Common Objects Out-of-Context (COOCO) dataset—a large, carefully curated collection of real and partially generated images in which objects align with or violate scene semantics to varying degrees. We use this resource to evaluate state-of-the- art multimodal models’ capacity to harness contextual information for accurate reference expressions, both in typical viewing and under visual degradation. Finally, by analyzing the attention patterns of a top-performing model, we offer insights into how scene information is processed in diverse task settings. Through this work, we aim to deepen our understanding of vision-language models’ robustness and flexibility in scene comprehension while bridging a gap between cognitive research on object-context interactions and computational models of visual attention.