With video games now generating the highest revenues in the entertainment industry, optimizing game development workflows has become essential for the sector’s sustained growth. Recent advancements in Vision-Language Models (VLMs) offer considerable potential to automate and enhance various aspects of game development, particularly Quality Assurance (QA), which remains one of the industry’s most labor-intensive processes. To accurately evaluate the performance of VLMs in video game QA tasks and determine their effectiveness in handling real-world scenarios, there is a clear need for standardized benchmarks, as existing benchmarks are insufficient to address the specific requirements of this domain which tend to focus heavily on complex mathematical or textual reasoning tasks, overlooking essential visual comprehension tasks fundamental to video game QA. 

To bridge this gap, we introduce VideoGameQA-Bench, a comprehensive benchmark of 16 state-of-the-art VLMs that covers a wide array of game QA activities, including visual unit testing, visual regression testing, needle-in-a-haystack tasks, glitch detection, and bug report generation for both images and videos of various games. 

The video game QA process can generally be abstracted into three main types of tasks: 

  • Verifying scene integrity by comparing the visual representation of scenes against intended configurations and known reference states, such as an oracle or previously rendered versions of the same scenes. 
  • Detecting glitches through open-ended exploration-these glitches are unintended gameplay or visual artifacts without specific reference points, requiring testers to rely on common sense and general knowledge for detection. 
  • Systematically reporting and documenting all identified glitches, ensuring developers receive clear and actionable information to address problems effectively during game development. 

The results highlight that while current VLMs show promising performance in identifying many visual issues and generating useful bug descriptions, they continue to struggle with fine-grained visual details, subtle regressions, and precise pinpointing of glitches in longer video clips.  

For more details on this work, detailed results and findings, please visit the project page here