ChineseSimpleVQA

“See” the World, Discover Knowledge

Abstract

The factual accuracy of responses generated by large vision language models (LVLMs) is a key metric for assessing model capability, reflecting both the model's knowledge capacity and its reliability. To test the abilities to ``see the world and discover knowledge'' of LVLMs, we introduce the first knowledge-based visual question-answering Chinese benchmark, Chinese SimpleVQA. Key features of Chinese SimpleVQA include Chinese language focus, diverse knowledge types, a multi-hop question construction mechanism, high-quality data, static consistency, and easy-to-evaluate through short answers. It provides image object recognition questions alongside further knowledge-based questions, allowing users to explore the model's knowledge boundaries and analyze its performance at various stages. Our ultimate goal is to assist developers in better understanding and analyzing the factual capabilities of their models, thereby advancing the development of LVLMs and improving the factual accuracy of LVLMs in real-world applications.

Leaderboard

# Model Overall results of Merged Final Q&A Overall results of Recognition Q&A
CO IN↓ NA↓ CGA F-score CO IN↓ NA↓ CGA F-score
o1-preview 🥇

OpenAI

68.8 24.6 6.5 73.6 71.1 79.1 13.6 7.3 85.3 82.1
Gemini-1.5-Pro 🥈

Google

56.5 34.6 8.8 62.0 59.2 70.3 25.9 3.8 73.1 71.6
Gemini-2.0-Pro-flash 🥉

Google

64.5 29.5 5.9 68.6 66.5 76.7 19.6 3.7 79.7 78.2
Claude-3.5-sonnet2

Anthropic

63.8 30.6 5.5 67.6 65.6 77.6 17.2 5.2 81.9 79.7
Claude-3.5-sonnet

Anthropic

59.5 26.4 14.2 69.4 64.0 69.5 20.2 10.3 77.5 73.3
GPT-4o

OpenAI

59.1 35.5 5.4 62.4 60.7 77.5 15.5 7.0 83.4 80.4
Qwen-VL-max

Alibaba

56.5 39.6 3.8 58.8 57.6 72.9 24.6 2.5 74.7 73.8

Benchmark

Statistics

Knowledge distribution.

grade-lv

Knowledge category distribution.

Construction Pipeline

grade-lv

An overview of the entire production process of Chinese SimpleVQA.

Experiment Results

Main Results

grade-lv

Performance comparison of closed source and open source LVLMs on multi-hop QAs (i.e. Merged Q&A and Recognition Q&A). For metrics, CO, NA, IN, and CGA denote ``Correct'', ``Not attempted'', ``Incorrect'', and ``Correct given attempted'', respectively. The highest scores among models in each section are highlighted in green.

Overall Rankings on ChineseSimpleVQA

grade-lv

Rankings of different models on ChineseSimpleVQA.

Performance on Different Topics

grade-lv

Correctness(CO) metric for eight topics. We show the top 10 models here.

Further Analysis

grade-lv

Up: Calibration of LLMs based on their stated confidence for Recognition and final Q&A.
Down: Improvement in accuracy with increased test-time compute using Best-of-N for Recognition and final Q&A.

Data Examples

Data Examples