We-Math

Does Your Large Multimodal Model
Achieve Human-like Mathematical Reasoning?

¹Beijing University of Posts and Telecommunications, ²Wechat, Tencent Inc., ³Huazhong University of Science and Technology, ⁴Beijing Institute of Technology

ACL 2025

Introduction

Inspired by human-like mathematical reasoning, we introduce We-Math, the first benchmark specifically designed to explore the problem-solving principles beyond the end-to-end performance.

We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. We firstly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) to hierarchically assess inherent issues in LMMs' reasoning process.

With We-Math, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving step and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategy. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization they correctly solve composite problems involving multiple knowledge concepts, yet fail in answering sub-problems.

We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning for LMMs.

Overview of LMMs' performances on We-Math. Figures from left to right illustrates the (1) accuracy of different LMMs on various problem-solving steps, (2) the performance in different visual mathematics categories and (3) the result in knowledge based reasoning evaluation.

Overview

Different from existing benchmarks, We-Math is constructed around textbook knowledge units, decomposing composite problem solutions into sub-problems based on the knowledge concepts.
(1) Hierarchical Knowledge Structure. Strictly adheres to the knowledge presented in mathematics textbooks, featuring a rigorous hierarchical and multi-category architecture. It ensures the independence of knowledge concepts within the same level, while establishing logical relationships among concepts at different hierarchical levels.
(2) Knowledge based Reasoning Evaluation. To explore how LMMs solve problems. Drawing upon that humans tackle problems incrementally by leveraging fundamental knowledge concepts, we break down complex mathematical problems into more manageable sub105 problems. Furthermore, we employ diverse measurement dimensions for meticulous evaluations.
(3) Knowledge Concept Augmentation. To alleviate the inherent issues during the problem-solving process, we heuristically introduce descriptions for 67 knowledge concepts from Wikipedia and textbooks, thereby providing essential knowledge support for the reasoning processes of LMMs.

The pipeline of knowledge-based data decomposition (an example of a two-step problem in We-Math).

The pipeline of knowledge-based data decomposition (an example of a three-step problem in We-Math).

Metric for Reasoning Evaluation

Based on the decomposed multi-step problems, we further reveal the inherent issues of LMMs in problem-solving process. We feed both the M one-step sub-problems and the original problem into LMMs, and classifying the responses into four categories
1. Insufficient Knowledge (IK): Part of one-step problems contain errors, and the multi-step problem is wrong. It is reasonable because model's insufficient grasp of single knowledge concept may lead to errors in multi-step problem.
2. Inadequate Generalization (IG): One-Step problems are all correct, but the multi-step problem is incorrect. This is also considered reasonable. While LMMs are capable of understanding individual knowledge concepts, they may struggle to generalize that knowledge to solve composite problems.
3. Complete Mastery (CM): One-Step problems are all correct, and multi-step problem is also answered correctly. This result demonstrates that the model's results are both reliable and accurate.
4. Rote Memorization (RM): One-Step problems contain errors, but the multi-step problem is answered correctly, which contradicts human logical thinking. If a model can solve composite multi-step problems but fails to answer the one-step problems needed in the process, it raises doubts about the model's reliability.

An example of the four-dimensional metrics for evaluating a two-step problem, using both loose and strict settings.

Diagram illustrating strict metric in three-step problem.

Diagram illustrating loose metric in three-step problem.

Leaderboard on We-Math (testmini)

Accuracy scores on the testmini subset (1,740 examples) of We-Math.

More results can be found on the OpenCompass! We recommend using VLMEvalKit for evaluation, and adopting the Strict Score as the reported metric.

#	Model	Source	Date	Avg(Strict)	IK(Strict)	IG(Strict)	CM(Strict)	RM(Strict)	Avg(Loose)	IK(Loose)	IG(Loose)	CM(Loose)	RM(Loose)
1	Gemini-2.5-Pro 🥇	Link	-	78.0%	-	-	-	-	-	-	-	-	-
2	Seed1.5-VL	Link	-	77.5%	-	-	-	-	-	-	-	-	-
3	Doubao-1.5-Pro	Link	-	65.7%	-	-	-	-	-	-	-	-	-
4	GPT-4.1-20250414	Link	-	55.5%	-	-	-	-	-	-	-	-	-
5	GPT-4o	Link	-	50.6%	-	-	-	-	-	-	-	-	-
6	Qwen2.5-VL-72B	Link	-	49.1%	-	-	-	-	-	-	-	-	-
7	InternVL3-78B	Link	-	46.1%	-	-	-	-	-	-	-	-	-
8	InternVL3-8B	Link	-	39.5%	-	-	-	-	-	-	-	-	-
9	URSA-8B-PS-GRPO	Link	-	38.3%	-	-	-	-	-	-	-	-	-
10	InternVL2-Llama3-76B	Link	2024-07	36.86%	33.90% (178)	15.81% (83)	28.95% (152)	42.42% (112)	56.29%	33.90% (178)	15.81% (83)	48.38% (254)	3.79% (10)
11	Qwen2-VL-72B	Link	2024-05	36.57%	33.52% (176)	14.10% (74)	29.52% (115)	43.64% (120)	56.76%	33.52% (176)	14.10% (74)	49.71% (261)	5.09% (14)
12	Qwen2.5-VL-7B	Link	-	36.2%	-	-	-	-	-	-	-	-	-
13	MM-Eureka-7B	Link	-	34.7%	-	-	-	-	-	-	-	-	-
14	URSA-8B	Link	-	32.8%	-	-	-	-	-	-	-	-	-
15	GPT-4V	Link	2024-04	31.05%	39.81% (209)	14.48% (76)	23.81% (125)	47.92% (115)	51.43%	39.81% (209)	14.48% (76)	44.19% (232)	3.33% (8)
16	R1-Onevision-7B	Link	-	30.0%	-	-	-	-	-	-	-	-	-
17	InternVL2.5-8B	Link	-	23.5%	-	-	-	-	-	-	-	-	-
18	Gemini 1.5 Pro	Link	2024-05	26.38%	42.86% (225)	11.24% (59)	20.76% (109)	54.77% (132)	46.00%	42.86% (225)	11.24% (59)	40.38% (212)	12.03% (29)
19	Qwen-VL-Max	Link	2024-01	10.48%	65.14% (342)	7.62% (40)	6.67% (35)	75.52% (108)	25.52%	65.14% (342)	7.62% (40)	21.71% (114)	20.28% (29)
20	LLaVA-OneVision-72B	Link	2024-05	28.67%	41.14% (216)	16.19% (85)	20.57% (108)	51.79% (116)	49.05%	41.14% (216)	16.19% (85)	40.95% (215)	4.02% (9)
21	InternVL-Chat-V1.5	Link	2024-04	14.95%	56.19% (295)	13.90% (73)	8.00% (42)	73.25% (115)	32.67%	56.19% (295)	13.90% (73)	25.71% (135)	14.01% (22)
22	LLaVA-1.6-13B	Link	2024-03	5.24%	69.14% (363)	3.24% (17)	3.62% (19)	86.90% (126)	22.00%	69.14% (363)	3.24% (17)	20.38% (107)	26.21% (38)
23	G-LLaVA-13B	Link	2024-03	6.48%	64.19% (337)	4.57% (24)	4.19% (22)	86.59% (142)	22.29%	64.19% (337)	4.57% (24)	20.00% (105)	35.98% (59)
24	GLM-4V-9B	Link	2024-06	14.86%	52.95% (278)	9.52% (50)	10.10% (53)	73.10% (144)	35.05%	52.95% (278)	9.52% (50)	30.29% (159)	19.29% (38)
25	InternVL2-8B	Link	2024-07	26.57%	45.52% (239)	13.52% (71)	19.81% (104)	51.63% (111)	44.86%	45.52% (239)	13.52% (71)	38.10% (200)	6.98% (15)
26	Qwen2-VL-7B	Link	2024-09	25.62%	47.05% (247)	14.67% (77)	18.29% (96)	52.24% (105)	42.95%	47.05% (247)	14.67% (77)	35.62% (187)	6.97% (14)
27	LLaVA-OneVision-7B	Link	2024-08	23.14%	44.95% (236)	13.14% (69)	16.57% (87)	60.45% (133)	44.86%	44.95% (236)	13.14% (69)	38.29% (201)	8.64% (19)
28	MiniCPM-LLaMA3-V 2.5	Link	2024-05	9.52%	60.19% (316)	9.14% (48)	4.95% (26)	83.85% (135)	28.00%	60.19% (316)	9.14% (48)	23.43% (123)	23.60% (38)
29	LongVA-7B	Link	2024-06	11.52%	61.14% (321)	8.95% (47)	7.05% (37)	76.43% (120)	27.71%	61.14% (321)	8.95% (47)	23.24% (122)	22.29% (35)
30	InternLM-XComposer2-VL-7B	Link	2024-04	12.67%	56.38% (296)	10.48% (55)	7.43% (39)	77.59% (135)	30.95%	56.38% (296)	10.48% (55)	25.71% (135)	22.41% (39)
31	LLaVA-1.6-7B	Link	2024-03	3.33%	78.29% (411)	2.48% (13)	2.10% (11)	89.11% (90)	13.81%	78.29% (411)	2.48% (13)	12.57% (66)	34.65% (35)
32	Phi3-Vision-4.2B	Link	2024-05	10.57%	58.86% (309)	8.95% (47)	6.10% (32)	81.07% (137)	29.81%	58.86% (309)	8.95% (47)	25.33% (133)	21.30% (36)
33	DeepSeek-VL-1.3B	Link	2024-03	5.90%	71.05% (373)	2.67% (14)	4.57% (24)	82.61% (114)	21.52%	71.05% (373)	2.67% (14)	20.19% (106)	23.19% (32)

#	Model	Source	Date	S1	S2	S3	UCU(Mem)	AL(Mem)	CPF(PF)	UPF(PF)	CSF(SF)	USF(SF)	BTF(TMF)	CCF(TMF)	Dir(PD)	Pos(PD)	RoM(PD)	CCP(PD)
1	GPT-4o	Link	2024-05	72.84%	58.06%	43.64%	86.61%	39.12%	77.35%	71.56%	84.50%	62.27%	58.74%	69.37%	93.10%	72.67%	47.53%	73.33%
2	GPT-4V	Link	2024-04	65.51%	49.17%	38.18%	82.54%	38.42%	70.67%	60.22%	76.58%	56.32%	57.76%	67.67%	79.29%	57.48%	47.80%	63.33%
3	Gemini 1.5 Pro	Link	2024-05	56.13%	51.39%	33.94%	50.99%	31.23%	61.75%	45.03%	69.95%	57.54%	39.24%	62.65%	68.81%	54.13%	40.66%	60.00%
4	Qwen-VL-Max	Link	2024-01	40.82%	30.28%	20.61%	19.35%	25.26%	39.82%	41.44%	43.64%	48.02%	43.82%	43.39%	41.43%	35.09%	40.66%	26.67%
5	InternVL2-Llama3-76B	Link	2024-07	67.90%	53.33%	43.64%	71.73%	39.82%	71.33%	61.74%	73.84%	61.54%	68.81%	63.94%	89.52%	76.59%	62.64%	73.33%
6	Qwen2-VL-72B	Link	2024-09	68.15%	53.06%	50.91%	92.36%	45.09%	70.20%	63.76%	72.93%	58.47%	61.34%	71.01%	75.48%	72.67%	66.76%	70.00%
7	LLaVA-OneVision-72B	Link	2024-08	63.95%	45.83%	35.76%	73.81%	35.79%	69.62%	62.15%	72.84%	57.35%	46.28%	65.05%	61.67%	65.92%	40.93%	56.67%
8	InternVL-Chat-V1.5	Link	2024-04	49.38%	30.56%	28.48%	43.95%	29.82%	52.23%	52.06%	44.19%	48.15%	47.05%	46.82%	65.71%	50.47%	36.54%	36.67%
9	LLaVA-1.6-13B	Link	2024-03	29.38%	25.28%	32.73%	21.73%	23.16%	23.37%	34.72%	25.26%	26.36%	37.52%	41.65%	26.90%	28.87%	37.09%	30.00%
10	G-LLaVA-13B	Link	2024-03	32.43%	30.56%	32.73%	33.33%	29.12%	32.04%	37.88%	19.57%	33.51%	37.12%	32.79%	31.19%	33.21%	25.55%	40.00%
11	GLM-4V-9B	Link	2024-06	47.33%	37.22%	38.18%	53.37%	37.02%	51.32%	46.52%	50.60%	38.22%	44.09%	45.22%	40.95%	49.27%	36.81%	53.33%
12	MiniCPM-LLaMA3-V 2.5	Link	2024-05	39.75%	31.11%	29.70%	28.57%	37.02%	40.81%	39.82%	40.97%	38.61%	31.96%	42.66%	40.95%	42.70%	43.96%	43.33%
13	Qwen2-VL-7B	Link	2024-09	59.09%	43.61%	26.67%	62.70%	37.19%	62.64%	60.78%	65.68%	49.24%	52.51%	49.22%	48.10%	68.23%	54.95%	56.67%
14	InternVL2-8B	Link	2024-07	59.42%	43.61%	35.15%	71.43%	20.53%	61.95%	55.54%	67.10%	57.29%	54.03%	60.51%	58.57%	63.62%	44.51%	50.00%
15	LLaVA-OneVision-7B	Link	2024-08	57.45%	43.06%	39.39%	59.03%	36.49%	66.66%	55.36%	64.39%	61.09%	48.63%	46.90%	55.00%	49.53%	25.55%	43.33%
16	LongVA-7B	Link	2024-06	43.54%	30.56%	28.48%	24.50%	39.82%	45.09%	40.75%	51.85%	42.49%	45.60%	44.56%	44.52%	40.74%	47.53%	20.00%
17	InternLM-XComposer2-VL-7B	Link	2024-04	47.00%	33.06%	33.33%	31.25%	46.49%	47.70%	42.57%	51.44%	43.87%	41.13%	50.58%	65.48%	53.87%	55.22%	40.00%
18	LLaVA-1.6-7B	Link	2024-03	22.96%	20.83%	15.76%	18.45%	20.53%	16.92%	29.63%	15.57%	18.60%	42.67%	24.05%	17.62%	43.31%	28.85%	26.67%
19	Phi3-Vision-4.2B	Link	2024-05	42.14%	34.17%	27.88%	28.67%	15.96%	47.23%	38.83%	49.99%	44.41%	28.76%	31.22%	48.57%	49.19%	26.37%	50.00%
20	DeepSeek-VL-1.3B	Link	2024-03	31.44%	27.78%	23.03%	27.78%	23.86%	22.76%	36.92%	30.36%	34.18%	44.46%	28.29%	48.10%	41.77%	37.09%	33.33%

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Results on Existing Foundation Models

The visualization of different LMMs' performances on each category.

The performance of different LMMs on four-dimensional metrics under strict metric.

The performance of different LMMs on four-dimensional metrics under loose metric.

The Leaderboard of different LMMs under the strict and loose metric (average score %). ~ represents an approximate estimate of the total parameters nums in LMMs.

Knowledge Card

The description of the knowledge concept "Angles and Length".

The description of the knowledge concept "Correspondence of Coordinates and Positions".

The description of the knowledge concept "Basic Transformations of Figures".

The description of the knowledge concept "Cutting and Combining of Figures".

The description of the knowledge concepts "Direction" and "Position".

The description of the knowledge concept "Route Map".

The description of the knowledge concept "Understanding and Conversion of Units".

The description of the knowledge concept "Calculation of Solid Figures".

The description of the knowledge concept "Calculation of Plane Figures".

The description of the knowledge concept "Understanding of Plane Figures".

The description of the knowledge concept "Understanding of Solid Figures".

BibTeX

@article{qiao2024we, title={We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?}, author={Qiao, Runqi and Tan, Qiuna and Dong, Guanting and Wu, Minhui and Sun, Chong and Song, Xiaoshuai and GongQue, Zhuoma and Lei, Shanglin and Wei, Zhe and Zhang, Miaoxuan and others}, journal={arXiv preprint arXiv:2407.01284}, year={2024} }