Logo We-Math

Does Your Large Multimodal Model
Achieve Human-like Mathematical Reasoning?

1Beijing University of Posts and Telecommunications, 2Wechat, Tencent Inc., 3Huazhong University of Science and Technology, 4Beijing Institute of Technology

*Equal contribution
†Corresponding author

Overview diagram and the statistics of We-Math. The left and right side shows the first two layers of We-Math's categories and information of different samples and terminal nodes.

Introduction

Inspired by human-like mathematical reasoning, we introduce We-Math, the first benchmark specifically designed to explore the problem-solving principles beyond the end-to-end performance.

We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and 5 layers of knowledge granularity. We firstly decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete Mastery (CM), and Rote Memorization (RM) to hierarchically assess inherent issues in LMMs' reasoning process.

With We-Math, we conduct a thorough evaluation of existing LMMs in visual mathematical reasoning and reveal a negative correlation between solving step and problem-specific performance. We confirm the IK issue of LMMs can be effectively improved via knowledge augmentation strategy. More notably, the primary challenge of GPT-4o has significantly transitioned from IK to IG, establishing it as the first LMM advancing towards the knowledge generalization stage. In contrast, other LMMs exhibit a marked inclination towards Rote Memorization they correctly solve composite problems involving multiple knowledge concepts, yet fail in answering sub-problems.

We anticipate that We-Math will open new pathways for advancements in visual mathematical reasoning for LMMs.



Overview of LMMs' performances on We-Math. Figures from left to right illustrates the (1) accuracy of different LMMs on various problem-solving steps, (2) the performance in different visual mathematics categories and (3) the result in knowledge based reasoning evaluation.

Logo We-Math Benchmark

Overview

Different from existing benchmarks, We-Math is constructed around textbook knowledge units, decomposing composite problem solutions into sub-problems based on the knowledge concepts.
(1) Hierarchical Knowledge Structure. Strictly adheres to the knowledge presented in mathematics textbooks, featuring a rigorous hierarchical and multi-category architecture. It ensures the independence of knowledge concepts within the same level, while establishing logical relationships among concepts at different hierarchical levels.
(2) Knowledge based Reasoning Evaluation. To explore how LMMs solve problems. Drawing upon that humans tackle problems incrementally by leveraging fundamental knowledge concepts, we break down complex mathematical problems into more manageable sub105 problems. Furthermore, we employ diverse measurement dimensions for meticulous evaluations.
(3) Knowledge Concept Augmentation. To alleviate the inherent issues during the problem-solving process, we heuristically introduce descriptions for 67 knowledge concepts from Wikipedia and textbooks, thereby providing essential knowledge support for the reasoning processes of LMMs.

Metric for Reasoning Evaluation

Based on the decomposed multi-step problems, we further reveal the inherent issues of LMMs in problem-solving process. We feed both the M one-step sub-problems and the original problem into LMMs, and classifying the responses into four categories
1. Insufficient Knowledge (IK): Part of one-step problems contain errors, and the multi-step problem is wrong. It is reasonable because model's insufficient grasp of single knowledge concept may lead to errors in multi-step problem.
2. Inadequate Generalization (IG): One-Step problems are all correct, but the multi-step problem is incorrect. This is also considered reasonable. While LMMs are capable of understanding individual knowledge concepts, they may struggle to generalize that knowledge to solve composite problems.
3. Complete Mastery (CM): One-Step problems are all correct, and multi-step problem is also answered correctly. This result demonstrates that the model's results are both reliable and accurate.
4. Rote Memorization (RM): One-Step problems contain errors, but the multi-step problem is answered correctly, which contradicts human logical thinking. If a model can solve composite multi-step problems but fails to answer the one-step problems needed in the process, it raises doubts about the model's reliability.

Experiment Results

Leaderboard on We-Math (testmini)

Accuracy scores on the testmini subset (1,740 examples) of Logo We-Math.

# Model Source Date Avg(Strict) IK(Strict) IG(Strict) CM(Strict) RM(Strict) Avg(Loose) IK(Loose) IG(Loose) CM(Loose) RM(Loose)
1 GPT-4o 🥇 Link 2024-05 42.86% 31.24% (164) 15.24% (80) 35.24% (185) 34.16% (96) 60.57% 31.24% (164) 15.24% (80) 52.95% (278) 1.07% (3)
2 GPT-4V 🥈 Link 2024-04 31.05% 39.81% (209) 14.48% (76) 23.81% (125) 47.92% (115) 51.43% 39.81% (209) 14.48% (76) 44.19% (232) 3.33% (8)
3 Gemini-1.5-pro 🥉 Link 2024-05 26.38% 42.86% (225) 11.24% (59) 20.76% (109) 54.77% (132) 46.00% 42.86% (225) 11.24% (59) 40.38% (212) 12.03% (29)
4 LLaVA-NeXT-110B Link 2024-05 19.24% 50.29% (264) 14.48% (76) 12.00% (63) 65.95% (122) 37.90% 50.29% (264) 14.48% (76) 30.67% (161) 12.97% (24)
5 InternVL-Chat-V1.5 Link 2024-04 14.95% 56.19% (295) 13.90% (73) 8.00% (42) 73.25% (115) 32.67% 56.19% (295) 13.90% (73) 25.71% (135) 14.01% (22)
6 GLM-4V-9B Link 2024-06 14.86% 52.95% (278) 9.52% (50) 10.10% (53) 73.10% (144) 35.05% 52.95% (278) 9.52% (50) 30.29% (159) 19.29% (38)
7 LLaVA-NeXT-72B Link 2024-05 13.43% 58.86% (309) 7.05% (37) 9.90% (52) 70.95% (127) 31.52% 58.86% (309) 7.05% (37) 28.00% (147) 17.88% (32)
8 InternLM-XComposer2-VL-7B Link 2024-04 12.67% 56.38% (296) 10.48% (55) 7.43% (39) 77.59% (135) 30.95% 56.38% (296) 10.48% (55) 25.71% (135) 22.41% (39)
9 LongVA Link 2024-06 11.52% 61.14% (321) 8.95% (47) 7.05% (37) 76.43% (120) 27.71% 61.14% (321) 8.95% (47) 23.24% (122) 22.29% (35)
10 Phi3-Vision-4.2B Link 2024-05 10.57% 58.86% (309) 8.95% (47) 6.10% (32) 81.07% (137) 29.81% 58.86% (309) 8.95% (47) 25.33% (133) 21.30% (36)
11 Qwen-VL-Max Link 2024-01 10.48% 65.14% (342) 7.62% (40) 6.67% (35) 75.52% (108) 25.52% 65.14% (342) 7.62% (40) 21.71% (114) 20.28% (29)
12 MiniCPM-LLaMA3-V 2.5 Link 2024-05 9.52% 60.19% (316) 9.14% (48) 4.95% (26) 83.85% (135) 28.00% 60.19% (316) 9.14% (48) 23.43% (123) 23.60% (38)
13 G-LLaVA-13B Link 2024-03 6.48% 64.19% (337) 4.57% (24) 4.19% (22) 86.59% (142) 22.29% 64.19% (337) 4.57% (24) 20.00% (105) 35.98% (59)
14 DeepSeek-VL-7B Link 2024-03 6.29% 69.14% (363) 4.57% (24) 4.00% (21) 84.78% (117) 20.95% 69.14% (363) 4.57% (24) 18.67% (98) 28.99% (40)
15 DeepSeek-VL-1.3B Link 2024-03 5.90% 71.05% (373) 2.67% (14) 4.57% (24) 82.61% (114) 21.52% 71.05% (373) 2.67% (14) 20.19% (106) 23.19% (32)
16 LLaVA-1.6-13B Link 2024-03 5.24% 69.14% (363) 3.24% (17) 3.62% (19) 86.90% (126) 22.00% 69.14% (363) 3.24% (17) 20.38% (107) 26.21% (38)
17 LLaVA-1.6-7B Link 2024-03 3.33% 78.29% (411) 2.48% (13) 2.10% (11) 89.11% (90) 13.81% 78.29% (411) 2.48% (13) 12.57% (66) 34.65% (35)

🚨 To submit your results to the leaderboard, please send to this email with your result json files.

Results on Existing Foundation Models

Knowledge Card

BibTeX

@misc{qiao2024wemathdoeslargemultimodal,
      title={We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?}, 
      author={Runqi Qiao and Qiuna Tan and Guanting Dong and Minhui Wu and Chong Sun and Xiaoshuai Song and Zhuoma GongQue and Shanglin Lei and Zhe Wei and Miaoxuan Zhang and Runfeng Qiao and Yifan Zhang and Xiao Zong and Yida Xu and Muxi Diao and Zhimin Bao and Chen Li and Honggang Zhang},
      year={2024},
      eprint={2407.01284},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.01284}, 
}