Accuracy scores on the testmini subset (1,740 examples) of We-Math.
# | Model | Source | Date | Avg(Strict) | IK(Strict) | IG(Strict) | CM(Strict) | RM(Strict) | Avg(Loose) | IK(Loose) | IG(Loose) | CM(Loose) | RM(Loose) |
1 | GPT-4o 🥇 | Link | 2024-05 | 42.86% | 31.24% (164) | 15.24% (80) | 35.24% (185) | 34.16% (96) | 60.57% | 31.24% (164) | 15.24% (80) | 52.95% (278) | 1.07% (3) |
2 | GPT-4V | Link | 2024-04 | 31.05% | 39.81% (209) | 14.48% (76) | 23.81% (125) | 47.92% (115) | 51.43% | 39.81% (209) | 14.48% (76) | 44.19% (232) | 3.33% (8) |
3 | Gemini 1.5 Pro | Link | 2024-05 | 26.38% | 42.86% (225) | 11.24% (59) | 20.76% (109) | 54.77% (132) | 46.00% | 42.86% (225) | 11.24% (59) | 40.38% (212) | 12.03% (29) |
4 | Qwen-VL-Max | Link | 2024-01 | 10.48% | 65.14% (342) | 7.62% (40) | 6.67% (35) | 75.52% (108) | 25.52% | 65.14% (342) | 7.62% (40) | 21.71% (114) | 20.28% (29) |
5 | InternVL2-Llama3-76B 🥈 | Link | 2024-07 | 36.86% | 33.90% (178) | 15.81% (83) | 28.95% (152) | 42.42% (112) | 56.29% | 33.90% (178) | 15.81% (83) | 48.38% (254) | 3.79% (10) |
6 | Qwen2-VL-72B 🥉 | Link | 2024-05 | 36.57% | 33.52% (176) | 14.10% (74) | 29.52% (115) | 43.64% (120) | 56.76% | 33.52% (176) | 14.10% (74) | 49.71% (261) | 5.09% (14) |
7 | LLaVA-OneVision-72B | Link | 2024-05 | 28.67% | 41.14% (216) | 16.19% (85) | 20.57% (108) | 51.79% (116) | 49.05% | 41.14% (216) | 16.19% (85) | 40.95% (215) | 4.02% (9) |
8 | InternVL-Chat-V1.5 | Link | 2024-04 | 14.95% | 56.19% (295) | 13.90% (73) | 8.00% (42) | 73.25% (115) | 32.67% | 56.19% (295) | 13.90% (73) | 25.71% (135) | 14.01% (22) |
9 | LLaVA-1.6-13B | Link | 2024-03 | 5.24% | 69.14% (363) | 3.24% (17) | 3.62% (19) | 86.90% (126) | 22.00% | 69.14% (363) | 3.24% (17) | 20.38% (107) | 26.21% (38) |
10 | G-LLaVA-13B | Link | 2024-03 | 6.48% | 64.19% (337) | 4.57% (24) | 4.19% (22) | 86.59% (142) | 22.29% | 64.19% (337) | 4.57% (24) | 20.00% (105) | 35.98% (59) |
11 | GLM-4V-9B | Link | 2024-06 | 14.86% | 52.95% (278) | 9.52% (50) | 10.10% (53) | 73.10% (144) | 35.05% | 52.95% (278) | 9.52% (50) | 30.29% (159) | 19.29% (38) |
12 | InternVL2-8B | Link | 2024-07 | 26.57% | 45.52% (239) | 13.52% (71) | 19.81% (104) | 51.63% (111) | 44.86% | 45.52% (239) | 13.52% (71) | 38.10% (200) | 6.98% (15) |
13 | Qwen2-VL-7B | Link | 2024-09 | 25.62% | 47.05% (247) | 14.67% (77) | 18.29% (96) | 52.24% (105) | 42.95% | 47.05% (247) | 14.67% (77) | 35.62% (187) | 6.97% (14) |
14 | LLaVA-OneVision-7B | Link | 2024-08 | 23.14% | 44.95% (236) | 13.14% (69) | 16.57% (87) | 60.45% (133) | 44.86% | 44.95% (236) | 13.14% (69) | 38.29% (201) | 8.64% (19) |
15 | MiniCPM-LLaMA3-V 2.5 | Link | 2024-05 | 9.52% | 60.19% (316) | 9.14% (48) | 4.95% (26) | 83.85% (135) | 28.00% | 60.19% (316) | 9.14% (48) | 23.43% (123) | 23.60% (38) |
16 | LongVA-7B | Link | 2024-06 | 11.52% | 61.14% (321) | 8.95% (47) | 7.05% (37) | 76.43% (120) | 27.71% | 61.14% (321) | 8.95% (47) | 23.24% (122) | 22.29% (35) |
17 | InternLM-XComposer2-VL-7B | Link | 2024-04 | 12.67% | 56.38% (296) | 10.48% (55) | 7.43% (39) | 77.59% (135) | 30.95% | 56.38% (296) | 10.48% (55) | 25.71% (135) | 22.41% (39) |
18 | LLaVA-1.6-7B | Link | 2024-03 | 3.33% | 78.29% (411) | 2.48% (13) | 2.10% (11) | 89.11% (90) | 13.81% | 78.29% (411) | 2.48% (13) | 12.57% (66) | 34.65% (35) |
19 | Phi3-Vision-4.2B | Link | 2024-05 | 10.57% | 58.86% (309) | 8.95% (47) | 6.10% (32) | 81.07% (137) | 29.81% | 58.86% (309) | 8.95% (47) | 25.33% (133) | 21.30% (36) |
20 | DeepSeek-VL-1.3B | Link | 2024-03 | 5.90% | 71.05% (373) | 2.67% (14) | 4.57% (24) | 82.61% (114) | 21.52% | 71.05% (373) | 2.67% (14) | 20.19% (106) | 23.19% (32) |
🚨 To submit your results to the leaderboard, please send to this email with your result json files.