Accuracy scores on the testmini subset (1,740 examples) of We-Math.
# | Model | Source | Date | Avg(Strict) | IK(Strict) | IG(Strict) | CM(Strict) | RM(Strict) | Avg(Loose) | IK(Loose) | IG(Loose) | CM(Loose) | RM(Loose) |
1 | GPT-4o 🥇 | Link | 2024-05 | 42.86% | 31.24% (164) | 15.24% (80) | 35.24% (185) | 34.16% (96) | 60.57% | 31.24% (164) | 15.24% (80) | 52.95% (278) | 1.07% (3) |
2 | GPT-4V 🥈 | Link | 2024-04 | 31.05% | 39.81% (209) | 14.48% (76) | 23.81% (125) | 47.92% (115) | 51.43% | 39.81% (209) | 14.48% (76) | 44.19% (232) | 3.33% (8) |
3 | Gemini-1.5-pro 🥉 | Link | 2024-05 | 26.38% | 42.86% (225) | 11.24% (59) | 20.76% (109) | 54.77% (132) | 46.00% | 42.86% (225) | 11.24% (59) | 40.38% (212) | 12.03% (29) |
4 | LLaVA-NeXT-110B | Link | 2024-05 | 19.24% | 50.29% (264) | 14.48% (76) | 12.00% (63) | 65.95% (122) | 37.90% | 50.29% (264) | 14.48% (76) | 30.67% (161) | 12.97% (24) |
5 | InternVL-Chat-V1.5 | Link | 2024-04 | 14.95% | 56.19% (295) | 13.90% (73) | 8.00% (42) | 73.25% (115) | 32.67% | 56.19% (295) | 13.90% (73) | 25.71% (135) | 14.01% (22) |
6 | GLM-4V-9B | Link | 2024-06 | 14.86% | 52.95% (278) | 9.52% (50) | 10.10% (53) | 73.10% (144) | 35.05% | 52.95% (278) | 9.52% (50) | 30.29% (159) | 19.29% (38) |
7 | LLaVA-NeXT-72B | Link | 2024-05 | 13.43% | 58.86% (309) | 7.05% (37) | 9.90% (52) | 70.95% (127) | 31.52% | 58.86% (309) | 7.05% (37) | 28.00% (147) | 17.88% (32) |
8 | InternLM-XComposer2-VL-7B | Link | 2024-04 | 12.67% | 56.38% (296) | 10.48% (55) | 7.43% (39) | 77.59% (135) | 30.95% | 56.38% (296) | 10.48% (55) | 25.71% (135) | 22.41% (39) |
9 | LongVA | Link | 2024-06 | 11.52% | 61.14% (321) | 8.95% (47) | 7.05% (37) | 76.43% (120) | 27.71% | 61.14% (321) | 8.95% (47) | 23.24% (122) | 22.29% (35) |
10 | Phi3-Vision-4.2B | Link | 2024-05 | 10.57% | 58.86% (309) | 8.95% (47) | 6.10% (32) | 81.07% (137) | 29.81% | 58.86% (309) | 8.95% (47) | 25.33% (133) | 21.30% (36) |
11 | Qwen-VL-Max | Link | 2024-01 | 10.48% | 65.14% (342) | 7.62% (40) | 6.67% (35) | 75.52% (108) | 25.52% | 65.14% (342) | 7.62% (40) | 21.71% (114) | 20.28% (29) |
12 | MiniCPM-LLaMA3-V 2.5 | Link | 2024-05 | 9.52% | 60.19% (316) | 9.14% (48) | 4.95% (26) | 83.85% (135) | 28.00% | 60.19% (316) | 9.14% (48) | 23.43% (123) | 23.60% (38) |
13 | G-LLaVA-13B | Link | 2024-03 | 6.48% | 64.19% (337) | 4.57% (24) | 4.19% (22) | 86.59% (142) | 22.29% | 64.19% (337) | 4.57% (24) | 20.00% (105) | 35.98% (59) |
14 | DeepSeek-VL-7B | Link | 2024-03 | 6.29% | 69.14% (363) | 4.57% (24) | 4.00% (21) | 84.78% (117) | 20.95% | 69.14% (363) | 4.57% (24) | 18.67% (98) | 28.99% (40) |
15 | DeepSeek-VL-1.3B | Link | 2024-03 | 5.90% | 71.05% (373) | 2.67% (14) | 4.57% (24) | 82.61% (114) | 21.52% | 71.05% (373) | 2.67% (14) | 20.19% (106) | 23.19% (32) |
16 | LLaVA-1.6-13B | Link | 2024-03 | 5.24% | 69.14% (363) | 3.24% (17) | 3.62% (19) | 86.90% (126) | 22.00% | 69.14% (363) | 3.24% (17) | 20.38% (107) | 26.21% (38) |
17 | LLaVA-1.6-7B | Link | 2024-03 | 3.33% | 78.29% (411) | 2.48% (13) | 2.10% (11) | 89.11% (90) | 13.81% | 78.29% (411) | 2.48% (13) | 12.57% (66) | 34.65% (35) |
🚨 To submit your results to the leaderboard, please send to this email with your result json files.