Edison Labs
Benchmarks / LabBench2

ProtocolQA2

protocolqa2

16 runs · 8 models · evaluated by HybridEvaluator.

# Model Variant Mode Score Avg. dur Tokens Date
1 gemini-3-pro-preview tools,high file 0.616 1.6m 451.4k 2026-01-27
2 claude-opus-4-6 tools,high file 0.584 34.5s 5.9M 2026-03-22
3 claude-opus-4-6 file 0.536 16.6s 5.0M 2026-03-20
4 gemini-3-pro-preview file 0.536 1.0m 436.9k 2026-01-27
5 claude-opus-4-5 tools,high file 0.512 30.2s 4.2M 2026-03-22
6 gpt-5-2-pro tools,high file 0.504 2.0m 1.4M 2026-01-26
7 gpt-5-2-pro tools,high_retry file 0.500 29.5s 24.9k 2026-01-27
8 claude-opus-4-5_retry file 0.484 13.7s 541.7k 2026-01-26
9 gpt-5-2-pro file 0.472 1.3m 4.4M 2026-01-27
10 gpt-5-2 tools,high file 0.416 1.3m 4.9M 2026-01-27
11 gpt-5-2 file 0.360 8.4s 2.4M 2026-01-27
12 claude-opus-4-5 file 0.328 13.0s 3.6M 2026-03-20
13 claude-opus-4-5 tools,high_retry file 0.314 22.1s 2.4M 2026-01-26
14 gpt-5-2-pro_retry file 0.0s 0 2026-01-26
15 gpt-5-2 tools,high_retry file 0.0s 0 2026-01-26
16 gpt-5-2_retry file 0.0s 0 2026-01-26

Click column headers to sort. Click mode chips to filter.