Edison Labs
← Benchmarks

LabBench2

Real-world capabilities of AI systems on scientific research tasks.

LabBench2 is a multi-task benchmark for measuring whether AI systems can perform the kinds of work scientists actually do — reading literature, interpreting figures and tables, reasoning about experimental protocols, and operating on biological sequences. It is composed of multiple sub-benchmarks, each isolating a specific skill. Results below are aggregated across the suite; click into a sub-benchmark for its individual leaderboard.

Paper
arXiv
Code
GitHub
Dataset
Hugging Face

Top performers

Bars grouped by config variant. Hover for details.

Aggregate leaderboard

Mean score across all sub-benchmarks. Only models with coverage ≥ 14/15 are listed — comparisons against partial-coverage runs would be misleading. 9 rows hidden.
# Model Variant Mean score Coverage
1 gpt-5-2 Tools + high 0.609 14/15
2 gpt-5-2-pro Tools + high 0.583 14/15
3 gemini-3-pro-preview Tools + high 0.564 14/15
4 claude-opus-4-6 Tools + high 0.519 14/15
5 claude-opus-4-5 Tools + high 0.471 14/15
6 gemini-3-pro-preview Base 0.404 15/15
7 gpt-5-2-pro Base 0.382 15/15
8 claude-opus-4-6 Base 0.349 15/15
9 gpt-5-2 Base 0.345 15/15
10 claude-opus-4-5 Base 0.325 14/15

Sub-benchmarks

Click any row for the per-sub-benchmark leaderboard.