LabBench2

Real-world capabilities of AI systems on scientific research tasks.

LabBench2 is a multi-task benchmark for measuring whether AI systems can perform the kinds of work scientists actually do — reading literature, interpreting figures and tables, reasoning about experimental protocols, and operating on biological sequences. It is composed of multiple sub-benchmarks, each isolating a specific skill. Results below are aggregated across the suite; click into a sub-benchmark for its individual leaderboard.

Paper

arXiv ↗

Top performers

Bars grouped by config variant. Hover for details.

Sub-benchmark

Mode

Min coverage

Top N

Aggregate leaderboard

Mean score across all sub-benchmarks. Only models with coverage ≥ 14/15 are listed — comparisons against partial-coverage runs would be misleading. 9 rows hidden.

#	Model	Variant	Mean score	Coverage
1	gpt-5-2	Tools + high	0.609	14/15
2	gpt-5-2-pro	Tools + high	0.583	14/15
3	gemini-3-pro-preview	Tools + high	0.564	14/15
4	claude-opus-4-6	Tools + high	0.519	14/15
5	claude-opus-4-5	Tools + high	0.471	14/15
6	gemini-3-pro-preview	Base	0.404	15/15
7	gpt-5-2-pro	Base	0.382	15/15
8	claude-opus-4-6	Base	0.349	15/15
9	gpt-5-2	Base	0.345	15/15
10	claude-opus-4-5	Base	0.325	14/15