LabBench2
Real-world capabilities of AI systems on scientific research tasks.
LabBench2 is a multi-task benchmark for measuring whether AI systems can perform the kinds of work scientists actually do — reading literature, interpreting figures and tables, reasoning about experimental protocols, and operating on biological sequences. It is composed of multiple sub-benchmarks, each isolating a specific skill. Results below are aggregated across the suite; click into a sub-benchmark for its individual leaderboard.
Top performers
Bars grouped by config variant. Hover for details.
No data for this slice.
Aggregate leaderboard
Mean score across all sub-benchmarks. Only models with coverage ≥ 14/15 are listed — comparisons against partial-coverage runs would be misleading. 9 rows hidden.| # | Model | Variant | Mean score | Coverage |
|---|---|---|---|---|
| 1 | gpt-5-2 | Tools + high | 0.609 | 14/15 |
| 2 | gpt-5-2-pro | Tools + high | 0.583 | 14/15 |
| 3 | gemini-3-pro-preview | Tools + high | 0.564 | 14/15 |
| 4 | claude-opus-4-6 | Tools + high | 0.519 | 14/15 |
| 5 | claude-opus-4-5 | Tools + high | 0.471 | 14/15 |
| 6 | gemini-3-pro-preview | Base | 0.404 | 15/15 |
| 7 | gpt-5-2-pro | Base | 0.382 | 15/15 |
| 8 | claude-opus-4-6 | Base | 0.349 | 15/15 |
| 9 | gpt-5-2 | Base | 0.345 | 15/15 |
| 10 | claude-opus-4-5 | Base | 0.325 | 14/15 |
Sub-benchmarks
Click any row for the per-sub-benchmark leaderboard.Sub-benchmark
Modes
Models
Runs
Current leader
Cloning
cloning
3
5
32
Leader
gemini-3-pro-preview 0.429
DBQA2
dbqa2
1
5
10
Leader
gemini-3-pro-preview 0.453
FigQA2
figqa2
1
5
10
Leader
gpt-5-2 0.426
FigQA2 (image)
figqa2-img
1
5
10
Leader
gpt-5-2 0.663
FigQA2 (pdf)
figqa2-pdf
1
5
10
Leader
gpt-5-2 0.644
LitQA3
litqa3
1
5
10
Leader
gpt-5-2-pro 0.851
PatentQA
patentqa
1
5
10
Leader
gpt-5-2-pro 0.909
ProtocolQA2
protocolqa2
1
8
16
Leader
gemini-3-pro-preview 0.616
SeqQA2
seqqa2
3
7
33
Leader
gemini-3-pro-preview 0.525
SourceQuality
sourcequality
1
5
10
Leader
gemini-3-pro-preview 0.900
SuppQA2
suppqa2
1
5
10
Leader
gpt-5-2-pro 0.368
TableQA2
tableqa2
1
5
10
Leader
gpt-5-2-pro 0.700
TableQA2 (image)
tableqa2-img
1
5
10
Leader
claude-opus-4-5 0.950
TableQA2 (pdf)
tableqa2-pdf
1
5
10
Leader
claude-opus-4-6 0.880
TrialQA
trialqa
1
5
10
Leader
gpt-5-2-pro 0.933