Benchmarks
Public benchmark suites and their leaderboards.
LabBench2
Real-world capabilities of AI systems on scientific research tasks.
15 sub-benchmarks
9 models
201 runs
BixBench
Coming soonBioinformatics workflows under realistic constraints.
HLE Gold
Coming soonHumanity's Last Exam — gold-set curated subset.