GPQA Diamond
0-100% accuracy
what it measures
Graduate-level science questions in biology, physics, and chemistry that are hard enough that PhDs outside the subfield lose to them even with Google. Diamond is the hardest, cleanest subset.
why it matters
You cannot bluff it, and it still spreads the frontier out instead of pinning everyone near the ceiling.
the takeThe reasoning number I trust most at the top. Strong here usually means strong where it counts.
Source: https://arxiv.org/abs/2311.12022