MMLU-Pro
0-100% accuracy
what it measures
A harder, cleaner rebuild of MMLU: broad multiple-choice knowledge across dozens of subjects, with ten options instead of four and the noisiest questions thrown out.
why it matters
It is a broad-knowledge floor. Useful for catching a model that is narrow, less useful for separating the leaders.
the takeTable stakes now. Below the high 80s you are not in the conversation; above it the number stops telling you much. I read it as a floor, not a ranking.
Source: https://arxiv.org/abs/2406.01574