
As artificial intelligence continues to smash through benchmarks and ace tests, experts are starting to question whether these traditional evaluation methods are still relevant. AI models like GPT-4.5 and Claude have made such significant strides that the usual accuracy tests are no longer sufficient to measure their true capabilities. So, what’s the next step? It seems the future of AI evaluation lies in human judgment.
Artificial intelligence has historically been tested using standardized benchmarks designed to approximate human knowledge. For instance, benchmarks like the General Language Understanding Evaluation (GLUE), the Massive Multitask Language Understanding (MMLU), and even the infamous “Humanity’s Last Exam” were all created to assess how well AI understands and processes complex data. However, these tests are increasingly falling short as AI models get smarter and more capable.
According to AI experts, including Michael Gerstenhaber, head of API technologies at Anthropic, we’ve “saturated the benchmarks.” The problem isn’t that AI is running into a wall; it’s that these models are simply outperforming what traditional tests can capture. “We’ve reached a point where human judgment has to step in,” said Gerstenhaber.
This perspective is gaining traction across the industry. For instance, a recent paper published in The New England Journal of Medicine argues that in fields like medical AI, traditional benchmarks are no longer sufficient. The research, led by Adam Rodman at Boston’s Beth Israel Deaconess Medical Center, suggests that AI models are acing exams like MedQA created at MIT but are still not aligned with the real-world demands of clinical practice. “Humans are the only way,” they conclude, advocating for methods like role-playing and human-computer interaction studies to improve AI’s application in medical fields.
Similarly, the team behind OpenAI’s GPT models has been increasingly relying on “human reinforcement learning” to improve their AI’s performance. This method involves humans reviewing and grading the output of AI systems to guide them toward more accurate, human-like responses. OpenAI’s GPT-4.5, for example, emphasized “human preference measures” in its latest evaluation, showcasing how human evaluators feel about the model’s output rather than just relying on automated test results.
Google is also shifting focus. In its unveiling of the open-source Gemma 3 model this month, the company downplayed traditional benchmarks and instead highlighted ratings from human evaluators. The use of human feedback mirrors the scoring systems used in sports, such as ELO ratings, to determine overall ability.
Even benchmark designers are adapting. François Chollet, creator of the ARC-AGI abstract reasoning test, has introduced a new version, ARC-AGI 2, which involves even more human participation. This new test uses live studies with real people to calibrate its difficulty, ensuring that benchmarks are aligned with human capabilities. The idea is to create a more realistic measure of AI’s performance, one that integrates human experience into the testing process.
This shift in focus towards human involvement in AI evaluation is a clear indication that we’re entering a new phase of AI development. As AI continues to evolve, integrating human feedback at every stage of its development will be crucial to ensure these systems work not just in theory, but in practice.
So, are we finally moving towards the era of human-powered AI? While we may not be there yet, it’s clear that the future of AI development depends on more than just automated tests—it depends on how well AI can align with human needs and expectations.