Humanity's Last Exam: The Ultimate AI Humiliation (Metaphorically, Of Course)

Move over, Turing Test-there's a new sheriff in town, and it’s called Humanity's Last Exam. Sounds dramatic, right? Well, it’s fitting, because this latest benchmark is not just here to challenge AI; it’s here to make every self-proclaimed "advanced" model cry binary tears. If you've been living under a rock (or just don’t subscribe to overly smug AI newsletters), here's the scoop: even the most powerful AI models on the planet are getting absolutely obliterated, managing to answer a whopping 10% of the questions correctly. That’s not a typo. Ten. Percent.

What Is Humanity’s Last Exam?

This delightful little torture device for AI brains was cooked up by Scale AI and the Center for AI Safety (CAIS). They gathered 70,000 questions from nearly 1,000 experts across 50 countries. After a rigorous review process, 3,000 questions made the cut—spanning topics like mathematics, natural sciences, humanities, and multimodal challenges (aka "look at this chart and tell me why you’re clueless").

Here’s an actual example to melt your brain:

Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone?

Answer: Yeah, me neither.

If you’re curious (and slightly masochistic), sample questions are available at lastexam.ai. Go ahead, prove you’re smarter than Skynet’s rejects.

Who Failed and How Hard?

Let’s name names. OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s mysterious "o1" all had their shot at glory. Spoiler alert: they flopped. With sub-10% scores, these models collectively facepalmed their way through the exam. But hey, the researchers are optimists. They predict that by year’s end, these AI wunderkinder might scrape by with slightly less humiliating results. Bold claim, but okay.

Here’s the kicker: the test’s authors admit the tasks are purely academic. No creativity or abstract thinking required. Translation: even if these models somehow passed, we’re still not handing them the Nobel Prize. They’re not thinking; they’re regurgitating. And badly, at that.

Why Should You Care?

Aside from the comedic value of watching billion-dollar AI systems fail miserably, Humanity’s Last Exam serves a deeper purpose. It’s designed as a reference point for scientists and policymakers to assess AI capabilities without all the marketing fluff. Plus, it’s a sobering reminder that no matter how many petaflops of compute you throw at a problem, understanding remains the Holy Grail of AI.

Who’s Behind This Sadistic Test?

Scale AI and CAIS, both based in San Francisco, are the masterminds. Scale AI supplies datasets for training AIs (think: data but make it fancy). CAIS is a non-profit laser-focused on AI safety and ethics, which basically makes them the designated babysitters of Skynet. Dan Hendrycks, co-founder of CAIS, has a history of releasing soul-crushing math benchmarks. Shocker: OpenAI secretly funded one of them, FrontierMath, where their model o3 barely outperformed others with a pathetic 25.2% accuracy.

And let’s not forget the irony of "AI safety" organizations essentially serving as gatekeepers to a gauntlet designed to keep AI humble. Humble—or just utterly humiliated. Either works.

Final Thoughts: A Test for Humanity Too?

While AI continues its quest for dominance (or just trying not to embarrass itself), Humanity’s Last Exam feels oddly poetic. It’s not just a test for machines; it’s a benchmark for us humans to pause and reflect. Are we really building intelligence, or just fancier calculators? One thing’s for sure: if this is the "last exam," humanity might still have the upper hand. For now. Until, of course, some rogue AI decides it’s done playing by our rules and writes its own damn test. At that point, we might all be taking Skynet’s First and Final Exam.

#ai #humanityslastexam #failureisfunny #aisafety