The big picture: Benchmarking AI remains a thorny issue, with companies often accused of cherry-picking flattering results while burying less favorable ones. Instead of fixating on math and logic trials, perhaps it’s time for a more unconventional test – one that challenges AI in a way humans instinctively understand: Super Mario Bros. After all, if an AI assistant can’t strategically navigate past Goombas and Koopa Troopas, can we really trust it to operate in our complex world?
Researchers at the Hao AI Lab at UC San Diego put several leading language models to the test in Super Mario Bros., offering a fresh perspective on AI capabilities.
The experiment used an emulated version of the classic Nintendo game, integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code. To guide their actions, the models received basic instructions, such as “Jump over that enemy,” along with screenshot visualizations of the game state.
While Super Mario Bros. may seem like a simple 2D sidescroller, researchers discovered that it challenges AI to plan complex move sequences and adapt real-time gameplay strategies on the fly.
Claude-3.7 was tested on Pokémon Red, but what about more real-time games like Super Mario 🍄🌟?
We threw AI gaming agents into LIVE Super Mario games and found Claude-3.7 outperformed other models with simple heuristics. 🤯
Claude-3.5 is also strong, but less capable of… pic.twitter.com/bqZVblwqX3
– Hao AI Lab (@haoailab) February 28, 2025
When it came to mastering Super Mario Bros., the top performer was Anthropic’s Claude 3.7, which showcased impressive reflexes, chaining together precise jumps and skillfully avoiding enemies. Even its predecessor, Claude 3.5, performed well.
Surprisingly, reasoning-heavy models like OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro lagged behind. Despite their reputation for strong reasoning abilities, they struggled with the game’s demands.
As it turns out, logical reasoning isn’t the key to excelling at Super Mario Bros. – timing is. Even a slight delay can send Mario tumbling into a pit. The Hao researchers suggest that more deliberative models likely took too long to calculate their next moves, leading to frequent, untimely deaths.
Of course, using retro video games to benchmark AI is mostly a playful experiment rather than a serious evaluation. Whether an AI can beat Super Mario Bros. has little bearing on its real-world usefulness, but watching sophisticated models struggle with what seems like child’s play is undeniably entertaining.
For those curious to experiment, the Hao AI Lab has open-sourced its GamingAgent framework on GitHub.