Saturday, March 15, 2025
Google search engine
HomeTechnologyUsMove over math and reasoning, it's time to benchmark AI using Super...

Move over math and reasoning, it’s time to benchmark AI using Super Mario Bros.


The big picture: Benchmarking AI remains a thorny issue, with companies often accused of cherry-picking flattering results while burying less favorable ones. Instead of fixating on math and logic trials, perhaps it’s time for a more unconventional test – one that challenges AI in a way humans instinctively understand: Super Mario Bros. After all, if an AI assistant can’t strategically navigate past Goombas and Koopa Troopas, can we really trust it to operate in our complex world?

Researchers at the Hao AI Lab at UC San Diego put several leading language models to the test in Super Mario Bros., offering a fresh perspective on AI capabilities.

The experiment used an emulated version of the classic Nintendo game, integrated with a custom framework called GamingAgent, developed by the Hao Lab. This system allowed AI models to control Mario by generating Python code. To guide their actions, the models received basic instructions, such as “Jump over that enemy,” along with screenshot visualizations of the game state.

While Super Mario Bros. may seem like a simple 2D sidescroller, researchers discovered that it challenges AI to plan complex move sequences and adapt real-time gameplay strategies on the fly.

When it came to mastering Super Mario Bros., the top performer was Anthropic’s Claude 3.7, which showcased impressive reflexes, chaining together precise jumps and skillfully avoiding enemies. Even its predecessor, Claude 3.5, performed well.

Surprisingly, reasoning-heavy models like OpenAI’s GPT-4o and Google’s Gemini 1.5 Pro lagged behind. Despite their reputation for strong reasoning abilities, they struggled with the game’s demands.

As it turns out, logical reasoning isn’t the key to excelling at Super Mario Bros. – timing is. Even a slight delay can send Mario tumbling into a pit. The Hao researchers suggest that more deliberative models likely took too long to calculate their next moves, leading to frequent, untimely deaths.

Of course, using retro video games to benchmark AI is mostly a playful experiment rather than a serious evaluation. Whether an AI can beat Super Mario Bros. has little bearing on its real-world usefulness, but watching sophisticated models struggle with what seems like child’s play is undeniably entertaining.

For those curious to experiment, the Hao AI Lab has open-sourced its GamingAgent framework on GitHub.





Source link

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments