Facepalm: Generative AI tools are able to perform the sorts of tasks that once seemed the stuff of sci-fi, but most of them still struggle with many basic skills, including reading analog clocks and calendars. A new study has found that overall, AI systems read clock faces correctly less than a quarter of the time.
A team of researchers at Edinburgh University tested some top multimodal large language models to see how well they could answer questions based on images of clocks and calendars.
The systems being tested were Google DeepMind’s Gemini 2.0, Anthropic’s Claude 3.5 Sonnet, Meta’s Llama 3.2-11B-Vision-Instruct, Alibaba’s Qwen2-VL7B-Instruct, ModelBest’s MiniCPM-V-2.6, and OpenAI’s GPT-4o and GPT-o1.
Various types of clocks appeared in the images: some with Roman numerals, those with and without seconds hands, different colored dials, etc.
The systems read the clocks correctly less than 25% of the time. They struggled more with clocks that used Roman numerals and stylized hands.
The AI’s performance didn’t improve when the seconds hand was removed, leading researchers to suggest that the problem comes from detecting the clocks’ hands and interpreting the angles on a clock face.
Using 10 years of calendar images, the researchers asked questions such as what day of the week is New Year’s Day? and What is the 153rd day of the year?
Even the most successful AI models got the calendar questions wrong 20 percent of the time.
The success rates varied based on the AI system being used. Gemini-2.0 was the highest scorer in the clock test, while GPT-01 was accurate 80% of the time on the calendar questions.
“Most people can tell the time and use calendars from an early age,” said study lead Rohit Saxena, from Edinburgh University’s School of Informatics. “Our findings highlight a significant gap in the ability of AI to carry out what are quite basic skills for people. These shortfalls must be addressed if AI systems are to be successfully integrated into time-sensitive, real-world applications, such as scheduling, automation and assistive technologies.”
Aryo Gema, another researcher from Edinburgh’s School of Informatics, said, “AI research today often emphasises complex reasoning tasks, but ironically, many systems still struggle when it comes to simpler, everyday tasks.”
The findings are being reported in a peer-reviewed paper that will be presented at the Reasoning and Planning for Large Language Models workshop at The Thirteenth International Conference on Learning Representations (ICLR) in Singapore on April 28. The findings are currently available on the preprint server arXiv.
This isn’t the first study this month showing AI systems still make plenty of mistakes. The Tow Center for Digital Journalism studied eight AI search engines and found that they are inaccurate 60 percent of the time. The worst culprit was Grok-3, which was 94 percent inaccurate.