Visual Reasoning Benchmark

Clock Bench

ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.

Clock Faces
36
Clocks
180
Questions
720
Human Accuracy
90.7%
Top Model Accuracy
39.4%

Leaderboard

RankModelAccuracyLab
Human Baseline90.7%
1Qwen 3-VL 235B Instruct39.4%Alibaba
2GPT-5 Chat32.8%OpenAI
3Gemini 3.1 Pro32.2%Google
4Gemini 3 Pro28.9%Google
5Gemini 2.5 Pro18.9%Google
6GPT-5.2 High15%OpenAI
7Gemini Robotics ER 1.515%Google
8o3 Pro14.4%OpenAI
9Qwen 3-VL 235B Thinking14.4%Alibaba
10o3 High12.2%OpenAI
11Gemini 2.5 Flash11.1%Google
12GPT-5 High11.1%OpenAI
13GPT-5 Pro11.1%OpenAI
14Mistral Medium 3.110%Mistral
15Claude Opus 4.68.9%Anthropic
16GPT-5 Mini8.9%OpenAI
17Claude Opus 4.18.3%Anthropic
18Claude Sonnet 4.57.2%Anthropic
19Qwen 2.5-VL 72B6.1%Alibaba
20Claude Sonnet 46.1%Anthropic
21GTP-4o5%OpenAI
22GTP-5 Nano3.9%OpenAI
23Grok 4 Fast3.9%xAI
ClockBench AI Benchmark
Human Baseline
90.7%
Qwen 3-VL 235B Instruct
39.4%
GPT-5 Chat
32.8%
Gemini 3.1 Pro
32.2%
Gemini 3 Pro
28.9%
Gemini 2.5 Pro
18.9%
GPT-5.2 High
15.0%
Gemini Robotics ER 1.5
15.0%
o3 Pro
14.4%
Qwen 3-VL 235B Thinking
14.4%
o3 High
12.2%
Gemini 2.5 Flash
11.1%
GPT-5 High
11.1%
GPT-5 Pro
11.1%
Mistral Medium 3.1
10.0%
Claude Opus 4.6
8.9%
GPT-5 Mini
8.9%
Claude Opus 4.1
8.3%
Claude Sonnet 4.5
7.2%
Qwen 2.5-VL 72B
6.1%
Claude Sonnet 4
6.1%
GTP-4o
5.0%
GTP-5 Nano
3.9%
Grok 4 Fast
3.9%

Results Summary

Despite frontier models showing strong reasoning skills, mathematical ability, and visual understanding on multiple benchmarks, they seem to be struggling at reading analog clocks for now.

One hypothesis might be that this task sets a high bar for doing reasoning within the visual space (as opposed to text space).

More research is likely needed to understand if these capabilities can be obtained by scaling existing paradigms, or a novel approach is required.

Dataset

Sample Clocks

Few examples of clocks that we used in the benchmark.

Sample clocks from ClockBench

Questions

  1. Reading Time
    Models are asked to determine whether a given clock shows a valid time. If valid, they should report the hours, minutes, seconds, date, month, and day of the week (based on what is present), in a structured JSON format.
  2. Adding or Subtracting Time
    Models are asked to add or subtract varying amounts of time.
  3. Rotating Hands
    Models are asked to rotate one of the hands (hour, minute, or second) by a specified angle, clockwise or counterclockwise.
  4. Shifting Time Zone
    Models are asked to assume they are in New York during summer and report the corresponding time in various locations worldwide.

Try Yourself

Interesting in trying out ClockBench?
A small public dataset and sample evaluation code is available to everyone.

Public Dataset
Alek Safar
LinkedInX.com

Please reach out to [email protected] with ideas, suggestions, questions or any other inquiries.