Visual Reasoning Benchmark
Clock Bench
ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.
Leaderboard
Rank | Model | Accuracy | Lab |
---|---|---|---|
Human Baseline | 89.1% | ||
1 | Gemini 2.5 Pro | 13.3% | |
2 | Gemini 2.5 Flash | 10.5% | |
3 | GPT-5 High | 8.4% | OpenAI |
4 | GPT-5 Mini | 5.6% | OpenAI |
5 | Claude Opus 4.1 | 5.6% | Anthropic |
6 | Qwen 2.5-VL 72B | 4.9% | Alibaba |
7 | Claude Sonnet 4 | 4.2% | Anthropic |
8 | Mistral Medium 3.1 | 2.8% | Mistral |
9 | GTP-4o | 2.1% | OpenAI |
10 | GTP-5 Nano | 2.1% | OpenAI |
11 | Grok 4 | 0.7% | xAI |

Results Summary
Despite frontier models showing strong reasoning skills, mathematical ability, and visual understanding on multiple benchmarks, they seem to be struggling at reading analog clocks for now.
One hypothesis might be that this task sets a high bar for doing reasoning within the visual space (as opposed to text space).
More research is likely needed to understand if these capabilities can be obtained by scaling existing paradigms, or a novel approach is required.
Dataset
Sample Clocks
Few examples of clocks that we used in the benchmark.

Questions
- Reading TimeModels are asked to determine whether a given clock shows a valid time. If valid, they should report the hours, minutes, seconds, date, month, and day of the week (based on what is present), in a structured JSON format.
- Adding or Subtracting TimeModels are asked to add or subtract varying amounts of time.
- Rotating HandsModels are asked to rotate one of the hands (hour, minute, or second) by a specified angle, clockwise or counterclockwise.
- Shifting Time ZoneModels are asked to assume they are in New York during summer and report the corresponding time in various locations worldwide.
Try Yourself
Interesting in trying out ClockBench?
A small public dataset and sample evaluation code is available to everyone.
Please reach out to [email protected] with ideas, suggestions, questions or any other inquiries.