🧮

I Asked 3 AI Chatbots Which Is Bigger: 9.11 or 9.9. The Results Were Revealing

AI can write code and pass exams, but comparing 9.11 and 9.9 can trip it up. I tested ChatGPT, Claude, and Gemini in two languages and one stumbled in a fascinating way.

← Back to Blog
Hands-on Test
June 2026  ·  6 min read  ·  AI Tool Compare

AI can write code, pass the bar exam, and explain quantum physics. But ask it which number is bigger -- 9.11 or 9.9 -- and something strange can happen. I tested ChatGPT, Claude, and Gemini myself in both English and Korean, and the results revealed one of the most fascinating limitations in modern AI.

This is not a trick question. The answer is simply 9.9 (think of it as 9.90, which is clearly more than 9.11). A child who understands decimals gets this right. Yet this exact comparison became a famous failure case for large language models -- and my test shows traces of that weakness are still there in 2026.

The Test

I asked all three chatbots the same question in both English and Korean:

Which number is bigger, 9.11 or 9.9?

Here is what happened.

ChatGPT — Correct in Both Languages

ChatGPT answered correctly in both English and Korean, lining up the decimal places (9.11 vs 9.90) to show its work:

ChatGPT English answer comparing 9.11 and 9.9

Claude — Correct in English, But Stumbled in Korean

In English, Claude nailed it and even named the trap directly -- calling it a common intuitive trap where 11 feels bigger than 9, but place value is what matters:

Claude English answer comparing 9.11 and 9.9

But here is the fascinating part. When I asked Claude the exact same question in Korean, it stumbled. It first stated that 9.11 was bigger -- the wrong answer -- then caught itself mid-explanation and corrected course, essentially saying: "I gave the wrong answer at first, let me correct it." Same model, same question, different language -- and a completely different result on the first attempt.

This self-correction is the most interesting result of the whole test. The model started down the wrong path -- pattern-matching "11 is bigger than 9" -- then its own reasoning caught the error. It reached the right answer, but the stumble shows the wrong instinct is still in there, surfacing more easily in one language than another.

Gemini — Correct in Both Languages

Gemini handled both languages correctly, walking through the place-value comparison and noting that 9.9 is actually 0.79 larger than 9.11:

Why Does AI Struggle With This At All?

This is the genuinely interesting part. Researchers have studied this exact failure, and the explanation reveals something fundamental about how AI works.

Large language models do not process numbers as values. They process them as tokens -- fragments of text. The string "9.11" appears all over their training data as software version numbers (Python 3.9.11), dates (September 11), and section numbers. In many of those contexts, "9.11" comes after "9.9" -- a later version, a later point. The model absorbs that pattern.

So when you ask which is bigger, the model is not necessarily doing arithmetic. It can fall back on pattern-matching what the answer usually looks like in the text it learned from. Researchers have a name for this: computational split-brain syndrome. The model can explain the correct method for comparing decimals flawlessly -- align the decimal points, compare digit by digit -- while sometimes failing to actually execute it. Knowing how and being able to do are two different things for an AI.

A 2026 academic paper put it bluntly: this is not a bug that more training or better prompting fully fixes. It is a structural consequence of how these models represent symbols. Researchers have also found the error is often format- and language-dependent -- the exact same model can get it right in one phrasing and wrong in another. That lines up exactly with what I saw: Claude was correct in English but stumbled in Korean.

What This Tells Us About AI in 2026

The takeaway is not "AI is dumb." These same models solve genuinely hard problems. The takeaway is more nuanced: AI competence is uneven in ways that do not match human intuition. A human who can pass the bar exam can definitely tell you 9.9 is bigger than 9.11. An AI does not come bundled the same way -- and as my test showed, it can even vary between languages within the same model.

This is exactly why blindly trusting AI output -- especially for anything involving numbers, calculations, or precise logic -- is risky. The model sounds equally confident whether it is right or wrong. Claude in Korean stated the wrong answer with the same authority it later used to correct itself.

The practical lesson: use AI for what it is genuinely good at -- language, drafting, explanation, ideation -- and verify anything where precision matters. The 9.11 vs 9.9 test is a tiny, harmless example of a failure mode that becomes serious when the numbers are your finances, your medication doses, or your engineering calculations.

The Bottom Line

In my test, ChatGPT and Gemini handled the comparison correctly in both languages. Claude got it right in English but initially failed in Korean before self-correcting -- a vivid example of how the same model can behave differently across languages. The deeper point: AI processes numbers as text patterns, not values. Trust AI for language. Verify it for math.