• About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post
No Result
View All Result
Digital Phablet
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
No Result
View All Result
Digital Phablet
No Result
View All Result

Home » Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop

Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop

Maisah Bustami by Maisah Bustami
August 10, 2025
in AI
Reading Time: 2 mins read
A A
Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop
ADVERTISEMENT

Select Language:

Companies often boast about “benchmarks” and “token counts” to showcase their superiority, but ultimately, none of that matters to the end user. My own method for testing them is straightforward: just one prompt.

ADVERTISEMENT

There’s no shortage of large language models on the market today. Everyone claims theirs is the smartest, fastest, or most “human-like,” but for daily use, none of that counts if the answers aren’t reliable.

I don’t care if a model has been trained on a zettabytes of data or boasts a massive context window—I just want to see if it can handle a specific task right now. For this, I’ve relied on a go-to prompt.

Some time ago, I created a list of questions that ChatGPT still couldn’t answer. I tested ChatGPT, Gemini, and Perplexity with simple riddles that any human could solve instantly. One of my favorites was a spatial reasoning puzzle:

ADVERTISEMENT

“Alan, Bob, Colin, Dave, and Emily stand in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?”

It’s basic logic: if Alan is on Bob’s immediate left, then Bob is on Alan’s right. Yet, at that time, every model stumbled over it.

When ChatGPT 5 launched, I went straight for this challenge, ignoring the usual benchmarks. And this time, it got it right. A reader once warned that sharing these prompts might help train future models—perhaps that’s what changed.

So I thought I had lost my favorite Q&A test until revisiting an old list and finding one prompt still too tricky.

Another challenging test was a simple probability puzzle:

“You’re playing Russian roulette with a six-shooter revolver. Your opponent loads five bullets, spins the cylinder, and fires at himself. He clicks—an empty chamber. He now offers you the choice: spin again before firing at you, or don’t. What do you choose?”

ADVERTISEMENT

The technically correct answer: yes, he should spin again. Without spinning, the next chamber is more likely to contain a bullet—so spinning resets the odds to 1 in 6, favoring survival. However, ChatGPT 5 failed this too. It recommended not spinning, then offered a detailed explanation that strangely supported the opposite answer—an obvious contradiction within its own response.

Gemini 2.5 Flash made the same error, first giving one answer and then reasoning differently. Both seemed to decide on an answer before considering the math, only doing the calculations afterward.

The reason models stumble on this prompt? When I asked ChatGPT 5 to identify the contradiction in its own reply, it spotted it but then claimed I answered incorrectly initially—even though I hadn’t responded at all. When I corrected it, it shrugged it off with a typical “that’s on me” apology.

Visual evidence shows ChatGPT trying to reconcile its conflicting statements. When pressed for an explanation, it suggested it probably echoed a similar training example and then changed its reasoning during calculations.

DeepSeek’s model, however, got it right. It didn’t rely solely on mathematical calculation but on a pattern of “thinking” first, then answering. It even second-guessed itself midway, asking, “Wait, is the survival chance really zero?” which was quite amusing.

In the end, this illustrates that current large language models aren’t truly intelligent—they’re just mimicking thought and reasoning. They don’t genuinely “think,” and they’ll openly admit this when asked. I keep prompts like these handy for those moments when someone treats a chatbot like a search engine or uses a quote from ChatGPT as proof of something in an argument. It’s a strange, fascinating world we’re living in.

ChatGPT ChatGPT Perplexity AI Perplexity Gemini AI Logo Gemini AI Grok AI Logo Grok AI
Google Banner
Tags: Artificial IntelligenceChatGPTTechnology Explained
ADVERTISEMENT
Maisah Bustami

Maisah Bustami

Maisah is a writer at Digital Phablet, covering the latest developments in the tech industry. With a bachelor's degree in Journalism from Indonesia, Maisah aims to keep readers informed and engaged through her writing.

Related Posts

Indian-American US cyber chief accused of leaking intel to ChatGPT
News

Indian-American US cyber chief accused of leaking intel to ChatGPT

January 30, 2026
UAE Offers $1.5M Grants for Cloud Seeding to Increase Rainfall
News

UAE Offers $1.5M Grants for Cloud Seeding to Increase Rainfall

January 22, 2026
WEF Survey: Economic Clash Now Outranks Armed Conflict as Main Risk
News

WEF Survey: Economic Clash Now Outranks Armed Conflict as Main Risk

January 14, 2026
Time names 'Architects of AI' as Person of the Year
News

Time names ‘Architects of AI’ as Person of the Year

December 11, 2025
Next Post
Europe and Ukraine urge U.S. for tougher stance before Trump-Putin talks

Europe and Ukraine urge U.S. for tougher stance before Trump-Putin talks

  • About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post

© 2026 Digital Phablet

No Result
View All Result
  • Home
  • News
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones

© 2026 Digital Phablet