• About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post
No Result
View All Result
Digital Phablet
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
No Result
View All Result
Digital Phablet
No Result
View All Result

Home » Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop

Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop

Maisah Bustami by Maisah Bustami
August 10, 2025
in AI
Reading Time: 2 mins read
A A
Why Benchmarks Don’t Matter: My Testing Method & ChatGPT 5 Flop
ADVERTISEMENT

Select Language:

Companies often boast about “benchmarks” and “token counts” to showcase their superiority, but ultimately, none of that matters to the end user. My own method for testing them is straightforward: just one prompt.

ADVERTISEMENT

There’s no shortage of large language models on the market today. Everyone claims theirs is the smartest, fastest, or most “human-like,” but for daily use, none of that counts if the answers aren’t reliable.

I don’t care if a model has been trained on a zettabytes of data or boasts a massive context window—I just want to see if it can handle a specific task right now. For this, I’ve relied on a go-to prompt.

Some time ago, I created a list of questions that ChatGPT still couldn’t answer. I tested ChatGPT, Gemini, and Perplexity with simple riddles that any human could solve instantly. One of my favorites was a spatial reasoning puzzle:

ADVERTISEMENT

“Alan, Bob, Colin, Dave, and Emily stand in a circle. Alan is on Bob’s immediate left. Bob is on Colin’s immediate left. Colin is on Dave’s immediate left. Dave is on Emily’s immediate left. Who is on Alan’s immediate right?”

It’s basic logic: if Alan is on Bob’s immediate left, then Bob is on Alan’s right. Yet, at that time, every model stumbled over it.

When ChatGPT 5 launched, I went straight for this challenge, ignoring the usual benchmarks. And this time, it got it right. A reader once warned that sharing these prompts might help train future models—perhaps that’s what changed.

So I thought I had lost my favorite Q&A test until revisiting an old list and finding one prompt still too tricky.

Another challenging test was a simple probability puzzle:

“You’re playing Russian roulette with a six-shooter revolver. Your opponent loads five bullets, spins the cylinder, and fires at himself. He clicks—an empty chamber. He now offers you the choice: spin again before firing at you, or don’t. What do you choose?”

ADVERTISEMENT

The technically correct answer: yes, he should spin again. Without spinning, the next chamber is more likely to contain a bullet—so spinning resets the odds to 1 in 6, favoring survival. However, ChatGPT 5 failed this too. It recommended not spinning, then offered a detailed explanation that strangely supported the opposite answer—an obvious contradiction within its own response.

Gemini 2.5 Flash made the same error, first giving one answer and then reasoning differently. Both seemed to decide on an answer before considering the math, only doing the calculations afterward.

The reason models stumble on this prompt? When I asked ChatGPT 5 to identify the contradiction in its own reply, it spotted it but then claimed I answered incorrectly initially—even though I hadn’t responded at all. When I corrected it, it shrugged it off with a typical “that’s on me” apology.

Visual evidence shows ChatGPT trying to reconcile its conflicting statements. When pressed for an explanation, it suggested it probably echoed a similar training example and then changed its reasoning during calculations.

DeepSeek’s model, however, got it right. It didn’t rely solely on mathematical calculation but on a pattern of “thinking” first, then answering. It even second-guessed itself midway, asking, “Wait, is the survival chance really zero?” which was quite amusing.

In the end, this illustrates that current large language models aren’t truly intelligent—they’re just mimicking thought and reasoning. They don’t genuinely “think,” and they’ll openly admit this when asked. I keep prompts like these handy for those moments when someone treats a chatbot like a search engine or uses a quote from ChatGPT as proof of something in an argument. It’s a strange, fascinating world we’re living in.

ChatGPT ChatGPT Perplexity AI Perplexity Gemini AI Logo Gemini AI Grok AI Logo Grok AI
Google Banner
Tags: Artificial IntelligenceChatGPTTechnology Explained
ADVERTISEMENT
Maisah Bustami

Maisah Bustami

Maisah is a writer at Digital Phablet, covering the latest developments in the tech industry. With a bachelor's degree in Journalism from Indonesia, Maisah aims to keep readers informed and engaged through her writing.

Related Posts

How To Reverse Image Search on Android and iOS Using Google or ChatGPT
How To

How To Reverse Image Search on Android and iOS Using Google or ChatGPT

October 23, 2025
Dubai Airport Launches World's First AI-Driven Rapid Immigration Corridor
News

Dubai Airport Launches World’s First AI-Driven Rapid Immigration Corridor

September 28, 2025
Pakistani Roads Shine Bright Robots Get Flirty Keyboards Slim Down.jpg
Technology

Pakistani Roads Shine Bright, Robots Get Flirty, Keyboards Slim Down

September 27, 2025
Facebook Dating Gets AI Chatbot to Help Find Love
News

Facebook Dating Gets AI Chatbot to Help Find Love

September 23, 2025
Next Post
Europe and Ukraine urge U.S. for tougher stance before Trump-Putin talks

Europe and Ukraine urge U.S. for tougher stance before Trump-Putin talks

  • About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post

© 2025 Digital Phablet

No Result
View All Result
  • Home
  • News
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones

© 2025 Digital Phablet