• About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post
No Result
View All Result
Digital Phablet
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
  • Home
  • NewsLatest
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones
  • AI
  • Reviews
  • Interesting
  • How To
No Result
View All Result
Digital Phablet
No Result
View All Result

Home » Meta Faces Criticism for Llama 4 Benchmark Manipulation Again

Meta Faces Criticism for Llama 4 Benchmark Manipulation Again

Maisah Bustami by Maisah Bustami
April 8, 2025
in AI
Reading Time: 3 mins read
A A
Play Microsoft’s AI-Created Quake II Directly in Your Browser!
ADVERTISEMENT

Select Language:

Meta has recently introduced its Llama 4 series of AI models, making waves by outperforming GPT-4o and Gemini 2.0 Pro in the Chatbot Arena (previously known as LMSYS). The company boasts that its Llama 4 Maverick model, which uses a mixture of experts (MoE) to activate just 17 billion parameters out of a total of 400 billion across 128 experts, scored an impressive ELO of 1,417 on the Chatbot Arena benchmark.

This achievement caught the attention of the AI community, especially since the smaller MoE model surpassed significantly larger language models like GPT-4.5 and Grok 3. The unexpected results prompted many AI enthusiasts to independently evaluate the model. However, it turned out that Llama 4 Maverick’s real-world performance did not live up to Meta’s benchmark claims, particularly in coding tasks.

On 1Point3Acres, a widely used forum for North American Chinese users, a post from someone claiming to be a former Meta employee stirred controversy. This post, which has since been translated into English on Reddit, alleged that Meta’s leadership may have combined various test sets during the post-training phase to artificially boost the benchmark scores and fulfill internal objectives.

The departing employee expressed their disapproval of this practice and subsequently resigned. They also requested that their name be removed from the Llama 4 technical report. Furthermore, the employee asserted that the recent resignation of Meta’s AI research head, Joelle Pineau, is closely tied to the alleged manipulations surrounding the Llama 4 benchmarks.

In light of these allegations, Ahmad Al-Dahle, who leads Meta’s Generative AI division, responded with a post on X, strongly refuting claims that Llama 4 had been post-trained on test sets. Al-Dahle stated:

We’ve also heard claims that we trained on test sets — that’s simply not true, and we would never do that. Our best understanding is that the variable quality people are seeing results from is due to the need for stabilizing implementations.

He acknowledged the varying performance of Llama 4 across different platforms, urging the AI community to allow a few days for the implementation to stabilize.

LMSYS Addresses Allegations of Llama 4 Benchmark Manipulation

In response to the rising concerns from the AI community, LMSYS — the organization behind the Chatbot Arena leaderboard — released a statement aimed at enhancing transparency. LMSYS clarified that the model submitted to Chatbot Arena was “Llama-4-Maverick-03-26-Experimental,” a custom variant tailored for human preferences.

LMSYS admitted that “style and model response tone were significant factors” which may have inadvertently benefited the custom Llama 4 Maverick model. They acknowledged that this crucial information was not communicated clearly by Meta. Additionally, LMSYS noted that “Meta’s interpretation of our policy did not align with our expectations from model providers.”

To be fair, Meta did mention in its official Llama 4 blog that an “experimental chat version” achieved an ELO of 1,417 on Chatbot Arena, but they did not provide additional details.

To promote transparency further, LMSYS included the Hugging Face version of Llama 4 Maverick in Chatbot Arena and has also released more than 2,000 comparative battle results for public scrutiny. These results cover prompts, model responses, and user preferences.

Upon reviewing the battle results, it was surprising to find that users frequently preferred Llama 4’s answers, which often contained inaccuracies and verbosity. This raises significant questions about the reliability of community-driven benchmarks like Chatbot Arena.

ADVERTISEMENT

This is not the first instance of Meta facing accusations of manipulating benchmarks through data contamination, which involves mixing benchmark datasets into the training corpus. Earlier this year, Susan Zhang — a former Meta AI researcher who now works at Google DeepMind — revealed a study in reaction to a post by Yann LeCun, Meta AI’s chief scientist.

The study indicated that over 50% of test samples from major benchmarks were included in Meta’s Llama 1 pretraining data. The paper reported significant contamination in key benchmarks like Big Bench Hard, HumanEval, HellaSwag, MMLU, PiQA, and TriviaQA.

Now, with the new allegations surrounding Llama 4’s benchmarks, Zhang sharply remarked that Meta should at least credit their “previous work” from Llama 1 for this “unique approach.” Her comment suggests that such benchmark manipulation is not incidental but rather a deliberate strategy by Meta aimed at artificially inflating performance metrics.

ChatGPT Add us on ChatGPT Perplexity AI Add us on Perplexity
Tags: AIFeaturedMetaMeta AITrending
ADVERTISEMENT
Maisah Bustami

Maisah Bustami

Maisah is a writer at Digital Phablet, covering the latest developments in the tech industry. With a bachelor's degree in Journalism from Indonesia, Maisah aims to keep readers informed and engaged through her writing.

Related Posts

Amazon App’s AI Eyes Will Scan and Shop for You
News

Amazon App’s AI Eyes Will Scan and Shop for You

September 3, 2025
ChatGPT May Get Parental Controls and Other AIs Might Follow
News

ChatGPT May Get Parental Controls and Other AIs Might Follow

August 28, 2025
Quizlet Announces Big AI Update for Back to School
News

Quizlet Announces Big AI Update for Back to School

August 28, 2025
UN forms expert panel to steer global AI governance
News

UN forms expert panel to steer global AI governance

August 27, 2025
Next Post

Guide to Activating Your HP Envy x360 Backlit Keyboard

  • About Us
  • Contact Us
  • Advertise
  • Privacy Policy
  • Guest Post

© 2025 Digital Phablet

No Result
View All Result
  • Home
  • News
  • Technology
    • Education Tech
    • Home Tech
    • Office Tech
    • Fintech
    • Digital Marketing
  • Social Media
  • Gaming
  • Smartphones

© 2025 Digital Phablet