Testing 7 AI Models' True Performance with 2025 Beijing Exam

Select Language:

The 2025 Beijing High School Entrance Examination has concluded successfully, with over 110,000 students completing the test. This marks the first implementation of a new reform in the examination process, reducing the testing duration from three days to just two.

The most significant changes in this year’s exam are a reduction in the total score from 670 to 510 and the introduction of an open-book format for the Morality and Law section. This score adjustment implies that each point holds greater value, intensifying competition among high-scoring students. Each subject’s questions will now focus more on assessing students’ core competencies and essential skills.

For example, in mathematics, the proportion of simpler questions has been reduced, and the difficulty and innovation of question types have increased. The Chinese language section emphasizes students’ foundational language skills and comprehension, encouraging them to think critically about how to use language effectively in problem-solving scenarios.

Feedback from students assessing the difficulty of the exam can be summed up in three words: “It was tough.”

Take this year’s Chinese essay prompt, for instance. Students had to choose between two topics: one focused on health and science—”Living a Healthier Life”—and the other on scientific literacy and practical life—”A Science Class.” While the topics may seem straightforward, crafting a standout essay proved challenging, with some students commenting, “I understand the topic, but writing about it was too difficult!”

This raises an intriguing question: if the current mainstream AI models were subject to the same entrance exam, what would their performance look like? Would they measure up to the so-called top students?

To explore this, seven prominent AI models were tested on selected subjects from the 2025 Beijing Entrance Exam, providing insight into their capabilities. The subjects included second essay prompts in both Chinese and English, along with the full mathematics exam.

The competitors in this test were DeepSeek, ByteDance’s Doubao, iFlyTek’s Spark, Tongyi Qianwen, Tencent’s Hunyuan, Wenxin Yiyuan, and GPT. These models were chosen for their widespread use and relevance.

To ensure fairness, all models were disconnected from the internet and configured for deep reasoning. The methodology for scoring the essays involved inviting expert educators and examiners to evaluate the outputs. Separate panels graded the Chinese and English essays, and the average scores were used for final assessments.

The mathematics component employed two evaluation formats: image scanning and LaTeX. Scores were determined based on uniform standards, separating objective questions from subjectively graded ones. For instance, multiple-choice and fill-in-the-blank questions only considered the final answers, while more complex problems were graded based on step-by-step solutions.

Examining the results reveals notable trends in performance across all tested models:

Mathematics:
The analysis of the mathematics scores demonstrated that iFlyTek’s Spark, Doubao, and GPT ranked the highest, scoring over 85 points. In contrast, Tongyi Qianwen, Wenxin Yiyuan, and DeepSeek ranked lower, with scores of 73, 68, and 63, respectively. DeepSeek faced significant challenges due to image recognition issues, which directly impacted its performance.

Chinese Writing:
In terms of essay writing, all AI models scored between 81-94% on a standard scale, with average scores around 86. While all models showed significant writing capability, the nuances in detail and emotional expression highlighted some differences. iFlyTek’s Spark stood out for its ability to present profound themes smoothly and coherently.

English Writing:
The English essays displayed a larger range in scores, from 7 to a perfect 10. iFlyTek’s Spark achieved the highest score, demonstrating strong thematic coverage and detail. In contrast, GPT fell short of expectations, scoring only 7.5 points despite covering all main points.

Overall, these tests showcase that AI models have advanced significantly beyond being mere text generators; they are now capable of producing thoughtful, reasoned responses. The results challenge students to transition from rote memorization and mechanical practices to more integrated, thoughtful approaches to learning.

This examination serves as an invitation to rethink educational engagement amid rapid technological changes. The potential collaboration between humans and AI promises to create a new chapter in learning, pushing the boundaries of creativity and understanding.