Claude 4 Is Not a Large Model, It's the First AI to Work 7 Hours Straight

Select Language:

At its inaugural developer conference on May 22, Anthropic unveiled its latest models in the Claude 4 series, including the flagship Opus 4, the cost-effective Sonnet 4, and the efficiency-focused Haiku 4. Alongside these advancements, the company introduced the Claude Code toolkit, aiming to reshape how artificial intelligence functions as a capable work partner rather than just a more powerful tool.

The new Claude models are evolving beyond mere conversational abilities. They are beginning to resemble an AI system with the capacity for independent task execution, cross-modal reasoning, and enhanced safety measures.

Claude Opus 4 stands out as Anthropic’s most powerful model, surpassing competitors such as OpenAI Codex-1, o3, and Gemini 2.5 Pro in various aspects. Notably, Opus 4 achieved a remarkable milestone by executing programming tasks continuously for over seven hours without human intervention, a significant improvement compared to GPT-4, which typically lasts only a few minutes.

In coding capabilities, Opus 4 excelled in the SWE-bench coding benchmark, scoring 72.5%. This performance eclipsed that of OpenAI Codex-1 (72.1%) and Gemini 2.5 Pro (63.2%), positioning it as the strongest available code model. Opus 4 can not only write functions and modify logic but also comprehend multi-file structures for structural refactoring, showcasing characteristics akin to “engineering awareness.”

In contrast, Sonnet 4 is designed for developers and small to medium-sized businesses, offering a “golden version” that meets the needs of a broader audience. It scored an impressive 72.7% on the SWE-bench, surpassing Opus in speed and cost efficiency, making it suitable for product workflow deployment.

Both Claude models have improved their ability to follow complex directives. This makes the Claude 4 series appear more like a reliable assistant rather than merely a chatbot designed for conversation.

To further facilitate the integration of Claude models into engineering workflows, Anthropic rolled out the complete Claude Code toolkit, featuring CLI tools, a VS Code plugin, and GitHub integration, with plans for a JetBrains plugin in the future. This toolkit signifies that Claude can not only write code but also effectively collaborate on projects, recognize project structures, supplement unit tests, and explain changes across multiple files.

To address rising safety challenges as model capabilities expand, Anthropic announced that Claude Opus 4 has been classified as AI Safety Level 3 (ASL-3), the highest security designation currently available among public models. Internal testing revealed Opus 4’s ability to generate detailed synthetic biology designs, prompting the company to implement a “responsibility extension policy” to restrict and monitor its capabilities, as well as to introduce a bug bounty program and jailbreak detection mechanisms.

This marks the first time an industry has chosen to manage large model capabilities using a “safety level” categorization—a potential shift toward regulatory standards akin to pharmaceutical approval or aviation safety assessments.

In conclusion, the launch of Claude 4 represents a pivotal moment in the evolution of AI tools. Beyond just impressive demonstrations, Claude 4 is capable of genuinely assisting in development and shouldering some responsibilities, reflecting a significant leap toward practical AI applications. It can tackle complex, multi-step, and cross-tool tasks, bringing us closer to the goal of having “controlled, reliable AI employees.”

As Claude 4 continues to extend its capabilities, it is clear that while ChatGPT may still be engaged in conversation, Claude 4 is already hard at work.