Select Language:
October 21 — ShengShu Technology, a Chinese startup specializing in multimodal generative AI, has unveiled updated versions of its reference-to-video tool. The improvements enhance consistency and creative controls, positioning the platform to compete with leading global AI developers, including OpenAI’s Sora 2 and Google’s Veo 3.1.
The latest iteration, Vidu Q2, allows users to upload and blend up to seven reference images—covering faces, scenes, or props—into a single, cohesive video. These images are combined with a text prompt using a multi-entity consistency feature, ensuring each element remains distinct and faithful to its original form, the company announced.
Another standout feature is the ability to generate transition animations by simply uploading the first and last frames, a powerful tool for story-driven videos also found on Veo 3.1. Additionally, ShengShu has launched an application programming interface (API) for Vidu Q2, enabling businesses to incorporate these advanced features into their workflows seamlessly.
“Vidu Q2 signifies a new chapter in AI-driven video creation,” stated Luo Yihang, CEO of the Beijing-based company. “We’re entering a era where AI can replicate human appearances and express emotions with cinematic quality. This isn’t just about basic video generation; it’s about teaching AI to act and tell stories alongside creators.”
“With each update, we’re blending technology with creativity more deeply,” Luo continued. “Our aim isn’t to replace human creativity but to amplify it, making imagination visible and emotions limitless.”
Industry experts have noted that ShengShu’s Vidu Q2 delivers content more quickly and affordably compared to high-end options like Sora 2 and Veo 3.1.
To test its capabilities, the new system was prompted to produce a video featuring a blade battery module moving on a conveyor in a Chinese electric vehicle factory, while being scanned by a yellow Siasun industrial robot. The background displayed a screen with a real-time yield rate of 99.92% in simplified Chinese.
Vidu Q2 effectively combined the battery, robot arm, Siasun logo, and Chinese text into a dynamic scene, maintaining high stability and accuracy, especially with the Chinese characters. This confirmed ShengShu’s claim of multi-entity consistency.
When tested with similar prompts, Veo 3.1, which supports up to three reference images, failed to reproduce the Chinese text correctly. Meanwhile, Sora 2 rendered the text properly but replaced the logo with Nissan’s.
In another test, the platform was prompted to generate a scene in which a chairman in Shanghai angrily asks, “The battery caught fire, are you messing with me?” in Chinese, with an American CEO replying in English, “Not me, it’s them.” The scene depicted a meeting room setting.
The tool used reference images to craft the angry facial expression, maintaining accurate lip-sync for both languages. However, the emotional tone of the audio was more subdued, with some lag compared to Veo 3.1. These tests demonstrate that Vidu Q2’s multilingual dialogue and emotional expression capabilities are highly competitive on the global stage.
Founded in March 2023 by a team from Tsinghua University’s AI Industry Research Institute, the company launched its first version, Vidu 1.0, last April. It now has over 30 million users across more than 200 countries and has generated over 400 million videos. The platform can produce five- to eight-second videos at 1080p resolution from text prompts in Chinese or English, as well as from images.




