
When Virtual Humans No Longer Wear "Fake Smiles": LongCat-Video-Avatar Endows AI with Authentic Vitality
In the face of the latest open-sourced virtual human video generation model, all crude AI digital humans show expressions of being left behind by the times. LongCat-Video-Avatar, just launched by Meituan's LongCat Team, allows virtual humans to appear as natural as real people even when "not speaking."
This company, renowned for local lifestyle services, has officially entered the AI video generation field with the launch of an all-round virtual human video generation model.
It is not only Meituan's first open-source video generation model but also the first all-round model capable of supporting three generation modes simultaneously: text, image, and video.
The Birth of LongCat-Video-Avatar Model
On December 18, 2025, Meituan's LongCat Team released and open-sourced the virtual human video generation model, LongCat-Video-Avatar.
Built on the previously released LongCat-Video foundation model, this model inherits the core design of "one model supporting multiple tasks."
As early as September, the team launched the LongCat-Flash Large Language Model (LLM); in early October, it made a major open-source release of LongCat-Video, which supports 5-minute long video generation; in November, the multimodal model LongCat-Flash-Omni achieved a breakthrough in real-time interaction; and in December, three image modelsâLongCat-Image, LongCat-Image-Dev, and LongCat-Image-Editâcompleted the closed loop of visual capabilities at the open-source State-of-the-Art (SOTA) level.
With an "offensive" posture, Meituan has intensively deployed a full-stack AI achievements, outlining a complete layout from language to multimodality, and from open-source ecosystem to industrial empowerment through its continuously iterated technology matrix.
Three Core Functions
LongCat-Video-Avatar natively supports multiple core functions.
Firstly, the Audio-Text-to-Video (AT2V) function enables users to generate virtual human videos through voice commands and text descriptions. Secondly, the Audio-Text-Image-to-Video (ATI2V) function, which combines reference images to generate more specific virtual human images.
Notably, it features a video continuation function. This allows the model to extend existing video content and generate coherent subsequent clipsâmeaning you only need to shoot a short video opening, and AI will help you complete the full video.
Technological Innovations
LongCat-Video-Avatar has achieved breakthroughs in three key dimensions: motion realism, long-video stability, and identity consistency.
One of the model's core highlights is its synchronous control capability, which can simultaneously direct the virtual human's lip movements, eye gaze, facial expressions, and body movements.
Even in a mute state, the virtual human can exhibit subtle behaviors such as blinking and adjusting sitting posture, no longer remaining a rigid digital sculpture.
This ability to maintain natural micro-movements while stationary marks a milestone where virtual humans truly possess a "sense of life."
Related Technical Architecture
Meituan's open-source video generation model series adopts the Diffusion Transformer (DiT) architecture. This architecture distinguishes different tasks by the number of conditional frames: text-to-video requires no input of conditional frames; image-to-video inputs 1 reference frame; and video continuation relies on multiple preceeding frames.
Through this unified architecture, the model can cover three types of tasks without additional modifications.
To enhance long-sequence generation capability, the model incorporates native video continuation tasks during the pre-training phase.
This enables the model to stably generate videos up to several minutes, with targeted optimizations for cross-frame temporal consistency and physical motion rationality to reduce issues such as color drift, image quality degradation, and motion discontinuity.
Try Free AI Art Image Generator Now
Efficient Generation Solution of LongCat-Video-Avatar
The model employs various strategies to improve video generation efficiency. It combines Block-Sparse Attention (BSA) mechanism with conditional token caching mechanism to reduce inference redundancy in long sequences.
When processing sequences of 93 frames or more, the model can maintain a stable balance between efficiency and generation quality.
For high-resolution and high-frame-rate scenarios, the model adopts a combined strategy of "two-stage coarse-to-fine + BSA + distillation." Reports indicate that this solution increases inference speed by approximately 10.1 times compared to the baseline.
These optimization solutions allow the model to generate high-quality 720p, 30fps videos within minutes.
Application Scenarios and Significance of Open Source
According to China Youth Network, LongCat-Video-Avatar has achieved leading SOTA level in evaluations on multiple authoritative datasets.
The open-sourcing of this model is of great significance to the AI community. Through GitHub, Hugging Face, and the official project page, global developers can freely access and apply this advanced technology.
For ordinary users, it means more applications will be able to generate high-quality, realistic virtual human content. From online education and customer service to the entertainment industry, virtual humans will no longer be limited to simple chatbots but will be able to interact with people in a more humanized manner.
Meituan's AI Strategic Layout
This virtual human video generation model is a continuation of Meituan's AI strategy. As early as the earnings call with analysts in March this year, Meituan CEO Wang Xing systematically elaborated on the AI strategy for the first time: "AI will disrupt all industries, and our strategy is to take the initiative to attack rather than defend passively."
As disclosed by Wang Xing at that time, Meituan's AI deployment is divided into three levels: AI at work, AI in products, and Building Large Language Models (LLMs).
The LongCat-Video-Avatar virtual human video generation model has taught virtual humans to "breathe."
Just like real humans, they blink when thinking and adjust their sitting posture in silence. When the gaze of virtual humans are no longer vacant, we may need to reflect: what is real is no longer important; what matters is whether we can tell the difference.
Today, whether generating a makeup tutorial video, producing a virtual human podcast program, or creating online education courses, LongCat-Video-Avatar can provide stable and realistic dynamic performance.
Once considered unattainable film-level virtual humans have now become accessible technical assets for any developer. When AI creativity meets the power of open source, the boundary between virtual and reality is quietly blurring.