ByteDance Disrupts Video Generation Race with Breakthrough in Multi-Subject Interaction

Posted by:

|

On:

|

On September 24, ByteDance’s technology arm, Volcano Engine, introduced two state-of-the-art video generation models, PixelDance and Seaweed, which significantly enhance video content creation capabilities through sophisticated multi-shot actions and complex interactions among multiple subjects. These models break new ground by adhering to complex directives and maintaining high consistency in character appearance and cinematography across various camera movements, closely resembling live-action footage.

Both models are engineered on the DiT architecture, which integrates efficient DiT fusion computing units. This technology facilitates free transitioning between cinematographic techniques such as zooming, panning, tilting, scaling, and target tracking, addressing the industry’s challenge of maintaining consistency in subject, style, and atmosphere during camera transitions.

The development of a new diffusion model training method has successfully resolved the issue of consistency across multiple camera switches, ensuring a uniform presentation of the main subjects and the overall visual style throughout the video. Additionally, an enhanced Transformer structure boosts the generalization ability of the models, enabling them to support various animation styles and adapt to different screen ratios. This makes them highly versatile for applications in e-commerce marketing, animated education, cultural tourism, and more, providing substantial creative aid to professional artists and creators.

The models have been refined through continuous iterations in real-world applications like CapCut and Dreamina, achieving professional-grade lighting and color blending that significantly enhances visual appeal and realism.

Targeted at the enterprise market, PixelDance and Seaweed exhibit robust semantic understanding capabilities and are adept at managing complex interactions and consistent content delivery across multiple camera views.

Volcano Engine also revealed that since its initial launch in May, the daily usage of DouBao language models has surged tenfold to over 1.3 trillion tokens, with multimodal data processing reaching 50 million images and 850,000 hours of voice data per day.

The pricing strategy for the DouBao models, set below 99% of the industry average, has initiated a trend of price reductions in China’s large model sector, removing cost as a barrier to innovation. With enterprise applications expanding, supporting higher traffic volumes has become a key growth factor in the industry.

Moreover, while current industry standards cap TPM (tokens per minute) at 300K to 100K, insufficient for some enterprise applications, DouBao models start with an initial capacity of 800K TPM, far exceeding these standards, with options for scalable expansions based on client needs. This capability allows the models to support high-demand scenarios such as scientific research, automotive smart systems, and AI education, where peak TPM requirements significantly surpass the industry average.

Editor: Chain Zhang

The post ByteDance Disrupts Video Generation Race with Breakthrough in Multi-Subject Interaction first appeared on Synced.

Posted by

in