Kling 2.6 Native Audio — Lip Sync, Voices, SFX & More!
TLDRThis video reviews Cling 2.6, a new AI video model that introduces native audio generation, including narration, lip sync, sound effects, and music from a single prompt. The creator tests the model across multiple platforms and scenarios—from monologues and ambient scenes to multi-speaker dialogue—highlighting its strengths in lip sync, tone, and camera direction, as well as its inconsistencies with ambient audio, timing, and multi-character conversations. While impressive for single-speaker clips and stylistic scenes, the model struggles with complex audio layering and consistent voice control. Overall, Cling 2.6 delivers exciting but uneven results, offering strong creative potential with some limitations.
Takeaways
- 😀 The Kling 2.6 API now supports native audio, including narration, lip sync, sound effects, and music from a single text prompt.
- 🎤 You can generate videos with or without sound, but enabling audio costs more credits, especially for longer videos.
- 🎧 The native audio feature currently supports English and Chinese voice output, with other languages automatically translated into English.
- 📚 The best results come from using a clear and structured prompt, which includes scene description, character details, movement, audio, and camera instructions.
- 🎥 When using the 11 Labs platform, it's important to select the Kling 2.6 model and enable audio, which increases the cost.
- 🎶 Ambient sounds and sound effects may not always be generated as expected, and the model may ignore certain prompts if they don't match the scene visually.
- 👄 The model generally performs well with lip-syncing, though it may struggle with complex character interactions, especially in multi-person dialogues.
- 🗣 Dialogue delivery can sometimes be rushed or altered to fit time constraints, leading to unnatural pacing or incomplete sentences.
- 🐕 In multi-character scenes, the model sometimes assigns dialogue to the wrong character or misinterprets actions, affecting the realism of interactions.
- 🧊 The model occasionally adds unexpected sounds or elements,Kling 2.6 audio review like wind effects in a video where they weren't prompted, enhancing the atmosphere.
- 📈 While the Kling 2.6 model shows promise, it's not perfect for projects requiring consistent voice delivery across multiple generations or complex audio mixing.
Q & A
What major upgrade does the Kling 2.6 model introduce?
-Kling 2.6 introduces native audio generation, allowing videos to be created with narration, lip-sync, sound effects, and music directly from a text prompt.
Which languages does Kling 2.6 currently support for voice output?
-The model currently supports English and Chinese voice output, and any other language is automatically translated to English.
What prompt structure does Kling recommend for best results?
-Kling recommends describing the scene, the characters, their movement, the audio, and any additional style or camera notes.
Why did the desert monologue video feel rushed?
-The model accelerated the speech because the 10-second limit wasn’t long enough to deliver the full dialogue naturally.
What issue appeared frequently when generating ambient audio?
-The model sometimes ignored ambient audio instructions entirely, generating no background sound even when specifically requested.
Why might the café chatter not have been generatedKling 2.6 audio upgrade in the Napoleon clip?
-The model may have decided the chatter didn’t visually match the scene since no other people appeared in the image.
How well does Kling 2.6 handle multi-speaker dialogue?
-It performs inconsistently; two-person dialogue can work but often has timing issues, while three-person conversation failed entirely in testing.
What labeling strategy helps the model assign lines to the correct speaker?
-Using unique, visually anchored labels like “man in the driver’s seat” or “woman in the passenger seat,” and repeating them consistently throughout the prompt.
What were the strengths observed in the Ice Queen video?
-The model delivered accurate lip sync and produced convincing tone and atmosphere, while the Kling AI 2.6 API added fitting wind and dragon sounds even without prompting.
What overall limitations did the reviewer identify in Kling 2.6?
-Inconsistent background audio, difficulty handling multiple speakers, occasional prompt-ignoring behaviors, and lack of control over voice consistency across generations.
What tip did the reviewer give for improving image-to-video quality?
-They recommended upscaling images before generation to improve visual fidelity in the final video.
Outlines
🎬 Clling's New Models and Features
In this paragraph, the video discusses the launch of the Clling 2.6 model and its new features, including native audio support. The new model allows users to generate videos with narration, perfect lip sync, sound effects, and even music all from a single text prompt. Several examples are provided, such as a rap, a gym trainer's comment, and a simulated dialogue with Dr. Sarah Miller from Stanford AI Lab. The section also briefly touches on the video generation process and the cost of enabling audio in generated videos, as well as the language limitations and additional guidelines for optimal usage.
🔊 Testing the 2.6 Model for Various Video Scenarios
This paragraph details the author's testing of the 2.6 model with various video prompts and examines how well it generates different types of content. It covers videos with sound effects, camera directives, and dialogue, providing insights on successes and failures. Specific examples include a desert woman delivering a monologue, a Napoleon dialogue, and a fictional encounter involving a dragon and an ice queen. The paragraph also compares the model’s effectiveness with prompts that involve specific sound effects and lip sync.
🗣️ Dialogue Challenges and Character Labels
In this section, the video focuses on the challenge of generating natural dialogue with the Clling 2.6 model when multiple characters are involved. The author emphasizes the importance of labeling characters clearly and consistently to help theClling 2.6 model features model separate the dialogue. A series of examples are provided to demonstrate both successes and failures with two- and three-character dialogues. Issues like mislabeling and missing sound effects (e.g., car sputtering or cafe chatter) are highlighted. Despite adjustments in prompts, the model still struggles with creating natural interactions between multiple characters.
🐾 A Dog Talking and Other Unexpected Outcomes
This paragraph explores the Clling 2.6 model's ability to generate videos with animal characters, specifically a dog. Despite giving detailed prompts, the dog’s speech and lip sync are good, but background effects like suspenseful music and sound effects (e.g., clattering or city street ambience) are missed. A comparison is made to an earlier test where the model didn’t include wind effects despite their explicit inclusion in the prompt. The section also mentions a humorous werewolf example that highlights some of the model’s inconsistencies with sound effects and natural speech.
🤡 Clown's Emotional Range and Final Test
This section concludes with an experiment using a clown character who goes through a range of emotions, from eerie to cheerful to sad and finally furious. The test highlights the strengths and weaknesses of the model, especially in terms of lip sync and camera movements. Despite a detailed prompt, the sadness in the clown's performance was not fully conveyed, and once again, the model failed to add requested sound effects. The overall result was a mixture of successes and flaws, showing that while the model works well in some areas, it still has significant room for improvement.
🔍 Conclusion and Personal Experience with the 2.6 Model
In the final paragraph, the author summarizes their experience with the Clling 2.6 model, acknowledging both its strengths and limitations. While the lip sync and camera directions were generally effective, the handling of complex audio combinations, dialogue assignments, and consistency in character voices were problematic. The model struggles with multi-character interactions and doesn't always deliver expected results in terms of sound effects or voice consistency. The author concludes with advice for users and asks for viewer feedback on their own experiences with the model.
Mindmap
Keywords
💡Kling 2.6 Native Audio
💡Lip Sync
💡Voice Output
💡Sound Effects
💡11 Labs
💡Camera Movements
💡Text Prompt
💡Character Labels
💡Multiple Speakers
💡Voice Consistency
Highlights
Cling 2.6 model introduces native audio support for generating videos with narration, lip sync, sound effects, and music from a single text prompt.
The model supports Chinese and English voice output, with other languages being automatically translated to English.
Best prompt structure involves describing the scene, characters, movements, audio, and additional style or camera notes to ensure more consistent results.
The 2.6 model is available on several platforms, including Cling’s site and 11 Labs, with the option to generate videos with or without sound.
Generating videos with audio costs more credits, with a 10-second video costing over 8,000 credits.
Native audio model is currently 30% off, but only supports specific languages (Chinese and English).
The sound effects in the videos are generally good, but sometimes the camera directions are ignored, or audio is missing or misplaced.
Lip sync works well in most cases, but when multiple characters are involved, the model sometimes assigns lines to the wrong speaker.
Ambient sound effects, like wind or city street noises, are not always generated as expected, despite being requested.
The model's ability to handleKling 2.6 audio features multiple types of audio, such as dialogue and sound effects, can be inconsistent, especially in more complex scenes.
When testing multi-character dialogues, the model sometimes gets confused and assigns the wrong dialogue to the wrong character.
Cling's user guide offers helpful tips on how to structure prompts for the best results.
The model excels at generating natural-sounding dialogue for a single speaker but struggles with maintaining consistency in a multi-character setup.
Generating videos with the Cling 2.6 model may require adjustments in the prompt structure for optimal results, especially when aiming for specific details like accents.
The user experiment with generating a dog character’s speech was generally successful, though there were issues with lip movement timing and missing background music.