Kling 2.6 Native Audio — Lip Sync, Voices, SFX & More!

AI Concoction
9 Dec 202511:51

TLDRThis video reviews Cling 2.6, a new AI video model that introduces native audio generation, including narration, lip sync, sound effects, and music from a single prompt. The creator tests the model across multiple platforms and scenarios—from monologues and ambient scenes to multi-speaker dialogue—highlighting its strengths in lip sync, tone, and camera direction, as well as its inconsistencies with ambient audio, timing, and multi-character conversations. While impressive for single-speaker clips and stylistic scenes, the model struggles with complex audio layering and consistent voice control. Overall, Cling 2.6 delivers exciting but uneven results, offering strong creative potential with some limitations.

Takeaways

  • 😀 The Kling 2.6 API now supports native audio, including narration, lip sync, sound effects, and music from a single text prompt.
  • 🎤 You can generate videos with or without sound, but enabling audio costs more credits, especially for longer videos.
  • 🎧 The native audio feature currently supports English and Chinese voice output, with other languages automatically translated into English.
  • 📚 The best results come from using a clear and structured prompt, which includes scene description, character details, movement, audio, and camera instructions.
  • 🎥 When using the 11 Labs platform, it's important to select the Kling 2.6 model and enable audio, which increases the cost.
  • 🎶 Ambient sounds and sound effects may not always be generated as expected, and the model may ignore certain prompts if they don't match the scene visually.
  • 👄 The model generally performs well with lip-syncing, though it may struggle with complex character interactions, especially in multi-person dialogues.
  • 🗣 Dialogue delivery can sometimes be rushed or altered to fit time constraints, leading to unnatural pacing or incomplete sentences.
  • 🐕 In multi-character scenes, the model sometimes assigns dialogue to the wrong character or misinterprets actions, affecting the realism of interactions.
  • 🧊 The model occasionally adds unexpected sounds or elements,Kling 2.6 audio review like wind effects in a video where they weren't prompted, enhancing the atmosphere.
  • 📈 While the Kling 2.6 model shows promise, it's not perfect for projects requiring consistent voice delivery across multiple generations or complex audio mixing.

Q & A

  • What major upgrade does the Kling 2.6 model introduce?

    -Kling 2.6 introduces native audio generation, allowing videos to be created with narration, lip-sync, sound effects, and music directly from a text prompt.

  • Which languages does Kling 2.6 currently support for voice output?

    -The model currently supports English and Chinese voice output, and any other language is automatically translated to English.

  • What prompt structure does Kling recommend for best results?

    -Kling recommends describing the scene, the characters, their movement, the audio, and any additional style or camera notes.

  • Why did the desert monologue video feel rushed?

    -The model accelerated the speech because the 10-second limit wasn’t long enough to deliver the full dialogue naturally.

  • What issue appeared frequently when generating ambient audio?

    -The model sometimes ignored ambient audio instructions entirely, generating no background sound even when specifically requested.

  • Why might the café chatter not have been generatedKling 2.6 audio upgrade in the Napoleon clip?

    -The model may have decided the chatter didn’t visually match the scene since no other people appeared in the image.

  • How well does Kling 2.6 handle multi-speaker dialogue?

    -It performs inconsistently; two-person dialogue can work but often has timing issues, while three-person conversation failed entirely in testing.

  • What labeling strategy helps the model assign lines to the correct speaker?

    -Using unique, visually anchored labels like “man in the driver’s seat” or “woman in the passenger seat,” and repeating them consistently throughout the prompt.

  • What were the strengths observed in the Ice Queen video?

    -The model delivered accurate lip sync and produced convincing tone and atmosphere, while the Kling AI 2.6 API added fitting wind and dragon sounds even without prompting.

  • What overall limitations did the reviewer identify in Kling 2.6?

    -Inconsistent background audio, difficulty handling multiple speakers, occasional prompt-ignoring behaviors, and lack of control over voice consistency across generations.

  • What tip did the reviewer give for improving image-to-video quality?

    -They recommended upscaling images before generation to improve visual fidelity in the final video.

Outlines

00:00

🎬 Clling's New Models and Features

In this paragraph, the video discusses the launch of the Clling 2.6 model and its new features, including native audio support. The new model allows users to generate videos with narration, perfect lip sync, sound effects, and even music all from a single text prompt. Several examples are provided, such as a rap, a gym trainer's comment, and a simulated dialogue with Dr. Sarah Miller from Stanford AI Lab. The section also briefly touches on the video generation process and the cost of enabling audio in generated videos, as well as the language limitations and additional guidelines for optimal usage.

05:03

🔊 Testing the 2.6 Model for Various Video Scenarios

This paragraph details the author's testing of the 2.6 model with various video prompts and examines how well it generates different types of content. It covers videos with sound effects, camera directives, and dialogue, providing insights on successes and failures. Specific examples include a desert woman delivering a monologue, a Napoleon dialogue, and a fictional encounter involving a dragon and an ice queen. The paragraph also compares the model’s effectiveness with prompts that involve specific sound effects and lip sync.

10:05

🗣️ Dialogue Challenges and Character Labels

In this section, the video focuses on the challenge of generating natural dialogue with the Clling 2.6 model when multiple characters are involved. The author emphasizes the importance of labeling characters clearly and consistently to help theClling 2.6 model features model separate the dialogue. A series of examples are provided to demonstrate both successes and failures with two- and three-character dialogues. Issues like mislabeling and missing sound effects (e.g., car sputtering or cafe chatter) are highlighted. Despite adjustments in prompts, the model still struggles with creating natural interactions between multiple characters.

🐾 A Dog Talking and Other Unexpected Outcomes

This paragraph explores the Clling 2.6 model's ability to generate videos with animal characters, specifically a dog. Despite giving detailed prompts, the dog’s speech and lip sync are good, but background effects like suspenseful music and sound effects (e.g., clattering or city street ambience) are missed. A comparison is made to an earlier test where the model didn’t include wind effects despite their explicit inclusion in the prompt. The section also mentions a humorous werewolf example that highlights some of the model’s inconsistencies with sound effects and natural speech.

🤡 Clown's Emotional Range and Final Test

This section concludes with an experiment using a clown character who goes through a range of emotions, from eerie to cheerful to sad and finally furious. The test highlights the strengths and weaknesses of the model, especially in terms of lip sync and camera movements. Despite a detailed prompt, the sadness in the clown's performance was not fully conveyed, and once again, the model failed to add requested sound effects. The overall result was a mixture of successes and flaws, showing that while the model works well in some areas, it still has significant room for improvement.

🔍 Conclusion and Personal Experience with the 2.6 Model

In the final paragraph, the author summarizes their experience with the Clling 2.6 model, acknowledging both its strengths and limitations. While the lip sync and camera directions were generally effective, the handling of complex audio combinations, dialogue assignments, and consistency in character voices were problematic. The model struggles with multi-character interactions and doesn't always deliver expected results in terms of sound effects or voice consistency. The author concludes with advice for users and asks for viewer feedback on their own experiences with the model.

Mindmap

Keywords

💡Kling 2.6 Native Audio

Kling 2.6 Native Audio is a new feature in the Kling platform, introduced as part of the Omni launch week. This model enables users to generate videos that include narration, perfect lip sync, sound effects, and even music from a single text prompt. The feature simplifies video creation, offering a more integrated multimedia experience where audio is seamlessly synced with visual elements, enhancing the overall storytelling experience.

💡Lip Sync

Lip sync refers to the synchronization of a character's lip movements with the spoken audio in a video. In the context of Kling 2.6, this feature is emphasized as one of the main strengths of the model. It ensures that characters' mouth movements match their dialogue perfectly, which is crucial for realism and immersion in generated videos, as seen in the example where the lip sync is described as 'perfect' for certain videos.

💡Voice Output

Voice output refers to the spoken dialogue that is generated by the Kling 2.6 model. This includes not only the primary dialogue but also any background voices or sound effects, such as ambient noise or environmental sounds. The model currently supports voice output in Chinese and English, translating non-supported languages intoKling 2.6 audio features English for consistency.

💡Sound Effects

Sound effects are additional auditory elements that are added to videos to enhance the immersive experience. These can range from environmental sounds, like wind or street chatter, to action-oriented sounds, like explosions or car engines. In the script, sound effects were often used as examples to demonstrate the model's ability to generate audio that matches the visual cues, although it is noted that sometimes these effects were not generated as expected.

💡11 Labs

11 Labs is another platform that supports the Kling 2.6 model, allowing users to generate videos with or without audio. The platform provides a straightforward interface for video creation, where users can upload images, adjust aspect ratios, and set video durations. However, 11 Labs' system sometimes resulted in missing audio effects or improper voice delivery, as highlighted in the video's critique of the generated examples.

💡Camera Movements

Camera movements refer to how the visual perspective of a video changes during the animation, such as zooming in or panning across the scene. The Kling 2.6 model allows users to include camera movement instructions in the text prompt, which it attempts to follow in the generated video. The video discusses instances where the model followed camera directions well but occasionally missed specific directives, such as zooming or panning.

💡Text Prompt

A text prompt is the primary input given to the Kling 2.6 model to generate a video. It provides the model with instructions on the scene description, characters, movements, audio requirements, and any extra stylistic notes. A well-structured prompt, as suggested in the script, helps ensure the best output, with detailed descriptions leading to more accurate results in the generated video.

💡Character Labels

Character labels are tags used in the text prompt to specify which character is speaking or performing an action at any given moment in the video. This helps the model accurately attribute voice output and actions to the correct character, as shown in the video where the speaker labels such as 'the man in the driver's seat' or 'the woman in the passenger seat' are used to organize dialogue.

💡Multiple Speakers

Multiple speakers refer to instances where two or more characters are involved in dialogue within the video. The script highlights the challenges of generating videos with multiple speakers, noting that the Kling 2.6 model sometimes struggles to accurately assign lines to the right character or produce natural interactions. This is especially problematic when there are more than two characters, as demonstrated in the failure of a three-person dialogue test.

💡Voice Consistency

Voice consistency refers to the need for a uniform character voice throughout a video or project. The script points out that the Kling 2.6 model struggles with maintaining consistent voice quality across different generations of video. This can be problematic for projects where a character's voice needs to remain the same, making the model less ideal for long-term or series-based content.

Highlights

Cling 2.6 model introduces native audio support for generating videos with narration, lip sync, sound effects, and music from a single text prompt.

The model supports Chinese and English voice output, with other languages being automatically translated to English.

Best prompt structure involves describing the scene, characters, movements, audio, and additional style or camera notes to ensure more consistent results.

The 2.6 model is available on several platforms, including Cling’s site and 11 Labs, with the option to generate videos with or without sound.

Generating videos with audio costs more credits, with a 10-second video costing over 8,000 credits.

Native audio model is currently 30% off, but only supports specific languages (Chinese and English).

The sound effects in the videos are generally good, but sometimes the camera directions are ignored, or audio is missing or misplaced.

Lip sync works well in most cases, but when multiple characters are involved, the model sometimes assigns lines to the wrong speaker.

Ambient sound effects, like wind or city street noises, are not always generated as expected, despite being requested.

The model's ability to handleKling 2.6 audio features multiple types of audio, such as dialogue and sound effects, can be inconsistent, especially in more complex scenes.

When testing multi-character dialogues, the model sometimes gets confused and assigns the wrong dialogue to the wrong character.

Cling's user guide offers helpful tips on how to structure prompts for the best results.

The model excels at generating natural-sounding dialogue for a single speaker but struggles with maintaining consistency in a multi-character setup.

Generating videos with the Cling 2.6 model may require adjustments in the prompt structure for optimal results, especially when aiming for specific details like accents.

The user experiment with generating a dog character’s speech was generally successful, though there were issues with lip movement timing and missing background music.