Lipsync SDKs

Annosoft currently provides 2 Lipsync SDK solutions.

Textless Lipsync SDK

This SDK produces accurately timed mouth positions and phonemes from a wave file. It analyses the entire file, using a statistical process that is robust against different speakers. It does not require a text file transcription of the audio.

When time is critical, or when dealing in languages not supported by the Text Based Lipsync SDK, the Textless Lipsync SDK is an excellent option.

We are interested in helping you solve your unique audio problems. Please send us an e-mail.

Platforms: Win32, MacOS

demo page

Text Based Lipsync SDK

The Text Based Lipsync SDK is best lipsync technology in the world. With the audio and a text script, this technology produces perfect or near perfect lipsync on short or very long files.

This technology has been production quality for 5 years and 1000s of hours of audio. Our customer list is a testament to the quality of this software.

The lipsync data output from the SDK is in a flexible format. If you have an existing character animation implementation, the lipsync data will be usable in straightforward way.

Uses include:

  • ultra high-quality lipsync
  • automatic subtitling/closed captioning
  • non-linear animation. script driven animation

Why text based lipsync/annotation is valuable?

Although the text-less system is easier for the end user, the text-based system offers a few significant advantages. In addition to perfectly accurate phoneme timings, it accurately times words and user data. Having accurately time-stamped words allows applications to automatically build page-turning applications that exactly match the source audio. The Text-based version also recognizes and timestamps arbitrary XML embedded in the text transcription. Take this example transcription:

After you have run the demonstration. I need to ask you a question.. <animate name=”point-2-user”/>. After you have answered the question. Click here <animate name=”point-2-button”/>. Thank you.

The power is that with an appropriate animation architecture, scenes can be built by adding application specific markers to the source audio transcription. Arbitrary markers allow applications to build a production process which doesn’t require hand timing anything to audio files. The text scripts define the presentation and rely on canned animation sequences to run the scene. Even the actual audio recording can be changed and very little production work will be required.


Converting a text transcription into a set of phonemes for speech alignment is non-trivial. Unlike the Textless Lipsync SDK, each language requires special processing. Currently, we support:

  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Norwegian
  • Portuguese
  • Russian
  • Spanish
  • Swedish

Because of the special power of the Text Based Lipsync SDK, supporting new languages is important. We are actively working to broaden the multilingual support of this product.

Platforms: Win32, MacOS

demo page

About the SDKs

Annosoft licenses multimedia speech SDKs. Written in C++ and assembly language, the SDKs are painless to integrate into any C++ application or platform. Additionally, a scriptable ActiveX Control is available for use in Visual Basic or other Microsoft technologies.

Annosoft SDKs are extremely flexible because speech models are not hard-coded into the SDKs. This allows our clients to choose from various “stock” speech models that are the best fit for their application. Our stock models give our clients the ability to tune their application (at any time) for 1) recognition speed. 2) recognition accuracy. 3) application footprint. Also, custom speech models can be trained based on the audio characteristics and speaker, producing an optimal model in terms of speed and accuracy.