SAPI 5.1 Lipsync Console Program and Source Code

Hi. Welcome to the documentation for the free SAPI lipsync implementation provided by Annosoft.

The latest source code and binary release can be downloaded from http://www.annosoft.com/sapi_lipsync/latest_source.zip. The plan is to move this to sourceforge.net.

This software and source code allow developers to integrate an automatic lipsync engine into their application, or use directly. Annosoft offers higher end versions, but this one is FREE.

This software is a WIN32 Console Application that uses the Microsoft Speech API (SAPI) 5.1 Engine to generate time-aligned phonetical information given Microsoft RIFF Wave input.

It is distributed as an executable and in source form in the hope that this may be useful. Annosoft developed this software in response to customer queries (see Why did Annosoft release this software?).

We think that it might be useful to independent game developers, and might also be used to produce new innovative tools to aid the game industry. Check out Copyright and Usage Information for more details.

The inputs to the system are a wave file and an optional text transcription. The mode which uses the wave file only is called "textless lipsync", and the mode which uses the wave file and the text transcript is called "text based lipsync". You will see these all over the code and in the documentation.

The output from the system is a newline delimited list of phoneme timings and word timings produced by SAPI. The program prints the results to the console. Programs can either read this directly, or they can be piped, using the ">" command line operator, to an auxillary file.

The output is a simple format described in (.anno file format). The good news is that the output of the system can be evaluated directly in The Lipsync Tool. No need to buy anything, just download the software provided at http://www.annosoft.com/demos.htm

The Lipsync Tool isn't required but does make it easy to evaluate this software. Otherwise, you'll need to develop a rendering system of some kind since all this software outputs is phoneme and word timing.

Runtime Requirements

The system expects that SAPI 5.1 is installed on the machine and the Speech Recognition language is set to "Microsoft Speech Recognizer 5.1". This can be validated by opening the Control Panel -> Speech Configuration. The "language" drop-list should be set to "Microsoft Speech Recognizer 5.1". Other systems won't work as well because we don't have the phoneme mappings, and may fail altogether.

Build Requirements

In the project settings, the C++ include paths include a path to the Speech API headers

  • "C:\Program Files\Microsoft Speech SDK 5.1\Include"

If you install the SAPI 5.1 SDK to a different location than this, you will need to change the build path.

In the project settings, the C++ link paths (Additional Library Path) include a path to:

  • "C:\Program Files\Microsoft Speech SDK 5.1\Lib\i386"

Again, if you install the SAPI 5.1 SDK to different location, you'll need to modify this in order to link.

With those two settings, the projects should build fine. You can turn on or off _UNICODE, but the internal structures are primarily std::wstring because SAPI 5.1 expects and returns unicode strings.


It's a console application in it's current form. sapi_lipsync.exe command line arguments. explains the current options, which will likely be expanded.

The system spits phoneme timings out to the console. The application can pipe these to a file which is designed to be easy to parse. By default, we map to the Annosoft Phoneme Labels, which is suprisingly similar to that used by the SAPI 5.1 English system. I haven't tried it on other languages, too much of headache.

DOS\\>sapi_lipsync.exe 01.wav 01.txt > 01.anno
;; run text based lipsync and pipe the results in the anno file
;; this file can be opened by "The Lipsync Tool" for evaluation

DOS\\>sapi_lipsync.exe 01.wav > 01.anno
;; run textless lipsync and pipe the results in the anno file
;; this file can be opened by "The Lipsync Tool" for evaluation

Thats about in on program usage for now.

The Code

There are only a few classes and functions that make the whole thing work. The actual processing code is in sapi_lipsync.h and sapi_lipsync.cpp.

We created a base class sapi_lipsync which contains common code used by both textless and textbased. And then we created two classes (sapi_textless_lipsync and sapi_textbased_lipsync) to do the specialty stuff.

The lipsync process in both cases runs asyncronously and the applicaton has to poll for results. We need to add an "escape" key to stop the process and accept the current results. We don't have that yet.

run_sapi_textbased_lipsync demonstrates concisely how to perform text based lipsync and phoneme alignment.

run_sapi_textless_lipsync demonstrates concisely how to perform textless lipsync and phoneme alignment.

Copyright (C) 2002-2005 Annosoft LLC. All Rights Reserved.
Visit us at www.annosoft.com