Information on converting phonemes to visemes (visual display)

Preface this page with a statement that the author is not an artist. I offer "plausible" mapping from phonemes to visemes based on information produced by others and on some years of experience in the area of lipsync.

Although smooth 3d lipsync requires more than a simple phoneme to viseme mapping. A basic mapping is critical to begin the work. These "basic" mouth positions can then be twisted and contoured to satisfy co-articulations, allophones, and intensity information in the sound.

With the basic mapping in hand, a series of transformations can be performed that takes the recognition output and produces visemes.

The phoneme-2-viseme transformation used in the realtime demo (realtime) is simplistic on the scale of things. It starts with the raw 'all on' visemes shown below. It then simply picks the longest phoneme within a 50 millisecond window. It morphs between the viseme representation (for the target phoneme) and the silence viseme. The morph value is a function of the energy of the audio signal as well as other ad-hoc rules.

The default phoneme to viseme transformation used in the Lipsync Tool is more complex. It uses an approach based on articulation theory which is implemented in the SDK and available to SDK customers.

For bitmapped graphics, where interpolation is not possible, it is a good idea to either create mouth visemes which are not as exaggerated as the mouths specified below or to assign two graphics for each voiced viseme, a hi and low. With {hi, lo} tuples and intensity information returned from the SDK, better mouth contouring can be achieved. Also, for bitmapped graphics, it is advisable to choose larger display frames of time. (83 milliseconds or so), and pick the phoneme which covers the most time within a given frame, or window of time.

On with the basic mappings:

To start, we need a list of 40 phonemes used by the Annosoft, LLC lipsync recognizer at this time:

Annosoft LLC, Basic Phoneset

label  word example transcription
x silence  
IY  eat IY t
IH it IH t
EH Ed EH d
AE at AE t
AH hut h AH t
UW two t UW
UH hood h UH d
AA odd AA d
AO ought AO t
EY ate EY t
AY hide h AY d
OY toy t OY
AW cow k AW
OW oat OW t
l lee l IY
r read r IY d
y yield y IY l d
w we w IY
ER hurt h ER t
m me m IY
n knee n IY
NG ping p IH NG
CH cheese CH IY z
j gee j IY
DH thee DH IY
b be b IY
d dee d IY
g green g r IY n
p pee p IY
t tea t IY
k key k IY
z zee z IY
ZH seizure s IY ZH ER
v vee v IY
f fee f IY
TH theta TH EY t AH
s sea s IY
SH she SH IY
h he h IY

Set 1 (10 mouths)

Annosoft Viseme Set