Phoneme Mapping with 17 visemes

Logo

This table displays the 17 visemes used by the lipsync tool for it's high polygon 3d character.

The visemes are shown at "full open". The viseme targets for your character should be somewhat exaggerated from real mouth positions. The blending functionality of Annosoft Lipsync tends to weaken everything somewhat.

The vowels can be condensed to fewer visemes and still have realism. There is room for debate on the vowel mapping, but this is what the lipsync tool uses. They are sorted by mouth similarity.

In this artwork, the "L" viseme is crummy. The tongue needs to be very high and the mouth needs to be more open, like a big AH sound with the tongue at the roof of the mouth. N also has a high tongue, but in our experiments, it detracts from realism.

Note: for textless lipsync users. f,v,m,b,p can be somewhat problematic in textless lipsync. These are difficult speech regions to identify exactly. 'm' gets confused with 'n'.  f,v,b,p and x are also tough to distinguish. Realism is better achieved with non-exaggerated mouth positions for these phonemes. This is a workaround for speech recognition errors, but seems to be pretty effective. 

 

AA - odd, adah

AH, h - adapt, marsha, h e

AO - score, ought

AW OW - cow, oats

OY UH UW-toy, tou gh, two

EH, AE - Ted, cat

IH,AY - hit, Hide

EY - ate, gate

y, IY - yes, yum, eat

r, ER- ranger

l - loud, unload

w - would, unwind

m,p,b

n,NG,DH,d,g,t,z,ZH,TH,k, s

CH, j, SH

f,v

x

 

 

 

Classification of the visemes

The mouth positions can be approximately grouped into vowels and consonants. Approximately because we some consonants (y r, l, w) have a vowel realization.

Vowels

Vowels are the voiced sounds. At their peak position, they will have a comparitively open position to its neighboring consonants. Annosoft Lipsync automatically determines how open the mouth should be, and when. The viseme should try to be as accurate a representation as possible, obviously. To get there a we need to subdivide the vowels so that we can make an accurate mouth reference.

Realism is effected by three factors in the viseme representation:


  • The width/backness of the mouth. The width of the mouth depends on the amount of an "E" contribution. Specifically, the viseme for "bee", the mouth is wide and back. And "boo" is almost the opposite. Vowel phonemes may have some contribution of E, bee, rest.
  • The openness of the mouth. The speech functionality generates articulation information that controls the openness and emphasis of a phoneme (per frame - phn_vis) or on a curve (phn_env). Mouths should be created as if they were being emphasized. The speech system will generate information to control "how much" of the phoneme is turned on. THe speech system won't exaggerate the phoneme unless the speech is emphasized.
    That said, certain phonemes are more open than others by default. THe word "cow" versus the word "hit", the mouth is not as open in "hit" given the same volume level.
  • The rounded-forwardness of the mouth. The "w" sound is almost closed mouth, very curved, the word "touch" has some forward openness, the word "slam" has little forward motion. For good lipsync, it is necessary to try to capture the various curved phonemes.
    Each phoneme will be a combination of those attributes, most are a combination of 2, openness and width-backness. or openness and rounded-forwardness. Ugly terms, but perhaps it makes sense.

    Referring the to viseme images, we can classify each phoneme as the contribution of a width-backness factor, the openness factor, and the rounded-forward factor.  This is subjective, and there may be disagreement. The viseme is listed following by a contributions. If the contribution is zero, this means the "neutral position" for forward-rounded, width-backness. For openness, zero means closed.

    Viseme  openness forward-roundedness width-backness notes
    AA  100% 0 0 Neutral Mouth - wide open
    AH, h  70% 0 0 Neutral Mouth - opened but not exaggerated
    AO  60% 60% 0 open and rounded, probably needs more open!
    AW OW  90% 80% 0 Rounded Mouth - open
    OY UH UW  70% 70% 0 Rounded Mouth, fairly open
    EH, AE  80% 0 60% Back mouth without exagerated open or exag back
    IH,AY  60% 0 60% less open that EH, AH, but same width/back
    EY  70% 0 80% back mouth with some openness
    y, IY  50% 0 100% full back mouth
    r, ER  35% 80% 0 small rounded mouth.
    w  20% 100% 0 full - almost closed rounded mouth
    l  70% 0 0 neutral mouth - open - TONGUE UP
    Table 1: Vowel Visemes

    The table offers a different way to look at the visemes, in terms of the contributions of the 3 elements. As we can see, visemes in the same class (neutral, back, forward-rounded), vary, in general by the amount of openness at the maximum position.

    This ends the section on vowels. Please provide feedback if this documentation isn't up to par!

    Consonants

    Consonants can be divided into categories, fricatives, stops, etc

    Some consonants play a minor role in the viseme realization, others have a strong effect. Consonants that close the mouth are important for realism. The speech recognizer must get these correct. The opening and closing of the mouth in timing is a critical feature for realism. So does the back and forth movements betweens "bee" and "cow".

    m,p,b,f,v need to represented as a closed mouth, lip bending is a nice too. We have it hear. Right now, the speech recognizer's intensity values for these are usually pretty low, this should be changed, the result is that m,p,b,f,v lip exaggerations are not utilized very well. It's good to have them. For textless lipsync, I recommend not exaggerating them. silence with a little background Hz may be misinterpreted by the textless lipsync as "f" or "v".

    The "neutral" consonants "n,NG,DH,d,g,t,k,z,TH,s" act more like tweens between vowels and vowels or vowels and closed mouth consonants. One could successfully argue that the phoneme TH should be tongue forward. This should work decently with text based lipsync.

    A few consonants introduce a forward-rounded element that effects realism, the phonemes "CH,j,SH,ZH" are mapped to a very forward mouth position with teeth displayed.

    Viseme  openness forward-roundedness width-backness notes
    n NG DH d g t k z TH s  50% 0% 0% Neutral Mouth - neutral open position.
    m,b,p 0% 0% 0% Closed Mouth - lip curves in
    f,v 0% 0% 0% Closed Mouth - bite top lip
    CH j SH ZH 40% 60% 15% Lips forward. say "cheese"
    TH 50% 0% 0% hypothetical- like nNG but tongue out. Not implemented here
    Table 2: Consonant Visemes

    The consonants should be create in a fairly exaggerated way. They should look a little unnatural. The lipsync system weights the nNG phonemes and will almost never show a 100% NG.


    Copyright (c) 2008 Annosoft LLC. All Rights Reserved.