Phoneme Mapping with 17 visemes |
This table displays the 17 visemes used by the lipsync tool for it's high polygon 3d character.
The visemes are shown at "full open". The viseme targets for your character should be somewhat exaggerated from real mouth positions. The blending functionality of Annosoft Lipsync tends to weaken everything somewhat.
The vowels can be condensed to fewer visemes and still have realism. There is room for debate on the vowel mapping, but this is what the lipsync tool uses. They are sorted by mouth similarity.
In this artwork, the "L" viseme is crummy. The tongue needs to be very high and the mouth needs to be more open, like a big AH sound with the tongue at the roof of the mouth. N also has a high tongue, but in our experiments, it detracts from realism.
Note: for textless lipsync users. f,v,m,b,p can be somewhat problematic in textless lipsync. These are difficult speech regions to identify exactly. 'm' gets confused with 'n'. f,v,b,p and x are also tough to distinguish. Realism is better achieved with non-exaggerated mouth positions for these phonemes. This is a workaround for speech recognition errors, but seems to be pretty effective.
AA - odd, adah |
AH, h - adapt, marsha, h e |
AO - score, ought |
AW OW - cow, oats |
OY UH UW-toy, tou gh, two |
EH, AE - Ted, cat |
IH,AY - hit, Hide |
EY - ate, gate |
y, IY - yes, yum, eat |
r, ER- ranger |
l - loud, unload |
w - would, unwind |
m,p,b |
n,NG,DH,d,g,t,z,ZH,TH,k, s |
CH, j, SH |
f,v |
x |
|
|
|
The mouth positions can be approximately grouped into vowels and consonants. Approximately because we some consonants (y r, l, w) have a vowel realization.
Vowels are the voiced sounds. At their peak position, they will have a comparitively open position to its neighboring consonants. Annosoft Lipsync automatically determines how open the mouth should be, and when. The viseme should try to be as accurate a representation as possible, obviously. To get there a we need to subdivide the vowels so that we can make an accurate mouth reference.
Realism is effected by three factors in the viseme representation:
Referring the to viseme images, we can classify each phoneme as the contribution of a width-backness factor, the openness factor, and the rounded-forward factor. This is subjective, and there may be disagreement. The viseme is listed following by a contributions. If the contribution is zero, this means the "neutral position" for forward-rounded, width-backness. For openness, zero means closed.
Viseme | openness | forward-roundedness | width-backness | notes |
---|---|---|---|---|
AA | 100% | 0 | 0 | Neutral Mouth - wide open |
AH, h | 70% | 0 | 0 | Neutral Mouth - opened but not exaggerated |
AO | 60% | 60% | 0 | open and rounded, probably needs more open! |
AW OW | 90% | 80% | 0 | Rounded Mouth - open |
OY UH UW | 70% | 70% | 0 | Rounded Mouth, fairly open |
EH, AE | 80% | 0 | 60% | Back mouth without exagerated open or exag back |
IH,AY | 60% | 0 | 60% | less open that EH, AH, but same width/back |
EY | 70% | 0 | 80% | back mouth with some openness |
y, IY | 50% | 0 | 100% | full back mouth |
r, ER | 35% | 80% | 0 | small rounded mouth. |
w | 20% | 100% | 0 | full - almost closed rounded mouth |
l | 70% | 0 | 0 | neutral mouth - open - TONGUE UP |
The table offers a different way to look at the visemes, in terms of the contributions of the 3 elements. As we can see, visemes in the same class (neutral, back, forward-rounded), vary, in general by the amount of openness at the maximum position.
This ends the section on vowels. Please provide feedback if this documentation isn't up to par!
Consonants can be divided into categories, fricatives, stops, etc
Some consonants play a minor role in the viseme realization, others have a strong effect. Consonants that close the mouth are important for realism. The speech recognizer must get these correct. The opening and closing of the mouth in timing is a critical feature for realism. So does the back and forth movements betweens "bee" and "cow".
m,p,b,f,v need to represented as a closed mouth, lip bending is a nice too. We have it hear. Right now, the speech recognizer's intensity values for these are usually pretty low, this should be changed, the result is that m,p,b,f,v lip exaggerations are not utilized very well. It's good to have them. For textless lipsync, I recommend not exaggerating them. silence with a little background Hz may be misinterpreted by the textless lipsync as "f" or "v".
The "neutral" consonants "n,NG,DH,d,g,t,k,z,TH,s" act more like tweens between vowels and vowels or vowels and closed mouth consonants. One could successfully argue that the phoneme TH should be tongue forward. This should work decently with text based lipsync.
A few consonants introduce a forward-rounded element that effects realism, the phonemes "CH,j,SH,ZH" are mapped to a very forward mouth position with teeth displayed.
Viseme | openness | forward-roundedness | width-backness | notes |
---|---|---|---|---|
n NG DH d g t k z TH s | 50% | 0% | 0% | Neutral Mouth - neutral open position. |
m,b,p | 0% | 0% | 0% | Closed Mouth - lip curves in |
f,v | 0% | 0% | 0% | Closed Mouth - bite top lip |
CH j SH ZH | 40% | 60% | 15% | Lips forward. say "cheese" |
TH | 50% | 0% | 0% | hypothetical- like nNG but tongue out. Not implemented here |
The consonants should be create in a fairly exaggerated way. They should look a little unnatural. The lipsync system weights the nNG phonemes and will almost never show a 100% NG.
Copyright (c) 2008 Annosoft LLC. All Rights Reserved.