Abstract
This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Badin, P., Borel, P., Bailly, G., Rev�ret, L., Baciu, M., and Segebarth, C. (2000). Towards an audiovisual virtual talking head: 3D articulatory modeling of tongue, lips and face based on MRI and video images. Proceedings of the 5th Speech Production Seminar, Germany: Kloster Seeon, pp. 261-264.
Bailly,G. (1998). Learning to speak. Sensori-motor control of speech movements. Speech Communication, 22(2/3):251-267.
Bailly, G., Gibert, G., and Odisio, M. (2002). Evaluation of movement generation systems using the point-light technique. IEEE Workshop on Speech Synthesis, Santa Monica, CA.
Beno�t, C., Lallouache, T., Mohamadi, T., and Abry, C. (1992). A set of French visemes for visual speech synthesis. In G. Bailly and C. Beno�t (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 485-501.
Bergeron, P. and Lachapelle, P. (1985). Controlling facial expression and body movements in the computer-generated short “Tony de Peltrie”. SIGGRAPH, Advanced Computer Animation Seminar Notes, San Francisco, CA.
Beskow, J. (1995). Rule-based Visual Speech Synthesis. Madrid, Spain, Eurospeech, pp. 299-302.
Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K.-E., and Öhman, T. (1997). The Teleface project-multimodal speech communication for the hearing impaired. Rhodos, Greece: Eurospeech, 2003-2010.
Brand, M. (1999). Voice pupperty. SIGGRAPH'99, Los Angeles, CA, pp. 21-28.
Bregler, C., Covell, M., and Slaney, M. (1997a). VideoRewrite: Driving visual speech with audio. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.
Bregler, C., Covell, M., and Slaney, M. (1997b). Video rewrite: Visual speech synthesis from video. International Conference on Auditory-Visual Speech Processing, Rhodes, Greece, pp. 153-156.
Brooke, N.M. and Scott, S.D. (1998). Two-and three-dimensional audio-visual speech synthesis. International Conference on Auditory-Visual Speech Processing, Terrigal, Australia, pp. 213-218.
Browman, C.P. and Goldstein, L.M. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18(3):299-320.
Chabanas, M. and Payan,Y. (2000). A3Dfinite element model of the face for simulation in plastic and maxillo-facial surgery. International Conference on Medical Image Computing and Computer-Assisted Interventions, Pittsburgh, USA, pp. 1068-1075.
Cohen, M.M. and Massaro, D.W. (1993). Modeling coarticulation in synthetic visual speech. In D. Thalmann and N. Magnenat-Thalmann (Eds.), Models and Techniques in Computer Animation. Springer-Verlag: Tokyo, pp. 141-155.
Cootes, T.F., Edwards, G.J., and Taylor, C.J. (2001). Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(6):681-685.
Cosatto, E. and Graf, H.P. (1997). Sample-based synthesis of photo-realistic talking-heads. SIGGRAPH'97, Los Angeles, CA, pp. 353-360.
Cosatto, E. and Graf, H.P. (1998). Sample-based synthesis of photo-realistic talking heads. Computer Animation, Philadelphia, Pennsylvania, pp. 103-110.
Couteau, B., Payan, Y., and Lavall�e, S. (2000). The Mesh-Matching algorithm: An automatic 3D mesh generator for finite element structures. Journal of Biomechanics, 33(8):pp1005-1009.
Doenges, P., Capin, T.K., Lavagetto, F., Ostermann, J., Pandzic, I., and Petajan, E. (1997). MPEG-4: audio/video and synthetic graphics/ audio for real-time, interactive media delivery. Image Communications Journal, 9(4):433-463.
Eisert, P. and Girod, B. (1998). Analyzing facial expressions for virtual conferencing. IEEE Computer Graphics & Applications: Special Issue: Computer Animation forVirtual Humans, 18(5):70-78.
Ekman, P. and Friesen,W.V. (1975). Unmasking the Face. Palo Alto, California: Consulting Psychologists Press.
Ekman, P. and Friesen, W. (1978). Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action. Palo Alto, California: Consulting Psychologists Press.
Elisei, F., Odisio, M., Bailly, G., and Badin, P. (2001). Creating and controlling video-realistic talking heads. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 90-97.
Ezzat, T. and Poggio, T. (1998). MikeTalk: A Talking Facial Display Based on Morphing Visemes. Philadelphia, PA: Computer Animation, pp. 96-102.
Ezzat, T., Geiger, G., and Poggio, T. (2002). Trainable videorealistic speech animation. ACM Transactions on Graphics, 21(3):388-398.
H�llgren, �. and Lyberg, B. (1998). Visual speech synthesis with concatenative speech. Auditory-Visual Speech Processing Conference, Terrigal-Sydney, Australia, pp. 181-183.
Harshman, R.A. and Lundy, M.E. (1984). The PARAFAC model for three-way factor analysis and multidimensional scaling. In H.G. Law, C.W. Snyder, J.A. Hattie, and R.P. MacDonald (Eds.), Research Methods for Multimode Data Analysis.New-York: Praeger, pp. 122-215.
Ishikawa, T., Sera, H., Morishima, S., and Terzopoulos, D. (1998). Facial image reconstruction by estimated muscle parameter. International Conference on Automatic Face and Gesture Recognition, Nara, Japan, pp. 342-347.
Li, H., Roivanen, P., and Forchheimer, R. (1993). 3D motion estimation in model-based facial image coding. IEEE Transactions on PAMI, 15(6):545-555.
Massaro, D. (1998a). Illusions and issues in bimodal speech perception. Auditory-Visual Speech Processing Conference, Terrigal, Sydney, Australia, pp. 21-26.
Massaro, D.W. (1998b). Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. Cambridge, MA: MIT Press.
Matthews, I., Cootes, T.F., and Bangham, J.A. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):198-213.
McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 26:746-748.
Minnis, S. and Breen, A.P. (1998). Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. ICSLP, Beijing, China, pp. 759-762.
Odisio, M., Elisei, F., Bailly, G., and Badin, P. (to appear). 3D talking clones for virtual teleconferencing. Annals of Telecommunications.
Ogden, R., Hawkins, S., House, J., Huckvale, M., Local, J., Carter, P., Dankovicov�a, J. and Heid, S. (2000). ProSynth: An integrated prosodic approach to device-independent, natural-sounding speech synthesis. Computer Speech and Language, 14(3):177-210.
�hman, S.E.G. (1967). Numerical model of coarticulation. Journal of the Acoustical Society of America, 41:310-320.
Okadome, T., Kaburagi, T., and Honda, M. (1999). Articulatory movement formation by kinematic triphone model. IEEE International Conference on Systems Man and Cybernetics, Tokyo, Japan, pp. 469-474.
Olives, J.-L., M�tt�nen, R., Kulju, J., and Sams, M. (1999). Audio-visual speech synthesis for finnish. Auditory-Visual Speech Processing Workshop, Santa Cruz, CA, pp. 157-162.
Pandzic, I., Ostermann, J., and Millen, D. (1999). Users evaluation: Synthetic talking faces for interactive services. The Visual Computer, 15:330-340.
Parke, F.I. (1972). Computer generated animation of faces. ACM National Conference, Salt Lake City, pp. 451-457.
Parke, F.I. (1975). A model for human faces that allows speech synchronized animation. Journal of Computers and Graphics, 1(1):1-4.
Parke, F.I. (1982). A parametrized model for facial animation. IEEE Computer Graphics and Applications, 2(9):61-70.
Parke, F.I. and Waters, K. (1996). Computer Facial Animation. Wellesley, MA, USA, A.K. Peters.
Perrier, P., Ostry, D.J., and Laboissi`ere, R. (1996). The equilibrium point hypothesis and its application to speech motor control. Journal of Speech and Hearing Research, 39:365-377.
Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., and Salesin, D.H. (1998). Synthesizing realistic facial expressions from photographs. Proceedings of Siggraph, Orlando, FL, USA, pp. 75-84.
Pisoni, D.B. (1997). Perception of synthetic speech. In J.P.H.V. Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. Springer Verlag: New York. pp. 541-560.
Platt, S.M. and Badler, N.I. (1981). Animating facial expressions. Computer Graphics, 15(3):245-252.
Pockaj, R., Costa, M., Lavagetto, F., and Braccini, C. (1999). MPEG-4 facial animation:Animplementation. InternationalWorkshop on Synthetic-Natural Hybrid Coding and Three Dimensional Imaging, Santorini, Greece, pp. 33-36.
Rev�ret, L., Bailly, G., and Badin, P. (2000). MOTHER: A new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. International Conference on Speech and Language Processing, Beijing, China, pp. 755-758.
Rydfalk, M. (1987). CANDIDE, a parameterized face. Sweden, Dept. of Electrical Engineering, Link�ping University: LiTH-ISYI-866.
Seitz, S.M. and Dyer, C.R. (1996). View morphing. ACM SIGGRAPH, New Orleans, Louisiana, pp. 21-30.
Shaiman, S. and Porter, R.J. (1991). Different phase-stable relationships of the upper lip and jaw for production of vowels and diphthongs. Journal of the Acoustical Society of America, 90:3000-3007.
Takeda, K., Abe, K., and Sagisaka, Y. (1992). On the basic scheme and algorithms in non-uniform unit speech synthesis. In G. Bailly and C. Beno�t (Eds.), Talking Machines: Theories, Models and Designs. Elsevier B.V., pp. 93-105.
Tamura, M., Kondo, S., Masuko, T., and Kobayashi, T. (1999). Textto-audio-visual speech synthesis based on parameter generation fromHMM.European Conference on Speech Communication and Technology, Budapest, Hungary, pp. 959-962.
Tekalp, A.M. and Ostermann, J. (2000). Face and 2-D Mesh animation in MPEG-4. Signal Processing: Image Communication, 15:387-421.
Terzopoulos, D. and Waters, K. (1990). Physically-based facial modeling, analysis and animation. The Journal of Visual and Computer Animation, 1:73-80.
Theobald, B.J., Bangham, J.A., Matthews, I., and Cawley, G.C. (2001). Visual speech synthesis using statistical models of shape and appearance. Auditory-Visual Speech Processing Workshop, Scheelsminde, Denmark, pp. 78-83.
Tsai, C.-J., Eisert, P., Girod, B., and Katsaggelos, A.K. (1997). Model-based synthetic view generation from a monocular video sequence. Proceedings of the International Conference on Image Processing, Santa Barbara, California, pp. 444-447.
Turk, M. and Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71-86.
Vignoli, F. and Braccini, C. (1999). A text-speech synchronization technique with applications to talking heads. Auditory-Visual Speech Processing Conference, Santa Cruz, California, USA, pp. 128-132.
Waters, K. (1987). A muscle model for animating three-dimensional facial expression. Computer Graphics, 21(4):17-24.
Waters, K. and Terzopoulos, D. (1992). The computer synthesis of expressive faces. Philosophical Transactions of the Royal Society of London (B), 335:87-93.
Yamamoto, E., Nakamura, S., and Shikano, K. (1998). Lipmovement synthesis from speech based on Hidden Markov Models. Speech Communication, 26(1-2):105-115.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Bailly, G., Bérar, M., Elisei, F. et al. Audiovisual Speech Synthesis. International Journal of Speech Technology 6, 331–346 (2003). https://doi.org/10.1023/A:1025700715107
Issue Date:
DOI: https://doi.org/10.1023/A:1025700715107