Department of Psychology

Department of

Psychology

Department of

Psychology

site header
The Speech Perception and Production Lab Banner

Queen's University Department of Psychology
Humphrey Hall, 62 Arch St., Kingston, ON K7L 3N6
T: 613-533-6000 • ext 77595 F: 613-533-2499
E: kevin.munhall@queensu.ca

 

Research

Facial Animation

Facial animation is a research tool in the Speech Perception and Production Laboratory. Animation allows us to have experimental control of the dynamics of facial motion and gesture cues that are critical to the visual perception of speech and emotion. In recent years we have worked on three types of animation:

1. Direct Motion Capture Animation Approach: facial animation that is controlled directly by kinematic data from motion capture of the head, face and upper body.

2. Principal Components Approach: the principal components of facial deformation were driven by a small set of motion capture data.

3. Physical Model Approach: a physical model of the face that includes the biomechanics of the skin and the physiological characteristics of the facial musculature was driven by recorded facial muscle EMG signals.

1. Direct Motion Capture Animation Approach:

In this approach to facial animation, we try to record the natural visual speech information in as much detail as we can. We use a multi-camera Vicon system with many small passive markers placed on the face. One typical marker map is shown below.

Drawing of face with passive markers

The 3D motion from these markers drive a talking head implemented in Maya. The underlying mesh is deformed by kinematic data from the motion capture and synchronized with recorded voice signals.

Digital face rendering with grid overlay

The animation is then rendered with a surface texture either in black and white or in colour.

4 panel figure of digital faces making various speech sounds

An animation of part of a motion capture session is shown below.

 

2. Principal Components Approach:

In this approach created by our colleague Takaaki Kuratate, a facial mesh is constructed with a small set of principal components of deformation based on a set of 3D scans of the face producing the five Japanese vowels and three non-speech postures. The figure below shows examples of these scans.

Multi-face image showing a face making a sound

3D kinematic data from a small set of markers (<20) drive this animation based on a linear estimator relating the marker locations to each of the adapted face meshes.

 

 

3. Physical Model Approach:

Our goal, in this type of facial animation, is to make a facial model which could be used in both speech perception and production research. In driving a realistic facial model, we learn about the neural control of speech production and how neural signals interact with the biomechanical and physiological characteristics of the articulators and the vocal tract. In addition, we make possible the systematic manipulation of physical parameters to study their effect on speech perception.

Our facial model is an extension of previous works on muscle-based models of facial animation (Lee, Terzopoulos, and Waters 1993, 1995; Parke and Waters, 1996; Terzopoulos and Waters, 1993; Waters and Terzopoulos, 1991, 1992). The modeled face consists of a deformable multi-layered mesh, with the following generic geometry: the nodes in the mesh are point masses, and are connected by spring and damping elements (i.e., each segment connecting nodes in the mesh consists of a spring and a damper in a parallel configuration). The nodes are arranged in three layers representing the structure of facial tissues. The top layer represents the epidermis, the middle layer represents the fascia, and the bottom layer represents the skull surface. The elements between the top and middle layers represent the dermal-fatty tissues, and elements between the middle and bottom layer represent the muscle. The skull nodes are fixed in the three-dimensional space. The fascia nodes are connected to the skull layer except in the region around the upper and lower lips and the cheeks The mesh is driven by modeling the activation and motion of several facial muscles in various facial expressions.

Face layer diagram

The figure (below) shows the full face mesh. In this figure we have individualized the shape of the mesh by adapting it to a subject's morphology using data from a Cyberware scanner. This is a 3-D laser rangefinder which provides a range map that is used to reproduce the subject's morphology and a texture map (shown below) that is used to simulate the subject's skin quality.

Full face mesh computerized face drawings

The red lines on the face mesh represent the lines of action of the modeled facial muscles. The lines of action, origins, insertions, and physiological cross-sectional areas are based on the anatomy literature and our measures of muscle geometry in cadavers. Our muscle model is a variant of the standard Hill model and includes dependence of force on muscle length and velocity.

At present, we can drive the model in two ways:

  1. by simulating the activation of several facial muscles during various facial gestures or
  2. by using processed electromyographic (EMG) recordings from a subject's actual facial muscles.

In the animation below you can watch the face model when it is driven by EMG recordings from the muscles around the mouth. The speaker is repeating the nonsense utterance /upae/. This animation of the lower face movements was produced using only the EMG recordings and thus several seconds of realistic animation were produced from previously recorded muscle activity.

 

Audiovisual Speech Perception

Our work on audiovisual speech perception focuses on three aspects of face-to-face communication. First, the mechanisms underlying cross-modal integration. Second, eye movement of perceivers during audiovisual speech perception. And finally, studies of the visual information for speech.

Three human face images

Our work on audiovisual speech perception focuses on three aspects of face-to-face communication:

  1. Studies of the visual information for speech. In these studies we focus on the analysis of the facial dynamics and what role they play in speech perception. This work involves detailed kinematic analysis of facial motion and psychophysics of face perception.
    • Lucero, J., Maciel, S., Johns, D., & Munhall, K.G. (2005). Empirical modeling of human face kinematics during speech using motion clustering. Journal of the Acoustical Society of America, 118, 405-409.
    • Munhall, K.G., Jones, J.A., Callan, D. Kuratate, T., & Vatikiotis-Bateson, E. (2004). Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychological Science, 15, 133-137.
    • Munhall, K.G., Kroos, C., Jozan, G. & Vatikiotis-Bateson, E. (2004). Spatial frequency requirements for audiovisual speech perception. Perception and Psychophysics, 66, 574-583.
    • Campbell, R., Zihl, J., Massaro, D., Munhall, K., & Cohen, M. (1997). Speechreading in a patient with severe impairment in visual motion perception (Akinetopsia). Brain, 120, 1793-1803.
    • Munhall, K.G. , Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk Effect. Perception and Psychophysics, 58, 351-362.
  2. Eye movement of perceivers during audiovisual speech perception. In these studies we have examined the patterns of eye movements when subjects watch and listen to another person speak.
    • Buchan, J.N., Paré, M., & Munhall, K.G. (in press). Spatial statistics of gaze fixations during dynamic face processing. Social Neuroscience.
    • Paré, M., Richler, R., ten Hove, M., & Munhall, K.G. (2003). Gaze Behavior in Audiovisual Speech Perception: The Influence of Ocular Fixations on the McGurk Effect.Perception and Psychophysics, 65, 553-567.
    • Vatikiotis-Bateson, E., Eigsti, I.M., Yano, S., & Munhall, K. (1998) Eye movement of perceivers during audiovisual speech perception. Perception and Psychophysics, 60(6), 926-940
  3. The mechanisms underlying cross-modal integration. To study the way the perceptual system uses information from different sensory modalities we make use of an audiovisual illusion called the McGurk Effect. The McGurk Effect (McGurk and McDonald, 1976) occurs when conflicting consonant information is presented simultaneously to the visual and auditory modalities. When this is done a third and distinct consonant is perceived. In our studies, an audio /aba/ was dubbed onto a visual /aga/, with the resultant percept of /ada/. Our lab has manipulated timing and spatial variables within the McGurk paradigm.
    • Munhall, K.G. & Vatikiotis-Bateson, E. (2004). Spatial and temporal constraints on audiovisual speech perception. In G. Calvert, J. Spence, B. Stein (eds.) Handbook of Multisensory Processing. Cambridge, MA: MIT Press.
    • Callan, D., Jones, J.A., Munhall, K.G., Kroos, C., Callan, A. & Vatikiotis-Bateson, E. (2004). Multisensory-integration sites identified by perception of spatial wavelet filtered visual speech gesture information. Journal of Cognitive Neuroscience, 16, 805-816.
    • Munhall, K.G. , Gribble, P., Sacco, L., & Ward, M. (1996). Temporal constraints on the McGurk Effect. Perception and Psychophysics, 58, 351-362.
    • Jones, J. A. & Munhall, K. G. (1997) The effects of separating auditory and visual sources on audiovisual integration of speech. Canadian Acoustics, 25(4)13-19.
    • Munhall, K.G. & Tohkura, Y. (1998) Audiovisual gating and the time course of speech perception. Journal of the Acoustical Society of America, 104, 530-539.
Speech Motor Control

The goal of our speech motor control work is to identify organizing principles underlying speech coordination. To this end we study the kinematics of lip, tongue, jaw and vocal fold movements and the muscle activity involve in producing these movements.

Recently we have focused on how auditory feedback influences speech motor control. When you speak, the sound of your own voice influences articulation and our studies use custom signal processing techniques to manipulate the feedback in real time.

  • Purcell, D. & Munhall, K.G. (2006) Adaptive control of vowel formant frequency: Evidence from real-time formant manipulation. Journal of the Acoustical Society of America. 120, 966-977.
  • Purcell, D. & Munhall, K.G. (2006) Compensation following real-time manipulation of formants in isolated vowels. Journal of the Acoustical Society of America, 119, 2288-2297.
  • Jones, J.A., & Munhall, K.G. (2005). Remapping auditory-motor representations in voice production. Current Biology, 15, 1768-1772.
  • Jones, J.A. & Munhall, K.G. (2003). Learning to produce speech with an altered vocal tract: the role of auditory feedback. Journal of the Acoustical Society of America. 113, 532-543.
  • Jones, J. A. & Munhall, K. G. (2002). Adaptation of fundamental frequency production under conditions of altered auditory feedback. Journal of Phonetics, 30, 303-320.

fMRI image of profile of human head with coloured area depicting mouth and throat movement during speech