UCR Research on
Audiovisual Speech Perception

AudiovisualSpeech Web-Lab (with demos)

Description of Research Program and Selected Projects
Reference List
Abstracts
Back to Lab Homepage

Research Program and Selected Projects

It is often assumed that speech perception is primarily anauditory process. However, it is now known that seeing the faceof the speaker is important for the hearing impaired, for speechdevelopment, and when dealing with a complicated or degraded auditorysignal. Audiovisual speech also provides a rich forum in whichto study more general issues in Cognitive Science. Our researchhas explored the nature of the audiovisual speech integrationprocess, as well as the form of visual speech information itself.Our approach places emphasis on understanding stimulus informationand construing speech perception as continuous with perceptionof other natural events.

Many of our projects have implemented a phenomenon known as theMcGurk effect. In the McGurk effect, visual speech syllable informationis shown to combine with and even override discrepant auditoryinformation, causing a perceiver to report hearing what he/sheactually sees. The effect is striking in that it works even ifthe perceiver is told of the discrepancy, and it attests to theautomaticity of audiovisual speech integration. One of our projects(Rosenblum and Saldaña, 1992)addressed the question of whether an integrated percept is asconvincing as a percept which does not require integration. Adiscrimination methodology was implemented to test whether a 'visually-influenced'syllable sounds as phonetically compelling as an audiovisual compatiblesyllable. We found that although identification of an integratedsyllable is as consistent as for a compatible syllable, it isnot as robust in a discrimination context.

Another important and long-argued issue in the speech literatureconcerns whether the perceptual primitives of speech are auditoryor gestural in nature. We examined this question by using thediscrepant speech effect and found that the perceived loudnessof speech can be influenced by visual information for physiologicaleffort (Rosenblum and Fowler, 1991).These findings add support that the primitives for speech perceptionare gestural. A follow-up experiment found analogous results fornonspeech perception leading to the project discussed next.

Another long-debated question in the cognitive literature is whetherspeech is processed in a way different from nonspeech sounds.The issue of anatomical and behavioral specialization-or modularity-hasbeen central to cognitive science since the early 1980's. Whilethere is evidence that separate brain mechanisms are used forspeech processing, most research demonstrating behavioral differencestest nonspeech stimuli that are synthetic and unrecognizable.For our experiments (Saldaña andRosenblum, 1993), we tested speech and natural nonspeech stimuliin the discrepant speech paradigm. We found that as long as theauditory and visual information had a lawful relation to the specifiedevent (rather than one of convention: e.g., text), then a visualinfluence ensued. These results (along with the loudness perceptioneffects mentioned above) support the emerging conclusion thatspeech is not processed in a qualitatively different way fromnonspeech. The issue of speech vs. nonspeech perception, in thecontext of Motor Theory and Ecological approaches to speech, isdiscussed in a book chapter (Fowler andRosenblum, 1991).

Related to the issue of modularity is the phenomenon of cognitiveimpenetrability. Many perceptual effects occur without consciousawareness of the perceptual primitives or processes involved.It has been unclear whether speech phenomena fall into this category.To examine this question, we implemented the discrepant speechmethodology to test whether speech adaptation effects have anauditory or phonetic basis (Saldañaand Rosenblum, 1994). We observed that adaptation occurredbased on the auditory stimulus component rather than the perceptitself, suggesting cognitive impenetrability of the audiovisualspeech process.

We have also used the discrepant speech methodology to test whetheraudiovisual integration can occur with visual images not recognizableas a face (Rosenblum and Saldaña,1996). We found that visual influences occurred with theseimages even when observers were never aware that they were seeinga face. These results again attest to the impenetrability of theintegration process and have implications for our understandingof visual speech information, as will be discussed below. A discussionof how our research generally bears on the questions of modularityand cognitive impenetrability can be found in a published commentary(Rosenblum, 1994).

Our research has also explored the nature of visual and audiovisualspeech information itself. The goals of this work are to determinethe informational metric for audiovisual speech perception andto help uncover the information used for lipreading. For theseends, a point-light visual display technique has been developedwhich allows for efficient analyses of articulatory movements.Since point-light displays do not involve standard facial features,these stimuli also help test the relative salience of kinematic(temporally-extended) vs. 'pictorial' visual speech information.Thus far, we have shown that these stimuli can specify a numberof consonantal and vowel segments (Johnson,Rosenblum, and Mahmood, in preparation); enhance auditoryspeech embedded in noise (Rosenblum, Johnson,and Saldaña, 1996); and integrate well with discrepantauditory speech (Rosenblum and Saldaña,1996). This last project also revealed that static imagesof an articulating face do not integrate with auditory speechsupporting the salience of kinematic speech information.

Point-light speech lends itself well to determining the salientvisual information for lipreading. This is important for designinglipreading training programs as well as speech recognition computersystems which are beginning to make use of visual speech input.Additionally, point-light images can be easily incorporated intotelecommunication systems for the deaf. With regard to theoreticalimplications, our point-light speech effects are relevant to theissue of amodal speech information. As suggested by the ecologicalapproach to perception, the appropriate informational descriptionis such that it can be instantiated in any modality (e.g., visual,auditory). If amodal information exists, then the responsibilityof the internal integration system is lessened, and the neurophysiologicalmechanism ultimately responsible for speech recovery need onlybe sensitive to event-related primitives. (A similar explanationhas been proposed for processes of object localization.) Recentresearch in auditory speech perception has shown that time-bound,dynamic dimensions (vs. more discrete 'cues') are most informative.Our point-light research suggests that the analogous case existsfor visual speech information. If dynamic information turns outto be most salient for both modalities, then the possibility ofamodal speech information is tenable. These issues are discussedin a book chapter (Rosenblum and Saldaña,in press).

We have also begun a series of projects testing the extent towhich pre-linguistic infants use and integrate audiovisual speechinformation. One of these projects (Rosenblum,Schmuckler, & Johnson, 1997) used a gaze-habituation paradigmto show that infants display the McGurk effect. This suggeststhat infants might integrate speech in a manner similar to adults.We have also been working on a series of projects to determinethe degree to which face recognition processes are involved inextracting visual speech information. One of these projects (Rosenblum, Yakel, & Green, under review)found that a specific face image manipulation known to disruptface perception (the 'Margaret Thatcher effect'), also disruptsvisual and audiovisual speech perception. Another project (Yakel & Rosenblum, 1996) found thatisolated time-varying (point-light) speech information can beused for recognizing faces which could suggest that similar primitivescan be used for both speech and speaker identification purposes.

Finally, we have initiated a project to examine the neurophysiologicalbasis of visual speech perception and whether previous findingsconcerning speechreading and laterality are based in speech specialization,or general kinematic processing (Johnson& Rosenblum, 1996). We have found initial evidence thatthe previously reported left-hemisphere advantage for visual speechperception might be more related to the dynamic nature of thestimuli than to the language-related task.

Relevant References

Fowler,C.A. and Rosenblum, L.D. (1991). Perception of the phoneticgesture. In I.G. Mattingly and M. Studdert-Kennedy (Eds.), Modularityand the Motor Theory. Hillsdale, NJ: Lawrence Earlbaum.

Rosenblum, L.D. and Fowler,C.A. (1991). Audio-visual investigation of the loudness-efforteffect for speech and nonspeech stimuli. Journal of ExperimentalPsychology: Human Perception and Performance. 17 (4) 976-985.

Rosenblum, L.D., and Saldaña, H.M. (1992). Discriminationtests of visually-influenced syllables. Perception and Psychophysics.52 (4), 461-473.

Saldaña, H.M. and Rosenblum, L.D. (1993). Visual influenceson auditory pluck and bow judgments. Perception and Psychophysics.54 (3), 406-416.

Rosenblum, L.D. (1994). How specialis audiovisual speech integration? Current Psychology of Cognition.13(1), 110-116.

Saldaña, H.M. and Rosenblum, L.D. (1994). Selective adaptationin speech perception using a compelling audiovisual adaptor. Journalof the Acoustical Society of America. 95(6), 3658-3661.

Yakel, D.A., Rosenblum, L.D., Green,K.P., Bosley, C.L. & Vasquez, R.A. (1995).
The effect of face and lip inversion on audiovisual speech integration.
Journal of the Acoustical Society of America, 97(5), 3286.

Rosenblum, L.D. and Saldaña, H.M. (1996). An audiovisualtest of kinematic primitives for visual speech perception. Journalof Experimental Psychology: Human Perception and Performance.22(2), 318-331.

Yakel, D.A. & Rosenblum, L.D. (1996). Face identificationusing visual speech information. Poster presented at the 132ndmeeting of the Acoustical Society of America, Honolulu, HI, December,2-6.

Johnson, J.A. & Rosenblum, L.D. (1996). Hemispheric differencesin perceiving and integrating dynamic visual speech information.Poster presented at the 132nd meeting of the Acoustical Societyof America, Honolulu, HI, December, 2-6.

Rosenblum, L.D., Johnson, J. A., and Saldaña, H.M. (1996).Visual kinematic information for embellishing speech in noise.Journal of Speech and Hearing Research 39(6), 1159-1170.

Rosenblum, L.D., Schmuckler,M.A., & Johnson, J.A. (1997). The McGurk effect in infants.Perception & Psychophysics, 59 (3), 347-357.

Rosenblum, L.D. & Saldaña, H.M. (in press). Time-varyinginformation for visual speech perception. To appear in R. Campbell,B. Dodd, D. Burnham (Eds.), Hearing by Eye: Part 2, The Psychologyof Speechreading and Audiovisual Speech. Earlbaum: Hillsdale,NJ

Johnson, J. A., Rosenblum, L.D.&Mahmood, C. (in preparation). Kinematic features for visual speechperception. To be submitted to Journal of Phonetics.

Abstracts

Fowler, C.A. and Rosenblum, L.D. (1991). Perception of the phonetic gesture. In I.G. Mattingly and M. Studdert-Kennedy (Eds.), Modularity and the Motor Theory. Hillsdale, NJ: Lawrence Earlbaum. Evidence in the literature on speech perception shows very clearly that listeners recover a talker's phoneticgestures from the acoustic speech signal. Throughout most of its history, the Motor Theory has been the only theory to confront this evidence and to provide an explanation for it. Specifically, the Motor Theory proposes that the dimensions of a listener's percept of a speech utterance conform more closely to those of a talker's phonetic gestures than to those of the acoustic speech signal, because, according to the theory, listeners access their speech motor systems in perception. Accordingly, their experience hearing a speech utterance conforms to processes that their own motor systems would engage in to produce a signal like the one they are perceivng. If the Motor Theory is correct, speech perception is quite unlike general auditory perception where access to a motor system could not be involved and where motor theorists claim perception is "homomorphic"-conforming to dimensions of the acoustic signal directly. Recently, motor theorists have suggested that speech perception is "modular". A major source of evidence in favor of a distinct speech module is the phenomenon of duplex perception. We offer some challenges to the Motor Theory. First, we suggest an alternative explanation for listeners' recovery of phonetic gestures. In particular, we suggest that phonetic gestures are "distal events", and that perception involves recovery of distal events from proximal stimulation. This holds for perception of other acoustic stimuli and for visual stimuli where access to the motor system cannot be invoked to explain distal-event perception. Accordingly, we suggest that speech perception is not special in its recovery of phonetic gestures; all perception is "heteromorphic", not homomorphic. We also provide evidence that the phenomenon of duplex perception does not reveal modularity of speech perception (although speech perception may yet be modular). Nonspeech sounds, not remotely likely to be perceived by a specialized module, may be perceived duplexly. Rosenblum, L.D. and Fowler, C.A. (1991). Audio-visual investigation of the loudness-effort effect for speech and nonspeech stimuli. Journal of Experimental Psychology: Human Perception and Performance. 17 (4) 976-985. There is some evidence that loudness judgments of speech are more closely related to the degree of vocal effort induced in speech production than to the speech signal's surface acoustic properties such as intensity (Lehiste and Peterson, 1959). Other researchers have claimed that speech loudness can be rationalized by simply considering the acoustic complexity of the signal (Glave and Reitveld, 1975). Since vocal effort can be specified optically as well as acoustically, a study to test the effort-loudness hypothesis was conducted which used conflicting audio-visual presentations of a speaker producing consonant-vowel syllables with different efforts. The prediction was made that if loudness judgments are constrained by effort perception rather than by simple acoustic parameters, then judgments should be affected by visual as well as auditory information. It is shown that loudness judgments are affected significantly by visual information even when subjects are instructed to base their judgments on only what they hear. Moreover, a similar-although less pronounced- patterning of results are shown for a nonspeech 'clapping' event attesting to the generality of the loudness-effort effect previously thought to be special to speech. Results are discussed in terms of auditory, fuzzy logical, motor, and ecological theories of speech perception. Rosenblum, L.D., and Saldaña, H.M. (1992). Discrimination tests of visually-influenced syllables. Perception and Psychophysics. 52 (4), 461-473. In the McGurk effect, perception of audio-visual discrepant syllables can depend on auditory, visual, or a combination of audio-visual information. Under some conditions, visual information can override auditory information to the extent that identification judgments of a visually-influenced syllable can be as consistent as for an analogous audio-visual compatible syllable. This might indicate that visually-influenced and analogous audio-visual compatible syllables are phonetically equivalent. Experiments were designed to test this issue using a compelling visually-influenced syllable in an AXB matching paradigm. Subjects were asked to match an audio syllable /va/ to either an audio-visual "consistent" syllable (audio /va/ - video /fa/) or audio-visual discrepant syllable (audio /ba/ - video /fa/). It was hypothesized that if the two audio-visual syllables were phonetically equivalent, then subjects should choose them equally often in the matching task. Results show, however, that subjects are more likely to match the audio /va/ to the audio-visual consistent /va/ suggesting differences in phonetic convincingness. Additional experiments further suggest that this preference is not based on a "phonetically-extraneous" dimension or on relative noticeable audio-visual discrepancies. Saldaña, H.M. and Rosenblum, L.D. (1993). Visual influences on auditory pluck and bow judgments. Perception and Psychophysics. 54 (3), 406-416. In the McGurk effect, visual information specifying a speaker's articulatory movements can influence auditory judgments of speech. In the present study, an analogue of the McGurk effect was attempted with non-speech stimuli using discrepant audiovisual tokens of plucks and bows on a cello. Results of an initial experiment revealed that subjects' auditory judgments were influenced significantly by the visual pluck and bow stimuli. However, a second experiment using speech syllables demonstrated that the visual influence on consonants was significantly greater than the visual influence observed for pluck-bow stimuli. This result could be interpreted to suggest that the nonspeech visual influence was not a true McGurk effect. In a third experiment, visual stimuli consisting of the words "Pluck" and "Bow" were found to have no influence over auditory pluck and bow judgments. This result could suggest that the nonspeech effects found in Experiment 1 were based on the audio and visual information having a (ostensive) lawful relation to the specified event. These results are discussed in terms of motor theory, ecological, and FLMP approaches to speech perception. Saldaña, H.M. and Rosenblum, L.D. (1994). Selective adaptation in speech perception using a compelling audiovisual adaptor. Journal of the Acoustical Society of America. 95(6), 3658-3661. A replication of the audiovisual test of speech selective adaptation performed by Roberts and Summerfield [Perception & Psychophysics, 30, 309-314 (1981)] was conducted. The audiovisual methodology allows for the dissociation of acoustic and phonetic components of an adapting stimulus. Roberts & Summerfield's (1981) results have been interpreted to support an auditory basis for selective adaptation. However, their subjects did not consistently report hearing the adaptor as a visually influenced syllable making this interpretation questionable. In the present experiment, a more compelling audiovisual adaptor was implemented resulting in a visually influenced percept 99% of the time. Still, systematic adaptation occurred only for the auditory component. Rosenblum, L.D. and Saldaña, H.M. (1996). An audiovisual test of kinematic primitives for visual speech perception. Journal of Experimental Psychology: Human Perception and Performance. 22(2), 318-331. Isolated kinematic properties of visible speech can provide information for lipreading. Kinematic facial information is isolated by darkening an actor's face and attaching dots to various articulators so that only moving dots can be seen with no facial features present. To test the salience of these images, experiments were conducted to determine whether they could visually influence the perception of discrepant auditory syllables. Results showed that these images can influence auditory speech and that this influence is not dependent on subjects' knowledge of the stimuli. In other experiments, single frozen frames of visible syllables were presented with discrepant auditory syllables to test the salience of static facial features. Results suggest that while the influence of the kinematic stimuli was perceptual, any influence of the static featural stimuli was likely based on subject misunderstanding or post-perceptual response bias.
SEE THE STIMULI

Rosenblum, L.D., Johnson, J. A., and Saldaña, H.M. (1996). Visual kinematic information for embellishing speech in noise. Journal of Speech and Hearing Research 39(6), 1159-1170.
Seeing a talker's face can improve the perception of speech in noise. There is little known about which characteristics of the face are useful for enhancing the degraded signal. In this study, a point-light technique was employed to help isolate the salient kinematic aspects of a visible articulating face. In this technique, fluorescent dots were arranged on the lips, teeth, tongue, cheeks, and jaw of an actor. The actor was videotaped speaking in the dark, so that when shown to observers, only the moving dots were seen. To test whether these reduced images could contribute to the perception of degraded speech, noise-embedded sentences were dubbed with the point-light images at various signal-to-noise ratios. It was found that these images could significantly improve comprehension for adults with normal hearing, and that the images became more effective as participants gained experience with the stimuli. These results have implications for uncovering salient visual speech information as well as the development of telecommunication systems for listeners who are hearing-impaired.
SEE THE STIMULI

Rosenblum, L.D., Schmuckler, M.A., & Johnson, J.A. (1997). The McGurk effect in infants. Perception & Psychophysics, 59 (3), 347-357.
In the McGurk effect, perceptual identification of auditory speech syllables is influenced by simultaneous presentation of discrepant visible speech syllables. While this effect has been shown in subjects of different ages and various native language backgrounds, no McGurk tests have been conducted with pre-linguistic infants. A series of experiments tested for the McGurk effect in 5-month-old English-exposed infants. Infants were first gaze-habituated to an audiovisual /va/. They were then presented two different dishabituation stimuli: audio /ba/-visual /va/ (perceived by adults as /va/); and audio /da/-visual /va/ (perceived by adults as /da/). The infants showed generalization from the audiovisual /va/ to the audio /ba/-visual /va/ stimulus but not to the audio /da/-visual /va/ stimulus. Follow-up experiments revealed that these generalization differences were not due to either a general preference for the audio /da/-visual /va/ stimulus or to the auditory similarity of /ba/ to /va/ relative to /da/. These results suggest that the infants were visually influenced in the same way as English-speaking adults.

Rosenblum, L.D. & Saldaña, H.M. (1998). Time-varying information for visual speech perception. In R. Campbell, B. Dodd, D. Burnham (Eds.), Hearing by Eye: Part 2, The Psychology of Speechreading and Audiovisual Speech. Earlbaum: Hillsdale, NJ.
In recent years, a number of theories of speechreading and audiovisual speech integration have been developed. While many of these accounts give an adequate description of the processes used for visual speech recognition and integration, they often take as their starting point abstract descriptions of the information available to the system. We propose that the form of the information utilized by the system largely constrains the processes and should therefore be an important part of any theory of visual speech perception. We acknowledge the difficulty of identifying the metric of the source information and therefore set out to differentiate the type of information available in very broad strokes. At an elementary level, the information for visual speech perception can be described in two distinct ways. Information can be described as time-independent (static; pictorial) or time-varying (kinematic; dynamic). In this chapter we will outline the evidence for both descriptions.

Rosenblum, L.D., Yakel, D.A., & Green, K.P. (under review). Face and mouth inversion effects on visual and audiovisual speech perception. Under review at Journal of Experimental Psychology: Human Perception & Performance.
Three experiments examined whether image manipulations known to disrupt face perception also disrupt visual speech perception. It is substantially more difficult to recognize a face when it is inverted. Research has also shown that an upright face with an inverted mouth looks strikingly grotesque while an inverted face and an inverted face containing an upright mouth are perceived as looking relatively normal. The current study examined whether a similar sensitivity to an upright facial context plays a role in the visual perception of speech. Visual and audiovisual syllable identification tasks were tested under four presentation conditions: upright face - upright mouth; inverted face - inverted mouth; inverted face - upright mouth; upright face - inverted mouth. Results revealed that face inversion disrupted visual and audiovisual identifications for some visual syllables. For others, only the upright face - inverted mouth image disrupted identification. These results suggest that for at least some visual segments, an upright facial context can play a role in the visual speech perception. A follow-up experiment testing upright and inverted isolated mouths supported this conclusion.
SEE THE STIMULI

Yakel, D.A., Rosenblum, L.D., Green, K.P., Bosley, C.L. & Vasquez, R.A. (1995). The effect of face and lip inversion on audiovisual speech integration. Journal of the Acoustical Society of America, 97(5), 3286.
Seeing a speaking face can influence observers' auditory perception of syllables [McGurk and McDonald, Nature, 264, 746-748 (1976)]. This effect decreases when the speaker's face is inverted [e.g., Green, J. Acoust. Soc. Am., 95, 3014, 1994]. Face recognition is also inhibited with inverted faces [e.g., Rock, Sci. Amer., 230, 78-85 (1974).] suggesting a similar underlying process. To further explore the link between face and audiovisual speech perception, a speech experiment was designed to replicate another face perception effect. In this effect, an inverted face and an inverted face containing upright lips are perceived as looking normal, but an upright face with inverted lips looks grotesque [Thompson, Perception, 9, 438-484, (1980)]. An audiovisual speech experiment tested four presentation conditions: upright face - upright mouth, upright face - inverted mouth, inverted face - inverted mouth, inverted face - upright mouth. Various discrepant audio-visual syllables were tested in each condition. Visual influences occurred in all but the upright face - inverted mouth condition for some of the syllable combinations thereby mimicking the face perception effect. However, other syllable combinations revealed visual influences in all four conditions. Results are interpreted in terms of articulatory dynamics and the vertical symmetry of the visual stimuli.

Yakel, D.A. & Rosenblum, L.D. (1996). Face identification using visual speech information. Poster presented at the 132nd meeting of the Acoustical Society of America, Honolulu, HI, December, 2-6.
Traditionally, the recovery of linguistic message and speaker identity are thought to involve distinct operations and information. However, recent observations with auditory speech show a contingency of speech perception on speaker identification/familiarity [e.g., L. C. Nygaard, M. S. Sommers, & D. B. Pisoni, Psych. Sci., 5, 42-46. (1994)]. Remez and his colleagues [R. E. Remez, J. M. Fellowes, & P. E. Rubin, J. Exp. Psy. (in press)] have provided evidence that these contingencies could be based on the use of common phonetic information for both operations. In order to examine whether common information might also be useful for face and visual speech recovery, point-light visual speech stimuli were implemented which provide phonetic information without containing facial features [L.D. Rosenblum & H.M. Saldaña, J. Exp. Psy.: Hum. Perc. & Perf. 22, 318-331(1996)]. A 2AFC procedure was used to determine if observers could match speaking point-light faces to the same fully-illuminated speaking face. Results revealed that dynamic point-light displays afforded high face matching accuracy which was significantly greater than accuracy with frozen point-light displays. These results suggest that dynamic speech information can be used for both visual speech and face recognition.

Johnson, J.A. & Rosenblum, L.D. (1996). Hemispheric differences in perceiving and integrating dynamic visual speech information. Poster presented at the 132nd meeting of the Acoustical Society of America, Honolulu, HI, December, 2-6.
There is evidence for a left-visual-field/right-hemisphere (LVF/RH) advantage for speechreading static faces [R. Campbell, Brain & Cog. 5, 1-21 (1986)] and a right-visual-field/left-hemisphere (RVF/LH) advantage for speechreading dynamic faces [P. M. Smeele, NATO ASI Workshop (1995)]. However, there is also evidence for a LVF/RH advantage when integrating dynamic visual speech with auditory speech [e.g., E. Diesch, (1995). Quart. J. Exp. Psy.: Human Exp. Psy. 48, 320-333 (1995)]. To test relative hemispheric differences and the role of dynamic information, static, dynamic, and point-light visual speech stimuli were implemented for both speechreading and audiovisual integration tasks. Point-light stimuli are thought to retain only dynamic visual speech information [L.D. Rosenblum & H.M. Saldaña, J. Exp. Psy.: Hum. Perc. & Perf. 22, 318-331 (1996)]. For both the speechreading and audiovisual integration tasks, a LVF/RH advantage was observed for the static stimuli, and a RVF/LH advantage was found for the dynamic and point-light stimuli. In addition, the relative RVF/ LH advantage was greater with the point-light stimuli implicating greater relative LH involvement for dynamic speech information.

Hits since 1/14/98

UCR Research on Audiovisual Speech Perception

AudiovisualSpeech Web-Lab (with demos)

Research Program and Selected Projects

Relevant References

Abstracts

UCR Research on
Audiovisual Speech Perception