UCL Logo

Modeling Auditory Scene Analysis by multidimensional statistical filtering: Azimuthal localization of concurrent speakers using a binaural auditory model


Volker Hohmann

Medical Physics, University of Oldenburg, Germany


'Auditory Scene Analysis' (ASA) denotes the ability of the human auditory system to decode information on sound sources from a superposition of sounds in an extremely robust way. ASA is closely related to the 'Cocktail-Party-Effect' (CPE), i.e., the ability of a listener to perceive speech in adverse conditions at low signal-to-noise ratios. This contribution discusses theoretical and empirical evidence suggesting that robustness of sound source decoding is partly achieved by exploiting redundancies that are present in the source signals.

Redundancies reflect the restricted spectro-temporal and spatial dynamics of real source signals, e.g., of a speech signal produced by a human moving in a room, and limit the number of possible states and state transitions of the sound source. In order to exploit them, prior knowledge on the characteristics of a sound source needs to be represented in the decoder/classifier (‘expectation-driven processing’).

In a proof-of-concept approach, novel multidimensional statistical filtering algorithms have been shown to successfully incorporate prior knowledge on the characteristics of speech and to estimate the dynamics of a speech source from a superposition of speech sounds [1]. In an extension of this approach, a computational model for the localization of concurrent speakers from a binaural signal using a binaural auditory model [2] is presented.


[1] Nix, J. and Hohmann, V. (2007) "Combined estimation of spectral envelopes and sound source direction of concurrent voices by multidimensional statistical filtering" IEEE Trans. Audio, Speech and Lang. Proc. 15(3): 995-1008.


[2] Dietz, M., Ewert, S., Hohmann, V., Kollmeier, B. (2008). "Coding of temporally fluctuating interaural timing disparities in a binaural processing model based on phase differences." Brain Research 1220: 233-244.