UCL Logo

Graphical Models for Single-channel Multi-talker Speech Recognition

John Hershey



One of the hallmarks of human perception is our ability to solve the auditory cocktail party problem: we can direct our attention to a given speaker in the presence of interfering speech, and understand what was said remarkably well, even when restricted to a single acoustic channel. The same cannot be said for conventional automatic speech recognition systems, for which interfering speech is extremely detrimental to performance. However, model-based speech separation approaches have recently succeeded in accurately transcribing overlapping speakers, in some cases better than human listeners. However, there is a caveat: the best performing algorithms have computational complexity that scales exponentially with the number of speakers. This talk reviews the modeling problem and approximate inference methods that reign in the combinatorial explosion. We introduce variational approximations that induce a set of probabilistic masking functions over the spectrum.   These masking functions enable us to restrict the set of interactions between the sources that have to be considered, dramatically reducing the complexity of inference, with minimal impact on accuracy.   In a restricted domain test, the resulting system can separate as many as four or five sources, solving a problem that was previously intractable and is difficult if not impossible for human listeners.  Demos will be played: see how your auditory system measures up on this task.