Machine Perception Research

My research programme encompasses both machine vision and machine hearing. Taking a broad view of these two different disciplines is extremely valuable since approaches and techniques used one field can be applied in the other. For example, machine vision is a larger and better developed field than machine hearing and so new techniques often show up there first. Examples include sparse coding models and deep belief networks, both of which have subsequently been used on audio data. Conversely, much of the work on audio signals has concentrated on sophisticated time-series models. Since time-series models for video are relatively under-developed, there are many opportunities to extend these audio models to handle video instead.

One particular decomposition of signals has proved particularly useful in a diverse range of applications including, speech recogntion, coclear implants for the deaf, visual object recognition, and the analysis of brain recordings. This decomposition is shown below:


Here the incoming signal (which may be audio, video, or brain recording data like EEG) is passed through a set of filters with different properties. In the case of audio data the filters typically have different centre-frequencies and bandwidths. Next, each of the filter outputs is split into a product of a slowly varying modulator (also called an envelope) and a quickly varying carrier. Intuitively, the modulator captures the local energy in the filter, whilst the carrier captures the fast-time variation. The carriers and modulators are then used to perform various talks. Cochlear implants and speech recognition use just the modulators. EEG analysis usually focusses on the carrier, or its phase. Other applications may use both sorts of information.

Despite the utility of this representation, it is limited. In particular, classical methods for performing the decomposition are not robust to noise, nor are they able to adapt to the signal. In our work, we have developed a probabilistic version of this decomposition viewing the estimation of the modulators and carriers as an ill-posed inference problem. The new method is robust, because it explicity models noise and represents uncertainty. The new method is also adaptive, using machine-learning techniques to optimise the filters and envelopes according to the variations in the energy in the incoming signal.

Interestingly, special cases of this model connect to existing signal processing methods. For example, when the modulators are fixed, estimation for the carriers is precisely equivalent to standard time-frequency representations from signal processing.


Moreover, in the extension to video, when the modulators are fixed and the filters are estimated, we recover the classical slow feature analysis algorithm from computer vision.


These deep theoretical connections between probabilistic models and classical algorithms for computer hearing and computer vision build bridges between the three disciplines. These bridges allow techniques to be transferred, combined, and generalised with great practical advantages. One of my major research themes is the unification of machine learning, computer vision and computer hearing.

Related papers

Related talks




Return to main page