Sound Archive

This page contains a number of audio demonstrations for my thesis (Statistical Models for Natural Sounds) and for various papers and talks that I have given.

To give you a feel for the methods developed in this these, please listen to this introductory example.

This example is made from a collection of sounds recorded during a camping trip. Here, a person starts by a camp fire and then walks past a stream to their tent. The wind starts to howl and the person gets to their tent just in time before it rains at the end of the clip.

Remarkably, all the sounds on this clip are synthetic (except for the sound of the closing zip). They are produced from a single 'generative model', which has been trained on natural sounds and learns how to produce natural sounding versions. The model is particularly good at synthesising auditory textures like fire, running water, wind and rain.

The model works by learning the important statistics of these sounds. It can then produce new synthetic versions, of arbitrary duration, by ensuring the new sounds match the statistics of the original. This demonstrates that auditory textures are often defined statistically, a fact first demonstrated by Josh McDermott and Eero Simoncelli.

Characterising the statistics of natural sounds is important. For example, when your car's automatic speech recognition system tries to figure out what you are saying when there is traffic noise in the background, it will often fail. However, the performance can be enhanced be removing the traffic noise and this can be done by knowing the difference between the statistics of the traffic noise and speech.

Importantly, by generating synthetic sounds, this work also reveals the statistics to which auditory processing is sensitive. This is an important practical tool for understanding how hearing operates.

Chapter 3: Probabilistic Amplitude Demodulation

Probabilistic amplitude demodulation is a new method we invented for estimating the envelope of a signal. Below, we illustrate the method by taking a training sound, extracting its envelope, and then generating a new sound using this envelope and a white noise carrier. It is clear from this example that the long-time "rhythm" of the sound remains, but the short-time frequency content is missing.

These examples were also referenced from the paper accepted and given an oral presentation at ICA 2007 conference, London. This won the "Best Student Paper" award.

Chapter 5: Probabilistic Time-Frequency Representations

Probabilistic time-frequency representations are complementary to traditional time-frequency representations. They were developed in Chapter 5 of my thesis. This new type of representation is slower to estimate than traditional methods, but once it has been estimated it is very simple to resynthesise modified sounds from it. For example, below we illustrate how to modify the duration of a sound, and also how to modify the pitch of the sound.

Chapter 5: MPAD (and ICASSP) synthetic auditory textures

In this section the goal is produce synthetic versions of natural sounds by learning their statistics and generating new versions which match those statistics. In other words, for each of the models below, the model parameters were learned from a training sound (named on the left hand side), and then entirely new sounds were generated using those learned parameters.

Stream 1 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Stream 2 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Stream 3 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Wind 1 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Wind 2 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Fire 1 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Fire 2 Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Rain Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Applause Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Snapping Twigs Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)
Speech Original Reversed AR2 matched spectra Independent Modulation (MPAD) Co-modulation (MPAD)

The "original" sounds are the training sounds. In some cases these are very short. The "AR2 matched spectra" sounds comprise a sum of AR(2) processes with parameters chosen to match the training spectra. These sounds are therefore Gaussian noise with spectra parameterised by the AR(2) processes.The "independent modulation" sounds are formed from a sum of independently modulated AR(2) processes. The "Co-modulation" sounds are formed from a sum of comodulated AR(2) processes. In this way, the models get more complicated from left to right across the table. Similarly, the complexity of the sounds increases down the table. So, whilst water is captured relatively well by independently modulated AR(2) processes, fire requires co-modulated processes to capture the crackles. Rain also requires comodulation to capture the sound of the droplets hitting leaves, but because this sound is asymmetric through time, the models - whose statistics are invariant under a reversal of time - cannot perfectly capture it. Similarly, speech cannot be captured because of this, and other higher-order statistics, which the models do not capture.

Here is some more detail about some of the sounds: The first wind sound is dominated by only three patterns of comodulation as can be demonstrated by the following sounds:



Full Generated Sound First three components Remaining 12 components

Similarly, the first fire sound is dominated by one component which handles the crackling sound. The remaining components capturing the slower aspects of the fire sound:



Full Generated Sound First component Remaining 27 components

Once again, the rain sound is dominated by two components which handle the transient sound of the water droplets hitting the leaves. The remaining components capturing the slower aspects of the rain sound:



Full Generated Sound First two components Remaining 27 components

The generated speech sound contains a number of different simple phoneme-like components.



Full Generated Sound First component Second component Third component Fourth component Fifth component

Chapter 5: Chimera

These auditory chimera contain the carriers inferred from one sound and the modulators inferred from another, possibly synthetic, sound. They indicate the aspects of the sounds which are captured by the two types of processes.

Modulators from a speech sound, sinusoidal carriers at the filter centre frequencies
Modulators inferred from a speech sound, random carriers drawn from AR(2) processes
Constant amplitude, carriers inferred from a speech sound (posterior mean)
Constant amplitude, carriers inferred from a speech sound (sample from the posterior)

When using MPAD (see Chapter 5 of my thesis) to carry out sub-band demodulation, the posterior mean over the carriers often appears to contain modulation. That it is, it appears like MPAD is not demodulating the sub-bands fully. However, it turns out that this feature is due to the posterior mean not being typical of the posterior distribution. Consider inferring a carrier when the associated amplitude is very small (compared to the observation noise). The posterior mean of the carrier reverts to the prior mean which is zero. This means that the posterior mean of the carriers tends to contain more energy when the amplitude is large than when it is small. However, the posterior variance of the carriers is higher in regions of low amplitude. Therefore, a sample from the posterior distribution over the carriers, contains rather less modulation than the posterior mean. For this reason, chimera should be produced using samples from the posterior distribution over the carriers, rather than from the posterior mean itself.

Chapter 5: Filling in missing data experiments

Various different generative models were used to fill in missing sections of speech.

The original speech sound.
1.25ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
6.25ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
9.375ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
12.5ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
15.625ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
18.75ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
25ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
31.25ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)
37.5ms Missing Bayesian Spectrum Estimation Trained AR(2) filter bank Independent Modulation (MPAD) Co-modulation (MPAD)

Auditory Scene Analysis

From the research talk (and Cosyne 2008 poster)


Return to main page