**Laurence Aitchison**

NeurIPS (2019)

Multi-sample, importance-weighted variational autoencoders (IWAE) give tighter bounds and more accurate uncertainty estimates than variational autoencoders (VAE) trained with a standard single-sample objective. However, IWAEs scale poorly: as the latent dimensionality grows, they require exponentially many samples to retain the benefits of importance weighting. While sequential Monte-Carlo (SMC) can address this problem, it is prohibitively slow because the resampling step imposes sequential structure which cannot be parallelised, and moreover, resampling is non-differentiable which is problematic when learning approximate posteriors. To address these issues, we developed tensor Monte-Carlo (TMC) which gives exponentially many importance samples by separately drawing K samples for each of the n latent variables, then averaging over all K^n possible combinations. While the sum over exponentially many terms might seem to be intractable, in many cases it can be computed efficiently as a series of tensor inner-products. We show that TMC is superior to IWAE on a generative model with multiple stochastic layers trained on the MNIST handwritten digit database, and we show that TMC can be combined with standard variance reduction techniques.

Rodrigo Echeveste, **Laurence Aitchison**, Guillaume Hennequin and Máté Lengyel

bioRxiv (2019)

Sensory cortices display a suite of ubiquitous dynamical features, such as ongoing noise variability, transient overshoots, and oscillations, that have so far escaped a common, principled theoretical account. We developed a unifying model for these phenomena by training a recurrent excitatory–inhibitory neural circuit model of a visual cortical hypercolumn to perform sampling-based probabilistic inference. The optimized network displayed several key biological properties, including divisive normalization, as well as stimulus-modulated noise variability, inhibition-dominated transients at stimulus onset, and strong gamma oscillations. These dynamical features had distinct functional roles in speeding up inferences and made predictions that we confirmed in novel analyses of awake monkey recordings. Our results suggest that the basic motifs of cortical dynamics emerge as a consequence of the efficient implementation of the same computational function—fast sampling-based inference—and predict further properties of these motifs that can be tested in future experiments.

Adria Garriga-Alonso, Carl E. Rasumussen and **Laurence Aitchison**

ICLR (2019)

We show that the output of a (residual) CNN with an appropriate prior over the weights and biases is a GP in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN. Further, we show that this kernel has two properties that allow it to be computed efficiently; the cost of evaluating the kernel for a pair of images is similar to a single forward pass through the original CNN with only one filter per layer. The kernel equivalent to a 32-layer ResNet obtains 0.84% classification error on MNIST, a new record for GP with a comparable number of parameters.

**Laurence Aitchison**, Guillaume Hennequin and Máté Lengyel

arXiv (2018)

Neural responses to sensory stimuli vary significantly from trial-to-trial. We postulate that suppressing this variability incurs an energetic cost, and consider this cost in the context of unsupervised learning tasks, where it induces a tradeoff between energetically cheap but inaccurate neural representations and energetically costly but accurate ones. Remarkably, networks that we trained subject to this tradeoff developed probabilistic representations in which neural variability represented different plausible explanations for the incoming sensory data. We were able to understand this result by formally linking it to a state-of-the-art probabilistic inference strategy: deep variational autoencoders. Our work shows how complex probabilistic computations emerge naturally out of classical models of unsupervised learning, combined with basic biophysical considerations.

**Laurence Aitchison**

arXiv (2018)

Neural network optimization methods fall into two broad classes: adaptive methods such as Adam and non-adaptive methods such as vanilla stochastic gradient descent (SGD). Here, we formulate the problem of neural network optimization as Bayesian filtering. We find that state-of-the-art adaptive (AdamW) and non-adaptive (SGD) methods can be recovered by taking limits as the amount of information about the parameter gets large or small, respectively. As such, we develop a new neural network optimization algorithm, AdaBayes, that adaptively transitions between SGD-like and Adam(W)-like behaviour. This algorithm converges more rapidly than Adam in the early part of learning, and has generalisation performance competitive with SGD.

**Laurence Aitchison**, Vincent Adam and Srini C Turaga

arXiv (2018)

Each training step for a variational autoencoder (VAE) requires us to sample from the approximate posterior, so we usually choose simple (e.g. factorised) approximate posteriors in which sampling is an efficient computation that fully exploits GPU parallelism. However, such simple approximate posteriors are often insufficient, as they eliminate statistical dependencies in the posterior. While it is possible to use normalizing flow approximate posteriors for continuous latents, some problems have discrete latents and strong statistical dependencies. The most natural approach to model these dependencies is an autoregressive distribution, but sampling from such distributions is inherently sequential and thus slow. We develop a fast, parallel sampling procedure for autoregressive distributions based on fixed-point iterations which enables efficient and accurate variational inference in discrete state-space latent variable dynamical systems. To optimize the variational bound, we considered two ways to evaluate probabilities: inserting the relaxed samples directly into the pmf for the discrete distribution, or converting to continuous logistic latent variables and interpreting the K-step fixed-point iterations as a normalizing flow. We found that converting to continuous latent variables gave considerable additional scope for mismatch between the true and approximate posteriors, which resulted in biased inferences, we thus used the former approach. Using our fast sampling procedure, we were able to realize the benefits of correlated posteriors, including accurate uncertainty estimates for one cell, and accurate connectivity estimates for multiple cells, in an order of magnitude less time.

**Laurence Aitchison**, Lloyd Russell, Adam M Packer, Jinyao Yan, Philippe Castonguay, Michael Häusser and Srini C Turaga

NeurIPS Oral (2017)

Population activity measurement by calcium imaging can be combined with cellular resolution optogenetic activity perturbations to enable the mapping of neural connectivity in vivo. This requires accurate inference of perturbed and unperturbed neural activity from calcium imaging measurements, which are noisy and indirect, and can also be contaminated by photostimulation artifacts. We have developed a new fully Bayesian approach to jointly inferring spiking activity and neural connectivity from in vivo all-optical perturbation experiments. In contrast to standard approaches that perform spike inference and analysis in two separate maximum-likelihood phases, our joint model is able to propagate uncertainty in spike inference to the inference of connectivity and vice versa. We use the framework of variational autoencoders to model spiking activity using discrete latent variables, low-dimensional latent common input, and sparse spike-and-slab generalized linear coupling between neurons. Additionally, we model two properties of the optogenetic perturbation: off-target photostimulation and photostimulation transients. Using this model, we were able to fit models on 30 minutes of data in just 10 minutes. We performed an all-optical circuit mapping experiment in primary visual cortex of the awake mouse, and use our approach to predict neural connectivity between excitatory neurons in layer 2/3. Predicted connectivity is sparse and consistent with known correlations with stimulus tuning, spontaneous correlation and distance.

**Laurence Aitchison** and Máté Lengyel

Current Opinion in Neurobiology (2017)

Two theoretical ideas have emerged recently with the ambition to provide a unifying functional explanation of neural population coding and dynamics: predictive coding and Bayesian inference. Here, we describe the two theories and their combination into a single framework: Bayesian predictive coding. We clarify how the two theories can be distinguished, despite sharing core computational concepts and addressing an overlapping set of empirical phenomena. We argue that predictive coding is an algorithmic/representational motif that can serve several different computational goals of which Bayesian inference is but one. Conversely, while Bayesian inference can utilize predictive coding, it can also be realized by a variety of other representations. We critically evaluate the experimental evidence supporting Bayesian predictive coding and discuss how to test it more directly.

Christoph Schmidt-Hieber, Gabija Toleikyte, **Laurence Aitchison**, Arnd Roth, Beverley A Clark, Tiago Branco and Michael Häusser

Nature Neuroscience (2017)

Understanding how active dendrites are exploited for behaviorally relevant computations is a fundamental challenge in neuroscience. Grid cells in medial entorhinal cortex are an attractive model system for addressing this question, as the computation they perform is clear: they convert synaptic inputs into spatially modulated, periodic firing. Whether active dendrites contribute to the generation of the dual temporal and rate codes characteristic of grid cell output is unknown. We show that dendrites of medial entorhinal cortex neurons are highly excitable and exhibit a supralinear input–output function in vitro, while in vivo recordings reveal membrane potential signatures consistent with recruitment of active dendritic conductances. By incorporating these nonlinear dynamics into grid cell models, we show that they can sharpen the precision of the temporal code and enhance the robustness of the rate code, thereby supporting a stable, accurate representation of space under varying environmental conditions. Our results suggest that active dendrites may therefore constitute a key cellular mechanism for ensuring reliable spatial navigation.

Dan Bang, **Laurence Aitchison**, Rani Moran, Santiago Herce Castanon, Banafsheh Rafiee, Ali Mahmoodi, Jennifer Y. F. Lau, Peter E. Latham, Bahador Bahrami and Christopher Summerfield

Nature Human Behaviour (2017)

Most important decisions in our society are made by groups, from cabinets and commissions to boards and juries. When disagreement arises, opinions expressed with higher confidence tend to carry more weight. Although an individual’s degree of confidence often reflects the probability that their opinion is correct, it can also vary with task-irrelevant psychological, social, cultural and demographic factors. Therefore, to combine their opinions optimally, group members must adapt to each other’s individual biases and express their confidence according to a common metric. However, solving this communication problem is computationally difficult. Here we show that pairs of individuals making group decisions meet this challenge by using a heuristic strategy that we call ‘confidence matching’: they match their communicated confidence so that certainty and uncertainty is stated in approximately equal measure by each party. Combining the behavioural data with computational modelling, we show that this strategy is effective when group members have similar levels of expertise, and that it is robust when group members have no insight into their relative levels of expertise. Confidence matching is, however, sub-optimal and can cause miscommunication about who is more likely to be correct. This herding behaviour is one reason why groups can fail to make good decisions.

**Laurence Aitchison** and Máté Lengyel

PLoS Computational Biology (2016)

Our brain operates in the face of substantial uncertainty due to ambiguity in the inputs, and inherent unpredictability in the environment. Behavioural and neural evidence indicates that the brain often uses a close approximation of the optimal strategy, probabilistic inference, to interpret sensory inputs and make decisions under uncertainty. However, the circuit dynamics underlying such probabilistic computations are unknown. In particular, two fundamental properties of cortical responses, the presence of oscillations and transients, are difficult to reconcile with probabilistic inference. We show that excitatory-inhibitory neural networks are naturally suited to implement a particular inference algorithm, Hamiltonian Monte Carlo. Our network showed oscillations and transients like those found in the cortex and took advantage of these dynamical motifs to speed up inference by an order of magnitude. These results suggest a new functional role for the separation of cortical populations into excitatory and inhibitory neurons, and for the neural oscillations that emerge in such excitatory-inhibitory networks: enhancing the efficiency of cortical computations.

**Laurence Aitchison**, Nicola Corradi and Peter E Latham

PLoS Computational Biology (2016)

Datasets ranging from word frequencies to neural activity all have a seemingly unusual property, known as Zipf’s law: when observations (e.g., words) are ranked from most to least frequent, the frequency of an observation is inversely proportional to its rank. Here we demonstrate that a single, general principle underlies Zipf’s law in a wide variety of domains, by showing that models in which there is a latent, or hidden, variable controlling the observations can, and sometimes must, give rise to Zipf’s law. We illustrate this mechanism in three domains: word frequency, data with variable sequence length, and neural data.

**Laurence Aitchison**, Dan Bang, Bahador Bahrami and Peter E Latham

PLoS Computational Biology (2015)

Confidence plays a key role in group interactions: when people express an opinion, they almost always communicate—either implicitly or explicitly—their confidence, and the degree of confidence has a strong effect on listeners. Understanding both how confidence is generated and how it is interpreted are therefore critical for understanding group interactions. Here we ask: how do people generate their confidence? A priori, they could use a heuristic strategy (e.g. their confidence could scale more or less with the magnitude of the sensory data) or what we take to be an optimal strategy (i.e. their confidence is a function of the probability that their opinion is correct). We found, using Bayesian model selection, that confidence reports reflect probability correct, at least in more standard experimental designs. If this result extends to other domains, it would provide a relatively simple interpretation of confidence, and thus greatly extend our understanding of group interactions.

**Laurence Aitchison**, Guillaume Hennequin and Máté Lengyel

NeurIPS (2014)

Multiple lines of evidence support the notion that the brain performs probabilistic inference in multiple cognitive domains, including perception and decision making. There is also evidence that probabilistic inference may be implemented in the brain through the (quasi-)stochastic activity of neural circuits, producing samples from the appropriate posterior distributions, effectively implementing a Markov chain Monte Carlo algorithm. However, time becomes a fundamental bottleneck in such sampling-based probabilistic representations: the quality of inferences depends on how fast the neural circuit generates new, uncorrelated samples from its stationary distribution (the posterior). We explore this bottleneck in a simple, linear-Gaussian latent variable model, in which posterior sampling can be achieved by stochastic neural networks with linear dynamics. The well-known Langevin sampling (LS) recipe, so far the only sampling algorithm for continuous variables of which a neural implementation has been suggested, naturally fits into this dynamical framework. However, we first show analytically and through simulations that the symmetry of the synaptic weight matrix implied by LS yields critically slow mixing when the posterior is high-dimensional. Next, using methods from control theory, we construct and inspect networks that are optimally fast, and hence orders of magnitude faster than LS, while being far more biologically plausible. In these networks, strong -- but transient -- selective amplification of external noise generates the spatially correlated activity fluctuations prescribed by the posterior. Intriguingly, although a detailed balance of excitation and inhibition is dynamically maintained, detailed balance of Markov chain steps in the resulting sampler is violated, consistent with recent findings on how statistical irreversibility can overcome the speed limitation of random walks in other domains.

**Laurence Aitchison**, Alex Pouget and Peter E Latham

arXiv (2014)

Learning, especially rapid learning, is critical for survival. However, learning is hard: a large number of synaptic weights must be set based on noisy, often ambiguous, sensory information. In such a high-noise regime, keeping track of probability distributions over weights - not just point estimates - is the optimal strategy. Here we hypothesize that synapses take that optimal strategy: they do not store just the mean weight; they also store their degree of uncertainty - in essence, they put error bars on the weights. They then use that uncertainty to adjust their learning rates, with higher uncertainty resulting in higher learning rates. We also make a second, independent, hypothesis: synapses communicate their uncertainty by linking it to variability, with more uncertainty leading to more variability. More concretely, the value of a synaptic weight at a given time is a sample from its probability distribution. These two hypotheses cast synaptic plasticity as a problem of Bayesian inference, and thus provide a normative view of learning. They are consistent with known learning rules, offer an explanation for the large variability in the size of post-synaptic potentials, and make several falsifiable experimental predictions.