|
On the Difficulty of Training Deep Architectures
|
|
Whereas theoretical work suggests that deep architectures might be more efficient at representing highly-varying functions, training deep architectures was unsuccessful until the recent advent of algorithms based on unsupervised pre-training. Even though these new algorithms have enabled training deep models, many questions remain as to the nature of this difficult learning problem. We attempt to shed some light on these questions by comparing different successful approaches to training deep architectures and through extensive simulations investigating explanatory hypotheses. The experiments confirm and clarify the advantage (and sometimes disadvantage) of unsupervised pre-training. They demonstrate the robustness of the training procedure with respect to the random initialization, the positive effect of pre-training in terms of optimization and its role as a regularizer (in both cases in unusual ways).
We explore explanatory hypotheses based on the notion that early growth of the model parameters is determinant, and in particular that early use of unsupervised learning places the dynamics of supervised learning in attractors associated with local minima with good generalization properties. We discuss how several training approaches for deep architecture may exploit the principle of continuation methods in order to find good local minima. In particular we suggest that this is the case of shaping or the use of a curriculum, showing that it has both an effect on the speed of convergence of the training process to a minimum and, in the case of non-convex criteria, on the quality of the local minima obtained.