Training Deep Neural Networks

Published: 09 Oct 2015 Category: deep_learning


Popular Training Approaches of DNNs — A Quick Overview

Optimisation and training techniques for deep learning

Activation functions


Rectified linear units improve restricted boltzmann machines

Expressiveness of Rectifier Networks

How can a deep neural network with ReLU activations in its hidden layers approximate any function?

Understanding Deep Neural Networks with Rectified Linear Units

Learning ReLUs via Gradient Descent


Rectifier Nonlinearities Improve Neural Network Acoustic Models

Deep Sparse Rectifier Neural Networks


Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Empirical Evaluation of Rectified Activations in Convolutional Network


Deep Learning with S-shaped Rectified Linear Activation Units

Parametric Activation Pools greatly increase performance and consistency in ConvNets

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Revise Saturated Activation Functions

Noisy Activation Functions


Multi-Bias Non-linear Activation in Deep Neural Networks

Learning activation functions from data using cubic spline interpolation

What is the role of the activation function in a neural network?

Concatenated ReLU (CRelu)

Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Implement CReLU (Concatenated ReLU)


Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

Formulating The ReLU

Activation Ensembles for Deep Neural Networks


Self-Normalizing Neural Networks

SELUs (scaled exponential linear units) - Visualized and Histogramed Comparisons among ReLU and Leaky ReLU

Difference Between Softmax Function and Sigmoid Function

Flexible Rectified Linear Units for Improving Convolutional Neural Networks

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting

Series on Initialization of Weights for DNN

Initialization Of Feedfoward Networks

Initialization Of Deep Feedfoward Networks

Initialization Of Deep Networks Case of Rectifiers

Weights Initialization

An Explanation of Xavier Initialization

Random Walk Initialization for Training Very Deep Feedforward Networks

Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

All you need is a good init

All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

Data-dependent Initializations of Convolutional Neural Networks

What are good initial weights in a neural network?

RandomOut: Using a convolutional gradient norm to win The Filter Lottery

Categorical Reparameterization with Gumbel-Softmax

On weight initialization in deep neural networks

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

Implementing Batch Normalization in Tensorflow

Deriving the Gradient for the Backward Pass of Batch Normalization

Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Backward pass of BN

Understanding the backward pass through Batch Normalization Layer

Deriving the Gradient for the Backward Pass of Batch Normalization

What does the gradient flowing through batch normalization looks like ?

Layer Normalization

Layer Normalization

Keras GRU with Layer Normalization

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Loss Function

The Loss Surfaces of Multilayer Networks

Direct Loss Minimization for Training Deep Neural Nets

Nonconvex Loss Functions for Classifiers and Deep Networks

Learning Deep Embeddings with Histogram Loss

Large-Margin Softmax Loss for Convolutional Neural Networks

An empirical analysis of the optimization of deep network loss surfaces

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Hierarchical Softmax

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Learning Rate

No More Pesky Learning Rates

Coupling Adaptive Batch Sizes with Learning Rates

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

Convolution Filters

Non-linear Convolution Filters for CNN-based Learning


Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Multi-scale Orderless Pooling of Deep Convolutional Activation Features

Fractional Max-Pooling

TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks

S3Pool: Pooling with Stochastic Spatial Sampling

Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Improved Bilinear Pooling with CNNs

**Learning Bag-of-Features Pooling for Deep Convolutional Neural Networks


Online Batch Selection for Faster Training of Neural Networks

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Scaling SGD Batch Size to 32K for ImageNet Training

ImageNet Training in 24 Minutes

Optimization Methods

On Optimization Methods for Deep Learning

Invariant backpropagation: how to train a transformation-invariant neural network

A practical theory for designing very deep convolutional neural network

Stochastic Optimization Techniques

Alec Radford’s animations for optimization algorithms

Faster Asynchronous SGD (FASGD)

An overview of gradient descent optimization algorithms (★★★★★)

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Writing fast asynchronous SGD/AdaGrad with RcppParallel

Quick Explanations Of Optimization Methods

Learning to learn by gradient descent by gradient descent

SGDR: Stochastic Gradient Descent with Restarts

The zen of gradient descent

Big Batch SGD: Automated Inference using Adaptive Batch Sizes

Improving Stochastic Gradient Descent with Feedback

Learning Gradient Descent: Better Generalization and Longer Horizons

Tensor Methods

Tensorizing Neural Networks

Tensor methods for training neural networks


DisturbLabel: Regularizing CNN on the Loss Layer

Robust Convolutional Neural Networks under Adversarial Noise

Adding Gradient Noise Improves Learning for Very Deep Networks

Stochastic Function Norm Regularization of Deep Networks

SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks

Regularizing neural networks by penalizing confident predictions

Automatic Node Selection for Deep Neural Networks using Group Lasso Regularization

Regularization in deep learning


Improving neural networks by preventing co-adaptation of feature detectors

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Fast dropout training

Dropout as data augmentation

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Improved Dropout for Shallow and Deep Learning

Dropout Regularization in Deep Learning Models With Keras

Dropout with Expectation-linear Regularization

Dropout with Theano

Information Dropout: learning optimal representations through noise

Recent Developments in Dropout

Generalized Dropout

Analysis of Dropout

Variational Dropout Sparsifies Deep Neural Networks

Learning Deep Networks from Noisy Labels with Dropout Regularization

Concrete Dropout

Analysis of dropout learning regarded as ensemble learning


Regularization of Neural Networks using DropConnect

Regularizing neural networks with dropout and with DropConnect


DropNeuron: Simplifying the Structure of Deep Neural Networks


Maxout Networks

Improving Deep Neural Networks with Probabilistic Maxout Units


Swapout: Learning an ensemble of deep architectures


Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks

Gradient Descent

RMSProp: Divide the gradient by a running average of its recent magnitude

Adam: A Method for Stochastic Optimization

Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning. What is the difference?(Normal Equations vs. GD vs. SGD vs. MB-GD)

An Introduction to Gradient Descent in Python

Train faster, generalize better: Stability of stochastic gradient descent

A Variational Analysis of Stochastic Gradient Algorithms

The vanishing gradient problem: Oh no — an obstacle to deep learning!

Gradient Descent For Machine Learning

Revisiting Distributed Synchronous SGD

Convergence rate of gradient descent

A Robust Adaptive Stochastic Gradient Method for Deep Learning

  • intro: IJCNN 2017 Accepted Paper, An extension of paper, “ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient”
  • intro: Universite de Montreal & University of Oxford
  • arxiv:

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning


Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

ADADELTA: An Adaptive Learning Rate Method


On the importance of initialization and momentum in deep learning

YellowFin and the Art of Momentum Tuning


Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks

Top-down Neural Attention by Excitation Backprop

Towards a Biologically Plausible Backprop

Sampled Backpropagation: Training Deep and Wide Neural Networks on Large Scale, User Generated Content Using Label Sampling

The Reversible Residual Network: Backpropagation Without Storing Activations

Accelerate Training

Neural Networks with Few Multiplications

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices

Deep Q-Networks for Accelerating the Training of Deep Neural Networks

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs


One weird trick for parallelizing convolutional neural networks

8-Bit Approximations for Parallelism in Deep Learning (ICLR 2016)

Handling Datasets

Data Augmentation

DataAugmentation ver1.0: Image data augmentation tool for training of image recognition algorithm

Caffe-Data-Augmentation: a branc caffe with feature of Data Augmentation using a configurable stochastic combination of 7 data augmentation techniques

Image Augmentation for Deep Learning With Keras

What you need to know about data augmentation for machine learning

HZPROC: torch data augmentation toolbox (supports affine transform)

AGA: Attribute Guided Augmentation

Accelerating Deep Learning with Multiprocess Image Augmentation in Keras

Comprehensive Data Augmentation and Sampling for Pytorch

Image augmentation for machine learning experiments.

Google/inception’s data augmentation: scale and aspect ratio augmentation

Caffe Augmentation Extension

Improving Deep Learning using Generic Data Augmentation

Learning to Compose Domain-Specific Transformations for Data Augmentation

Imbalanced Datasets

Investigation on handling Structured & Imbalanced Datasets with Deep Learning

Low Numerical Precision

Training deep neural networks with low precision multiplications

Deep Learning with Limited Numerical Precision

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Binarized Neural Networks

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Adversarial Training

Learning from Simulated and Unsupervised Images through Adversarial Training

Incremental Training

ClickBAIT: Click-based Accelerated Incremental Training of Convolutional Neural Networks


Understanding the difficulty of training deep feed forward neural networks

Domain-Adversarial Training of Neural Networks

Scalable and Sustainable Deep Learning via Randomized Hashing

Training Deep Nets with Sublinear Memory Cost

Improving the Robustness of Deep Neural Networks via Stability Training

Faster Training of Very Deep Networks Via p-Norm Gates

Fast Training of Convolutional Neural Networks via Kernel Rescaling

FreezeOut: Accelerate Training by Progressively Freezing Layers

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Image Quality Assessment Guided Deep Neural Networks Training

An Effective Training Method For Deep Convolutional Neural Network

On the Importance of Consistency in Training Deep Neural Networks


pastalog: Simple, realtime visualization of neural network training performance

torch-pastalog: A Torch interface for pastalog - simple, realtime visualization of neural network training performance


Important nuances to train deep learning models

Train your deep model faster and sharper — two novel techniques