Training Deep Neural Networks

Published: 09 Oct 2015 Category: deep_learning


Popular Training Approaches of DNNs — A Quick Overview

Optimisation and training techniques for deep learning


SNIPER: Efficient Multi-Scale Training

RePr: Improved Training of Convolutional Filters

Activation functions


Rectified linear units improve restricted boltzmann machines

Expressiveness of Rectifier Networks

How can a deep neural network with ReLU activations in its hidden layers approximate any function?

Understanding Deep Neural Networks with Rectified Linear Units

Learning ReLUs via Gradient Descent

Training Better CNNs Requires to Rethink ReLU

Deep Learning using Rectified Linear Units (ReLU)

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks


Rectifier Nonlinearities Improve Neural Network Acoustic Models

Deep Sparse Rectifier Neural Networks


Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Empirical Evaluation of Rectified Activations in Convolutional Network


Deep Learning with S-shaped Rectified Linear Activation Units

Parametric Activation Pools greatly increase performance and consistency in ConvNets

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Revise Saturated Activation Functions

Noisy Activation Functions


Multi-Bias Non-linear Activation in Deep Neural Networks

Learning activation functions from data using cubic spline interpolation

What is the role of the activation function in a neural network?

Concatenated ReLU (CRelu)

Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Implement CReLU (Concatenated ReLU)


Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

Formulating The ReLU

Activation Ensembles for Deep Neural Networks


Self-Normalizing Neural Networks

SELUs (scaled exponential linear units) - Visualized and Histogramed Comparisons among ReLU and Leaky ReLU

Difference Between Softmax Function and Sigmoid Function

Flexible Rectified Linear Units for Improving Convolutional Neural Networks

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting


EraseReLU: A Simple Way to Ease the Training of Deep Convolution Neural Networks


Swish: a Self-Gated Activation Function

Searching for Activation Functions

Deep Learning with Data Dependent Implicit Activation Function

Series on Initialization of Weights for DNN

Initialization Of Feedfoward Networks

Initialization Of Deep Feedfoward Networks

Initialization Of Deep Networks Case of Rectifiers

Weights Initialization

An Explanation of Xavier Initialization

Random Walk Initialization for Training Very Deep Feedforward Networks

Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

All you need is a good init

All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

Data-dependent Initializations of Convolutional Neural Networks

What are good initial weights in a neural network?

RandomOut: Using a convolutional gradient norm to win The Filter Lottery

Categorical Reparameterization with Gumbel-Softmax

On weight initialization in deep neural networks

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

Revisiting Batch Normalization For Practical Domain Adaptation

  • intro: Peking University & TuSimple & SenseTime
  • intro: Pattern Recognition
  • keywords: Adaptive Batch Normalization (AdaBN)
  • arxiv:

Implementing Batch Normalization in Tensorflow

Deriving the Gradient for the Backward Pass of Batch Normalization

Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches

Decorrelated Batch Normalization

Understanding Batch Normalization

Implementing Synchronized Multi-GPU Batch Normalization

Restructuring Batch Normalization to Accelerate CNN Training

Intro to optimization in deep learning: Busting the myth about batch normalization

Understanding Regularization in Batch Normalization

How Does Batch Normalization Help Optimization?

Cross-Iteration Batch Normalization

Extended Batch Normalization

Towards Stabilizing Batch Statistics in Backward Propagation of Batch Normalization

Rethinking “Batch” in BatchNorm

Backward pass of BN

Understanding the backward pass through Batch Normalization Layer

Deriving the Gradient for the Backward Pass of Batch Normalization

What does the gradient flowing through batch normalization looks like ?

Layer Normalization

Layer Normalization

Keras GRU with Layer Normalization

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Differentiable Learning-to-Normalize via Switchable Normalization

Group Normalization

Group Normalization

Batch-Instance Normalization

Batch-Instance Normalization for Adaptively Style-Invariant Neural Networks

Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net

Dynamic Normalization

Dynamic Normalization

Loss Function

The Loss Surfaces of Multilayer Networks

Direct Loss Minimization for Training Deep Neural Nets

Nonconvex Loss Functions for Classifiers and Deep Networks

Learning Deep Embeddings with Histogram Loss

Large-Margin Softmax Loss for Convolutional Neural Networks

An empirical analysis of the optimization of deep network loss surfaces

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Hierarchical Softmax

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

DropMax: Adaptive Stochastic Softmax

Rethinking Feature Distribution for Loss Functions in Image Classification

Ensemble Soft-Margin Softmax Loss for Image Classification

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels

Learning Rates

No More Pesky Learning Rates

Coupling Adaptive Batch Sizes with Learning Rates

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

Improving the way we work with learning rate.

WNGrad: Learn the Learning Rate in Gradient Descent

Learning with Random Learning Rates

Learning Rate Dropout

Convolution Filters

Non-linear Convolution Filters for CNN-based Learning


Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Multi-scale Orderless Pooling of Deep Convolutional Activation Features

Fractional Max-Pooling

TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks

S3Pool: Pooling with Stochastic Spatial Sampling

Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Improved Bilinear Pooling with CNNs

**Learning Bag-of-Features Pooling for Deep Convolutional Neural Networks

A new kind of pooling layer for faster and sharper convergence

Statistically Motivated Second Order Pooling

Detail-Preserving Pooling in Deep Networks


Online Batch Selection for Faster Training of Neural Networks

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Scaling SGD Batch Size to 32K for ImageNet Training

Large Batch Training of Convolutional Networks

ImageNet Training in 24 Minutes

Don’t Decay the Learning Rate, Increase the Batch Size

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Large Batch Training of Convolutional Networks with Layer-wise Adaptive Rate Scaling

Revisiting Small Batch Training for Deep Neural Networks

Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs

Optimization Methods

On Optimization Methods for Deep Learning

Invariant backpropagation: how to train a transformation-invariant neural network

A practical theory for designing very deep convolutional neural network

Stochastic Optimization Techniques

Alec Radford’s animations for optimization algorithms

Faster Asynchronous SGD (FASGD)

An overview of gradient descent optimization algorithms (★★★★★)

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Writing fast asynchronous SGD/AdaGrad with RcppParallel

Quick Explanations Of Optimization Methods

Learning to learn by gradient descent by gradient descent

SGDR: Stochastic Gradient Descent with Restarts

The zen of gradient descent

Big Batch SGD: Automated Inference using Adaptive Batch Sizes

Improving Stochastic Gradient Descent with Feedback

Learning Gradient Descent: Better Generalization and Longer Horizons

Optimization Algorithms

Gradient Normalization & Depth Based Decay For Deep Learning

Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

Optimization for Deep Learning Highlights in 2017

Gradients explode - Deep Networks are shallow - ResNet explained

A Sufficient Condition for Convergences of Adam and RMSProp


Adam: A Method for Stochastic Optimization

Fixing Weight Decay Regularization in Adam

On the Convergence of Adam and Beyond

Tensor Methods

Tensorizing Neural Networks

Tensor methods for training neural networks


DisturbLabel: Regularizing CNN on the Loss Layer

Robust Convolutional Neural Networks under Adversarial Noise

Adding Gradient Noise Improves Learning for Very Deep Networks

Stochastic Function Norm Regularization of Deep Networks

SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks

Regularizing neural networks by penalizing confident predictions

Automatic Node Selection for Deep Neural Networks using Group Lasso Regularization

Regularization in deep learning

LDMNet: Low Dimensional Manifold Regularized Neural Networks

Learning Sparse Neural Networks through L0 Regularization

Regularization and Optimization strategies in Deep Convolutional Neural Network

Regularizing Deep Networks by Modeling and Predicting Label Structure

Adversarial Noise Layer: Regularize Neural Network By Adding Noise

Deep Bilevel Learning

Can We Gain More from Orthogonality Regularizations in Training Deep CNNs?

Gradient-Coherent Strong Regularization for Deep Neural Networks


Improving neural networks by preventing co-adaptation of feature detectors

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Fast dropout training

Dropout as data augmentation

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Improved Dropout for Shallow and Deep Learning

Dropout Regularization in Deep Learning Models With Keras

Dropout with Expectation-linear Regularization

Dropout with Theano

Information Dropout: learning optimal representations through noise

Recent Developments in Dropout

Generalized Dropout

Analysis of Dropout

Variational Dropout Sparsifies Deep Neural Networks

Learning Deep Networks from Noisy Labels with Dropout Regularization

Concrete Dropout

Analysis of dropout learning regarded as ensemble learning

An Analysis of Dropout for Matrix Factorization

Analysis of Dropout in Online Learning

Regularization of Deep Neural Networks with Spectral Dropout

Data Dropout in Arbitrary Basis for Deep Network Regularization

A New Angle on L2 Regularization

Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning

Data Dropout: Optimizing Training Data for Convolutional Neural Networks

DropFilter: Dropout for Convolutions

DropFilter: A Novel Regularization Method for Learning Convolutional Neural Networks

Targeted Dropout


Regularization of Neural Networks using DropConnect

Regularizing neural networks with dropout and with DropConnect


DropNeuron: Simplifying the Structure of Deep Neural Networks


DropBlock: A regularization method for convolutional networks


Maxout Networks

Improving Deep Neural Networks with Probabilistic Maxout Units


Swapout: Learning an ensemble of deep architectures


Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks

ShakeDrop regularization

Shakeout: A New Approach to Regularized Deep Neural Network Training

Gradient Descent

RMSProp: Divide the gradient by a running average of its recent magnitude

Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning. What is the difference?(Normal Equations vs. GD vs. SGD vs. MB-GD)

An Introduction to Gradient Descent in Python

Train faster, generalize better: Stability of stochastic gradient descent

A Variational Analysis of Stochastic Gradient Algorithms

The vanishing gradient problem: Oh no — an obstacle to deep learning!

Gradient Descent For Machine Learning

Revisiting Distributed Synchronous SGD

Convergence rate of gradient descent

A Robust Adaptive Stochastic Gradient Method for Deep Learning

  • intro: IJCNN 2017 Accepted Paper, An extension of paper, “ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient”
  • intro: Universite de Montreal & University of Oxford
  • arxiv:

Accelerating Stochastic Gradient Descent

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

Understanding Generalization and Stochastic Gradient Descent

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Improving Generalization Performance by Switching from Adam to SGD

Laplacian Smoothing Gradient Descent


Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

ADADELTA: An Adaptive Learning Rate Method


On the importance of initialization and momentum in deep learning

YellowFin and the Art of Momentum Tuning


Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks

Top-down Neural Attention by Excitation Backprop

Towards a Biologically Plausible Backprop

Sampled Backpropagation: Training Deep and Wide Neural Networks on Large Scale, User Generated Content Using Label Sampling

The Reversible Residual Network: Backpropagation Without Storing Activations

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

Accelerate Training

Neural Networks with Few Multiplications

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices

Deep Q-Networks for Accelerating the Training of Deep Neural Networks

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs


One weird trick for parallelizing convolutional neural networks

8-Bit Approximations for Parallelism in Deep Learning (ICLR 2016)

Handling Datasets

Data Augmentation

DataAugmentation ver1.0: Image data augmentation tool for training of image recognition algorithm

Caffe-Data-Augmentation: a branc caffe with feature of Data Augmentation using a configurable stochastic combination of 7 data augmentation techniques

Image Augmentation for Deep Learning With Keras

What you need to know about data augmentation for machine learning

HZPROC: torch data augmentation toolbox (supports affine transform)

AGA: Attribute Guided Augmentation

Accelerating Deep Learning with Multiprocess Image Augmentation in Keras

Comprehensive Data Augmentation and Sampling for Pytorch

Image augmentation for machine learning experiments.

Google/inception’s data augmentation: scale and aspect ratio augmentation

Caffe Augmentation Extension

Improving Deep Learning using Generic Data Augmentation

Augmentor: An Image Augmentation Library for Machine Learning

Automatic Dataset Augmentation

Learning to Compose Domain-Specific Transformations for Data Augmentation

Data Augmentation in Classification using GAN

Data Augmentation Generative Adversarial Networks

Random Erasing Data Augmentation

Context Augmentation for Convolutional Neural Networks

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

MentorNet: Regularizing Very Deep Neural Networks on Corrupted Labels

mixup: Beyond Empirical Risk Minimization

mixup: Data-Dependent Data Augmentation

Data Augmentation by Pairing Samples for Images Classification

Feature Space Transfer for Data Augmentation

Visual Data Augmentation through Learning

Data Augmentation Generative Adversarial Networks

BAGAN: Data Augmentation with Balancing GAN

Parallel Grid Pooling for Data Augmentation

AutoAugment: Learning Augmentation Policies from Data

Improved Mixed-Example Data Augmentation

Data augmentation instead of explicit regularization

Data Augmentation using Random Image Cropping and Patching for Deep CNNs

GANsfer Learning: Combining labelled and unlabelled data for GAN based data augmentat

Adversarial Learning of General Transformations for Data Augmentation

Implicit Semantic Data Augmentation for Deep Networks

Data Augmentation Revisited: Rethinking the Distribution Gap between Clean and Augmented Data

AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty

GridMask Data Augmentation

On Feature Normalization and Data Augmentation

DADA: Differentiable Automatic Data Augmentation

Negative Data Augmentation

Imbalanced Datasets

Investigation on handling Structured & Imbalanced Datasets with Deep Learning

A systematic study of the class imbalance problem in convolutional neural networks

Class Rectification Hard Mining for Imbalanced Deep Learning

Bridging the Gap: Simultaneous Fine Tuning for Data Re-Balancing

Imbalanced Deep Learning by Minority Class Incremental Rectification

Pseudo-Feature Generation for Imbalanced Data Analysis in Deep Learning

Max-margin Class Imbalanced Learning with Gaussian Affinity

Dynamic Curriculum Learning for Imbalanced Data Classification

Class Rectification Hard Mining for Imbalanced Deep Learning

Noisy / Unlabelled Data

Data Distillation: Towards Omni-Supervised Learning

Learning From Noisy Singly-labeled Data

Low Numerical Precision

Training deep neural networks with low precision multiplications

Deep Learning with Limited Numerical Precision

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Binarized Neural Networks

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Distributed Training

Large Scale Distributed Systems for Training Neural Networks

Large Scale Distributed Deep Networks

Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning

SparkNet: Training Deep Networks in Spark

A Scalable Implementation of Deep Learning on Spark

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Distributed Supervised Learning using Neural Networks

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

How to scale distributed deep learning?

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  • intro: Tsinghua University & Stanford University
  • comments: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
  • keywords: momentum correction, local gradient clipping, momentum factor masking, and warm-up training
  • arxiv:

Distributed learning of CNNs on heterogeneous CPU/GPU architectures

Integrated Model and Data Parallelism in Training Neural Networks

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  • intro: ICLR 2018
  • intro: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
  • arxiv:

RedSync : Reducing Synchronization Traffic for Distributed Deep Learning


Theano-MPI: a Theano-based Distributed Training Framework

CaffeOnSpark: Open Sourced for Distributed Deep Learning on Big Data Clusters

Tunnel: Data Driven Framework for Distributed Computing in Torch 7

Distributed deep learning with Keras and Apache Spark

BigDL: Distributed Deep learning Library for Apache Spark


A Scalable Implementation of Deep Learning on Spark

Distributed TensorFlow on Spark: Scaling Google’s Deep Learning Library (Spark Summit)

Deep Recurrent Neural Networks for Sequence Learning in Spark (Spark Summit)

Distributed deep learning on Spark


Distributed Deep Learning Reads

Hadoop, Spark, Deep Learning Mesh on Single GPU Cluster

The Unreasonable Effectiveness of Deep Learning on Spark

Distributed Deep Learning with Caffe Using a MapR Cluster

Deep Learning with Apache Spark and TensorFlow

Deeplearning4j on Spark

Distributed Deep Learning, Part 1: An Introduction to Distributed Training of Neural Networks

GPU Acceleration in Databricks: Speeding Up Deep Learning on Apache Spark

Distributed Deep Learning with Apache Spark and Keras

Adversarial Training

Learning from Simulated and Unsupervised Images through Adversarial Training

The Robust Manifold Defense: Adversarial Training using Generative Models

DeepDefense: Training Deep Neural Networks with Improved Robustness

Gradient Adversarial Training of Neural Networks

Gray-box Adversarial Training

Universal Adversarial Training

MEAL: Multi-Model Ensemble via Adversarial Learning

Regularized Ensembles and Transferability in Adversarial Learning

Feature denoising for improving adversarial robustness

Second Rethinking of Network Pruning in the Adversarial Setting

Interpreting Adversarially Trained Convolutional Neural Networks

On Stabilizing Generative Adversarial Training with Noise

Adversarial Learning with Margin-based Triplet Embedding Regularization

Bag of Tricks for Adversarial Training

Low-Precision Training

Mixed Precision Training

High-Accuracy Low-Precision Training

Incremental Training

ClickBAIT: Click-based Accelerated Incremental Training of Convolutional Neural Networks

ClickBAIT-v2: Training an Object Detector in Real-Time

Class-incremental Learning via Deep Model Consolidation


Understanding the difficulty of training deep feed forward neural networks

Domain-Adversarial Training of Neural Networks

Scalable and Sustainable Deep Learning via Randomized Hashing

Training Deep Nets with Sublinear Memory Cost

Improving the Robustness of Deep Neural Networks via Stability Training

Faster Training of Very Deep Networks Via p-Norm Gates

Fast Training of Convolutional Neural Networks via Kernel Rescaling

FreezeOut: Accelerate Training by Progressively Freezing Layers

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Image Quality Assessment Guided Deep Neural Networks Training

An Effective Training Method For Deep Convolutional Neural Network

On the Importance of Consistency in Training Deep Neural Networks

Solving internal covariate shift in deep learning with linked neurons


pastalog: Simple, realtime visualization of neural network training performance

torch-pastalog: A Torch interface for pastalog - simple, realtime visualization of neural network training performance


Important nuances to train deep learning models

Train your deep model faster and sharper — two novel techniques