Training Deep Neural Networks

Published: 09 Oct 2015 Category: deep_learning

Tutorials

Popular Training Approaches of DNNs — A Quick Overview

https://medium.com/@asjad/popular-training-approaches-of-dnns-a-quick-overview-26ee37ad7e96#.pqyo039bb

Optimisation and training techniques for deep learning

https://blog.acolyer.org/2017/03/01/optimisation-and-training-techniques-for-deep-learning/

Papers

SNIPER: Efficient Multi-Scale Training

https://arxiv.org/abs/1805.09300

Activation functions

ReLU

Rectified linear units improve restricted boltzmann machines

Expressiveness of Rectifier Networks

How can a deep neural network with ReLU activations in its hidden layers approximate any function?

Understanding Deep Neural Networks with Rectified Linear Units

Learning ReLUs via Gradient Descent

https://arxiv.org/abs/1705.04591

Training Better CNNs Requires to Rethink ReLU

https://arxiv.org/abs/1709.06247

Deep Learning using Rectified Linear Units (ReLU)

LReLU

Rectifier Nonlinearities Improve Neural Network Acoustic Models

Deep Sparse Rectifier Neural Networks

PReLU

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Empirical Evaluation of Rectified Activations in Convolutional Network

SReLU

Deep Learning with S-shaped Rectified Linear Activation Units

Parametric Activation Pools greatly increase performance and consistency in ConvNets

From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

Revise Saturated Activation Functions

Noisy Activation Functions

MBA

Multi-Bias Non-linear Activation in Deep Neural Networks

Learning activation functions from data using cubic spline interpolation

What is the role of the activation function in a neural network?

Concatenated ReLU (CRelu)

Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Implement CReLU (Concatenated ReLU)

GELU

Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units

Formulating The ReLU

Activation Ensembles for Deep Neural Networks

https://arxiv.org/abs/1702.07790

SELU

Self-Normalizing Neural Networks

SELUs (scaled exponential linear units) - Visualized and Histogramed Comparisons among ReLU and Leaky ReLU

https://github.com/shaohua0116/Activation-Visualization-Histogram

Difference Between Softmax Function and Sigmoid Function

http://dataaspirant.com/2017/03/07/difference-between-softmax-function-and-sigmoid-function/

Flexible Rectified Linear Units for Improving Convolutional Neural Networks

Be Careful What You Backpropagate: A Case For Linear Output Activations & Gradient Boosting

EraseReLU

EraseReLU: A Simple Way to Ease the Training of Deep Convolution Neural Networks

https://arxiv.org/abs/1709.07634

Swish

Swish: a Self-Gated Activation Function

Searching for Activation Functions

Deep Learning with Data Dependent Implicit Activation Function

https://arxiv.org/abs/1802.00168

Series on Initialization of Weights for DNN

Initialization Of Feedfoward Networks

Initialization Of Deep Feedfoward Networks

Initialization Of Deep Networks Case of Rectifiers

Weights Initialization

An Explanation of Xavier Initialization

Random Walk Initialization for Training Very Deep Feedforward Networks

Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

All you need is a good init

All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

Data-dependent Initializations of Convolutional Neural Networks

What are good initial weights in a neural network?

RandomOut: Using a convolutional gradient norm to win The Filter Lottery

Categorical Reparameterization with Gumbel-Softmax

On weight initialization in deep neural networks

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Batch Normalization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Normalization Propagation: A Parametric Technique for Removing Internal Covariate Shift in Deep Networks

Revisiting Batch Normalization For Practical Domain Adaptation

Implementing Batch Normalization in Tensorflow

Deriving the Gradient for the Backward Pass of Batch Normalization

Exploring Normalization in Deep Residual Networks with Concatenated Rectified Linear Units

Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

https://arxiv.org/abs/1709.08145

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches

https://arxiv.org/abs/1802.03133

Decorrelated Batch Normalization

Understanding Batch Normalization

https://arxiv.org/abs/1806.02375

Implementing Synchronized Multi-GPU Batch Normalization

http://hangzh.com/PyTorch-Encoding/notes/syncbn.html

Backward pass of BN

Understanding the backward pass through Batch Normalization Layer

https://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html

Deriving the Gradient for the Backward Pass of Batch Normalization

https://kevinzakka.github.io/2016/09/14/batch_normalization/

What does the gradient flowing through batch normalization looks like ?

http://cthorey.github.io./backpropagation/

Layer Normalization

Layer Normalization

Keras GRU with Layer Normalization

Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Differentiable Learning-to-Normalize via Switchable Normalization

Group Normalization

Group Normalization

Batch-Instance Normalization

Batch-Instance Normalization for Adaptively Style-Invariant Neural Networks

https://arxiv.org/abs/1805.07925

Loss Function

The Loss Surfaces of Multilayer Networks

Direct Loss Minimization for Training Deep Neural Nets

Nonconvex Loss Functions for Classifiers and Deep Networks

Learning Deep Embeddings with Histogram Loss

Large-Margin Softmax Loss for Convolutional Neural Networks

An empirical analysis of the optimization of deep network loss surfaces

https://arxiv.org/abs/1612.04010

Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

Hierarchical Softmax

http://building-babylon.net/2017/08/01/hierarchical-softmax/

Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

DropMax: Adaptive Stochastic Softmax

Rethinking Feature Distribution for Loss Functions in Image Classification

Ensemble Soft-Margin Softmax Loss for Image Classification

Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels

Learning Rate

No More Pesky Learning Rates

Coupling Adaptive Batch Sizes with Learning Rates

Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates

https://arxiv.org/abs/1708.07120

Improving the way we work with learning rate.

https://medium.com/@bushaev/improving-the-way-we-work-with-learning-rate-5e99554f163b

WNGrad: Learn the Learning Rate in Gradient Descent

Convolution Filters

Non-linear Convolution Filters for CNN-based Learning

Pooling

Stochastic Pooling for Regularization of Deep Convolutional Neural Networks

Multi-scale Orderless Pooling of Deep Convolutional Activation Features

Fractional Max-Pooling

TI-POOLING: transformation-invariant pooling for feature learning in Convolutional Neural Networks

S3Pool: Pooling with Stochastic Spatial Sampling

Inductive Bias of Deep Convolutional Networks through Pooling Geometry

Improved Bilinear Pooling with CNNs

https://arxiv.org/abs/1707.06772

**Learning Bag-of-Features Pooling for Deep Convolutional Neural Networks

A new kind of pooling layer for faster and sharper convergence

Statistically Motivated Second Order Pooling

https://arxiv.org/abs/1801.07492

Detail-Preserving Pooling in Deep Networks

Mini-Batch

Online Batch Selection for Faster Training of Neural Networks

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Scaling SGD Batch Size to 32K for ImageNet Training

https://arxiv.org/abs/1708.03888

ImageNet Training in 24 Minutes

https://arxiv.org/abs/1709.05011

Don’t Decay the Learning Rate, Increase the Batch Size

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Hessian-based Analysis of Large Batch Training and Robustness to Adversaries

Revisiting Small Batch Training for Deep Neural Networks

https://arxiv.org/abs/1804.07612

Optimization Methods

On Optimization Methods for Deep Learning

Invariant backpropagation: how to train a transformation-invariant neural network

A practical theory for designing very deep convolutional neural network

Stochastic Optimization Techniques

Alec Radford’s animations for optimization algorithms

http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html

Faster Asynchronous SGD (FASGD)

An overview of gradient descent optimization algorithms (★★★★★)

Exploiting the Structure: Stochastic Gradient Methods Using Raw Clusters

Writing fast asynchronous SGD/AdaGrad with RcppParallel

Quick Explanations Of Optimization Methods

Learning to learn by gradient descent by gradient descent

SGDR: Stochastic Gradient Descent with Restarts

The zen of gradient descent

Big Batch SGD: Automated Inference using Adaptive Batch Sizes

Improving Stochastic Gradient Descent with Feedback

Learning Gradient Descent: Better Generalization and Longer Horizons

Optimization Algorithms

Gradient Normalization & Depth Based Decay For Deep Learning

Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks

Optimization for Deep Learning Highlights in 2017

http://ruder.io/deep-learning-optimization-2017/index.html

Gradients explode - Deep Networks are shallow - ResNet explained

Adam

Adam: A Method for Stochastic Optimization

Fixing Weight Decay Regularization in Adam

On the Convergence of Adam and Beyond

Tensor Methods

Tensorizing Neural Networks

Tensor methods for training neural networks

Regularization

DisturbLabel: Regularizing CNN on the Loss Layer

Robust Convolutional Neural Networks under Adversarial Noise

Adding Gradient Noise Improves Learning for Very Deep Networks

Stochastic Function Norm Regularization of Deep Networks

SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks

Regularizing neural networks by penalizing confident predictions

Automatic Node Selection for Deep Neural Networks using Group Lasso Regularization

Regularization in deep learning

LDMNet: Low Dimensional Manifold Regularized Neural Networks

https://arxiv.org/abs/1711.06246

Learning Sparse Neural Networks through L0 Regularization

Regularization and Optimization strategies in Deep Convolutional Neural Network

https://arxiv.org/abs/1712.04711

Regularizing Deep Networks by Modeling and Predicting Label Structure

Adversarial Noise Layer: Regularize Neural Network By Adding Noise

Dropout

Improving neural networks by preventing co-adaptation of feature detectors

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Fast dropout training

Dropout as data augmentation

A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Improved Dropout for Shallow and Deep Learning

Dropout Regularization in Deep Learning Models With Keras

Dropout with Expectation-linear Regularization

Dropout with Theano

Information Dropout: learning optimal representations through noise

Recent Developments in Dropout

Generalized Dropout

Analysis of Dropout

Variational Dropout Sparsifies Deep Neural Networks

Learning Deep Networks from Noisy Labels with Dropout Regularization

Concrete Dropout

Analysis of dropout learning regarded as ensemble learning

An Analysis of Dropout for Matrix Factorization

https://arxiv.org/abs/1710.03487

Analysis of Dropout in Online Learning

https://arxiv.org/abs/1711.03343

Regularization of Deep Neural Networks with Spectral Dropout

https://arxiv.org/abs/1711.08591

Data Dropout in Arbitrary Basis for Deep Network Regularization

https://arxiv.org/abs/1712.00891

A New Angle on L2 Regularization

DropConnect

Regularization of Neural Networks using DropConnect

Regularizing neural networks with dropout and with DropConnect

DropNeuron

DropNeuron: Simplifying the Structure of Deep Neural Networks

Maxout

Maxout Networks

Improving Deep Neural Networks with Probabilistic Maxout Units

Swapout

Swapout: Learning an ensemble of deep architectures

Whiteout

Whiteout: Gaussian Adaptive Regularization Noise in Deep Neural Networks

ShakeDrop regularization

https://arxiv.org/abs/1802.02375

Gradient Descent

RMSProp: Divide the gradient by a running average of its recent magnitude

Fitting a model via closed-form equations vs. Gradient Descent vs Stochastic Gradient Descent vs Mini-Batch Learning. What is the difference?(Normal Equations vs. GD vs. SGD vs. MB-GD)

http://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

An Introduction to Gradient Descent in Python

Train faster, generalize better: Stability of stochastic gradient descent

A Variational Analysis of Stochastic Gradient Algorithms

The vanishing gradient problem: Oh no — an obstacle to deep learning!

Gradient Descent For Machine Learning

Revisiting Distributed Synchronous SGD

Convergence rate of gradient descent

A Robust Adaptive Stochastic Gradient Method for Deep Learning

  • intro: IJCNN 2017 Accepted Paper, An extension of paper, “ADASECANT: Robust Adaptive Secant Method for Stochastic Gradient”
  • intro: Universite de Montreal & University of Oxford
  • arxiv: https://arxiv.org/abs/1703.00788

Accelerating Stochastic Gradient Descent

https://arxiv.org/abs/1704.08227

Gentle Introduction to the Adam Optimization Algorithm for Deep Learning

Understanding Generalization and Stochastic Gradient Descent

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Improving Generalization Performance by Switching from Adam to SGD

https://arxiv.org/abs/1712.07628

Laplacian Smoothing Gradient Descent

AdaGrad

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

ADADELTA: An Adaptive Learning Rate Method

Momentum

On the importance of initialization and momentum in deep learning

YellowFin and the Art of Momentum Tuning

Backpropagation

Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks

Top-down Neural Attention by Excitation Backprop

Towards a Biologically Plausible Backprop

Sampled Backpropagation: Training Deep and Wide Neural Networks on Large Scale, User Generated Content Using Label Sampling

The Reversible Residual Network: Backpropagation Without Storing Activations

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

Accelerate Training

Neural Networks with Few Multiplications

Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices

Deep Q-Networks for Accelerating the Training of Deep Neural Networks

Omnivore: An Optimizer for Multi-device Deep Learning on CPUs and GPUs

Parallelism

One weird trick for parallelizing convolutional neural networks

8-Bit Approximations for Parallelism in Deep Learning (ICLR 2016)

Handling Datasets

Data Augmentation

DataAugmentation ver1.0: Image data augmentation tool for training of image recognition algorithm

Caffe-Data-Augmentation: a branc caffe with feature of Data Augmentation using a configurable stochastic combination of 7 data augmentation techniques

Image Augmentation for Deep Learning With Keras

What you need to know about data augmentation for machine learning

HZPROC: torch data augmentation toolbox (supports affine transform)

AGA: Attribute Guided Augmentation

Accelerating Deep Learning with Multiprocess Image Augmentation in Keras

Comprehensive Data Augmentation and Sampling for Pytorch

Image augmentation for machine learning experiments.

https://github.com/aleju/imgaug

Google/inception’s data augmentation: scale and aspect ratio augmentation

https://github.com/facebook/fb.resnet.torch/blob/master/datasets/transforms.lua#L130

Caffe Augmentation Extension

Improving Deep Learning using Generic Data Augmentation

Augmentor: An Image Augmentation Library for Machine Learning

Learning to Compose Domain-Specific Transformations for Data Augmentation

https://arxiv.org/abs/1709.01643

Data Augmentation in Classification using GAN

https://arxiv.org/abs/1711.00648

Data Augmentation Generative Adversarial Networks

https://arxiv.org/abs/1711.04340

Random Erasing Data Augmentation

Context Augmentation for Convolutional Neural Networks

https://arxiv.org/abs/1712.01653

The Effectiveness of Data Augmentation in Image Classification using Deep Learning

https://arxiv.org/abs/1712.04621

MentorNet: Regularizing Very Deep Neural Networks on Corrupted Labels

mixup: Beyond Empirical Risk Minimization

mixup: Data-Dependent Data Augmentation

http://www.inference.vc/mixup-data-dependent-data-augmentation/

Data Augmentation by Pairing Samples for Images Classification

Feature Space Transfer for Data Augmentation

Visual Data Augmentation through Learning

https://arxiv.org/abs/1801.06665

Data Augmentation Generative Adversarial Networks

BAGAN: Data Augmentation with Balancing GAN

https://arxiv.org/abs/1803.09655

Parallel Grid Pooling for Data Augmentation

AutoAugment: Learning Augmentation Policies from Data

Improved Mixed-Example Data Augmentation

https://arxiv.org/abs/1805.11272

Data augmentation instead of explicit regularization

https://arxiv.org/abs/1806.03852

Imbalanced Datasets

Investigation on handling Structured & Imbalanced Datasets with Deep Learning

A systematic study of the class imbalance problem in convolutional neural networks

Class Rectification Hard Mining for Imbalanced Deep Learning

https://arxiv.org/abs/1712.03162

Bridging the Gap: Simultaneous Fine Tuning for Data Re-Balancing

Noisy / Unlabelled Data

Data Distillation: Towards Omni-Supervised Learning

Learning From Noisy Singly-labeled Data

Low Numerical Precision

Training deep neural networks with low precision multiplications

Deep Learning with Limited Numerical Precision

BinaryConnect: Training Deep Neural Networks with binary weights during propagations

Binarized Neural Networks

BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Distributed Training

Large Scale Distributed Systems for Training Neural Networks

Large Scale Distributed Deep Networks

Implementation of a Practical Distributed Calculation System with Browsers and JavaScript, and Application to Distributed Deep Learning

SparkNet: Training Deep Networks in Spark

A Scalable Implementation of Deep Learning on Spark

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Distributed Supervised Learning using Neural Networks

Distributed Training of Deep Neuronal Networks: Theoretical and Practical Limits of Parallel Scalability

How to scale distributed deep learning?

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  • intro: Tsinghua University & Stanford University
  • comments: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
  • keywords: momentum correction, local gradient clipping, momentum factor masking, and warm-up training
  • arxiv: https://arxiv.org/abs/1712.01887

Distributed learning of CNNs on heterogeneous CPU/GPU architectures

https://arxiv.org/abs/1712.02546

Integrated Model and Data Parallelism in Training Neural Networks

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

  • intro: ICLR 2018
  • intro: we find 99.9% of the gradient exchange in distributed SGD is redundant; we reduce the communication bandwidth by two orders of magnitude without losing accuracy
  • arxiv: https://arxiv.org/abs/1712.01887

Projects

Theano-MPI: a Theano-based Distributed Training Framework

CaffeOnSpark: Open Sourced for Distributed Deep Learning on Big Data Clusters

Tunnel: Data Driven Framework for Distributed Computing in Torch 7

Distributed deep learning with Keras and Apache Spark

BigDL: Distributed Deep learning Library for Apache Spark

Videos

A Scalable Implementation of Deep Learning on Spark

Distributed TensorFlow on Spark: Scaling Google’s Deep Learning Library (Spark Summit)

Deep Recurrent Neural Networks for Sequence Learning in Spark (Spark Summit)

Distributed deep learning on Spark

Blogs

Distributed Deep Learning Reads

https://github.com//tmulc18/DistributedDeepLearningReads

Hadoop, Spark, Deep Learning Mesh on Single GPU Cluster

http://www.nextplatform.com/2016/02/24/hadoop-spark-deep-learning-mesh-on-single-gpu-cluster/

The Unreasonable Effectiveness of Deep Learning on Spark

https://databricks.com/blog/2016/04/01/unreasonable-effectiveness-of-deep-learning-on-spark.html

Distributed Deep Learning with Caffe Using a MapR Cluster

https://www.mapr.com/blog/distributed-deep-learning-caffe-using-mapr-cluster

Deep Learning with Apache Spark and TensorFlow

https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

Deeplearning4j on Spark

http://deeplearning4j.org/spark

Distributed Deep Learning, Part 1: An Introduction to Distributed Training of Neural Networks

GPU Acceleration in Databricks: Speeding Up Deep Learning on Apache Spark

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html

Distributed Deep Learning with Apache Spark and Keras

https://db-blog.web.cern.ch/blog/joeri-hermans/2017-01-distributed-deep-learning-apache-spark-and-keras

Adversarial Training

Learning from Simulated and Unsupervised Images through Adversarial Training

The Robust Manifold Defense: Adversarial Training using Generative Models

https://arxiv.org/abs/1712.09196

DeepDefense: Training Deep Neural Networks with Improved Robustness

https://arxiv.org/abs/1803.00404

Gradient Adversarial Training of Neural Networks

Low-Precision Training

High-Accuracy Low-Precision Training

Incremental Training

ClickBAIT: Click-based Accelerated Incremental Training of Convolutional Neural Networks

ClickBAIT-v2: Training an Object Detector in Real-Time

https://arxiv.org/abs/1803.10358

Papers

Understanding the difficulty of training deep feed forward neural networks

Domain-Adversarial Training of Neural Networks

Scalable and Sustainable Deep Learning via Randomized Hashing

Training Deep Nets with Sublinear Memory Cost

Improving the Robustness of Deep Neural Networks via Stability Training

Faster Training of Very Deep Networks Via p-Norm Gates

Fast Training of Convolutional Neural Networks via Kernel Rescaling

FreezeOut: Accelerate Training by Progressively Freezing Layers

Normalized Gradient with Adaptive Stepsize Method for Deep Neural Network Training

Image Quality Assessment Guided Deep Neural Networks Training

https://arxiv.org/abs/1708.03880

An Effective Training Method For Deep Convolutional Neural Network

On the Importance of Consistency in Training Deep Neural Networks

Solving internal covariate shift in deep learning with linked neurons

Tools

pastalog: Simple, realtime visualization of neural network training performance

torch-pastalog: A Torch interface for pastalog - simple, realtime visualization of neural network training performance

Blogs

Important nuances to train deep learning models

http://www.erogol.com/important-nuances-train-deep-learning-models/

Train your deep model faster and sharper — two novel techniques

https://hackernoon.com/training-your-deep-model-faster-and-sharper-e85076c3b047