Video Applications

Published: 09 Oct 2015 Category: deep_learning

Papers

You Lead, We Exceed: Labor-Free Video Concept Learningby Jointly Exploiting Web Videos and Images

Video Fill in the Blank with Merging LSTMs

  • intro: for Large Scale Movie Description and Understanding Challenge (LSMDC) 2016, “Movie fill-in-the-blank” Challenge, UCF_CRCV
  • intro: Video-Fill-in-the-Blank (ViFitB)
  • arxiv: https://arxiv.org/abs/1610.04062

Video Pixel Networks

Robust Video Synchronization using Unsupervised Deep Learning

Video Propagation Networks

Video Frame Synthesis using Deep Voxel Flow

Optimizing Deep CNN-Based Queries over Video Streams at Scale

NoScope: 1000x Faster Deep Learning Queries over Video

http://dawn.cs.stanford.edu/2017/06/22/noscope/

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

ProcNets: Learning to Segment Procedures in Untrimmed and Unconstrained Videos

https://arxiv.org/abs/1703.09788

Unsupervised Learning Layers for Video Analysis

  • intro: Baidu Research
  • intro: “The experiments demonstrated the potential applications of UL layers and online learning algorithm to head orientation estimation and moving object localization”
  • arxiv: https://arxiv.org/abs/1705.08918

Look, Listen and Learn

Video Imagination from a Single Image with Transformation Generation

Learning to Learn from Noisy Web Videos

Convolutional Long Short-Term Memory Networks for Recognizing First Person Interactions

  • intro: Accepted on the second International Workshop on Egocentric Perception, Interaction and Computing(EPIC) at International Conference on Computer Vision(ICCV-17)
  • arxiv: https://arxiv.org/abs/1709.06495

Learning Binary Residual Representations for Domain-specific Video Streaming

Video Representation Learning Using Discriminative Pooling

Rethinking the Faster R-CNN Architecture for Temporal Action Localization

Deep Keyframe Detection in Human Action Videos

FFNet: Video Fast-Forwarding via Reinforcement Learning

Fast forwarding Egocentric Videos by Listening and Watching

https://arxiv.org/abs/1806.04620

Scanner: Efficient Video Analysis at Scale

Massively Parallel Video Networks

Object Level Visual Reasoning in Videos

Video Time: Properties, Encoders and Evaluation

Inserting Videos into Videos

Video Classification

Large-scale Video Classification with Convolutional Neural Networks

Exploiting Image-trained CNN Architectures for Unconstrained Video Classification

Beyond Short Snippets: Deep Networks for Video Classification

Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification

Video Content Recognition with Deep Learning

Video Content Recognition with Deep Learning

Efficient Large Scale Video Classification

Fusing Multi-Stream Deep Networks for Video Classification

Learning End-to-end Video Classification with Rank-Pooling

Deep Learning for Video Classification and Captioning

Fast Video Classification via Adaptive Cascading of Deep Models

Deep Feature Flow for Video Recognition

Large-Scale YouTube-8M Video Understanding with Deep Neural Networks

https://arxiv.org/abs/1706.04488

Deep Learning Methods for Efficient Large Scale Video Labeling

Learnable pooling with Context Gating for video classification

Aggregating Frame-level Features for Large-Scale Video Classification

Tensor-Train Recurrent Neural Networks for Video Classification

https://arxiv.org/abs/1707.01786

Hierarchical Deep Recurrent Architecture for Video Understanding

Large-scale Video Classification guided by Batch Normalized LSTM Translator

UTS submission to Google YouTube-8M Challenge 2017

A spatiotemporal model with visual attention for video classification

https://arxiv.org/abs/1707.02069

Cultivating DNN Diversity for Large Scale Video Labelling

Attention Transfer from Web Images for Video Recognition

Non-local Neural Networks

Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification

https://arxiv.org/abs/1711.08200

Appearance-and-Relation Networks for Video Classification

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

Long Activity Video Understanding using Functional Object-Oriented Network

https://arxiv.org/abs/1807.00983

Deep Architectures and Ensembles for Semantic Video Classification

https://arxiv.org/abs/1807.01026

Deep Discriminative Model for Video Classification

Deep Video Color Propagation

Non-local NetVLAD Encoding for Video Classification

Learnable Pooling Methods for Video Classification

NeXtVLAD: An Efficient Neural Network to Aggregate Frame-level Features for Large-scale Video Classification

High Order Neural Networks for Video Classification

SlowFast Networks for Video Recognition

Efficient Video Classification Using Fewer Frames

Video Classification with Channel-Separated Convolutional Networks

Two-Stream Video Classification with Cross-Modality Attention

https://arxiv.org/abs/1908.00497

Action Detection / Activity Recognition

3d convolutional neural networks for human action recognition

Sequential Deep Learning for Human Action Recognition

Two-stream convolutional networks for action recognition in videos

Finding action tubes

  • intro: “built action models from shape and motion cues. They start from the image proposals and select the motion salient subset of them and extract saptio-temporal features to represent the video using the CNNs.”
  • arxiv: http://arxiv.org/abs/1411.6031

Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

Action Recognition by Hierarchical Mid-level Action Elements

Contextual Action Recognition with R*CNN

Towards Good Practices for Very Deep Two-Stream ConvNets

Action Recognition using Visual Attention

End-to-end Learning of Action Detection from Frame Glimpses in Videos

Multi-velocity neural networks for gesture recognition in videos

Active Learning for Online Recognition of Human Activities from Streaming Videos

Convolutional Two-Stream Network Fusion for Video Action Recognition

Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables

Unsupervised Semantic Action Discovery from Video Collections

Anticipating Visual Representations from Unlabeled Video

VideoLSTM Convolves, Attends and Flows for Action Recognition

Hierarchical Attention Network for Action Recognition in Videos (HAN)

Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

Actionness Estimation Using Hybrid FCNs

Real-time Action Recognition with Enhanced Motion Vector CNNs

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Temporal Segment Networks for Action Recognition in Videos

Hierarchical Attention Network for Action Recognition in Videos

DeepCAMP: Deep Convolutional Action & Attribute Mid-Level Patterns

Depth2Action: Exploring Embedded Depth for Large-Scale Action Recognition

Dynamic Image Networks for Action Recognition

Human Action Recognition without Human

Temporal Convolutional Networks: A Unified Approach to Action Segmentation

Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks

Sequential Deep Trajectory Descriptor for Action Recognition with Three-stream CNN

Semi-Coupled Two-Stream Fusion ConvNets for Action Recognition at Extremely Low Resolutions

Spatiotemporal Residual Networks for Video Action Recognition

Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks

Deep Recurrent Neural Network for Mobile Human Activity Recognition with High Throughput

Joint Network based Attention for Action Recognition

Temporal Convolutional Networks for Action Segmentation and Detection

AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos

ActionFlowNet: Learning Motion Representation for Action Recognition

Higher-order Pooling of CNN Features via Kernel Linearization for Action Recognition

Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos

https://arxiv.org/abs/1703.10664

Temporal Action Detection with Structured Segment Networks

Recurrent Residual Learning for Action Recognition

https://arxiv.org/abs/1706.08807

Hierarchical Multi-scale Attention Networks for Action Recognition

https://arxiv.org/abs/1708.07590

Two-stream Flow-guided Convolutional Attention Networks for Action Recognition

Action Classification and Highlighting in Videos

https://arxiv.org/abs/1708.09522

Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN

https://arxiv.org/abs/1710.03383

End-to-end Video-level Representation Learning for Action Recognition

Fully-Coupled Two-Stream Spatiotemporal Networks for Extremely Low Resolution Action Recognition

DiscrimNet: Semi-Supervised Action Recognition from Videos using Generative Adversarial Networks

https://arxiv.org/abs/1801.07230

A Fusion of Appearance based CNNs and Temporal evolution of Skeleton with LSTM for Daily Living Action Recognition

https://arxiv.org/abs/1802.00421

Real-Time End-to-End Action Detection with Two-Stream Networks

https://arxiv.org/abs/1802.08362

A Closer Look at Spatiotemporal Convolutions for Action Recognition

VideoCapsuleNet: A Simplified Network for Action Detection

https://arxiv.org/abs/1805.08162

Where and When to Look? Spatio-temporal Attention for Action Recognition in Videos

https://arxiv.org/abs/1810.04511

Relational Long Short-Term Memory for Video Action Recognition

https://arxiv.org/abs/1811.07059

Temporal Recurrent Networks for Online Action Detection

https://arxiv.org/abs/1811.073910

Video Action Transformer Network

D3D: Distilled 3D Networks for Video Action Recognition

TACNet: Transition-Aware Context Network for Spatio-Temporal Action Detection

Deformable Tube Network for Action Detection in Videos

https://arxiv.org/abs/1907.01847

You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization

https://arxiv.org/abs/1911.06644

Projects

A Torch Library for Action Recognition and Detection Using CNNs and LSTMs

2016 ActivityNet action recognition challenge. CNN + LSTM approach. Multi-threaded loading.

LSTM for Human Activity Recognition

Scanner: Efficient Video Analysis at Scale

Charades Starter Code for Activity Classification and Localization

NonLocalNetwork and Sequeeze-Excitation Network

Event Recognition

TagBook: A Semantic Video Representation without Supervision for Event Detection

AENet: Learning Deep Audio Features for Video Analysis

Event Detection

DevNet: A Deep Event Network for Multimedia Event Detection and Evidence Recounting

Detecting events and key actors in multi-person videos

Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection

Efficient Action Detection in Untrimmed Videos via Multi-Task Learning

Joint Event Detection and Description in Continuous Video Streams

Abnormality / Anomaly Detection

Fully Convolutional Neural Network for Fast Anomaly Detection in Crowded Scenes

Anomaly Detection in Video Using Predictive Convolutional Long Short-Term Memory Networks

Abnormal Event Detection in Videos using Spatiotemporal Autoencoder

Abnormal Event Detection in Videos using Generative Adversarial Nets

Joint Detection and Recounting of Abnormal Events by Learning Deep Generic Knowledge

An overview of deep learning based methods for unsupervised and semi-supervised anomaly detection in videos

STAN: Spatio-Temporal Adversarial Networks for Abnormal Event Detection

Video Anomaly Detection and Localization via Gaussian Mixture Fully Convolutional Variational Autoencoder

https://arxiv.org/abs/1805.11223

Attentioned Convolutional LSTM InpaintingNetwork for Anomaly Detection in Videos

https://arxiv.org/abs/1811.10228

Video Prediction

Deep multi-scale video prediction beyond mean square error

Unsupervised Learning for Physical Interaction through Video Prediction

Generating Videos with Scene Dynamics

PredNet

Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning

Diversity encouraged learning of unsupervised LSTM ensemble for neural activity video prediction

Video Ladder Networks

Unsupervised Learning of Long-Term Motion Dynamics for Videos

One-Step Time-Dependent Future Video Frame Prediction with a Convolutional Encoder-Decoder Neural Network

Fully Context-Aware Video Prediction

Novel Video Prediction for Large-scale Scene using Optical Flow

Video Tagging

Automatic Image and Video Tagging

Tagging YouTube music videos with deep learning - Alexandre Passant

Shot Boundary Detection

Large-scale, Fast and Accurate Shot Boundary Detection through Spatio-temporal Convolutional Neural Networks

https://arxiv.org/abs/1705.03281

Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks

Video Action Segmentation

TricorNet: A Hybrid Temporal Convolutional and Recurrent Network for Video Action Segmentation

Video2GIF

Video2GIF: Automatic Generation of Animated GIFs from Video (Robust Deep RankNet)

Creating Animated GIFs Automatically from Video

https://yahooresearch.tumblr.com/post/148009705216/creating-animated-gifs-automatically-from-video

GIF2Video: Color Dequantization and Temporal Interpolation of GIF images

Video2Speech

Vid2speech: Speech Reconstruction from Silent Video

Video Captioning

http://handong1587.github.io/deep_learning/2015/10/09/image-video-captioning.html#video-captioning

Video Summarization

Video summarization produces a short summary of a full-length video and ideally encapsulates its most informative parts, alleviates the problem of video browsing, editing and indexing.

Video Summarization with Long Short-term Memory

DeepVideo: Video Summarization using Temporal Sequence Modelling

Semantic Video Trailers

Video Summarization using Deep Semantic Features

CNN-Based Prediction of Frame-Level Shot Importance for Video Summarization

Video Summarization with Attention-Based Encoder-Decoder Networks

https://arxiv.org/abs/1708.09545

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward

Viewpoint-aware Video Summarization

DTR-GAN: Dilated Temporal Relational Adversarial Network for Video Summarization

https://arxiv.org/abs/1804.11228

Learning Video Summarization Using Unpaired Data

https://arxiv.org/abs/1805.12174

Video Summarization Using Fully Convolutional Sequence Networks

https://arxiv.org/abs/1805.10538

Video Summarisation by Classification with Deep Reinforcement Learning

Query-Conditioned Three-Player Adversarial Network for Video Summarization

Discriminative Feature Learning for Unsupervised Video Summarization

Rethinking the Evaluation of Video Summaries

Video Highlight Detection

Unsupervised Extraction of Video Highlights Via Robust Recurrent Auto-encoders

  • intro: ICCV 2015
  • intro: rely on an assumption that highlights of an event category are more frequently captured in short videos than non-highlights
  • arxiv: http://arxiv.org/abs/1510.01442

Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization

Using Deep Learning to Find Basketball Highlights

Real-Time Video Highlights for Yahoo Esports

A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video

PHD-GIFs: Personalized Highlight Detection for Automatic GIF Creation

Video Understanding

Scale Up Video Understandingwith Deep Learning

Slicing Convolutional Neural Network for Crowd Video Understanding

Rethinking Spatiotemporal Feature Learning For Video Understanding

https://arxiv.org/abs/1712.04851

Hierarchical Video Understanding

https://arxiv.org/abs/1809.03316

Challenges

THUMOS Challenge 2014

THUMOS Challenge 2015

ActivityNet Challenge 2016