Image / Video Captioning

Published: 09 Oct 2015 Category: deep_learning

Papers

Im2Text: Describing Images Using 1 Million Captioned Photographs

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Show and Tell

Show and Tell: A Neural Image Caption Generator

Image caption generation by CNN and LSTM

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge

Learning a Recurrent Visual Representation for Image Caption Generation

Mind’s Eye: A Recurrent Visual Representation for Image Caption Generation

Deep Visual-Semantic Alignments for Generating Image Descriptions

Deep Captioning with Multimodal Recurrent Neural Networks

Show, Attend and Tell

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention (ICML 2015)

Automatically describing historic photographs


Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images

What value do explicit high level concepts have in vision to language problems?

Aligning where to see and what to tell: image caption with region-based attention and scene factorization

Learning FRAME Models Using CNN Filters for Knowledge Visualization (CVPR 2015)

Generating Images from Captions with Attention

Order-Embeddings of Images and Language

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Expressing an Image Stream with a Sequence of Natural Sentences

Multimodal Pivots for Image Caption Translation

Image Captioning with Deep Bidirectional LSTMs

Encode, Review, and Decode: Reviewer Module for Caption Generation

Review Network for Caption Generation

Attention Correctness in Neural Image Captioning

Image Caption Generation with Text-Conditional Semantic Attention

DeepDiary: Automatic Caption Generation for Lifelogging Image Streams

phi-LSTM: A Phrase-based Hierarchical LSTM Model for Image Captioning

Captioning Images with Diverse Objects

Learning to generalize to new compositions in image understanding

Generating captions without looking beyond objects

SPICE: Semantic Propositional Image Caption Evaluation

Boosting Image Captioning with Attributes

Bootstrap, Review, Decode: Using Out-of-Domain Textual Data to Improve Image Captioning

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Dense Captioning with Joint Inference and Visual Context

Optimization of image description metrics using policy gradient methods

Areas of Attention for Image Captioning

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Recurrent Highway Networks with Language CNN for Image Captioning

Top-down Visual Saliency Guided by Captions

MAT: A Multimodal Attentive Translator for Image Captioning

https://arxiv.org/abs/1702.05658

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

Punny Captions: Witty Wordplay in Image Descriptions

https://arxiv.org/abs/1704.08224

Show, Adapt and Tell: Adversarial Training of Cross-domain Image Captioner

https://arxiv.org/abs/1705.00930

Actor-Critic Sequence Training for Image Captioning

  • intro: Queen Mary University of London & Yang’s Accounting Consultancy Ltd
  • keywords: actor-critic reinforcement learning
  • arxiv: https://arxiv.org/abs/1706.09601

What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator?

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

https://arxiv.org/abs/1709.03376

Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning

https://arxiv.org/abs/1709.05038

Contrastive Learning for Image Captioning

Phrase-based Image Captioning with Hierarchical LSTM Model

Convolutional Image Captioning

https://arxiv.org/abs/1711.09151

Show-and-Fool: Crafting Adversarial Examples for Neural Image Captioning

https://arxiv.org/abs/1712.02051

Improved Image Captioning with Adversarial Semantic Alignment

Object Counts! Bringing Explicit Detections Back into Image Captioning

Defoiling Foiled Image Captions

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text

Improving Image Captioning with Conditional Generative Adversarial Nets

https://arxiv.org/abs/1805.07112

CNN+CNN: Convolutional Decoders for Image Captioning

https://arxiv.org/abs/1805.09019

Diverse and Controllable Image Captioning with Part-of-Speech Guidance

https://arxiv.org/abs/1805.12589

Learning to Evaluate Image Captioning

Topic-Guided Attention for Image Captioning

Context-Aware Visual Policy Network for Sequence-Level Image Captioning

Exploring Visual Relationship for Image Captioning

Boosted Attention: Leveraging Human Attention for Image Captioning

Image Captioning as Neural Machine Translation Task in SOCKEYE

https://arxiv.org/abs/1810.04101

Unsupervised Image Captioning

https://arxiv.org/abs/1811.10787

Attend More Times for Image Captioning

https://arxiv.org/abs/1812.03283

Object Descriptions

Generation and Comprehension of Unambiguous Object Descriptions

Video Captioning / Description

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Describing Videos by Exploiting Temporal Structure

SA-tensorflow: Soft attention mechanism for video caption generation

Sequence to Sequence – Video to Text

Jointly Modeling Embedding and Translation to Bridge Video and Language

Video Description using Bidirectional Recurrent Neural Networks

Bidirectional Long-Short Term Memory for Video Description

3 Ways to Subtitle and Caption Your Videos Automatically Using Artificial Intelligence

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

Grounding and Generation of Natural Language Descriptions for Images and Videos

Video Captioning and Retrieval Models with Semantic Attention

  • intro: Winner of three (fill-in-the-blank, multiple-choice test, and movie retrieval) out of four tasks of the LSMDC 2016 Challenge (Workshop in ECCV 2016)
  • arxiv: https://arxiv.org/abs/1610.02947

Spatio-Temporal Attention Models for Grounded Video Captioning

Video and Language: Bridging Video and Language with Deep Learning

Recurrent Memory Addressing for describing videos

Video Captioning with Transferred Semantic Attributes

Adaptive Feature Abstraction for Translating Video to Language

Semantic Compositional Networks for Visual Captioning

Hierarchical Boundary-Aware Neural Encoder for Video Captioning

Attention-Based Multimodal Fusion for Video Description

Weakly Supervised Dense Video Captioning

Generating Descriptions with Grounded and Co-Referenced People

Multi-Task Video Captioning with Video and Entailment Generation

Dense-Captioning Events in Videos

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

https://arxiv.org/abs/1706.01231

Reinforced Video Captioning with Entailment Rewards

End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

https://arxiv.org/abs/1708.02478

Grounded Objects and Interactions for Video Captioning

https://arxiv.org/abs/1711.06354

Integrating both Visual and Audio Cues for Enhanced Video Caption

https://arxiv.org/abs/1711.08097

Video Captioning via Hierarchical Reinforcement Learning

https://arxiv.org/abs/1711.11135

Consensus-based Sequence Training for Video Captioning

https://arxiv.org/abs/1712.09532

Less Is More: Picking Informative Frames for Video Captioning

https://arxiv.org/abs/1803.01457

End-to-End Video Captioning with Multitask Reinforcement Learning

https://arxiv.org/abs/1803.07950

End-to-End Dense Video Captioning with Masked Transformer

Reconstruction Network for Video Captioning

Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning

Jointly Localizing and Describing Events for Dense Video Captioning

Contextualize, Show and Tell: A Neural Visual Storyteller

https://arxiv.org/abs/1806.00738

RUC+CMU: System Report for Dense Captioning Events in Videos

Streamlined Dense Video Captioning

Projects

Learning CNN-LSTM Architectures for Image Caption Generation: An implementation of CNN-LSTM image caption generator architecture that achieves close to state-of-the-art results on the MSCOCO dataset.

screengrab-caption: an openframeworks app that live-captions your desktop screen with a neural net

Tools

CaptionBot (Microsoft)

Blogs

Captioning Novel Objects in Images

http://bair.berkeley.edu/jacky/2017/08/08/novel-object-captioning/