
Advanced Image Captioning and Language Modeling Techniques
Explore cutting-edge techniques in image captioning algorithms, classification problems, and statistical language models. Dive into the world of N-gram models and understand how to generate language effectively for image descriptions. Discover the process of ranking sentences and utilizing language models to enhance image captioning accuracy.
Uploaded on | 1 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Image Captioning Tackgeun You
Image Captioning Algorithms Retrieval-based Captioning Template-based Captioning Machine Translation-based Captioning How to generate Language?
Image Classification Problem setting In the label space ?, Find appropriate labels for given image ? Na ve testing phase Estimate scores of all possible labels Then, threshold to get result labels
Image Captioning Problem setting In the sentence space S = ?? Find appropriate sentences for given image ? Na ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences
Image Captioning (1) Problem setting In the sentence space S = ?? Find appropriate sentences for given image ? Na ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences
Image Captioning (2) Problem setting In the sentence space ?? Find appropriate sentences for given image ? Na ve testing phase Retrieve relevant sentences Estimate scores of relevant sentences Then, threshold to get result sentences
Image Captioning (3) Ranking Sentences Input Image Result Captions feature Sentences Language Model Generating Sentences
Statistical Language Model (Statistical) Language Model is a probability distribution over sequences of words. It assigns a probability ? ??,?? 1,?? 2, ,?1 to all sentences. ? ??,?? 1, ,?1 = ?=1 ? ?1? ?2?1? ?3?2,?1 ?(??|?? 1, ?1) ? ? ???? 1, ,?1 = ex. A woman holding a camera in a crowd. ? ? ? ????? ? ? ?????? ?????,? ?(?????|?, ?)
N-gram Model ?-gram model (= ? 1 order Markov assumption) ? ??,?? 1, ,?1 = ?=1 ?(??|?? 1, ,?1) ?=1 ? ???? 1, ,?? ? ? ? ex. A woman holding a camera in a crowd. ?(?????|?,??,??????,?, ??????,?????,?) ?(?????|?,??,??????)
Sentence Generation by LM (1) In Language Model, Relevant sentence = high-probability sentence! Retrieve High-probability sentence Search sentence with High-probability! Search scheme Exhaustive search Greedy search Beam search
Sentence Generation by LM (2) Greedy search ? ??,?? 1, ,?1 = ? ? ???? 1, ,?1 ?=1 = ? ?1? ?2?1? ?3?2,?1 ?(??|?? 1, ?1) Pick the word with highest probability Beam search Greedy search while retaining ?-best maximum paths
Image Captioning Pipeline Ranking Sentences Input Image Result Captions feature Sentences Language Model Generating Sentences
From Captions to Visual Concepts and Back (CVPR 2015) Language Model with Detected Words(Labels) Ranking Sentences Input Image Result Captions feature Sentences Detected Words Language Model Generating Sentences
1. Word Detection Learning Weakly-supervised Detector 1000 frequent words in training set cover over 92% of the word occurrence FCN + Noisy-OR version of MIL Take Input sets of positive and negative bags of bounding boxes for each word The probability of bag ??containing word ? ??? ?) ?= 1 ? ??(1 ???
Fully Convolutional Network with MIL 1 0.5 0 output vector (1 1 1000) query image (224x224) fully connected output map (12 12 1000) convolution 1 0.5 query image (565x565) 0 output vector (1 1 1000)
2. Sentence Generation Beam search with blackboard ? ?1 ?0? ?2 ?1, ?1 ?(??| ?? 1, , ?1, ?? 1) 1. A ____ 2. A woman ____ 3. A woman holding ____ woman, crowd, cat, camera, holding, purple crowd, cat, camera, holding, purple crowd, cat, camera, purple N. A woman holding a camera in a crowd cat, purple
3. Re-Ranking Sentences Off-the-shelf Algorithm to Rank MERT (Minimum Error Rate Training) Optimized for BLEU on validation set Similarity measure DMSM (Deep Multimodal Similarity Model) Image model : fc7 @ fine-tuned VGG16 Text model : Semantic vector
Result in Microsoft COCO Table C5 COCO Challenge BLEU-1 BLEU-2 BLEU-3 BLEU-4 M1 M2 Human 0.663 0.469 0.321 0.217 0.638 0.675 MSR 0.695 0.526 0.391 0.291 0.268 0.322 Google 0.713 0.542 0.407 0.309 0.273 0.317 c5 five reference captions for every train/val/test images M1 - Percentage of captions that are evaluated as better or equal to human caption. M2 - Percentage of captions that pass the Turing Test.
#2. Show and Tell: A Neural Image Caption Generator (CVPR 2015) Neural Language Model with CNN feature No need to re-rank! Input Image Result Captions Language Model feature Generating Sentences
Novelty in Captions (validation c4) Unique Captions (%) Seen in Training (%) BLUE4 Human - 99.4 4.8 MSR 25.7 47.0 30.0 Google 27.2 ~80 Overfitting
Discussion Generating sentences with (sequential) Language model How to modify LM to avoid overfitting?
Discussion Image Captioning
#3. Phrase-based Image Captioning (ICML 2015) Phrase-based Language Model Ranking Sentences Input Image Result Captions feature Sentences Sentence Model Phrase Model Generating Sentences
Image Captioning Ranking Sentences Input Image Result Captions feature Sentences Language Model Generating Sentences
1. Word Detection A person with helmet is riding a bicycle. Words = a/person/with/helmet/is/riding/bicycle
http://www.cs.berkeley.edu/~sgupta/captions /cvprPdf.pdf