Multimodal Recurrent Attention CNN for Image Aesthetic Prediction

Multimodal Recurrent Attention CNN for Image Aesthetic Prediction
Slide Note
Embed
Share

Using a multimodal recurrent attention neural network, MRACNN, this study proposes a unified approach for image aesthetic prediction by jointly learning visual and textual features. Inspired by human attention mechanisms, the network utilizes datasets like AVA and photo.net comments to enhance multimodal modeling in image aesthetics. The architecture includes vision stream feature extractors, language stream text-CNN, and multimodal factorized bilinear pooling, leading to significant advancements in this field.

  • Neural Network
  • Multimodal Modeling
  • Image Aesthetics
  • Attention Mechanism

Uploaded on Oct 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Jie Li TMM2020

  2. Contributions Inspired by the human attention mechanism, a recurrent attention neural network is used to extract visual features A multimodal network called MRACNN is proposed to jointly learn the visual features and textual features for image aesthetic prediction We collect the AVA comment dataset and the photo.net comment dataset. These datasets can advance the research on multimodal modelling in image aesthetics

  3. AVA dataset with comments

  4. MRACNN architecture EMD Loss

  5. Vision Streamfeature extractor Base network: VGG-16 or other type of network architecture Input: image resized to 224x224 Output: tensor with dimension (W, H, D), represented as: where L = W x H

  6. Vision StreamLSTM LSTM: Attention:

  7. Language StreamText-CNN

  8. Multimodal Factorized Bilinear Pooling Given the visual feature and the textual feature , the multimodal bilinear models can be defined as: It can also be rewritten as:

  9. ExperimentsFeature Extractor

  10. ExperimentsAblation Study

  11. ExperimentsAttention Map

  12. ExperimentsPerformance Comparison

  13. ExperimentsPerformance on Photo.net

  14. Comments Pros: recurrent attention CNN, multimodal framework Cons: text data may not be available in the real scenario, spatial information not considered in attention module

Related


More Related Content