ANSEL Photobot: Robot Event Photographer with Semantic Intelligence

ansel photobot a robot event photographer with n.w
1 / 12
Embed
Share

Explore the innovative ANSEL Photobot, a robot event photographer with semantic intelligence. Discover how it leverages GPT-3 and CLIP models to capture contextually meaningful photos, making it a game-changer in social event photography.

  • ANSEL Photobot
  • Robot Photographer
  • Semantic Intelligence
  • Event Photography
  • Innovative Technology

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ANSEL PHOTOBOT: A Robot Event Photographer with Semantic Intelligence Rivkin et al. Paper Discussion by Novak Petrovic

  2. BREAKING DOWN ANSEL PHOTOBOT High-level Overview Background Models Used Methodology Evaluation/Results Strengths and Weaknesses 2

  3. WHAT IS ANSEL? ANSEL (Appropriate sNap SELection) Robotic photographer Produces contextually meaningful contextually meaningful photos photos for social events Leverages OpenAI s GPT-3 LM and CLIP VLM Captures video stream Captures video stream from robot embodiment, extracts contextually extracts contextually appropriate frames appropriate frames to create a portfolio 3

  4. BACKGROUND AND RELATED WORK Robot photographers have seen a variety of use cases: Medical procedures, military operations, etc.. Typically focus is on shot composition and presence of certain objects Video summarization tools sample according to these same ideas ANSEL differs, focussing on capturing meaningful social moments Approach used avoids the need for complex queries to guide frame selection from videos 4

  5. MODELS USED GPT3 GPT3 Transformer based LM Enables common sense reasoning Provides semantically relevant outputs according to real-world social conventions CLIP CLIP Transformer based VLM Aligns images and text in a shared embedding space Grounds language with robot s sensory inputs Multi Multi- -Task Cascaded CNN(MTCNN) Task Cascaded CNN(MTCNN) Used in post-processing of images Performs facial detection 5

  6. METHODOLOGY 6

  7. METHODOLOGY Prompt engineering: Backus-Naur form (BNF) used to prompt GPT3 More reliable/uniform outputs Outputs from GPT3 rejected if composition terms are present e.g. close-up, wide shot 7

  8. METHODOLOGY Post Processing: MTCNN is used to detect faces If no faces found, image is ignored Otherwise, photo is cropped to the bounding box of the ensemble of faces detected 8

  9. EVALUATION Robot collected videos of 3 simulated events Birthday party, wine tasting, painting class Portfolios of 9 photos generated per event Tested against an adapted version of CA-SUM Both image and GPT3 label appropriateness evaluated

  10. RESULTS ANSEL outperformed CA-SUM in 2/3 events Shows ANSEL s ability to capture socially appropriate images GPT3 labels achieved an average of 7.0/10 in terms of quality of label Participant labels achieved 7.4/10 Shows GPT3 provided semantically relevant labels for the events 10

  11. STRENGTHS AND WEAKNESSES OF ANSEL PHOTOBOT STRENGTHS STRENGTHS WEAKNESSES WEAKNESSES Demonstrated impressive semantic awareness in images Promising results with no fine-tuning of models Generalizability across event types Some issues with visual understanding Limited embodiment No considerations made for shot composition or image quality 11 11

  12. THANK YOU 12

More Related Content