Text-based VQA Method by Team USTB-PRIR for Contextualized Attention-based Deep Network

1 / 6

Embed Share

"Team USTB-PRIR introduces a novel text-centric method for Text-based Visual Question Answering utilizing OCR context, semantic understanding, and machine reading comprehension. Their approach involves exploring OCR semantics, object-text relationships, and OCR answer prediction, achieving promising results in the ST-VQA Challenge."

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

ST-VQA Challenge Team USTB-PRIR Zan-Xia Jin, Heran Wu, Lu Zhang, Bei Yin, Jingyan Qin, Xu-Cheng Yin Contact: xuchengyin@ustb.edu.cn; zanxiajin@xs.ustb.edu.cn Reporter Miaotong Jiang

USTB_TQA Method

SDNet Model SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering

Results Answer Prediction Task3 OCR Candidates N-gram Task3 Attended Object Embedding OCR 1 gram OCR 1+2 gram Test SET Question Embedding OCR Test SET 0.2615 Embedding 0.2757 0.2549 0.2364 Object Detection Model Task3 0.2615 Bottom-Up Attention Model Yolo3 Model Test SET OCR-Object Relation Finding Task3 0.2757 Positional Attention Semantic Attention Test SET 0.2826 0.2549 ST-VQA Task 3 Test SET 0.256 First place result 0.2820 0.2587 Our Submitted result 0.1702 0.2615 After-challenge result 0.2826

Conclusions We propose a novel text-centric method for Text-based VQA. 1. We propose to construct OCR context to fully explore the semantics of OCR tokens by using NLP model and various attention mechanisms. 2. We propose to find the relationship between object and text in a scene according to the semantic and positional information of plain text. 3. We propose to predict whether an OCR is the answer based on machine reading comprehension model, which matches OCR tokens with questions semantically.