CN113360621A - Scene text visual question-answering method based on modal inference graph neural network - Google Patents

Scene text visual question-answering method based on modal inference graph neural network Download PDF

Info

Publication number
CN113360621A
CN113360621A CN202110691232.6A CN202110691232A CN113360621A CN 113360621 A CN113360621 A CN 113360621A CN 202110691232 A CN202110691232 A CN 202110691232A CN 113360621 A CN113360621 A CN 113360621A
Authority
CN
China
Prior art keywords
visual
text
question
features
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110691232.6A
Other languages
Chinese (zh)
Inventor
郭欣雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning Technical University
Original Assignee
Liaoning Technical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning Technical University filed Critical Liaoning Technical University
Priority to CN202110691232.6A priority Critical patent/CN113360621A/en
Publication of CN113360621A publication Critical patent/CN113360621A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a scene text visual question-answering method based on a modal inference graph neural network, which comprises the following steps: constructing a multi-modal graph; multiplying the preprocessed problem word embedding by two independent weights to respectively obtain a visual problem feature and a text problem feature; correspondingly multiplying the weight by the characteristic node to obtain a filtered vector; more accurately answering questions about text in images of a scene; and multiplying the updated nodes by the corresponding question features, and then connecting and outputting the predicted answers. The method helps to understand the scene text content by using information in various forms in the image, respectively preprocesses the scene text picture into the forms of a visual object picture and a text picture, and filters redundant information under a problem self-attention module; and perfecting the mutual node characteristics among the subgraphs by using the attention-adding aggregator so as to fuse information among different modes, wherein the updated nodes provide better functions for the answering module by using context information of different modes.

Description

Scene text visual question-answering method based on modal inference graph neural network
Technical Field
The invention belongs to the technical field of visual question answering, and particularly relates to a scene text visual question answering method based on a modal inference graph neural network.
Background
Deep learning has made Computer Vision (CV) and Natural Language Processing (NLP) a great advance, and the cross-domain discipline between vision and natural language has attracted strong attention, such as Visual Question and Answer (VQA), the main goal of which is to make a computer output an answer that conforms to the rules of natural language and has reasonable contents according to the input pictures and questions. For a specific picture, if the answer to the question is to be deduced effectively by associating the visual features in the image with the semantic features in the question, the visual question-answering model needs to have a certain understanding of the content of the picture, the meaning and intention of the question and the related common sense. The visual question-answering task relates to technologies of fine-grained identification, object identification, understanding of texts contained in questions and the like. Typically, the visual question-answering model consists of three modules. The feature extraction module extracts features of the picture by using a convolutional neural network and extracts features of the problem by using a cyclic neural network; the feature fusion module fuses the problem features and the picture features into an abstract feature capable of representing the current task; and the answer classification module takes the fused features as input to classify the features, and the number of classification categories is determined by the number of candidate answers. Whereas the recently proposed datasets TextVQA and ST-VQA are scene pictures with textual information, each image in the dataset carries the actual textual semantics and requires an understanding of the image scene text to answer the question. The effect of the common representation on these data sets is not good using a general visual question-and-answer model to process. For multi-modal datasets featuring visual, textual, and question information, it is more current to use transformers or Graph Neural Networks (GNNs) to fuse together different modality information, where the multi-modal Graph Neural networks MM-GNNs use Graph Neural networks to represent images as three graphs and introduce three aggregators to guide the transfer of messages from one Graph to another to predict the generation of answers.
The prior art MM-GNNs are used to answer many questions that need to be read. Given an image containing visual objects and scene text and a question, the goal is to generate an answer. The MM-GNN answers the question in three steps, (1) constructs a three-layer graph structure for representing three modalities in the image, namely, the visual modality of visual entities (including text and objects), the semantic modality of scene text, and the numerical modality of number-related text, and the initial representations of nodes in the three graphs are obtained a priori, such as word embedding learned from a corpus and faster RCNN features. (2) The MM-GNN uses three attention-based aggregators that compute their relevance scores based on the visual appearance and layout information of two nodes in the image and questions, and pass information between and within the graph, which can direct the flow of information between one sub-graph to another sub-graph or itself, dynamically updating the representations of the nodes in turn, with the updated representations containing richer, more accurate information, making it easier for the answer model to notice the correct answer. (3) And the answer prediction module outputs answers by utilizing the characteristics.
The MM-GNN does not perform feature filtering before aggregation by using an implicit full-connectivity graph, namely, useless or redundant features are not removed, and the effect of feature extraction is poor. Three aggregators are used for de-aggregating features during aggregation, the visual-semantic aggregation significance effect is not large compared with semantic and semantic set and semantic-numerical aggregation, and the consumed calculation amount is excessive.
Disclosure of Invention
Based on the defects of the prior art, the technical problem to be solved by the invention is to provide a scene text visual question-answering method based on a modal inference graph neural network, which helps to understand the scene text content by using various forms of information in an image, and provides a better function for a question-answering module by using context information of different modalities.
In order to solve the technical problem, the invention provides a scene text visual question-answering method based on a modal inference graph neural network, which comprises the following steps:
step 1: constructing a multi-modal graph;
step 2: generating self-attention weight by using a problem word sequence, and multiplying the preprocessed problem word embedding and two independent weights to respectively obtain visual problem features and text problem features;
and step 3: calculating attention weight under the guidance of the visual problem feature and the text problem feature, inputting the attention weight into softmax, and multiplying the weight by the feature node correspondingly to obtain a filtered vector;
and 4, step 4: the semantic context is utilized to perfect the visual nodes, so that the information between the sub-images is fused with each other, and the questions about the texts in the scene images are answered more accurately;
and 5: and multiplying the updated nodes by the corresponding question features, and then connecting and outputting the predicted answers.
Optionally, in step 2, given a problem of T words, the words are embedded into the feature sequence using pre-trained BERT, and then two sets of attention weights are generated using two independent two-layer MLPs, generating visual problem features and textual problem features.
Therefore, the scene text visual question-answering method based on the modal inference graph neural network has the following beneficial effects:
the method helps to understand the scene text content by using information in various forms in the image, respectively preprocesses the scene text picture into the forms of a visual object picture and a text picture, and filters redundant information under a problem self-attention module; and perfecting the mutual node characteristics among the subgraphs by using the attention-adding aggregator so as to fuse information among different modes, wherein the updated nodes provide better functions for the answering module by using context information of different modes. The effectiveness is verified on ST-VQA and TextVQA data sets, and experimental results show that compared with some other models of the task, the multi-Modal Reconstruction Graph Neural Network (MRGNN) provided by the invention is obviously improved on the task.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.
FIG. 1 is a diagram of a visual question-answer model based on multi-modal graph neural network reasoning.
Detailed Description
Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.
When a scene text task is processed by a Visual Question Answering (VQA) model, answers need to be obtained by reading visual contents and text contents in images and reasoning questions, and poor text reading capability and insufficient visual reasoning capability are main reasons of poor effects of the existing visual question answering model, as shown in fig. 1, aiming at the problems, the invention provides a scene text visual question answering method based on a modal reasoning graph neural network, which comprises the following steps:
constructing a multimodal map
Features of each modality are extracted and projected to a common d-dimensional semantic space by domain-specific embedding methods as follows. Constructing a visual fully connected graph GV={vVV}, node
Figure BDA0003126838860000041
Projection into d-dimensional space through two learned linear transformations and summarization into final embedding, each edge
Figure BDA0003126838860000042
Representing the relative spatial relationship between two objects, a set of K objects is determined using a fast-RCNN detector, the objects being represented by 2048-dimensional visual feature vectors
Figure BDA0003126838860000051
And a four-dimensional target coordinate vector
Figure BDA0003126838860000052
In which b isi=[xi,yi,wi,hi],(xi,yi)、wiAnd hiCoordinates, width and height of the border, respectively. As shown in equation (1):
Figure BDA0003126838860000053
wherein W1And W2Is the learned projection matrix. LN is a layer normalized text graph GtAlso a fully connected graph, uses four different types of features to represent OCR forms: (1) a 300-dimensional FastText vector
Figure BDA0003126838860000054
(2) Features of Rol-Pooling bounding box of OCR field extracted using fast-RCNN
Figure BDA0003126838860000055
(3) A 604-dimensional PHOC vector
Figure BDA0003126838860000056
(4) Position of bounding box
Figure BDA0003126838860000057
Each feature is then linearly mapped into d-dimensional space and added as an embedding of the final OCR field, as shown in equation (2):
Figure BDA0003126838860000058
question self-attention module
The extraction of the question features plays a key role in the final answer, and in order to grasp the interaction between the modalities, the best effect is obtained. Extracting problem word features from the first 3 layers of BERTBAE, generating self-attention weight by using a problem word sequence, and multiplying the preprocessed problem word embedding and two independent weights to respectively obtain visual problem features QVAnd text question feature QS. Specifically, the problem of a given T words
Figure BDA0003126838860000059
Embedding words into feature sequences using pre-trained BERT
Figure BDA00031268388600000510
Two sets of attention weights are then generated using two independent dual-layer MLPs,
Figure BDA00031268388600000511
Figure BDA00031268388600000512
generating visual problem features QVAnd text question feature QSTaking the visual problem feature as an example, the formula is as follows:
Figure BDA00031268388600000513
Figure BDA00031268388600000514
feature filtering module
The invention filters out irrelevant or redundant features before fusion, the problem self-attention module and the visual problem feature QVAnd text question feature QSAs a characteristic query, in the visual problem characteristic QVAnd text question feature QSCalculating attention weight under the guidance of (1), inputting the attention weight into softmax, and multiplying the weight by the characteristic node correspondingly to obtain a filtered vector. Taking the visual characteristics as an example, the formula is as follows:
Figure BDA0003126838860000061
si=soft max(ai) (6)
Figure BDA0003126838860000062
multimodal aggregation module
And aggregating the visual and text nodes by using an aggregator, wherein the aim is to refine the nodes of the text graph by using the visual content of the image, and the semantic context is utilized to perfect the visual nodes, so that the information between the sub-graphs is fused with each other, and the problem about the text in the scene image can be answered more accurately. For the filtered text node
Figure BDA0003126838860000063
The aggregator firstly accesses the related nodes in the visual graph, then aggregates the information of the participating nodes and updates
Figure BDA0003126838860000064
Is shown. Computing relevance scores between nodes with specific needs
Figure BDA0003126838860000065
Vision node
Figure BDA0003126838860000066
And its bounding box feature btiAnd bvi(i.e., the coordinates of the bounding box), the problem feature is extracted from the attention module, and the calculation formula is:
Figure BDA0003126838860000067
fs,fv,fbMLP, | for processing semantic node, visual node, and bounding box feature, respectively; and | represents connecting two vectors.
Figure BDA0003126838860000068
Figure BDA0003126838860000069
Combining the aggregated features with nodes
Figure BDA00031268388600000610
Concatenating to obtain an updated semantic representation, wherein
Figure BDA00031268388600000611
Is an updated node representation, fv~Is an MLP used to encode relevant node characteristics.
Similar to text node updates, we obtain
Figure BDA0003126838860000071
The update node of (2). The formula is as follows:
Figure BDA0003126838860000072
Figure BDA0003126838860000073
answer prediction module
Multiplying the updated nodes by the corresponding question features and then connecting and outputting the predicted answers as shown in the formula (5), wherein f ispMLP,
Figure BDA0003126838860000076
Is element-wise. Obtaining a predicted result
Figure BDA0003126838860000074
(K is the number of candidate answers), and then the predicted answers are input into the binary cross entropy loss for optimization training.
Figure BDA0003126838860000075
Problem characteristics: the problem maximum length is set to 20, the problem is encoded using a three-layer BERT model pre-trained with a wikipedia dataset, and fine-tuned during training.
Visual characteristics: the maximum number of object regions is 36, objects are detected using a fast-RCNN model, features are extracted through fc6 and using the pooling layer of the fast-RCNN, fc7 is fine-tuned to output 2048-dimensional feature vectors, and are embedded into the features using bounding box coordinates.
Text characteristics: a maximum of 50 OCR tokens are recognized in the figure, text labels are extracted on each image using a Rosetta OCR system (we use Rosetta-en only), the same extractor as the visual features is used (Faster R-CNN is trained for general object detection), Fasttext and PHOC (extracted from the recognized OCR character sequence) and coordinate embedding is added.
The method is realized by PyTorch, AdaMax optimization is carried out, experiments are carried out on NVIDIA GeForce 1080Ti gpu, the batch processing size is 128, the learning rate is set to be 0.0001, the learning rate is multiplied by 0.1 when 14000 and 19000 iterations are carried out, and the maximum iteration number of training is 24000.
When multi-module reasoning is carried out on scene text pictures, the model can identify, copy and mark text information in images and predict answers by combining fixed vocabularies of the models, and the model has excellent characteristic identification and spatial reasoning effects in most cases. With a question query of a player, the attention module directs the OCR system to focus on more information about the player's name and the area where the player is located, since the model knows that the words under "Casilla" are likely numbers. The OCR token "RX 405" combines the location and semantic information of "boat" in the question and can be selected in the answer module.
While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (2)

1. A scene text visual question-answering method based on a modal inference graph neural network is characterized by comprising the following steps:
step 1: constructing a multi-modal graph;
step 2: generating self-attention weight by using a problem word sequence, and multiplying the preprocessed problem word embedding and two independent weights to respectively obtain visual problem features and text problem features;
and step 3: calculating attention weight under the guidance of the visual problem feature and the text problem feature, inputting the attention weight into softmax, and multiplying the weight by the feature node correspondingly to obtain a filtered vector;
and 4, step 4: the semantic context is utilized to perfect the visual nodes, so that the information between the sub-images is fused with each other, and the questions about the texts in the scene images are answered more accurately;
and 5: and multiplying the updated nodes by the corresponding question features, and then connecting and outputting the predicted answers.
2. The method of claim 1, wherein in step 2, given a question of T words, the words are embedded into a sequence of features using pre-trained BERT, and then two sets of attention weights are generated using two independent two-layer MLPs, generating visual question features and textual question features.
CN202110691232.6A 2021-06-22 2021-06-22 Scene text visual question-answering method based on modal inference graph neural network Pending CN113360621A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110691232.6A CN113360621A (en) 2021-06-22 2021-06-22 Scene text visual question-answering method based on modal inference graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110691232.6A CN113360621A (en) 2021-06-22 2021-06-22 Scene text visual question-answering method based on modal inference graph neural network

Publications (1)

Publication Number Publication Date
CN113360621A true CN113360621A (en) 2021-09-07

Family

ID=77535678

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110691232.6A Pending CN113360621A (en) 2021-06-22 2021-06-22 Scene text visual question-answering method based on modal inference graph neural network

Country Status (1)

Country Link
CN (1) CN113360621A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020954A (en) * 2021-09-10 2022-02-08 广西师范大学 Personalized image description method for embodying user intention and style
CN114168104A (en) * 2021-12-08 2022-03-11 杭州电子科技大学 Scene character interactive understanding system for visually impaired people
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
WO2023207059A1 (en) * 2022-04-29 2023-11-02 山东海量信息技术研究院 Visual question answering task processing method and system, electronic device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN112507727A (en) * 2020-11-18 2021-03-16 北京科技大学 Text visual question-answering system and method based on text
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN111984772A (en) * 2020-07-23 2020-11-24 中山大学 Medical image question-answering method and system based on deep learning
CN112507727A (en) * 2020-11-18 2021-03-16 北京科技大学 Text visual question-answering system and method based on text
CN112818889A (en) * 2021-02-09 2021-05-18 北京工业大学 Dynamic attention-based method for integrating accuracy of visual question-answer answers by hyper-network
CN112926655A (en) * 2021-02-25 2021-06-08 电子科技大学 Image content understanding and visual question and answer VQA method, storage medium and terminal

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114020954A (en) * 2021-09-10 2022-02-08 广西师范大学 Personalized image description method for embodying user intention and style
CN114168104A (en) * 2021-12-08 2022-03-11 杭州电子科技大学 Scene character interactive understanding system for visually impaired people
CN114398471A (en) * 2021-12-24 2022-04-26 哈尔滨工程大学 Visual question-answering method based on deep reasoning attention mechanism
WO2023207059A1 (en) * 2022-04-29 2023-11-02 山东海量信息技术研究院 Visual question answering task processing method and system, electronic device, and storage medium

Similar Documents

Publication Publication Date Title
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
Zhi et al. Action unit analysis enhanced facial expression recognition by deep neural network evolution
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN111611367B (en) Visual question-answering method introducing external knowledge
CN107247755A (en) A kind of personalized image method for generating captions based on context serial memorization network
Liu et al. Fact-based visual question answering via dual-process system
CN115331075A (en) Countermeasures type multi-modal pre-training method for enhancing knowledge of multi-modal scene graph
CN115017911A (en) Cross-modal processing for vision and language
Pezzelle et al. Is the red square big? MALeViC: Modeling adjectives leveraging visual contexts
CN116089645A (en) Hierarchical style-based conditional text-e-commerce picture retrieval method and system
CN115223020A (en) Image processing method, image processing device, electronic equipment and readable storage medium
CN114595306A (en) Text similarity calculation system and method based on distance perception self-attention mechanism and multi-angle modeling
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
Wang et al. Mirtt: Learning multimodal interaction representations from trilinear transformers for visual question answering
CN111598252A (en) University computer basic knowledge problem solving method based on deep learning
Miao et al. Research on visual question answering based on GAT relational reasoning
Tüselmann et al. Recognition-free question answering on handwritten document collections
Zhu et al. Configurable graph reasoning for visual relationship detection
CN115588193A (en) Visual question-answering method and device based on graph attention neural network and visual relation
CN115359486A (en) Method and system for determining custom information in document image
Wang et al. TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
Li et al. Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210907