CN113360621A

CN113360621A - Scene text visual question-answering method based on modal inference graph neural network

Info

Publication number: CN113360621A
Application number: CN202110691232.6A
Authority: CN
Inventors: 郭欣雨
Original assignee: Liaoning Technical University
Current assignee: Liaoning Technical University
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2021-09-07

Abstract

The invention discloses a scene text visual question-answering method based on a modal inference graph neural network, which comprises the following steps: constructing a multi-modal graph; multiplying the preprocessed problem word embedding by two independent weights to respectively obtain a visual problem feature and a text problem feature; correspondingly multiplying the weight by the characteristic node to obtain a filtered vector; more accurately answering questions about text in images of a scene; and multiplying the updated nodes by the corresponding question features, and then connecting and outputting the predicted answers. The method helps to understand the scene text content by using information in various forms in the image, respectively preprocesses the scene text picture into the forms of a visual object picture and a text picture, and filters redundant information under a problem self-attention module; and perfecting the mutual node characteristics among the subgraphs by using the attention-adding aggregator so as to fuse information among different modes, wherein the updated nodes provide better functions for the answering module by using context information of different modes.

Description

Scene text visual question-answering method based on modal inference graph neural network

Technical Field

The invention belongs to the technical field of visual question answering, and particularly relates to a scene text visual question answering method based on a modal inference graph neural network.

Background

Deep learning has made Computer Vision (CV) and Natural Language Processing (NLP) a great advance, and the cross-domain discipline between vision and natural language has attracted strong attention, such as Visual Question and Answer (VQA), the main goal of which is to make a computer output an answer that conforms to the rules of natural language and has reasonable contents according to the input pictures and questions. For a specific picture, if the answer to the question is to be deduced effectively by associating the visual features in the image with the semantic features in the question, the visual question-answering model needs to have a certain understanding of the content of the picture, the meaning and intention of the question and the related common sense. The visual question-answering task relates to technologies of fine-grained identification, object identification, understanding of texts contained in questions and the like. Typically, the visual question-answering model consists of three modules. The feature extraction module extracts features of the picture by using a convolutional neural network and extracts features of the problem by using a cyclic neural network; the feature fusion module fuses the problem features and the picture features into an abstract feature capable of representing the current task; and the answer classification module takes the fused features as input to classify the features, and the number of classification categories is determined by the number of candidate answers. Whereas the recently proposed datasets TextVQA and ST-VQA are scene pictures with textual information, each image in the dataset carries the actual textual semantics and requires an understanding of the image scene text to answer the question. The effect of the common representation on these data sets is not good using a general visual question-and-answer model to process. For multi-modal datasets featuring visual, textual, and question information, it is more current to use transformers or Graph Neural Networks (GNNs) to fuse together different modality information, where the multi-modal Graph Neural networks MM-GNNs use Graph Neural networks to represent images as three graphs and introduce three aggregators to guide the transfer of messages from one Graph to another to predict the generation of answers.

The prior art MM-GNNs are used to answer many questions that need to be read. Given an image containing visual objects and scene text and a question, the goal is to generate an answer. The MM-GNN answers the question in three steps, (1) constructs a three-layer graph structure for representing three modalities in the image, namely, the visual modality of visual entities (including text and objects), the semantic modality of scene text, and the numerical modality of number-related text, and the initial representations of nodes in the three graphs are obtained a priori, such as word embedding learned from a corpus and faster RCNN features. (2) The MM-GNN uses three attention-based aggregators that compute their relevance scores based on the visual appearance and layout information of two nodes in the image and questions, and pass information between and within the graph, which can direct the flow of information between one sub-graph to another sub-graph or itself, dynamically updating the representations of the nodes in turn, with the updated representations containing richer, more accurate information, making it easier for the answer model to notice the correct answer. (3) And the answer prediction module outputs answers by utilizing the characteristics.

The MM-GNN does not perform feature filtering before aggregation by using an implicit full-connectivity graph, namely, useless or redundant features are not removed, and the effect of feature extraction is poor. Three aggregators are used for de-aggregating features during aggregation, the visual-semantic aggregation significance effect is not large compared with semantic and semantic set and semantic-numerical aggregation, and the consumed calculation amount is excessive.

Disclosure of Invention

Based on the defects of the prior art, the technical problem to be solved by the invention is to provide a scene text visual question-answering method based on a modal inference graph neural network, which helps to understand the scene text content by using various forms of information in an image, and provides a better function for a question-answering module by using context information of different modalities.

In order to solve the technical problem, the invention provides a scene text visual question-answering method based on a modal inference graph neural network, which comprises the following steps:

step 1: constructing a multi-modal graph;

step 2: generating self-attention weight by using a problem word sequence, and multiplying the preprocessed problem word embedding and two independent weights to respectively obtain visual problem features and text problem features;

and step 3: calculating attention weight under the guidance of the visual problem feature and the text problem feature, inputting the attention weight into softmax, and multiplying the weight by the feature node correspondingly to obtain a filtered vector;

and 4, step 4: the semantic context is utilized to perfect the visual nodes, so that the information between the sub-images is fused with each other, and the questions about the texts in the scene images are answered more accurately;

and 5: and multiplying the updated nodes by the corresponding question features, and then connecting and outputting the predicted answers.

Optionally, in step 2, given a problem of T words, the words are embedded into the feature sequence using pre-trained BERT, and then two sets of attention weights are generated using two independent two-layer MLPs, generating visual problem features and textual problem features.

Therefore, the scene text visual question-answering method based on the modal inference graph neural network has the following beneficial effects:

the method helps to understand the scene text content by using information in various forms in the image, respectively preprocesses the scene text picture into the forms of a visual object picture and a text picture, and filters redundant information under a problem self-attention module; and perfecting the mutual node characteristics among the subgraphs by using the attention-adding aggregator so as to fuse information among different modes, wherein the updated nodes provide better functions for the answering module by using context information of different modes. The effectiveness is verified on ST-VQA and TextVQA data sets, and experimental results show that compared with some other models of the task, the multi-Modal Reconstruction Graph Neural Network (MRGNN) provided by the invention is obviously improved on the task.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical means of the present invention more clearly understood, the present invention may be implemented in accordance with the content of the description, and in order to make the above and other objects, features, and advantages of the present invention more clearly understood, the following detailed description is given in conjunction with the preferred embodiments, together with the accompanying drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings of the embodiments will be briefly described below.

FIG. 1 is a diagram of a visual question-answer model based on multi-modal graph neural network reasoning.

Detailed Description

Other aspects, features and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, which form a part of this specification, and which illustrate, by way of example, the principles of the invention. In the referenced drawings, the same or similar components in different drawings are denoted by the same reference numerals.

When a scene text task is processed by a Visual Question Answering (VQA) model, answers need to be obtained by reading visual contents and text contents in images and reasoning questions, and poor text reading capability and insufficient visual reasoning capability are main reasons of poor effects of the existing visual question answering model, as shown in fig. 1, aiming at the problems, the invention provides a scene text visual question answering method based on a modal reasoning graph neural network, which comprises the following steps:

constructing a multimodal map

Features of each modality are extracted and projected to a common d-dimensional semantic space by domain-specific embedding methods as follows. Constructing a visual fully connected graph G_V＝{v^V,ε^V}, node

Projection into d-dimensional space through two learned linear transformations and summarization into final embedding, each edge

Representing the relative spatial relationship between two objects, a set of K objects is determined using a fast-RCNN detector, the objects being represented by 2048-dimensional visual feature vectors

And a four-dimensional target coordinate vector

In which b is_i＝[x_i,y_i,w_i,h_i]，(x_i,y_i)、w_iAnd h_iCoordinates, width and height of the border, respectively. As shown in equation (1):

wherein W₁And W₂Is the learned projection matrix. LN is a layer normalized text graph G_tAlso a fully connected graph, uses four different types of features to represent OCR forms: (1) a 300-dimensional FastText vector

(2) Features of Rol-Pooling bounding box of OCR field extracted using fast-RCNN

(3) A 604-dimensional PHOC vector

(4) Position of bounding box

Each feature is then linearly mapped into d-dimensional space and added as an embedding of the final OCR field, as shown in equation (2):

question self-attention module

The extraction of the question features plays a key role in the final answer, and in order to grasp the interaction between the modalities, the best effect is obtained. Extracting problem word features from the first 3 layers of BERTBAE, generating self-attention weight by using a problem word sequence, and multiplying the preprocessed problem word embedding and two independent weights to respectively obtain visual problem features Q_VAnd text question feature Q_S. Specifically, the problem of a given T words

Embedding words into feature sequences using pre-trained BERT

Two sets of attention weights are then generated using two independent dual-layer MLPs,

generating visual problem features Q_VAnd text question feature Q_STaking the visual problem feature as an example, the formula is as follows:

feature filtering module

The invention filters out irrelevant or redundant features before fusion, the problem self-attention module and the visual problem feature Q_VAnd text question feature Q_SAs a characteristic query, in the visual problem characteristic Q_VAnd text question feature Q_SCalculating attention weight under the guidance of (1), inputting the attention weight into softmax, and multiplying the weight by the characteristic node correspondingly to obtain a filtered vector. Taking the visual characteristics as an example, the formula is as follows:

s_i＝soft max(a_i) (6)

multimodal aggregation module

And aggregating the visual and text nodes by using an aggregator, wherein the aim is to refine the nodes of the text graph by using the visual content of the image, and the semantic context is utilized to perfect the visual nodes, so that the information between the sub-graphs is fused with each other, and the problem about the text in the scene image can be answered more accurately. For the filtered text node

The aggregator firstly accesses the related nodes in the visual graph, then aggregates the information of the participating nodes and updates

Is shown. Computing relevance scores between nodes with specific needs

Vision node

And its bounding box feature b_tiAnd b_vi(i.e., the coordinates of the bounding box), the problem feature is extracted from the attention module, and the calculation formula is:

f_s,f_v,f_bMLP, | for processing semantic node, visual node, and bounding box feature, respectively; and | represents connecting two vectors.

Combining the aggregated features with nodes

Concatenating to obtain an updated semantic representation, wherein

Is an updated node representation, f_v～Is an MLP used to encode relevant node characteristics.

Similar to text node updates, we obtain

The update node of (2). The formula is as follows:

answer prediction module

Multiplying the updated nodes by the corresponding question features and then connecting and outputting the predicted answers as shown in the formula (5), wherein f is_pMLP，

Is element-wise. Obtaining a predicted result

(K is the number of candidate answers), and then the predicted answers are input into the binary cross entropy loss for optimization training.

Problem characteristics: the problem maximum length is set to 20, the problem is encoded using a three-layer BERT model pre-trained with a wikipedia dataset, and fine-tuned during training.

Visual characteristics: the maximum number of object regions is 36, objects are detected using a fast-RCNN model, features are extracted through fc6 and using the pooling layer of the fast-RCNN, fc7 is fine-tuned to output 2048-dimensional feature vectors, and are embedded into the features using bounding box coordinates.

Text characteristics: a maximum of 50 OCR tokens are recognized in the figure, text labels are extracted on each image using a Rosetta OCR system (we use Rosetta-en only), the same extractor as the visual features is used (Faster R-CNN is trained for general object detection), Fasttext and PHOC (extracted from the recognized OCR character sequence) and coordinate embedding is added.

The method is realized by PyTorch, AdaMax optimization is carried out, experiments are carried out on NVIDIA GeForce 1080Ti gpu, the batch processing size is 128, the learning rate is set to be 0.0001, the learning rate is multiplied by 0.1 when 14000 and 19000 iterations are carried out, and the maximum iteration number of training is 24000.

When multi-module reasoning is carried out on scene text pictures, the model can identify, copy and mark text information in images and predict answers by combining fixed vocabularies of the models, and the model has excellent characteristic identification and spatial reasoning effects in most cases. With a question query of a player, the attention module directs the OCR system to focus on more information about the player's name and the area where the player is located, since the model knows that the words under "Casilla" are likely numbers. The OCR token "RX 405" combines the location and semantic information of "boat" in the question and can be selected in the answer module.

While the foregoing is directed to the preferred embodiment of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims

1. A scene text visual question-answering method based on a modal inference graph neural network is characterized by comprising the following steps:

step 1: constructing a multi-modal graph;

2. The method of claim 1, wherein in step 2, given a question of T words, the words are embedded into a sequence of features using pre-trained BERT, and then two sets of attention weights are generated using two independent two-layer MLPs, generating visual question features and textual question features.