CN115375984A

CN115375984A - Chart question-answering method based on graph neural network

Info

Publication number: CN115375984A
Application number: CN202211142426.1A
Authority: CN
Inventors: 沈玮祺; 贺樑; 肖路巍; 吴兴蛟; 马天龙; 何峻
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2022-11-22

Abstract

The invention discloses a chart question-answering method based on a graph neural network, which comprises the following steps: a) Respectively modeling two modal characteristics of a diagram and a text by using a visual diagram neural network and a two-way long-and-short time memory network, and then performing cross-modal characteristic fusion after splicing the two acquired characteristic representations of the diagram and the text; b) Performing first-stage feature fusion on the spliced cross-modal feature representation by using a 1X1 convolution kernel and a bidirectional GRU (generalized regression unit) to obtain a low-order cross-modal feature representation; c) Performing second-stage feature fusion on the low-order cross-modal feature representation by using a multi-head self-attention mechanism and a densely connected graph convolution network, and extracting a cross-modal semantic dependency relationship to obtain a high-order cross-modal feature representation; d) And inputting the high-order cross-modal characteristic representation into a full-connection network, and finally inputting the high-order cross-modal characteristic representation into a classifier to predict a question answer. Compared with the existing method, the method has the advantages of strong feature extraction capability and high cross-modal information fusion degree, and can improve the accuracy of the question and answer of the chart to a certain degree.

Description

Chart question-answering method based on graph neural network

Technical Field

The invention relates to the field of visual question-answering related to multi-modal data fusion research, in particular to a chart question-answering method based on a graph neural network.

Background

The visual question-answering task is a challenging and practical task in the field of artificial intelligence, belongs to a multi-modal task, and mainly relates to a cross research direction of computer vision and natural language processing. A CQA (chart question answering) is a task type in visual question answering, and is a multi-modal inference task for processing chart image and text questions. The chart question-answer means: given a chart and related questions of the chart, the target of the chart question answering is to combine the visual information of the chart and the text content of the questions, and obtain the answers of the questions by carrying out feature fusion and reasoning on the chart and the text.

Early studies of charting were mainly based on simple neural network combinations and a single cross-modal interaction approach. For example, problem text representation of one-way long and short term memory network modeling and convolutional neural network extraction are input into a classifier after image features are spliced; for example, based on a simple relational network method, each pixel point on a feature map obtained through a convolutional neural network is regarded as an object, feature vectors of every two pixels are paired to construct a relational pair feature containing the relations between all the pixels, then a full-connection network is used for reasoning the relations between every two objects, and finally the relational pairs are added to predict a final answer. These models based on simple feature combinations can only model low-order image and text information and contain much redundant information, and the actual models do not perform well.

In recent years, researchers have designed novel algorithms to improve the performance of the chart question-answering task. For example, a traditional attention mechanism of multiple iterations is adopted to realize the collection of key area features in the diagram; there are also methods to use the trained open source Optical Character Recognition (OCR) to mine the character information in the graph, then locate it into the embedding problem, then extract the graph feature map by using the pre-trained ResNet-152, and finally add the feature map as hidden layer information to the long-short term memory network by the spatial attention mechanism to obtain the text representation. However, it is difficult to model the structural features in the diagram, especially the contrast relationships between elements inside the diagram, such as the size, proportion and broken line trend, and it is also difficult to extract the high-order semantic relationship between the diagram and the problem keywords, and the cross-modal features cannot be sufficiently interacted.

Disclosure of Invention

In view of the above, the present invention provides a graph question and answer method based on a graph neural network. In order to model the structural characteristics in the diagram, particularly the contrast relation among elements in the diagram, such as the size, the proportion, the broken line trend and the like, the method adopts the neural network of the visual diagram to extract the characteristics of the diagram. In order to extract the high-order semantic relation between the chart and the problem keywords and enable cross-modal characteristics to be fully interactively fused, the method provides that cross-modal fusion of two different stages is adopted.

In order to achieve the purpose, the invention adopts the following technical scheme:

a graph question-answering method based on a graph neural network comprises the following steps:

(1) The method comprises the following substeps of respectively extracting two modal feature representations of an image and a text by using a visual graph neural network and a two-way long-and-short-term memory network, aligning and splicing the two modal feature representations:

(1.1) taking the scientific charts acquired from the Figure QA data set, namely a vertical histogram, a horizontal histogram, a line graph, a pie graph and a scatter graph, the corresponding questions of the scientific charts and the answers after manual labeling as training and verification sets, wherein the acquisition ratio of the training and verification sets is 8:2; wherein, the manual marking is the answer given to the question;

(1.2) modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; constructing a dependency relationship matrix between each sub-image and other sub-images by iteration, inputting all sub-image representation vectors and the dependency relationship matrix into a neural network of the graph, and modeling the feature representation of the image;

(1.3) preprocessing the input text; obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation;

and (1.4) aligning and splicing the channel dimension represented by the image feature and the word dimension represented by the text feature to obtain cross-modal feature representation.

(2) Performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation, comprising the following substeps:

(2.1) performing first-stage feature fusion on the cross-modal feature representation; performing feature extraction on the cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1;

and (2.2) inputting the cross-modal characteristic representation extracted by the convolutional neural network with the convolution kernel of 1x1 into the bidirectional GRU network, and performing cross-modal characteristic fusion to obtain a low-order cross-modal characteristic representation.

(3) Performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation, and comprising the following substeps:

(3.1) performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal characteristic representation obtained after the first-stage characteristic fusion; carrying out average pooling on the multiple self-attention moment arrays, and then obtaining an interactive attention weight matrix;

(3.2) inputting the low-order cross-modal feature representation and the interaction attention weight matrix obtained after the feature fusion in the first stage into a densely connected depth map convolution network, and modeling the semantic dependency relationship between the text and the image to obtain a medium-order cross-modal feature representation;

and (3.3) carrying out average pooling operation on the intermediate-order cross-mode characteristic representation to obtain a high-order cross-mode characteristic representation.

(4) Inputting the high-order cross-modal feature representation into a classifier to obtain a prediction result, and the method comprises the following substeps:

and (4.1) inputting the high-order cross-modal characteristic representation into a fully-connected network, and then inputting the high-order cross-modal characteristic representation into a classifier to obtain a prediction result.

Compared with the prior art, the invention adopting the technical scheme has the following beneficial effects:

the invention provides a graph question-answering method based on a graph neural network, which is used for modeling the dependence relationship of different sub-image blocks in a graph image by using the visual graph neural network and can better extract the structural characteristics in the graph, particularly the contrast relationship among elements in the graph, such as the size, the proportion, the broken line trend and the like.

The two-stage cross-modal feature fusion mechanism provided by the invention enables the cross-modal features to be fully interacted, and can better mine the high-order semantic relation between the chart and the problem keywords.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention.

Detailed Description

The invention is further illustrated by the following examples in conjunction with the accompanying drawings.

As shown in fig. 1, the graph question-answering method based on the graph neural network of the present invention includes the following steps:

(1.1) taking the scientific charts acquired from the Figure QA data set, namely a vertical histogram, a horizontal histogram, a line graph, a pie graph and a scatter graph, the corresponding questions of the scientific charts and the answers after manual labeling as training and verification sets, wherein the acquisition ratio of the training and verification sets is 8:2; wherein, the manual labeling is the answer given to the question.

(1.2) modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K-nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; with the iteration, a dependency relationship matrix between each sub-graph and other sub-graphs is constructed, then all sub-graph representation vectors and the dependency relationship matrix are input into a graph neural network, and the feature representation of the image is modeled.

(1.3) preprocessing the input text; and obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation.

(2) Performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation, comprising the following steps of:

(2.1) performing first-stage feature fusion on the cross-modal feature representation; and performing feature extraction on the cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1.

(3) And performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation, wherein the method comprises the following substeps:

(3.1) performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal feature representation obtained after the feature fusion of the first stage; and carrying out average pooling on the plurality of self-attention moment arrays, and then obtaining an interactive attention weight matrix.

And (3.2) inputting the low-order cross-modal feature representation and the interaction attention weight matrix obtained after the first-stage feature fusion into a densely connected depth map convolution network, and modeling the semantic dependency relationship between the text and the image to obtain a medium-order cross-modal feature representation.

Examples

Referring to fig. 2, in this embodiment, a scientific chart acquired by a figuerqa dataset and a problem corresponding to the scientific chart are respectively preprocessed and then input to a visual chart neural network and a bidirectional long-short term memory network to obtain modal characteristics of two parts of the chart and the text, and then the obtained modal characteristics of the two parts of the chart and the text are spliced and then cross-modal characteristic fusion is performed. And performing first-stage feature fusion on the spliced cross-modal feature representation by using a 1X1 convolution kernel and a bidirectional GRU (generalized regression unit) to obtain a low-order cross-modal feature representation. And performing second-stage feature fusion on the low-order cross-modal feature representation by using a multi-head self-attention mechanism and a densely connected graph convolution network, extracting cross-modal semantic dependency relationship, and performing average pooling to obtain high-order cross-modal feature representation. And inputting the high-order cross-modal feature expression vector into a full-connection network, and outputting the high-order cross-modal feature expression vector as a question-answer prediction result.

The above description is only a preferred embodiment of the present invention, and certain modifications may be made within the scope of the present invention as defined in the appended claims, all of which shall fall within the scope of the present invention.

Claims

1. A graph question-answering method based on a graph neural network is characterized by comprising the following specific steps:

step 1: respectively extracting modal characteristic representations of the image and the text by using a visual diagram neural network and a two-way long-short-term memory network, aligning and splicing the modal characteristic representations of the two parts to obtain cross-modal characteristic representations;

step 2: performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation;

and step 3: performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation;

and 4, step 4: inputting the high-order cross-modal feature representation into a classifier to obtain a question-answer prediction result; wherein:

the step 1 specifically comprises the following steps:

1.1: taking scientific charts acquired from a Figure QA data set, namely a vertical histogram, a horizontal histogram, a broken line chart, a pie chart and a scatter chart, corresponding questions of the scientific charts and answers after manual marking as a training set and a verification set, wherein the acquisition ratio of the training set to the verification set is 8:2; wherein, the manual marking is the answer given to the question;

1.2: modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K-nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; constructing a dependency relationship matrix between each sub-image and other sub-images by iteration, inputting all sub-image representation vectors and the dependency relationship matrix into a neural network of the graph, and modeling the feature representation of the image;

1.3: preprocessing an input text; obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation;

1.4: aligning and splicing the channel dimension represented by the image feature and the word dimension represented by the text feature to obtain cross-modal feature representation;

the step 2 specifically comprises the following steps:

2.1: performing first-stage feature fusion on the cross-modal feature representation; performing feature extraction on cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1;

2.2: inputting the cross-modal characteristic representation extracted from the convolutional neural network with a convolution kernel of 1x1 into the bidirectional GRU network, and performing cross-modal characteristic fusion to obtain a low-order cross-modal characteristic representation;

the step 3 specifically comprises the following steps:

3.1: performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal feature representation obtained after the feature fusion of the first stage; carrying out average pooling on a plurality of self-attention moment arrays, and then obtaining an interactive attention weight matrix;

3.2: inputting the low-order cross-modal feature representation and the interactive attention weight matrix obtained after the first-stage feature fusion into a densely connected depth map convolution network, and modeling the semantic dependence relationship between the text and the image to obtain a medium-order cross-modal feature representation;

3.3: and carrying out average pooling operation on the intermediate-order cross-modal characteristic representation to obtain a high-order cross-modal characteristic representation.