CN115375984A - Chart question-answering method based on graph neural network - Google Patents

Chart question-answering method based on graph neural network Download PDF

Info

Publication number
CN115375984A
CN115375984A CN202211142426.1A CN202211142426A CN115375984A CN 115375984 A CN115375984 A CN 115375984A CN 202211142426 A CN202211142426 A CN 202211142426A CN 115375984 A CN115375984 A CN 115375984A
Authority
CN
China
Prior art keywords
modal
cross
representation
feature
order cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211142426.1A
Other languages
Chinese (zh)
Inventor
沈玮祺
贺樑
肖路巍
吴兴蛟
马天龙
何峻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202211142426.1A priority Critical patent/CN115375984A/en
Publication of CN115375984A publication Critical patent/CN115375984A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • G06V10/811Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data the classifiers operating on different input data, e.g. multi-modal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a chart question-answering method based on a graph neural network, which comprises the following steps: a) Respectively modeling two modal characteristics of a diagram and a text by using a visual diagram neural network and a two-way long-and-short time memory network, and then performing cross-modal characteristic fusion after splicing the two acquired characteristic representations of the diagram and the text; b) Performing first-stage feature fusion on the spliced cross-modal feature representation by using a 1X1 convolution kernel and a bidirectional GRU (generalized regression unit) to obtain a low-order cross-modal feature representation; c) Performing second-stage feature fusion on the low-order cross-modal feature representation by using a multi-head self-attention mechanism and a densely connected graph convolution network, and extracting a cross-modal semantic dependency relationship to obtain a high-order cross-modal feature representation; d) And inputting the high-order cross-modal characteristic representation into a full-connection network, and finally inputting the high-order cross-modal characteristic representation into a classifier to predict a question answer. Compared with the existing method, the method has the advantages of strong feature extraction capability and high cross-modal information fusion degree, and can improve the accuracy of the question and answer of the chart to a certain degree.

Description

Chart question-answering method based on graph neural network
Technical Field
The invention relates to the field of visual question-answering related to multi-modal data fusion research, in particular to a chart question-answering method based on a graph neural network.
Background
The visual question-answering task is a challenging and practical task in the field of artificial intelligence, belongs to a multi-modal task, and mainly relates to a cross research direction of computer vision and natural language processing. A CQA (chart question answering) is a task type in visual question answering, and is a multi-modal inference task for processing chart image and text questions. The chart question-answer means: given a chart and related questions of the chart, the target of the chart question answering is to combine the visual information of the chart and the text content of the questions, and obtain the answers of the questions by carrying out feature fusion and reasoning on the chart and the text.
Early studies of charting were mainly based on simple neural network combinations and a single cross-modal interaction approach. For example, problem text representation of one-way long and short term memory network modeling and convolutional neural network extraction are input into a classifier after image features are spliced; for example, based on a simple relational network method, each pixel point on a feature map obtained through a convolutional neural network is regarded as an object, feature vectors of every two pixels are paired to construct a relational pair feature containing the relations between all the pixels, then a full-connection network is used for reasoning the relations between every two objects, and finally the relational pairs are added to predict a final answer. These models based on simple feature combinations can only model low-order image and text information and contain much redundant information, and the actual models do not perform well.
In recent years, researchers have designed novel algorithms to improve the performance of the chart question-answering task. For example, a traditional attention mechanism of multiple iterations is adopted to realize the collection of key area features in the diagram; there are also methods to use the trained open source Optical Character Recognition (OCR) to mine the character information in the graph, then locate it into the embedding problem, then extract the graph feature map by using the pre-trained ResNet-152, and finally add the feature map as hidden layer information to the long-short term memory network by the spatial attention mechanism to obtain the text representation. However, it is difficult to model the structural features in the diagram, especially the contrast relationships between elements inside the diagram, such as the size, proportion and broken line trend, and it is also difficult to extract the high-order semantic relationship between the diagram and the problem keywords, and the cross-modal features cannot be sufficiently interacted.
Disclosure of Invention
In view of the above, the present invention provides a graph question and answer method based on a graph neural network. In order to model the structural characteristics in the diagram, particularly the contrast relation among elements in the diagram, such as the size, the proportion, the broken line trend and the like, the method adopts the neural network of the visual diagram to extract the characteristics of the diagram. In order to extract the high-order semantic relation between the chart and the problem keywords and enable cross-modal characteristics to be fully interactively fused, the method provides that cross-modal fusion of two different stages is adopted.
In order to achieve the purpose, the invention adopts the following technical scheme:
a graph question-answering method based on a graph neural network comprises the following steps:
(1) The method comprises the following substeps of respectively extracting two modal feature representations of an image and a text by using a visual graph neural network and a two-way long-and-short-term memory network, aligning and splicing the two modal feature representations:
(1.1) taking the scientific charts acquired from the Figure QA data set, namely a vertical histogram, a horizontal histogram, a line graph, a pie graph and a scatter graph, the corresponding questions of the scientific charts and the answers after manual labeling as training and verification sets, wherein the acquisition ratio of the training and verification sets is 8:2; wherein, the manual marking is the answer given to the question;
(1.2) modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; constructing a dependency relationship matrix between each sub-image and other sub-images by iteration, inputting all sub-image representation vectors and the dependency relationship matrix into a neural network of the graph, and modeling the feature representation of the image;
(1.3) preprocessing the input text; obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation;
and (1.4) aligning and splicing the channel dimension represented by the image feature and the word dimension represented by the text feature to obtain cross-modal feature representation.
(2) Performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation, comprising the following substeps:
(2.1) performing first-stage feature fusion on the cross-modal feature representation; performing feature extraction on the cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1;
and (2.2) inputting the cross-modal characteristic representation extracted by the convolutional neural network with the convolution kernel of 1x1 into the bidirectional GRU network, and performing cross-modal characteristic fusion to obtain a low-order cross-modal characteristic representation.
(3) Performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation, and comprising the following substeps:
(3.1) performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal characteristic representation obtained after the first-stage characteristic fusion; carrying out average pooling on the multiple self-attention moment arrays, and then obtaining an interactive attention weight matrix;
(3.2) inputting the low-order cross-modal feature representation and the interaction attention weight matrix obtained after the feature fusion in the first stage into a densely connected depth map convolution network, and modeling the semantic dependency relationship between the text and the image to obtain a medium-order cross-modal feature representation;
and (3.3) carrying out average pooling operation on the intermediate-order cross-mode characteristic representation to obtain a high-order cross-mode characteristic representation.
(4) Inputting the high-order cross-modal feature representation into a classifier to obtain a prediction result, and the method comprises the following substeps:
and (4.1) inputting the high-order cross-modal characteristic representation into a fully-connected network, and then inputting the high-order cross-modal characteristic representation into a classifier to obtain a prediction result.
Compared with the prior art, the invention adopting the technical scheme has the following beneficial effects:
the invention provides a graph question-answering method based on a graph neural network, which is used for modeling the dependence relationship of different sub-image blocks in a graph image by using the visual graph neural network and can better extract the structural characteristics in the graph, particularly the contrast relationship among elements in the graph, such as the size, the proportion, the broken line trend and the like.
The two-stage cross-modal feature fusion mechanism provided by the invention enables the cross-modal features to be fully interacted, and can better mine the high-order semantic relation between the chart and the problem keywords.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of an embodiment of the present invention.
Detailed Description
The invention is further illustrated by the following examples in conjunction with the accompanying drawings.
As shown in fig. 1, the graph question-answering method based on the graph neural network of the present invention includes the following steps:
(1) The method comprises the following substeps of respectively extracting two modal feature representations of an image and a text by using a visual graph neural network and a two-way long-and-short-term memory network, aligning and splicing the two modal feature representations:
(1.1) taking the scientific charts acquired from the Figure QA data set, namely a vertical histogram, a horizontal histogram, a line graph, a pie graph and a scatter graph, the corresponding questions of the scientific charts and the answers after manual labeling as training and verification sets, wherein the acquisition ratio of the training and verification sets is 8:2; wherein, the manual labeling is the answer given to the question.
(1.2) modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K-nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; with the iteration, a dependency relationship matrix between each sub-graph and other sub-graphs is constructed, then all sub-graph representation vectors and the dependency relationship matrix are input into a graph neural network, and the feature representation of the image is modeled.
(1.3) preprocessing the input text; and obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation.
And (1.4) aligning and splicing the channel dimension represented by the image feature and the word dimension represented by the text feature to obtain cross-modal feature representation.
(2) Performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation, comprising the following steps of:
(2.1) performing first-stage feature fusion on the cross-modal feature representation; and performing feature extraction on the cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1.
And (2.2) inputting the cross-modal characteristic representation extracted by the convolutional neural network with the convolution kernel of 1x1 into the bidirectional GRU network, and performing cross-modal characteristic fusion to obtain a low-order cross-modal characteristic representation.
(3) And performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation, wherein the method comprises the following substeps:
(3.1) performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal feature representation obtained after the feature fusion of the first stage; and carrying out average pooling on the plurality of self-attention moment arrays, and then obtaining an interactive attention weight matrix.
And (3.2) inputting the low-order cross-modal feature representation and the interaction attention weight matrix obtained after the first-stage feature fusion into a densely connected depth map convolution network, and modeling the semantic dependency relationship between the text and the image to obtain a medium-order cross-modal feature representation.
And (3.3) carrying out average pooling operation on the intermediate-order cross-mode characteristic representation to obtain a high-order cross-mode characteristic representation.
(4) Inputting the high-order cross-modal feature representation into a classifier to obtain a prediction result, and the method comprises the following substeps:
and (4.1) inputting the high-order cross-modal characteristic representation into a fully-connected network, and then inputting the high-order cross-modal characteristic representation into a classifier to obtain a prediction result.
Examples
Referring to fig. 2, in this embodiment, a scientific chart acquired by a figuerqa dataset and a problem corresponding to the scientific chart are respectively preprocessed and then input to a visual chart neural network and a bidirectional long-short term memory network to obtain modal characteristics of two parts of the chart and the text, and then the obtained modal characteristics of the two parts of the chart and the text are spliced and then cross-modal characteristic fusion is performed. And performing first-stage feature fusion on the spliced cross-modal feature representation by using a 1X1 convolution kernel and a bidirectional GRU (generalized regression unit) to obtain a low-order cross-modal feature representation. And performing second-stage feature fusion on the low-order cross-modal feature representation by using a multi-head self-attention mechanism and a densely connected graph convolution network, extracting cross-modal semantic dependency relationship, and performing average pooling to obtain high-order cross-modal feature representation. And inputting the high-order cross-modal feature expression vector into a full-connection network, and outputting the high-order cross-modal feature expression vector as a question-answer prediction result.
The above description is only a preferred embodiment of the present invention, and certain modifications may be made within the scope of the present invention as defined in the appended claims, all of which shall fall within the scope of the present invention.

Claims (1)

1. A graph question-answering method based on a graph neural network is characterized by comprising the following specific steps:
step 1: respectively extracting modal characteristic representations of the image and the text by using a visual diagram neural network and a two-way long-short-term memory network, aligning and splicing the modal characteristic representations of the two parts to obtain cross-modal characteristic representations;
step 2: performing first-stage feature fusion on the cross-modal feature representation to obtain a low-order cross-modal feature representation;
and step 3: performing second-stage feature fusion on the low-order cross-modal feature representation to obtain a high-order cross-modal feature representation;
and 4, step 4: inputting the high-order cross-modal feature representation into a classifier to obtain a question-answer prediction result; wherein:
the step 1 specifically comprises the following steps:
1.1: taking scientific charts acquired from a Figure QA data set, namely a vertical histogram, a horizontal histogram, a broken line chart, a pie chart and a scatter chart, corresponding questions of the scientific charts and answers after manual marking as a training set and a verification set, wherein the acquisition ratio of the training set to the verification set is 8:2; wherein, the manual marking is the answer given to the question;
1.2: modeling the visual information of the scientific chart by using a visual chart neural network; firstly, decomposing the gridding treatment of the whole scientific chart into a plurality of subgraphs, and flattening an RGB three-channel matrix of each subgraph in the channel direction, wherein each subgraph is represented by a vector; regarding each subgraph as a node, calculating K nodes closest to the current node by using a K-nearest neighbor algorithm, and establishing a corresponding dependency relationship between the current node and the K nodes; constructing a dependency relationship matrix between each sub-image and other sub-images by iteration, inputting all sub-image representation vectors and the dependency relationship matrix into a neural network of the graph, and modeling the feature representation of the image;
1.3: preprocessing an input text; obtaining a word vector corresponding to each word in the problem by using BERT, and then inputting the whole problem into a bidirectional long-short term memory network for sequence modeling to obtain text characteristic representation;
1.4: aligning and splicing the channel dimension represented by the image feature and the word dimension represented by the text feature to obtain cross-modal feature representation;
the step 2 specifically comprises the following steps:
2.1: performing first-stage feature fusion on the cross-modal feature representation; performing feature extraction on cross-modal feature representation by using a convolutional neural network with a convolution kernel of 1x 1;
2.2: inputting the cross-modal characteristic representation extracted from the convolutional neural network with a convolution kernel of 1x1 into the bidirectional GRU network, and performing cross-modal characteristic fusion to obtain a low-order cross-modal characteristic representation;
the step 3 specifically comprises the following steps:
3.1: performing second-stage feature fusion on the low-order cross-modal feature representation; generating a plurality of self-attention matrixes according to the number of heads by using a multi-head self-attention mechanism for the low-order cross-modal feature representation obtained after the feature fusion of the first stage; carrying out average pooling on a plurality of self-attention moment arrays, and then obtaining an interactive attention weight matrix;
3.2: inputting the low-order cross-modal feature representation and the interactive attention weight matrix obtained after the first-stage feature fusion into a densely connected depth map convolution network, and modeling the semantic dependence relationship between the text and the image to obtain a medium-order cross-modal feature representation;
3.3: and carrying out average pooling operation on the intermediate-order cross-modal characteristic representation to obtain a high-order cross-modal characteristic representation.
CN202211142426.1A 2022-09-20 2022-09-20 Chart question-answering method based on graph neural network Pending CN115375984A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211142426.1A CN115375984A (en) 2022-09-20 2022-09-20 Chart question-answering method based on graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211142426.1A CN115375984A (en) 2022-09-20 2022-09-20 Chart question-answering method based on graph neural network

Publications (1)

Publication Number Publication Date
CN115375984A true CN115375984A (en) 2022-11-22

Family

ID=84072506

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211142426.1A Pending CN115375984A (en) 2022-09-20 2022-09-20 Chart question-answering method based on graph neural network

Country Status (1)

Country Link
CN (1) CN115375984A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390165A (en) * 2023-10-27 2024-01-12 北京中科闻歌科技股份有限公司 Multi-mode large model-based chart question-answering method, system, medium and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117390165A (en) * 2023-10-27 2024-01-12 北京中科闻歌科技股份有限公司 Multi-mode large model-based chart question-answering method, system, medium and equipment

Similar Documents

Publication Publication Date Title
CN110532900B (en) Facial expression recognition method based on U-Net and LS-CNN
CN109783666B (en) Image scene graph generation method based on iterative refinement
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN111046661B (en) Reading understanding method based on graph convolution network
CN111274800A (en) Inference type reading understanding method based on relational graph convolution network
CN113191357B (en) Multilevel image-text matching method based on graph attention network
CN113486190B (en) Multi-mode knowledge representation method integrating entity image information and entity category information
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113989890A (en) Face expression recognition method based on multi-channel fusion and lightweight neural network
CN110263174B (en) Topic category analysis method based on focus attention
CN112686345A (en) Off-line English handwriting recognition method based on attention mechanism
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113283336A (en) Text recognition method and system
CN115375984A (en) Chart question-answering method based on graph neural network
CN116386148B (en) Knowledge graph guide-based small sample action recognition method and system
CN113408721A (en) Neural network structure searching method, apparatus, computer device and storage medium
CN117131933A (en) Multi-mode knowledge graph establishing method and application
CN114241497B (en) Table sequence identification method and system based on context attention mechanism
Hua et al. Collaborative Generative Adversarial Network with Visual perception and memory reasoning
CN113569867A (en) Image processing method and device, computer equipment and storage medium
CN111858682A (en) Judgment document logic evaluation method and system based on deep learning
Wang Improved facial expression recognition method based on gan
CN117952206B (en) Knowledge graph link prediction method
CN117690178B (en) Face image recognition method and system based on computer vision
US20230360367A1 (en) Neural network architectures for invariant object representation and classification using local hebbian rule-based updates

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Shen Qiwei

Inventor after: He Liang

Inventor after: Xiao Luwei

Inventor after: Wu Xingjiao

Inventor after: Ma Tianlong

Inventor after: He Jun

Inventor before: Shen Weiqi

Inventor before: He Liang

Inventor before: Xiao Luwei

Inventor before: Wu Xingjiao

Inventor before: Ma Tianlong

Inventor before: He Jun

CB03 Change of inventor or designer information