CN115186072A

CN115186072A - Knowledge graph visual question-answering method based on double-process cognitive theory

Info

Publication number: CN115186072A
Application number: CN202110374169.3A
Authority: CN
Inventors: 何小海; 刘露平; 王美玲; 卿粼波; 陈洪刚; 吴小强; 滕奇志
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2022-10-14

Abstract

The invention discloses a knowledge graph visual question-answering method based on a double-process cognitive theory, which comprises the following steps of: (1) Problem-picture joint characterization, namely, for input problems and pictures, respectively extracting the characteristics of the problems and the pictures by using a pre-training model BERT and a Faster-RCNN, and then sending the characteristics into a double-flow Transformer model to learn the problem-picture joint characterization; (2) Constructing a fact graph and a semantic graph, namely constructing the fact graph and the semantic graph respectively aiming at each question-picture pair, retrieving the fact graph from a knowledge base to construct the fact graph based on a semantic matching mode, and constructing the semantic graph in an image description mode; (3) Evidence aggregation, namely selecting evidences from the two graphs respectively by using a graph reasoning network, and then aggregating the evidences from a semantic graph to a fact graph based on a cross-modal graph reasoning network; (4) And (4) carrying out answer reasoning, and carrying out binary classification on nodes in the fact graph to obtain answers. The method has wide application prospect in the fields of education, entertainment and the like.

Description

Knowledge graph visual question-answering method based on double-process cognitive theory

Technical Field

The invention designs a knowledge graph visual question-answering method based on a double-process cognitive theory, and belongs to the intersection of the fields of natural language processing and computer vision.

Background

The ability of an intelligent agent to understand the world by analyzing visual and linguistic information is a hot research topic in recent years for combining computer vision with natural language processing technology. Related research has driven the development of many applications, such as Visual Question Answering (VQA), image indexing, and image description. Among them, VQA is a challenging task that requires the model to answer arbitrary questions based on a given image. In order to promote the development of VQA technology, a great deal of preliminary research work has been done by related scholars in recent years, and great progress has been made. However, the existing VQA method focuses on answering questions according to picture contents, and cannot answer some questions that need to be answered in combination with common sense.

To advance the field, wang et al proposed the task of Fact-Based Visual Question Answering (FVQA) (P.Wang, Q.Wu, C.Shen, A.Dick and A.van den Hengel, "FVQA: fact-Based Visual Question Answering," in IEEE Transactions on Pattern Analysis and Machine understanding, vol.40, no.10, pp.241. 3-7, 1Oct.2018, doi 10.1109/TPAMI.2017.2754246). At the same time, they also published a new data set that provides additional supporting facts for each question-answer pair and requires the model to answer the question through joint analysis of the image and external knowledge. The Wang et al study first parses the sentence and then maps into a knowledge graph, and then uses keyword matching to find the correct answer. This method has significant drawbacks and becomes ineffective when there is no mention of obvious visual concepts in the problem or synonyms and homomorphic synonyms. Consequently, the subsequent learners proposed a semantic learning-based search method, which projects the image-question-visual concepts and candidate facts into a learning embedding space and finds supporting facts by calculating corresponding distances. However, this method is inefficient because it evaluates a fact node at a time, and furthermore, it cannot utilize structural information of an external knowledge base. To solve this problem, narasimon et al propose a graph inference based approach that selects answers by reasoning over the entire graph (Narasimon M, schwing AG.Stright to the effects: learning knowledge base retrieval for scientific visual query and switching [ C ]// Proceedings of the European conference on computer vision (ECCV). 2018). The method constructs an entity graph, each node in the graph is represented by connection represented by an entity, an image and a question, then a graph convolution network is used for aggregating messages to obtain corresponding node updating characteristics, and finally, the prediction of answers is carried out based on the updated node characteristics. This approach inevitably introduces noise information, since the problem is only concerned with part of the visual content.

For humans, when a picture and a question are given, the inferred answer can be divided into two steps, (1) the brain quickly obtains the content presented in the image and the visual information concerned by the question by analyzing the image and the question; (2) And performing deep analysis according to the output of the first step and the knowledge stored in the brain to find out a correct answer. This process is known in cognitive science as the theory of two processes. The theory of two-process cognition holds that the human brain first perceives external input information by being performed by a system called system 1, an implicit, unconscious process. This information is then fed into a System called System2 to perform an explicit, conscious, controlled inference process. The reasoning process makes sequential inferences in working memory, a slow but human-specific attribute that is a high-level agent. From this perspective, the FVQA problem can be solved using a two-process cognitive theory-system 1 quickly retrieves information from images and problems, and system2 performs deep reasoning with external knowledge to find the correct answer.

Inspired by a double-process cognition theory, the invention provides a new framework for solving the FVQA problem based on a double-process cognition system. In particular, the system 1 in the framework of the invention is implemented by a multimodal Transformer network which uses a cross-attention mechanism to capture the complex relationships between questions and images. The system 1 outputs a joint representation of the image and the problem. For system2, the present invention uses a Graph Neural Network (GNN) to reason over two external knowledge graphs (fact Graph and semantic Graph) to find the correct answer. The system2 firstly selects and aggregates evidences in the modality, and aggregates evidences information related to the problem from the two knowledge graphs respectively; then cross-modal selection is performed to gather evidence in the fact graph into the semantic graph to assist in better inference of the answer. In the reasoning process, the invention provides a dual attention mechanism of node level attention and path level attention, captures valuable information from key nodes and paths, makes the reasoning process more reasonable, and further improves the performance of the knowledge-graph-based visual question-answering.

Disclosure of Invention

Inspired by the theory of dual process cognition, the present invention proposes a new framework for solving the FVQA problem with a dual process cognitive system. In particular, the system 1 in the framework of the invention is implemented by a multimodal Transformer network which uses a cross-attention mechanism to capture the complex relationships between questions and images and to output a joint representation of the question-image. For system2, a GCN network is used to reason over two external knowledge graphs (fact graph and semantic graph) to find the correct answer.

The invention realizes the purpose through the following technical scheme:

1. the invention relates to a knowledge-graph visual question-answering framework based on a double-process cognitive theory, which is shown in figure 1 and comprises a coordination perception module (system 1) and an explicit reasoning module (system 2), wherein the specific reasoning process of the framework comprises the following steps:

(1) Respectively using a text pre-training model BERT and an image pre-training model Faster-RCNN to extract features of an input text and an image, respectively adding [ CLS ] and [ SEP ] zone bits at the starting position and the ending position of each problem, and then sending the CLS and the SEP zone bits into the BERT model to extract features; and for each picture, extracting 36 target areas, wherein each target area comprises the visual characteristics of a target and the space-time position characteristic information of the target.

(2) And (2) sending the image and text characteristics extracted in the step (1) into a double-stream Transformer network to learn the joint characterization of the image and the text, wherein one single-stream Transformer network is used for learning the problem characterization guided by the image, the other single-stream Transformer is used for learning the image characterization guided by the problem, and finally, performing average pooling on the outputs of the two double-stream Transformer networks and further multiplying to obtain the joint characterization.

(3) And (2) constructing a fact graph and a semantic graph respectively aiming at each problem-image pair, wherein the fact graph is selected from an external knowledge base in a sentence-level semantic matching mode, and the semantic graph is obtained by firstly carrying out semantic description on the pictures and then carrying out semantic analysis on the generated sentences.

(4) The evidence aggregation based on graph reasoning is implemented by firstly utilizing two graph reasoning networks to respectively aggregate evidence information from a knowledge graph and a fact graph based on an attention mechanism aiming at the fact graph and the semantic graph, and then utilizing a cross-modal reasoning network to aggregate evidence information related to a problem from the semantic graph into the knowledge graph.

(5) And (3) answer prediction, namely performing point multiplication calculation on the joint representation of the question and the picture and the feature vector of each node in the fact graph to obtain a semantic relevancy score of each node and the question, and finally sending the relevancy score into a Sigmoid layer to predict a corresponding answer.

In step (1), firstly, the word of the input question is initialized by vector using the BERT pre-training model, wherein the BERT usesThe bert-base-uncased version is subjected to vector initialization to obtain a feature vector C = [ C ] of a word ₀ ,c ₁ ,...,c _n ]Feature vector dimensions for each word are 768 dimensions. Then, 36 target regions of the input picture are extracted by using a Faster-RCNN pre-training model, wherein each region comprises the appearance characteristics of one target

Spatial position features of the target

And a corresponding tag. In order to capture the visual feature and the spatial feature at the same time, the invention firstly projects the appearance feature and the spatial position feature into the same dimension (768 dimensions in the invention), and then calculates and averages the appearance visual feature and the spatial position feature to obtain the feature representation of each target object.

In step (2), a cross-modality Transformer is adopted to align the relationship between the two modalities, and the problem representation of the image participation and the image representation of the problem participation are learned, as shown in the figure 1.

Obtaining a problem characteristic vector C and a picture characteristic vector V through the step (1), and sending the problem characteristic vector C and the picture characteristic vector V into a double-flow Transformer network to learn the complex interaction of the problem and the picture, wherein one single-flow Transformer network is used for learning problem representation under the guidance of the picture, and the other single-flow Transformer network is used for learning picture representation under the guidance of the problem; in the question representation learning under picture guidance, a question feature vector is used as a query vector, a picture feature vector is used as a key vector and a value vector, and a calculation formula of the dependency relationship between a picture and a question is as follows:

w 'in formula (1)' _* Representing a parameter matrix to be learned by the model; in the picture representation learning under the guidance of the question, the picture feature vector is used as the query vector, and the question feature is used as the question feature vectorThe feature vector is used as a key vector and a value vector, and the dependency relationship between the key vector and the value vector is calculated according to the following formula:

w in formula (1) _* Representing a parameter matrix to be learned by the model; after a multi-layered (in the present invention, specifically 9 layers) joint characterization of the image and the text, [ CLS ] of the text sequence]The bit features act as joint representation features for the entire problem-picture.

Constructing a fact graph and a semantic graph in the step (3), and in the first step, sequentially splicing the input problem and the label of the object detected in the picture aiming at each problem-picture pair to obtain a problem-picture example set; and converting each triple in the external knowledge base into a natural language processing sentence (the conversion method is to sequentially splice a head entity, a relation and a tail entity) to obtain a corresponding fact instance set. And secondly, coding the problem-picture instance set and the fact instance set by using a pre-trained sentence coder Universal-content-encoder to obtain corresponding instance representations, calculating cosine similarity of the feature representation of each instance object in the problem-picture instance set and the feature representation of the fact instance object in sequence to obtain corresponding association scores, and finally sequencing all the instance objects according to the cosine similarity scores to obtain the top 10 instances with the highest scores as alternative supporting facts. And thirdly, constructing a fact graph according to the retrieved alternative support facts, wherein nodes in the graph are entities in a knowledge base, edges are relations between the two entities, and after the fact graph is constructed, performing corresponding initialization representation on the nodes and the edges by using BERT (binary offset test), wherein the initialization representation of the nodes and the edges is the average of word embedding of all words in the nodes and the edges.

The evidence aggregation based on graph reasoning in the step (4) comprises evidence aggregation divided into intra-modality evidence aggregation and inter-modality evidence aggregation:

in the case of polymerization, which is demonstrated in the modality, the attention mechanism proposed by the present invention, which includes a double level, is first utilized: node level attention (Node-level) and Path-level attention (Path-level) to perform feature selection and aggregation; in the node level attention calculation process, the attention score of each node and the problem-picture joint representation in the graph is calculated firstly

The calculation process is as follows:

then multiplying the attention score by the initial feature vector of each node in the graph to obtain a node feature vector guided based on the picture-question; in the path node attention calculation process, which path is more important to the inference process is mainly concerned, wherein each path is defined as a path formed by nodes and edges which are directly connected with a target node, and the path is defined as follows:

φ _ij ＝(v _i ,r _ij ,v _j ) (4)

wherein v is _i ，r _ij ，v _j The sublist represents the representation of the head node, the representation of the relationship and the characteristic representation of the tail node in the fact graph; after the path representation is obtained, the attention calculation process at the path level is as follows:

and then aggregating characteristics from the neighbor nodes according to the message propagation network, wherein the characteristic aggregation process of the neighbor nodes has the following calculation formula:

and finally, fusing the characteristics of the neighbor nodes and the characteristics of the target node and then further updating the characteristics of the target node, designing a gating mechanism to control the ratio of the characteristics of the neighbor nodes to the original characteristics of the target node in order to prevent the characteristics of the neighbor nodes from excessively updating the initial characteristics of the node, wherein the characteristic updating process of the target node is calculated as follows:

the process of evidence selection in the semantic graph is the same as the steps in the fact graph, and the description is not repeated here;

and (2) aggregating the inter-modal evidences, wherein when the inter-modal evidences are aggregated, attention weight coefficients of each node in the fact graph and each node in the semantic graph are calculated under the guidance of a problem, and finally, each node in the semantic graph is weighted and summed according to the attention weight coefficients to obtain relevant features in the semantic graph, wherein the relevant process calculation process comprises the following steps:

finally, fusing the feature vector in the semantic graph with the feature vector of the original node in the fact graph to obtain updated features after cross-modal fusion, and designing a corresponding gating mechanism to control the proportion of two different modal features in order to prevent the features from the semantic graph from excessively updating the features of the fact node, wherein a specific process calculation formula is as follows:

and finally, using the updated features for the inference of the answer.

Drawings

Fig. 1 is the main framework of the network model proposed by the present invention.

FIG. 2 is a structure of a trans-modal Transformer network.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

fig. 1 is the structure of the whole network, which is composed of two parts, a coordination awareness module (system 1) and an explicit reasoning module (system 2). The system 1 in the framework is implemented by a multimodal Transformer network that utilizes a cross-attention mechanism to capture the complex relationships between the problem and the image. The system 1 outputs a joint representation of the image and the problem. For system2, the present invention uses a Graph Neural Network (GNN) to reason over two external knowledge graphs (fact Graph and semantic Graph) to find the correct answer.

Fig. 2 is a cross-modal Transformer framework, which feeds the extracted picture and text features into a two-stream Transformer network to learn joint representation of the picture and the text, wherein one single-stream Transformer network is used for learning picture-guided problem representation, the other single-stream Transformer model is used for learning problem-guided picture representation, and finally, the outputs of the two-stream Transformer networks are further multiplied after Average pooling (Average pooling) to obtain joint representation.

Tables 1 and 2 show the experimental results of the present invention on the public data sets FVQA and OK-VQA, and the experiments show that the comprehensive evaluation index F of the proposed model is compared with the best existing model ₁ The best results are obtained.

TABLE 1 Experimental comparison of the inventive network model on the FVQA dataset with other existing models

TABLE 2 Experimental comparison of the network model of the present invention on OK-VQA data set with other existing models

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims

1. A knowledge graph visual question-answer based on a double-process cognitive theory is characterized by comprising the following steps:

(1) Respectively using a text pre-training model BERT and a target detection model Faster-RCNN to extract the characteristics of an input text and an image, respectively adding [ CLS ] and [ SEP ] marks at the beginning and the end of each sentence aiming at the text, and then sending the sentence into the BERT model to extract the characteristics; for each picture, extracting 36 target regions, wherein each target region comprises the appearance visual characteristics of an object and the spatial position characteristic information of the object in the picture;

(2) Sending the extracted picture and text characteristics into a double-flow Transformer network to learn joint representation of the picture and the text, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, the other single-flow Transformer model is used for learning picture representation under problem guidance, and finally multiplying the output of the two double-flow Transformer networks after average pooling to obtain problem-image joint representation;

(3) The method comprises the steps of constructing a fact graph and a semantic graph respectively aiming at each question-picture pair, wherein the fact graph is constructed by retrieving alternative supporting facts from an external knowledge base based on a sentence-level semantic matching mode, the semantic graph is constructed by performing semantic description on pictures firstly and then performing semantic analysis on generated sentences;

(4) The evidence aggregation based on graph reasoning comprises the steps of firstly, aggregating evidence information from a fact graph and a semantic graph by using two graph reasoning networks with attention mechanisms according to the fact graph and the semantic graph, and then aggregating the evidence information related to problems from the semantic graph into a knowledge graph by using a cross-modal reasoning network;

(5) And (3) answer prediction, namely performing point multiplication calculation on the joint representation of the question and the picture and the feature vector of each node in the fact graph to obtain a semantic matching degree score of each node and the question, and finally sending the matching degree score into a Sigmoid layer to predict a corresponding answer.

2. The method according to claim 1, wherein the problem-picture joint characterization learning method in (2) comprises the following steps:

giving a problem feature vector C and a picture feature vector V, and sending the problem feature vector C and the picture feature vector V into a double-flow Transformer network to learn the complex interaction of the problem and the picture, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, and the other single-flow Transformer network is used for learning picture representation under problem guidance; in the problem representation learning under the guidance of pictures, a problem feature vector is used as a query vector, a picture feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between pictures and problems is as follows:

w 'in the formula (1)' _* Representing a parameter matrix to be learned by the model; in the picture representation learning under the guidance of the question, the picture feature vector is used as a query vector, the question feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between the two vectors is as follows:

w in formula (2) _* The parameter matrix to be learned of the model is also represented; after a multi-layered (in the present invention, specifically 9 layers) joint characterization of the image and the text, [ CLS ] of the text sequence]The bit features act as joint representation features for the entire problem-picture.

3. The method according to claim 1, wherein the fact graph constructing method in step (3) comprises the following steps:

(1) For each problem-picture pair, firstly, sequentially splicing input problems and tags of objects detected in pictures to obtain a problem-picture example set, then converting each triple in an external knowledge base into a natural language sentence to obtain a corresponding fact example set, wherein the conversion method is obtained by sequentially splicing a head entity, a relation and a tail entity;

(2) Coding the problem-picture instance set and the fact instance set by using a pre-trained sentence coder Universal-content-encoder to obtain corresponding instance representation;

(3) Finally, calculating cosine similarity between the feature representation of each instance object in the problem-picture instance set and the feature representation of the fact instance object in sequence to obtain corresponding association scores, and finally sequencing all instance objects according to the cosine similarity scores to obtain the top 10 instances with the highest scores as alternative supporting facts;

(4) And constructing a fact graph according to the retrieved alternative support facts, wherein nodes in the graph are entities in a knowledge base, edges are relations between the two entities, and after the fact graph is constructed, performing corresponding initialized representation on the nodes and the edges by using a BERT model, wherein the initialized representation of the nodes and the edges is the average of word embedding of all words in the nodes and the edges.

4. The method according to claim 1, wherein the evidence aggregation step in (4) is divided into intra-modality evidence aggregation and inter-modality evidence aggregation:

(1) In the case of intra-modal evidence of aggregation, the attention proposed by the present invention is first utilized, including a double level: the Node-level attention (Node-level) and Path-level attention (Path-level) networks are used for feature selection and aggregation; in the node level attention calculation process, the attention score of each node and the problem-picture joint representation in the graph is calculated firstly

The calculation process is as follows:

w and b in the above formula represent the parameter matrix to be learned by the model and the offset, and W and b in the following formula represent the same meaning and will not be described repeatedly; obtaining an attention weight coefficient, and multiplying the attention score by the initial feature vector of each node in the graph to obtain a node feature representation guided based on the picture-question; in the path node attention calculation process, which path is more important to the inference process is mainly concerned, wherein each path is defined as a path formed by nodes and edges which are directly connected with a target node, and the path is defined as follows:

φ _ij ＝(v _i ,r _ij ,v _j ) (4)

wherein v is _i ，r _ij ，v _j The sublist represents the feature representation of the head node, the feature representation of the relationship and the feature representation of the tail node in the fact graph, and after the path representation is obtained, the attention calculation process of the path level is as follows:

and then, aggregating the features from the neighbor nodes by using a message propagation mechanism, wherein the feature aggregation process of the neighbor nodes is shown as the following formula:

finally, fusing the characteristics of the neighbor nodes and the characteristics of the target node and then further updating the characteristics of the target node, in order to prevent the characteristics of the neighbor nodes from excessively updating the characteristics of the nodes, designing a gating mechanism to control the ratio of the characteristics of the neighbor nodes to the original characteristics of the target node, wherein the characteristic updating process of the whole target node is shown as the following formula:

the process of evidence aggregation in the semantic graph is the same as the steps in the fact graph, and the description is not repeated here;

(2) And (2) aggregating the inter-modal evidences, wherein when the inter-modal evidences are aggregated, attention weight coefficients of each node in the fact graph and each node in the semantic graph are calculated under the guidance of a problem, and finally, the feature of each node in the semantic graph is weighted and summed according to the attention weight coefficients to obtain the related features in the semantic graph, wherein the related process calculation process comprises the following steps:

finally, fusing the feature vectors aggregated from the semantic graph with the feature vectors of the nodes in the fact graph to obtain new features subjected to cross-mode fusion, and designing a corresponding gating mechanism to control the proportion of two different mode features in order to prevent the features from the semantic graph from excessively updating the features of the nodes in the fact graph, wherein the specific process calculation formula is as follows:

and finally, using the updated features for the inference of the answer.