CN115186072A - Knowledge graph visual question-answering method based on double-process cognitive theory - Google Patents

Knowledge graph visual question-answering method based on double-process cognitive theory Download PDF

Info

Publication number
CN115186072A
CN115186072A CN202110374169.3A CN202110374169A CN115186072A CN 115186072 A CN115186072 A CN 115186072A CN 202110374169 A CN202110374169 A CN 202110374169A CN 115186072 A CN115186072 A CN 115186072A
Authority
CN
China
Prior art keywords
graph
picture
fact
representation
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110374169.3A
Other languages
Chinese (zh)
Inventor
何小海
刘露平
王美玲
卿粼波
陈洪刚
吴小强
滕奇志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202110374169.3A priority Critical patent/CN115186072A/en
Publication of CN115186072A publication Critical patent/CN115186072A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a knowledge graph visual question-answering method based on a double-process cognitive theory, which comprises the following steps of: (1) Problem-picture joint characterization, namely, for input problems and pictures, respectively extracting the characteristics of the problems and the pictures by using a pre-training model BERT and a Faster-RCNN, and then sending the characteristics into a double-flow Transformer model to learn the problem-picture joint characterization; (2) Constructing a fact graph and a semantic graph, namely constructing the fact graph and the semantic graph respectively aiming at each question-picture pair, retrieving the fact graph from a knowledge base to construct the fact graph based on a semantic matching mode, and constructing the semantic graph in an image description mode; (3) Evidence aggregation, namely selecting evidences from the two graphs respectively by using a graph reasoning network, and then aggregating the evidences from a semantic graph to a fact graph based on a cross-modal graph reasoning network; (4) And (4) carrying out answer reasoning, and carrying out binary classification on nodes in the fact graph to obtain answers. The method has wide application prospect in the fields of education, entertainment and the like.

Description

Knowledge graph visual question-answering method based on double-process cognitive theory
Technical Field
The invention designs a knowledge graph visual question-answering method based on a double-process cognitive theory, and belongs to the intersection of the fields of natural language processing and computer vision.
Background
The ability of an intelligent agent to understand the world by analyzing visual and linguistic information is a hot research topic in recent years for combining computer vision with natural language processing technology. Related research has driven the development of many applications, such as Visual Question Answering (VQA), image indexing, and image description. Among them, VQA is a challenging task that requires the model to answer arbitrary questions based on a given image. In order to promote the development of VQA technology, a great deal of preliminary research work has been done by related scholars in recent years, and great progress has been made. However, the existing VQA method focuses on answering questions according to picture contents, and cannot answer some questions that need to be answered in combination with common sense.
To advance the field, wang et al proposed the task of Fact-Based Visual Question Answering (FVQA) (P.Wang, Q.Wu, C.Shen, A.Dick and A.van den Hengel, "FVQA: fact-Based Visual Question Answering," in IEEE Transactions on Pattern Analysis and Machine understanding, vol.40, no.10, pp.241. 3-7, 1Oct.2018, doi 10.1109/TPAMI.2017.2754246). At the same time, they also published a new data set that provides additional supporting facts for each question-answer pair and requires the model to answer the question through joint analysis of the image and external knowledge. The Wang et al study first parses the sentence and then maps into a knowledge graph, and then uses keyword matching to find the correct answer. This method has significant drawbacks and becomes ineffective when there is no mention of obvious visual concepts in the problem or synonyms and homomorphic synonyms. Consequently, the subsequent learners proposed a semantic learning-based search method, which projects the image-question-visual concepts and candidate facts into a learning embedding space and finds supporting facts by calculating corresponding distances. However, this method is inefficient because it evaluates a fact node at a time, and furthermore, it cannot utilize structural information of an external knowledge base. To solve this problem, narasimon et al propose a graph inference based approach that selects answers by reasoning over the entire graph (Narasimon M, schwing AG.Stright to the effects: learning knowledge base retrieval for scientific visual query and switching [ C ]// Proceedings of the European conference on computer vision (ECCV). 2018). The method constructs an entity graph, each node in the graph is represented by connection represented by an entity, an image and a question, then a graph convolution network is used for aggregating messages to obtain corresponding node updating characteristics, and finally, the prediction of answers is carried out based on the updated node characteristics. This approach inevitably introduces noise information, since the problem is only concerned with part of the visual content.
For humans, when a picture and a question are given, the inferred answer can be divided into two steps, (1) the brain quickly obtains the content presented in the image and the visual information concerned by the question by analyzing the image and the question; (2) And performing deep analysis according to the output of the first step and the knowledge stored in the brain to find out a correct answer. This process is known in cognitive science as the theory of two processes. The theory of two-process cognition holds that the human brain first perceives external input information by being performed by a system called system 1, an implicit, unconscious process. This information is then fed into a System called System2 to perform an explicit, conscious, controlled inference process. The reasoning process makes sequential inferences in working memory, a slow but human-specific attribute that is a high-level agent. From this perspective, the FVQA problem can be solved using a two-process cognitive theory-system 1 quickly retrieves information from images and problems, and system2 performs deep reasoning with external knowledge to find the correct answer.
Inspired by a double-process cognition theory, the invention provides a new framework for solving the FVQA problem based on a double-process cognition system. In particular, the system 1 in the framework of the invention is implemented by a multimodal Transformer network which uses a cross-attention mechanism to capture the complex relationships between questions and images. The system 1 outputs a joint representation of the image and the problem. For system2, the present invention uses a Graph Neural Network (GNN) to reason over two external knowledge graphs (fact Graph and semantic Graph) to find the correct answer. The system2 firstly selects and aggregates evidences in the modality, and aggregates evidences information related to the problem from the two knowledge graphs respectively; then cross-modal selection is performed to gather evidence in the fact graph into the semantic graph to assist in better inference of the answer. In the reasoning process, the invention provides a dual attention mechanism of node level attention and path level attention, captures valuable information from key nodes and paths, makes the reasoning process more reasonable, and further improves the performance of the knowledge-graph-based visual question-answering.
Disclosure of Invention
Inspired by the theory of dual process cognition, the present invention proposes a new framework for solving the FVQA problem with a dual process cognitive system. In particular, the system 1 in the framework of the invention is implemented by a multimodal Transformer network which uses a cross-attention mechanism to capture the complex relationships between questions and images and to output a joint representation of the question-image. For system2, a GCN network is used to reason over two external knowledge graphs (fact graph and semantic graph) to find the correct answer.
The invention realizes the purpose through the following technical scheme:
1. the invention relates to a knowledge-graph visual question-answering framework based on a double-process cognitive theory, which is shown in figure 1 and comprises a coordination perception module (system 1) and an explicit reasoning module (system 2), wherein the specific reasoning process of the framework comprises the following steps:
(1) Respectively using a text pre-training model BERT and an image pre-training model Faster-RCNN to extract features of an input text and an image, respectively adding [ CLS ] and [ SEP ] zone bits at the starting position and the ending position of each problem, and then sending the CLS and the SEP zone bits into the BERT model to extract features; and for each picture, extracting 36 target areas, wherein each target area comprises the visual characteristics of a target and the space-time position characteristic information of the target.
(2) And (2) sending the image and text characteristics extracted in the step (1) into a double-stream Transformer network to learn the joint characterization of the image and the text, wherein one single-stream Transformer network is used for learning the problem characterization guided by the image, the other single-stream Transformer is used for learning the image characterization guided by the problem, and finally, performing average pooling on the outputs of the two double-stream Transformer networks and further multiplying to obtain the joint characterization.
(3) And (2) constructing a fact graph and a semantic graph respectively aiming at each problem-image pair, wherein the fact graph is selected from an external knowledge base in a sentence-level semantic matching mode, and the semantic graph is obtained by firstly carrying out semantic description on the pictures and then carrying out semantic analysis on the generated sentences.
(4) The evidence aggregation based on graph reasoning is implemented by firstly utilizing two graph reasoning networks to respectively aggregate evidence information from a knowledge graph and a fact graph based on an attention mechanism aiming at the fact graph and the semantic graph, and then utilizing a cross-modal reasoning network to aggregate evidence information related to a problem from the semantic graph into the knowledge graph.
(5) And (3) answer prediction, namely performing point multiplication calculation on the joint representation of the question and the picture and the feature vector of each node in the fact graph to obtain a semantic relevancy score of each node and the question, and finally sending the relevancy score into a Sigmoid layer to predict a corresponding answer.
In step (1), firstly, the word of the input question is initialized by vector using the BERT pre-training model, wherein the BERT usesThe bert-base-uncased version is subjected to vector initialization to obtain a feature vector C = [ C ] of a word 0 ,c 1 ,...,c n ]Feature vector dimensions for each word are 768 dimensions. Then, 36 target regions of the input picture are extracted by using a Faster-RCNN pre-training model, wherein each region comprises the appearance characteristics of one target
Figure BDA0003010504460000031
Spatial position features of the target
Figure BDA0003010504460000032
And a corresponding tag. In order to capture the visual feature and the spatial feature at the same time, the invention firstly projects the appearance feature and the spatial position feature into the same dimension (768 dimensions in the invention), and then calculates and averages the appearance visual feature and the spatial position feature to obtain the feature representation of each target object.
In step (2), a cross-modality Transformer is adopted to align the relationship between the two modalities, and the problem representation of the image participation and the image representation of the problem participation are learned, as shown in the figure 1.
Obtaining a problem characteristic vector C and a picture characteristic vector V through the step (1), and sending the problem characteristic vector C and the picture characteristic vector V into a double-flow Transformer network to learn the complex interaction of the problem and the picture, wherein one single-flow Transformer network is used for learning problem representation under the guidance of the picture, and the other single-flow Transformer network is used for learning picture representation under the guidance of the problem; in the question representation learning under picture guidance, a question feature vector is used as a query vector, a picture feature vector is used as a key vector and a value vector, and a calculation formula of the dependency relationship between a picture and a question is as follows:
Figure BDA0003010504460000033
w 'in formula (1)' * Representing a parameter matrix to be learned by the model; in the picture representation learning under the guidance of the question, the picture feature vector is used as the query vector, and the question feature is used as the question feature vectorThe feature vector is used as a key vector and a value vector, and the dependency relationship between the key vector and the value vector is calculated according to the following formula:
Figure BDA0003010504460000034
w in formula (1) * Representing a parameter matrix to be learned by the model; after a multi-layered (in the present invention, specifically 9 layers) joint characterization of the image and the text, [ CLS ] of the text sequence]The bit features act as joint representation features for the entire problem-picture.
Constructing a fact graph and a semantic graph in the step (3), and in the first step, sequentially splicing the input problem and the label of the object detected in the picture aiming at each problem-picture pair to obtain a problem-picture example set; and converting each triple in the external knowledge base into a natural language processing sentence (the conversion method is to sequentially splice a head entity, a relation and a tail entity) to obtain a corresponding fact instance set. And secondly, coding the problem-picture instance set and the fact instance set by using a pre-trained sentence coder Universal-content-encoder to obtain corresponding instance representations, calculating cosine similarity of the feature representation of each instance object in the problem-picture instance set and the feature representation of the fact instance object in sequence to obtain corresponding association scores, and finally sequencing all the instance objects according to the cosine similarity scores to obtain the top 10 instances with the highest scores as alternative supporting facts. And thirdly, constructing a fact graph according to the retrieved alternative support facts, wherein nodes in the graph are entities in a knowledge base, edges are relations between the two entities, and after the fact graph is constructed, performing corresponding initialization representation on the nodes and the edges by using BERT (binary offset test), wherein the initialization representation of the nodes and the edges is the average of word embedding of all words in the nodes and the edges.
The evidence aggregation based on graph reasoning in the step (4) comprises evidence aggregation divided into intra-modality evidence aggregation and inter-modality evidence aggregation:
in the case of polymerization, which is demonstrated in the modality, the attention mechanism proposed by the present invention, which includes a double level, is first utilized: node level attention (Node-level) and Path-level attention (Path-level) to perform feature selection and aggregation; in the node level attention calculation process, the attention score of each node and the problem-picture joint representation in the graph is calculated firstly
Figure BDA0003010504460000041
The calculation process is as follows:
Figure BDA0003010504460000042
then multiplying the attention score by the initial feature vector of each node in the graph to obtain a node feature vector guided based on the picture-question; in the path node attention calculation process, which path is more important to the inference process is mainly concerned, wherein each path is defined as a path formed by nodes and edges which are directly connected with a target node, and the path is defined as follows:
φ ij =(v i ,r ij ,v j ) (4)
wherein v is i ,r ij ,v j The sublist represents the representation of the head node, the representation of the relationship and the characteristic representation of the tail node in the fact graph; after the path representation is obtained, the attention calculation process at the path level is as follows:
Figure BDA0003010504460000043
and then aggregating characteristics from the neighbor nodes according to the message propagation network, wherein the characteristic aggregation process of the neighbor nodes has the following calculation formula:
Figure BDA0003010504460000044
and finally, fusing the characteristics of the neighbor nodes and the characteristics of the target node and then further updating the characteristics of the target node, designing a gating mechanism to control the ratio of the characteristics of the neighbor nodes to the original characteristics of the target node in order to prevent the characteristics of the neighbor nodes from excessively updating the initial characteristics of the node, wherein the characteristic updating process of the target node is calculated as follows:
Figure BDA0003010504460000045
Figure BDA0003010504460000046
the process of evidence selection in the semantic graph is the same as the steps in the fact graph, and the description is not repeated here;
and (2) aggregating the inter-modal evidences, wherein when the inter-modal evidences are aggregated, attention weight coefficients of each node in the fact graph and each node in the semantic graph are calculated under the guidance of a problem, and finally, each node in the semantic graph is weighted and summed according to the attention weight coefficients to obtain relevant features in the semantic graph, wherein the relevant process calculation process comprises the following steps:
Figure BDA0003010504460000047
Figure BDA0003010504460000048
finally, fusing the feature vector in the semantic graph with the feature vector of the original node in the fact graph to obtain updated features after cross-modal fusion, and designing a corresponding gating mechanism to control the proportion of two different modal features in order to prevent the features from the semantic graph from excessively updating the features of the fact node, wherein a specific process calculation formula is as follows:
Figure BDA0003010504460000049
Figure BDA00030105044600000410
and finally, using the updated features for the inference of the answer.
Drawings
Fig. 1 is the main framework of the network model proposed by the present invention.
FIG. 2 is a structure of a trans-modal Transformer network.
Detailed Description
The invention will be further described with reference to the accompanying drawings in which:
fig. 1 is the structure of the whole network, which is composed of two parts, a coordination awareness module (system 1) and an explicit reasoning module (system 2). The system 1 in the framework is implemented by a multimodal Transformer network that utilizes a cross-attention mechanism to capture the complex relationships between the problem and the image. The system 1 outputs a joint representation of the image and the problem. For system2, the present invention uses a Graph Neural Network (GNN) to reason over two external knowledge graphs (fact Graph and semantic Graph) to find the correct answer.
Fig. 2 is a cross-modal Transformer framework, which feeds the extracted picture and text features into a two-stream Transformer network to learn joint representation of the picture and the text, wherein one single-stream Transformer network is used for learning picture-guided problem representation, the other single-stream Transformer model is used for learning problem-guided picture representation, and finally, the outputs of the two-stream Transformer networks are further multiplied after Average pooling (Average pooling) to obtain joint representation.
Tables 1 and 2 show the experimental results of the present invention on the public data sets FVQA and OK-VQA, and the experiments show that the comprehensive evaluation index F of the proposed model is compared with the best existing model 1 The best results are obtained.
TABLE 1 Experimental comparison of the inventive network model on the FVQA dataset with other existing models
Figure BDA0003010504460000051
TABLE 2 Experimental comparison of the network model of the present invention on OK-VQA data set with other existing models
Figure BDA0003010504460000061
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the technical solutions of the present invention, so long as the technical solutions can be realized on the basis of the above embodiments without creative efforts, which should be considered to fall within the protection scope of the patent of the present invention.

Claims (4)

1. A knowledge graph visual question-answer based on a double-process cognitive theory is characterized by comprising the following steps:
(1) Respectively using a text pre-training model BERT and a target detection model Faster-RCNN to extract the characteristics of an input text and an image, respectively adding [ CLS ] and [ SEP ] marks at the beginning and the end of each sentence aiming at the text, and then sending the sentence into the BERT model to extract the characteristics; for each picture, extracting 36 target regions, wherein each target region comprises the appearance visual characteristics of an object and the spatial position characteristic information of the object in the picture;
(2) Sending the extracted picture and text characteristics into a double-flow Transformer network to learn joint representation of the picture and the text, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, the other single-flow Transformer model is used for learning picture representation under problem guidance, and finally multiplying the output of the two double-flow Transformer networks after average pooling to obtain problem-image joint representation;
(3) The method comprises the steps of constructing a fact graph and a semantic graph respectively aiming at each question-picture pair, wherein the fact graph is constructed by retrieving alternative supporting facts from an external knowledge base based on a sentence-level semantic matching mode, the semantic graph is constructed by performing semantic description on pictures firstly and then performing semantic analysis on generated sentences;
(4) The evidence aggregation based on graph reasoning comprises the steps of firstly, aggregating evidence information from a fact graph and a semantic graph by using two graph reasoning networks with attention mechanisms according to the fact graph and the semantic graph, and then aggregating the evidence information related to problems from the semantic graph into a knowledge graph by using a cross-modal reasoning network;
(5) And (3) answer prediction, namely performing point multiplication calculation on the joint representation of the question and the picture and the feature vector of each node in the fact graph to obtain a semantic matching degree score of each node and the question, and finally sending the matching degree score into a Sigmoid layer to predict a corresponding answer.
2. The method according to claim 1, wherein the problem-picture joint characterization learning method in (2) comprises the following steps:
giving a problem feature vector C and a picture feature vector V, and sending the problem feature vector C and the picture feature vector V into a double-flow Transformer network to learn the complex interaction of the problem and the picture, wherein one single-flow Transformer network is used for learning problem representation under picture guidance, and the other single-flow Transformer network is used for learning picture representation under problem guidance; in the problem representation learning under the guidance of pictures, a problem feature vector is used as a query vector, a picture feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between pictures and problems is as follows:
Figure FDA0003010504450000011
w 'in the formula (1)' * Representing a parameter matrix to be learned by the model; in the picture representation learning under the guidance of the question, the picture feature vector is used as a query vector, the question feature vector is used as a key vector and a value vector, and the calculation formula of the dependency relationship between the two vectors is as follows:
Figure FDA0003010504450000012
w in formula (2) * The parameter matrix to be learned of the model is also represented; after a multi-layered (in the present invention, specifically 9 layers) joint characterization of the image and the text, [ CLS ] of the text sequence]The bit features act as joint representation features for the entire problem-picture.
3. The method according to claim 1, wherein the fact graph constructing method in step (3) comprises the following steps:
(1) For each problem-picture pair, firstly, sequentially splicing input problems and tags of objects detected in pictures to obtain a problem-picture example set, then converting each triple in an external knowledge base into a natural language sentence to obtain a corresponding fact example set, wherein the conversion method is obtained by sequentially splicing a head entity, a relation and a tail entity;
(2) Coding the problem-picture instance set and the fact instance set by using a pre-trained sentence coder Universal-content-encoder to obtain corresponding instance representation;
(3) Finally, calculating cosine similarity between the feature representation of each instance object in the problem-picture instance set and the feature representation of the fact instance object in sequence to obtain corresponding association scores, and finally sequencing all instance objects according to the cosine similarity scores to obtain the top 10 instances with the highest scores as alternative supporting facts;
(4) And constructing a fact graph according to the retrieved alternative support facts, wherein nodes in the graph are entities in a knowledge base, edges are relations between the two entities, and after the fact graph is constructed, performing corresponding initialized representation on the nodes and the edges by using a BERT model, wherein the initialized representation of the nodes and the edges is the average of word embedding of all words in the nodes and the edges.
4. The method according to claim 1, wherein the evidence aggregation step in (4) is divided into intra-modality evidence aggregation and inter-modality evidence aggregation:
(1) In the case of intra-modal evidence of aggregation, the attention proposed by the present invention is first utilized, including a double level: the Node-level attention (Node-level) and Path-level attention (Path-level) networks are used for feature selection and aggregation; in the node level attention calculation process, the attention score of each node and the problem-picture joint representation in the graph is calculated firstly
Figure FDA0003010504450000021
The calculation process is as follows:
Figure FDA0003010504450000022
w and b in the above formula represent the parameter matrix to be learned by the model and the offset, and W and b in the following formula represent the same meaning and will not be described repeatedly; obtaining an attention weight coefficient, and multiplying the attention score by the initial feature vector of each node in the graph to obtain a node feature representation guided based on the picture-question; in the path node attention calculation process, which path is more important to the inference process is mainly concerned, wherein each path is defined as a path formed by nodes and edges which are directly connected with a target node, and the path is defined as follows:
φ ij =(v i ,r ij ,v j ) (4)
wherein v is i ,r ij ,v j The sublist represents the feature representation of the head node, the feature representation of the relationship and the feature representation of the tail node in the fact graph, and after the path representation is obtained, the attention calculation process of the path level is as follows:
Figure FDA0003010504450000023
and then, aggregating the features from the neighbor nodes by using a message propagation mechanism, wherein the feature aggregation process of the neighbor nodes is shown as the following formula:
Figure FDA0003010504450000024
finally, fusing the characteristics of the neighbor nodes and the characteristics of the target node and then further updating the characteristics of the target node, in order to prevent the characteristics of the neighbor nodes from excessively updating the characteristics of the nodes, designing a gating mechanism to control the ratio of the characteristics of the neighbor nodes to the original characteristics of the target node, wherein the characteristic updating process of the whole target node is shown as the following formula:
Figure FDA0003010504450000031
Figure FDA0003010504450000032
the process of evidence aggregation in the semantic graph is the same as the steps in the fact graph, and the description is not repeated here;
(2) And (2) aggregating the inter-modal evidences, wherein when the inter-modal evidences are aggregated, attention weight coefficients of each node in the fact graph and each node in the semantic graph are calculated under the guidance of a problem, and finally, the feature of each node in the semantic graph is weighted and summed according to the attention weight coefficients to obtain the related features in the semantic graph, wherein the related process calculation process comprises the following steps:
Figure FDA0003010504450000033
Figure FDA0003010504450000034
finally, fusing the feature vectors aggregated from the semantic graph with the feature vectors of the nodes in the fact graph to obtain new features subjected to cross-mode fusion, and designing a corresponding gating mechanism to control the proportion of two different mode features in order to prevent the features from the semantic graph from excessively updating the features of the nodes in the fact graph, wherein the specific process calculation formula is as follows:
Figure FDA0003010504450000035
Figure FDA0003010504450000036
and finally, using the updated features for the inference of the answer.
CN202110374169.3A 2021-04-07 2021-04-07 Knowledge graph visual question-answering method based on double-process cognitive theory Pending CN115186072A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110374169.3A CN115186072A (en) 2021-04-07 2021-04-07 Knowledge graph visual question-answering method based on double-process cognitive theory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110374169.3A CN115186072A (en) 2021-04-07 2021-04-07 Knowledge graph visual question-answering method based on double-process cognitive theory

Publications (1)

Publication Number Publication Date
CN115186072A true CN115186072A (en) 2022-10-14

Family

ID=83512224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110374169.3A Pending CN115186072A (en) 2021-04-07 2021-04-07 Knowledge graph visual question-answering method based on double-process cognitive theory

Country Status (1)

Country Link
CN (1) CN115186072A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401390A (en) * 2023-05-19 2023-07-07 中国科学技术大学 Visual question-answering processing method, system, storage medium and electronic equipment
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401390A (en) * 2023-05-19 2023-07-07 中国科学技术大学 Visual question-answering processing method, system, storage medium and electronic equipment
CN116401390B (en) * 2023-05-19 2023-10-20 中国科学技术大学 Visual question-answering processing method, system, storage medium and electronic equipment
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108509519B (en) General knowledge graph enhanced question-answer interaction system and method based on deep learning
CN112200317B (en) Multi-mode knowledge graph construction method
CN110121706B (en) Providing responses in a conversation
CN112015868B (en) Question-answering method based on knowledge graph completion
CN110298043B (en) Vehicle named entity identification method and system
Sharma et al. A survey of methods, datasets and evaluation metrics for visual question answering
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN111897944B (en) Knowledge graph question-answering system based on semantic space sharing
CN114064918A (en) Multi-modal event knowledge graph construction method
CN112115253B (en) Depth text ordering method based on multi-view attention mechanism
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN111460132A (en) Generation type conference abstract method based on graph convolution neural network
CN113240046B (en) Knowledge-based multi-mode information fusion method under visual question-answering task
CN115186072A (en) Knowledge graph visual question-answering method based on double-process cognitive theory
CN114077673A (en) Knowledge graph construction method based on BTBC model
CN116187349A (en) Visual question-answering method based on scene graph relation information enhancement
CN114818703A (en) Multi-intention recognition method and system based on BERT language model and TextCNN model
CN116821291A (en) Question-answering method and system based on knowledge graph embedding and language model alternate learning
Phan et al. Building a Vietnamese question answering system based on knowledge graph and distributed CNN
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
Wu et al. Visual Question Answering
CN117094395B (en) Method, device and computer storage medium for complementing knowledge graph
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN117131933A (en) Multi-mode knowledge graph establishing method and application
CN116737911A (en) Deep learning-based hypertension question-answering method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination