CN116701590A

CN116701590A - Visual question-answering method for constructing answer semantic space based on knowledge graph

Info

Publication number: CN116701590A
Application number: CN202310703224.8A
Authority: CN
Inventors: 成科扬; 蒋洲; 彭长生; 万浩; 司宇; 沈维杰; 严浏阳; 周昊; 陈楠; 余悦; 陈涛; 邹文轩; 黄昊; 唐静峰; 位刘涛; 叶中得
Original assignee: Jiangsu King Intelligent System Co ltd; Jiangsu University
Current assignee: Jiangsu King Intelligent System Co ltd; Jiangsu University
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-05

Abstract

The invention discloses a visual question-answering method based on participation of a knowledge graph in constructing an answer semantic space. When the answer semantic space is constructed, the method not only embeds external knowledge into answer features, but also reduces language deviation of the model through semantic loss of design, and remarkably improves generalization capability of the model. The technology can be applied to various different visual question-answering methods and has great commercial value.

Description

Visual question-answering method for constructing answer semantic space based on knowledge graph

Technical Field

The invention relates to the technical fields of computer vision, natural language processing, knowledge graph and the like, and the main method is to enhance the feature expression of answers related to entities in an answer space through related entity nodes in the knowledge graph. The technology can be applied to the fields of online education, blind person assistance, intelligent question and answer and the like, and has great commercial value.

Background

Visual questions and answers are a task that requires providing text answers given questions and images as input. The visual question-answering model requires a high level of understanding of the image content and the problem and is therefore generally considered a proxy task that allows the visual reasoning capabilities of the system to be assessed.

Although the task itself requires predictive text output, i.e., text answers. The text answer is a structurally rich output space, but most known benchmarks and evaluation prototypes treat it as a classification problem, such as VQAV1, VQAV2, GQA. In the previous approach, the answer dictionary was highly dependent on the training set, which is detrimental to generalizing the unseen data, while the answer classes of the dictionary were considered independent, not considering their semantic relationships. This results in a model that is highly dependent on problem bias.

At the same time, a large number of questions in the visual question-and-answer task require reasoning with general knowledge about the image content. For example, what can be done for a red object on the ground? "such problem? Not only does one need to visually identify that the red object is a fire hydrant, it is also known that fire hydrants can be used to extinguish fires. Wang et al constructed a fact-based VQA dataset FVQA, questions in FVQA need to be answered with external knowledge. Wang et al also propose a new VQA method that can automatically find facts from a large structured knowledge base that support visualization problems, such as ConceptNet and DBpedia. Instead of learning the mapping from questions to answers directly, the method learns the mapping from questions to knowledge base queries. Wen and Peng et al propose a common sense knowledge based reasoning model that supports the visual common sense reasoning task by taking external knowledge. In addition, there are many visual question-answering methods using external knowledge, but some current visual question-answering methods based on external knowledge do not consider external knowledge contained in answer sets in a dataset.

In summary, in the existing visual question-answering task, how to fully utilize the external knowledge contained in the answer and how to utilize the direct semantic relevance of similar answers is still an urgent problem to be solved.

Disclosure of Invention

1. A visual question-answering method based on knowledge graph participation to construct answer semantic space is characterized in that,

the method comprises the following steps:

step 1.1: extracting the problem and the image context related entity node from the external large-scale knowledge graph, respectively embedding the image feature and the text feature of the problem to obtain the external knowledge embedded image feature V ^k And problem feature Q for external knowledge embedding ^k 。

Step 1.2: will V ^k ，Q ^k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer _θ (V ^k ，Q ^k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.

Step 1.3: screening out entity nodes related to answers from an external large-scale knowledge graph, and reserving entity nodes with highest degree of correlation in the knowledge graph for embedding answer characteristic expression so as to generate answer characteristic expression G embedded by external knowledge _φ (a)。

Step 1.4: combining the images obtained in step 1.2 and step 1.3 with the joint feature F of the problem _θ (V ^k ，Q ^k ) Answer characteristic expression G embedded with external knowledge _φ (a) Projecting the same answer semantic space for learning.

Step 1.5: given an input image and a question, the model predicts the answer by the following formula:

wherein a is ^* Representing the answer predicted.

2. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein the method of embedding relevant external knowledge into text and image in step 1.1 is as follows:

step 2.1: make the following stepsProcessing the input image with a pre-trained target detection network on the Visual Genome data set to obtain Visual featuresWhere nv=18, vi represents the characteristics of the object in the detection frame, and each detection frame also yields a 5-dimensional detection frame characteristic d _v And the detected entity name->The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Where vn is the number of detected entities.

Step 2.2: each word in the input question is represented by GloVe word embedding, each word is represented by 1024-dimensional word embedding, and then the word is input into a long-short-term memory network to obtain the integral text characteristic representation of the question

Step 2.3: analyzing the problem through Steady named entity recognition to obtain an entity set involved in the problemWherein q is _n Is the number of detected entities.

Step 2.4: screening entities appearing in images from large-scale knowledge graph ConceptNet by computing cosine similarityEntities related to questions->Obtaining the most relevant entity node e _k E, where E represents a collection of entity nodes. With entity node e _k Starting from e _k And expanding to the second-order field, reserving field entities and edges, and finally forming a knowledge graph related to the image and the problem context, wherein all entity names are characterized by embedding GloVe words. Obtaining updated entity node characteristic representation related to the image and the problem context after the knowledge graph obtained through RGCN processing is subjected to multiple iterations of the graph neural network>

Step 2.5: embedding the final entity node characteristics obtained in the step 2.4 into the image characteristics and the text characteristics of the problem respectively to obtain the image characteristics V containing the external knowledge information ^k And image feature Q containing external knowledge information ^k The method is used for downstream multi-modal feature fusion tasks.

3. Embedding external knowledge in text and images respectively according to step 2.5 of claim 2, characterized in that the embedding method comprises the following specific steps:

step 3.1: for each entity detected by the object detection network, its visual characteristics v are determined _i Splicing the external knowledge with the corresponding entity node characteristics in the knowledge graph to obtain visual characteristics embedded by the external knowledge, namely each visual characteristicThe overall input image features are denoted as V after embedding the external knowledge ^k 。

Step 3.2: splicing the detected entity node characteristics in the problem as auxiliary characteristics after the problem characteristics to obtain the problem characteristic expression embedded by external knowledgeThen q _k Inputting into a multi-layer perceptron to obtain problem characteristics Q embedded by external knowledge ^k 。

4. The visual question-answering method for constructing answer semantic space based on knowledge graph according to claim 1, wherein the method of embedding the answer features into the external entity node containing knowledge information in step 1.3 is as follows:

step 4.1: modeling the first 3500 answer labels with highest probability of occurrence in visual question-answering as entity e _a And (3) selecting entities related to the answer entities from the large-scale knowledge graph through calculating cosine similarity, embedding all entity names by using Glove words as characteristics, expanding the entities to the second-order field, reserving field entities and edges, and finally forming the knowledge graph related to the answer context.

Step 4.2: after RGCN is iterated for many times, the entity node related to each answer gathers the surrounding node knowledge information to obtain updated entity node characteristic representation

Step 4.3: if the label of the answer is consistent with the entity name, directly modeling the entity in the knowledge graph as an element in the answer set. If the answer label contains words beyond the entity names, the entity features related to the answers are spliced to the text features of the answer label to obtain new answer features serving as elements in the answer set.

5. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein in step 1.4, F is set up _θ (V ^k ，Q ^k ) And G _φ (a) The method of learning projected into answer semantic space is as follows:

step 5.1: inputting each element in the answer set obtained in the step 4.3 into an LSTM neural network, and taking the last layer of the LSTM hidden layer as the representing characteristic of the answerWherein->Is a parameter of the LSTM network.

Step 5.2: will feature F _θ (V ^k ，Q ^k ) Andprojecting into the same answer semantic space for learning, and optimizing model +.>F is made to _θ (V ^k ，Q ^k ) Andare close to each other in a common feature space, where a is the set of all answers. The loss function in the training process is as follows:

where I (a, b) is a binary indicator function, taking 1 when a=b is true, otherwise 0.

Step 5.3: prediction to penalize semantic proximityAnd goal->Classification errors between, defining a semantic loss function as follows:

where s represents the similarity calculation function,finally, the overall loss function of the model during training is:

where λ is a predefined hyper-parameter.

The beneficial results of the invention are: the invention combines the knowledge graph and the related technology of graph convolution to provide a solution for constructing answer semantic space, thereby greatly improving the positive effect of answer information in the data set on the visual question-answer task.

Drawings

FIG. 1 is a diagram of a visual question-answering method according to the present invention

FIG. 2 is a flow chart of a method of the visual question-answering method according to the present invention

FIG. 3 is a schematic diagram of the construction of answer semantic space by the visual question answering method of the present invention

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the visual question-answering method for constructing answer semantic space based on knowledge graph according to the invention comprises the following specific implementation steps:

step 1: extracting the problem and the image context related entity node from the external large-scale knowledge graph, respectively embedding the image feature and the text feature of the problem to obtain the external knowledge embedded image feature V ^k And problem feature Q for external knowledge embedding ^k 。

Step 1.1: processing the input image using a pre-trained object detection network on the Visual Genome dataset to obtain Visual featuresWherein n is _v ＝18，v _i Representing the characteristics of the objects in the detection frames, and simultaneously, each detection frame also obtains the characteristics d of the detection frame in 5 dimensions _v . D of 5 dimension _v The method comprises the steps of detecting the coordinates of the upper, lower, left and right positions of the frame and the confidence of the frame. Each detection box also gets the detected entity name +.>The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Wherein v is _n Is the number of detected entities.

Step 1.2: each word in the input question is represented by GloVe word embedding, each word is represented by 1024-dimensional word embedding, and then the word is input into a long-short-term memory network to obtain the integral text characteristic representation of the question

Step 1.3: analyzing the problem through Steady named entity recognition to obtain an entity set involved in the problemWherein q is _n Is the number of detected entities.

Step 1.4: screening entities appearing in images from large-scale knowledge graph ConceptNet by computing cosine similarityEntities related to questions->Obtaining the most relevant entity node e _k E, where E represents a collection of entity nodes. With entity node e _k Starting from e _k And expanding to the second-order field, reserving field entities and edges, and finally forming a knowledge graph related to the image and the problem context, wherein all entity names are characterized by embedding GloVe words. The knowledge graph obtained through RGCN processing obtains updated entity nodes related to the image and the problem context after multiple iterations of the graph neural networkCharacteristic representation->

Step 1.5: for each entity detected by the object detection network, its visual characteristics v are determined _i Splicing the external knowledge with the corresponding entity node characteristics in the knowledge graph to obtain visual characteristics embedded by the external knowledge, namely each visual characteristicThe overall input image features are denoted as V after embedding the external knowledge ^k . Splicing the detected entity node characteristics in the problem as auxiliary characteristics after the problem characteristics to obtain the problem characteristic expression embedded by external knowledgeThen q _k Inputting a 3-layer multi-layer perceptron to obtain problem characteristics Q embedded by external knowledge ^k 。

Step 2: a joint feature representation of the image and the problem is generated. Will V ^k ，Q ^k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer _θ (V ^k ，Q ^k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.

Step 3: embedding the entity node containing knowledge information into answer characteristics to obtain characteristic representation of corresponding answer

Step 3.1: modeling the first 3500 answer labels with highest occurrence probability in VQA data set and OK-VQA data set as entity e _a And (3) selecting entities related to the answer entities from the large-scale knowledge graph through calculating cosine similarity, embedding all entity names by using Glove words as characteristics, expanding the entities to the second-order field, reserving field entities and edges, and finally forming the knowledge graph related to the answer context.

Step 3.2: after RGCN is iterated for many times, the entity node related to each answer gathers the surrounding node knowledge information to obtain updated entity node characteristic representation

Step 3.3: if the label of the answer is consistent with the entity name, directly modeling the entity in the knowledge graph as an element in the answer set. If the answer label contains words beyond the entity names, the entity features related to the answers are spliced to the text features of the answer label to obtain new answer features serving as elements in the answer set. Inputting each element in the obtained answer set into a 4-layer LSTM neural network, and taking the last layer of the LSTM neural network hidden layer as the representing characteristic of the answerWherein->Is a parameter of the LSTM network.

Step 4: f to be constructed _θ (V ^k ，Q ^k ) And G _φ (a) And putting an answer semantic space, and carrying out feature matching through a designed probability model to obtain an answer, wherein the answer semantic space is shown in figure 3.

Step 4.1: will feature F _θ (V ^k ，Q ^k ) Andprojecting into the same answer semantic space for learning, and optimizing model +.>F is made to _θ (V ^k ，Q ^k ) Andin a common feature spaceIn close proximity to each other, where A is the set of all answers, τ is the artificially set loss temperature, and when τ is larger, the distance of different eigenvectors in the eigenvector space is larger. The loss function in the training process is as follows:

where λ is a predefined hyper-parameter.

In general, the model framework is shown in FIG. 2. The target detection network processing the input image was pre-trained on Visual genome datasets, input images and labels, size 448x448. The number of training rounds of the target detection network is 20. The fixed number of detected objects in each image is 18, detection frames with confidence degree larger than 0.5 are reserved during detection, and if the number of the detection frames is less than 18, the spare entity features are filled with 0. The text of the knowledge-graph and the text of the input problem initially use the same 300-dimensional word embedded representation. When detecting that related entities in the text are used for knowledge embedding, only the entity with the highest confidence is reserved. We used an Adma optimizer during model training with a minimum batch size of 128. The training was stabilized using dropout and batch normalization methods. During training, the temperature loss τ was set to 0.02.

The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims

1. A visual question-answering method for constructing answer semantic space based on knowledge graph participation is characterized by comprising the following steps:

Step 1.2: will V ^k ，Q ^k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer _θ (V ^k ,Q ^k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.

Step 1.4: combining the images obtained in step 1.2 and step 1.3 with the joint feature F of the problem _θ (V ^k ,Q ^k ) And

answer characteristic expression G for embedding external knowledge _φ (a) Projected to the same directionLearning is performed in an answer semantic space.

wherein a is ^* Representing the answer predicted.

step 2.1: processing the input image using a target detection network pre-trained on the Visual Genome data set to obtain Visual featuresWherein n is _v ＝18,v _i Representing the characteristics of the objects in the detection frames, and simultaneously, each detection frame also obtains the characteristics d of the detection frame in 5 dimensions _v And the detected entity name->The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Wherein v is _n Is the number of detected entities.

5. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein in step 1.4, F is set up _θ (V ^k ,Q ^k ) And G _φ (a) The method of learning projected into answer semantic space is as follows:

step 5.1: inputting each element in the answer set obtained in the step 4.3 into an LSTM neural network, and taking the last layer of the LSTM hidden layer as the representing characteristic of the answerWhere phi is a parameter of the LSTM network.

Step 5.2: will feature F _θ (V ^k ,Q ^k ) And G _φ (a) Projected into the same answer semantic space for learning, and in the neural network training process, the model is optimizedF is made to _θ (V ^k ,Q ^k ) And->Are close to each other in a common feature space, where a is the set of all answers. The loss function in the training process is as follows:

where λ is a predefined hyper-parameter.