CN116701590A - Visual question-answering method for constructing answer semantic space based on knowledge graph - Google Patents

Visual question-answering method for constructing answer semantic space based on knowledge graph Download PDF

Info

Publication number
CN116701590A
CN116701590A CN202310703224.8A CN202310703224A CN116701590A CN 116701590 A CN116701590 A CN 116701590A CN 202310703224 A CN202310703224 A CN 202310703224A CN 116701590 A CN116701590 A CN 116701590A
Authority
CN
China
Prior art keywords
answer
entity
embedding
knowledge
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310703224.8A
Other languages
Chinese (zh)
Inventor
成科扬
蒋洲
彭长生
万浩
司宇
沈维杰
严浏阳
周昊
陈楠
余悦
陈涛
邹文轩
黄昊
唐静峰
位刘涛
叶中得
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu King Intelligent System Co ltd
Jiangsu University
Original Assignee
Jiangsu King Intelligent System Co ltd
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu King Intelligent System Co ltd, Jiangsu University filed Critical Jiangsu King Intelligent System Co ltd
Priority to CN202310703224.8A priority Critical patent/CN116701590A/en
Publication of CN116701590A publication Critical patent/CN116701590A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a visual question-answering method based on participation of a knowledge graph in constructing an answer semantic space. When the answer semantic space is constructed, the method not only embeds external knowledge into answer features, but also reduces language deviation of the model through semantic loss of design, and remarkably improves generalization capability of the model. The technology can be applied to various different visual question-answering methods and has great commercial value.

Description

Visual question-answering method for constructing answer semantic space based on knowledge graph
Technical Field
The invention relates to the technical fields of computer vision, natural language processing, knowledge graph and the like, and the main method is to enhance the feature expression of answers related to entities in an answer space through related entity nodes in the knowledge graph. The technology can be applied to the fields of online education, blind person assistance, intelligent question and answer and the like, and has great commercial value.
Background
Visual questions and answers are a task that requires providing text answers given questions and images as input. The visual question-answering model requires a high level of understanding of the image content and the problem and is therefore generally considered a proxy task that allows the visual reasoning capabilities of the system to be assessed.
Although the task itself requires predictive text output, i.e., text answers. The text answer is a structurally rich output space, but most known benchmarks and evaluation prototypes treat it as a classification problem, such as VQAV1, VQAV2, GQA. In the previous approach, the answer dictionary was highly dependent on the training set, which is detrimental to generalizing the unseen data, while the answer classes of the dictionary were considered independent, not considering their semantic relationships. This results in a model that is highly dependent on problem bias.
At the same time, a large number of questions in the visual question-and-answer task require reasoning with general knowledge about the image content. For example, what can be done for a red object on the ground? "such problem? Not only does one need to visually identify that the red object is a fire hydrant, it is also known that fire hydrants can be used to extinguish fires. Wang et al constructed a fact-based VQA dataset FVQA, questions in FVQA need to be answered with external knowledge. Wang et al also propose a new VQA method that can automatically find facts from a large structured knowledge base that support visualization problems, such as ConceptNet and DBpedia. Instead of learning the mapping from questions to answers directly, the method learns the mapping from questions to knowledge base queries. Wen and Peng et al propose a common sense knowledge based reasoning model that supports the visual common sense reasoning task by taking external knowledge. In addition, there are many visual question-answering methods using external knowledge, but some current visual question-answering methods based on external knowledge do not consider external knowledge contained in answer sets in a dataset.
In summary, in the existing visual question-answering task, how to fully utilize the external knowledge contained in the answer and how to utilize the direct semantic relevance of similar answers is still an urgent problem to be solved.
Disclosure of Invention
1. A visual question-answering method based on knowledge graph participation to construct answer semantic space is characterized in that,
the method comprises the following steps:
step 1.1: extracting the problem and the image context related entity node from the external large-scale knowledge graph, respectively embedding the image feature and the text feature of the problem to obtain the external knowledge embedded image feature V k And problem feature Q for external knowledge embedding k
Step 1.2: will V k ,Q k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer θ (V k ,Q k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.
Step 1.3: screening out entity nodes related to answers from an external large-scale knowledge graph, and reserving entity nodes with highest degree of correlation in the knowledge graph for embedding answer characteristic expression so as to generate answer characteristic expression G embedded by external knowledge φ (a)。
Step 1.4: combining the images obtained in step 1.2 and step 1.3 with the joint feature F of the problem θ (V k ,Q k ) Answer characteristic expression G embedded with external knowledge φ (a) Projecting the same answer semantic space for learning.
Step 1.5: given an input image and a question, the model predicts the answer by the following formula:
wherein a is * Representing the answer predicted.
2. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein the method of embedding relevant external knowledge into text and image in step 1.1 is as follows:
step 2.1: make the following stepsProcessing the input image with a pre-trained target detection network on the Visual Genome data set to obtain Visual featuresWhere nv=18, vi represents the characteristics of the object in the detection frame, and each detection frame also yields a 5-dimensional detection frame characteristic d v And the detected entity name->The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Where vn is the number of detected entities.
Step 2.2: each word in the input question is represented by GloVe word embedding, each word is represented by 1024-dimensional word embedding, and then the word is input into a long-short-term memory network to obtain the integral text characteristic representation of the question
Step 2.3: analyzing the problem through Steady named entity recognition to obtain an entity set involved in the problemWherein q is n Is the number of detected entities.
Step 2.4: screening entities appearing in images from large-scale knowledge graph ConceptNet by computing cosine similarityEntities related to questions->Obtaining the most relevant entity node e k E, where E represents a collection of entity nodes. With entity node e k Starting from e k And expanding to the second-order field, reserving field entities and edges, and finally forming a knowledge graph related to the image and the problem context, wherein all entity names are characterized by embedding GloVe words. Obtaining updated entity node characteristic representation related to the image and the problem context after the knowledge graph obtained through RGCN processing is subjected to multiple iterations of the graph neural network>
Step 2.5: embedding the final entity node characteristics obtained in the step 2.4 into the image characteristics and the text characteristics of the problem respectively to obtain the image characteristics V containing the external knowledge information k And image feature Q containing external knowledge information k The method is used for downstream multi-modal feature fusion tasks.
3. Embedding external knowledge in text and images respectively according to step 2.5 of claim 2, characterized in that the embedding method comprises the following specific steps:
step 3.1: for each entity detected by the object detection network, its visual characteristics v are determined i Splicing the external knowledge with the corresponding entity node characteristics in the knowledge graph to obtain visual characteristics embedded by the external knowledge, namely each visual characteristicThe overall input image features are denoted as V after embedding the external knowledge k
Step 3.2: splicing the detected entity node characteristics in the problem as auxiliary characteristics after the problem characteristics to obtain the problem characteristic expression embedded by external knowledgeThen q k Inputting into a multi-layer perceptron to obtain problem characteristics Q embedded by external knowledge k
4. The visual question-answering method for constructing answer semantic space based on knowledge graph according to claim 1, wherein the method of embedding the answer features into the external entity node containing knowledge information in step 1.3 is as follows:
step 4.1: modeling the first 3500 answer labels with highest probability of occurrence in visual question-answering as entity e a And (3) selecting entities related to the answer entities from the large-scale knowledge graph through calculating cosine similarity, embedding all entity names by using Glove words as characteristics, expanding the entities to the second-order field, reserving field entities and edges, and finally forming the knowledge graph related to the answer context.
Step 4.2: after RGCN is iterated for many times, the entity node related to each answer gathers the surrounding node knowledge information to obtain updated entity node characteristic representation
Step 4.3: if the label of the answer is consistent with the entity name, directly modeling the entity in the knowledge graph as an element in the answer set. If the answer label contains words beyond the entity names, the entity features related to the answers are spliced to the text features of the answer label to obtain new answer features serving as elements in the answer set.
5. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein in step 1.4, F is set up θ (V k ,Q k ) And G φ (a) The method of learning projected into answer semantic space is as follows:
step 5.1: inputting each element in the answer set obtained in the step 4.3 into an LSTM neural network, and taking the last layer of the LSTM hidden layer as the representing characteristic of the answerWherein->Is a parameter of the LSTM network.
Step 5.2: will feature F θ (V k ,Q k ) Andprojecting into the same answer semantic space for learning, and optimizing model +.>F is made to θ (V k ,Q k ) Andare close to each other in a common feature space, where a is the set of all answers. The loss function in the training process is as follows:
where I (a, b) is a binary indicator function, taking 1 when a=b is true, otherwise 0.
Step 5.3: prediction to penalize semantic proximityAnd goal->Classification errors between, defining a semantic loss function as follows:
where s represents the similarity calculation function,finally, the overall loss function of the model during training is:
where λ is a predefined hyper-parameter.
The beneficial results of the invention are: the invention combines the knowledge graph and the related technology of graph convolution to provide a solution for constructing answer semantic space, thereby greatly improving the positive effect of answer information in the data set on the visual question-answer task.
Drawings
FIG. 1 is a diagram of a visual question-answering method according to the present invention
FIG. 2 is a flow chart of a method of the visual question-answering method according to the present invention
FIG. 3 is a schematic diagram of the construction of answer semantic space by the visual question answering method of the present invention
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the visual question-answering method for constructing answer semantic space based on knowledge graph according to the invention comprises the following specific implementation steps:
step 1: extracting the problem and the image context related entity node from the external large-scale knowledge graph, respectively embedding the image feature and the text feature of the problem to obtain the external knowledge embedded image feature V k And problem feature Q for external knowledge embedding k
Step 1.1: processing the input image using a pre-trained object detection network on the Visual Genome dataset to obtain Visual featuresWherein n is v =18,v i Representing the characteristics of the objects in the detection frames, and simultaneously, each detection frame also obtains the characteristics d of the detection frame in 5 dimensions v . D of 5 dimension v The method comprises the steps of detecting the coordinates of the upper, lower, left and right positions of the frame and the confidence of the frame. Each detection box also gets the detected entity name +.>The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Wherein v is n Is the number of detected entities.
Step 1.2: each word in the input question is represented by GloVe word embedding, each word is represented by 1024-dimensional word embedding, and then the word is input into a long-short-term memory network to obtain the integral text characteristic representation of the question
Step 1.3: analyzing the problem through Steady named entity recognition to obtain an entity set involved in the problemWherein q is n Is the number of detected entities.
Step 1.4: screening entities appearing in images from large-scale knowledge graph ConceptNet by computing cosine similarityEntities related to questions->Obtaining the most relevant entity node e k E, where E represents a collection of entity nodes. With entity node e k Starting from e k And expanding to the second-order field, reserving field entities and edges, and finally forming a knowledge graph related to the image and the problem context, wherein all entity names are characterized by embedding GloVe words. The knowledge graph obtained through RGCN processing obtains updated entity nodes related to the image and the problem context after multiple iterations of the graph neural networkCharacteristic representation->
Step 1.5: for each entity detected by the object detection network, its visual characteristics v are determined i Splicing the external knowledge with the corresponding entity node characteristics in the knowledge graph to obtain visual characteristics embedded by the external knowledge, namely each visual characteristicThe overall input image features are denoted as V after embedding the external knowledge k . Splicing the detected entity node characteristics in the problem as auxiliary characteristics after the problem characteristics to obtain the problem characteristic expression embedded by external knowledgeThen q k Inputting a 3-layer multi-layer perceptron to obtain problem characteristics Q embedded by external knowledge k
Step 2: a joint feature representation of the image and the problem is generated. Will V k ,Q k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer θ (V k ,Q k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.
Step 3: embedding the entity node containing knowledge information into answer characteristics to obtain characteristic representation of corresponding answer
Step 3.1: modeling the first 3500 answer labels with highest occurrence probability in VQA data set and OK-VQA data set as entity e a And (3) selecting entities related to the answer entities from the large-scale knowledge graph through calculating cosine similarity, embedding all entity names by using Glove words as characteristics, expanding the entities to the second-order field, reserving field entities and edges, and finally forming the knowledge graph related to the answer context.
Step 3.2: after RGCN is iterated for many times, the entity node related to each answer gathers the surrounding node knowledge information to obtain updated entity node characteristic representation
Step 3.3: if the label of the answer is consistent with the entity name, directly modeling the entity in the knowledge graph as an element in the answer set. If the answer label contains words beyond the entity names, the entity features related to the answers are spliced to the text features of the answer label to obtain new answer features serving as elements in the answer set. Inputting each element in the obtained answer set into a 4-layer LSTM neural network, and taking the last layer of the LSTM neural network hidden layer as the representing characteristic of the answerWherein->Is a parameter of the LSTM network.
Step 4: f to be constructed θ (V k ,Q k ) And G φ (a) And putting an answer semantic space, and carrying out feature matching through a designed probability model to obtain an answer, wherein the answer semantic space is shown in figure 3.
Step 4.1: will feature F θ (V k ,Q k ) Andprojecting into the same answer semantic space for learning, and optimizing model +.>F is made to θ (V k ,Q k ) Andin a common feature spaceIn close proximity to each other, where A is the set of all answers, τ is the artificially set loss temperature, and when τ is larger, the distance of different eigenvectors in the eigenvector space is larger. The loss function in the training process is as follows:
where I (a, b) is a binary indicator function, taking 1 when a=b is true, otherwise 0.
Step 5.3: prediction to penalize semantic proximityAnd goal->Classification errors between, defining a semantic loss function as follows:
where s represents the similarity calculation function,finally, the overall loss function of the model during training is:
where λ is a predefined hyper-parameter.
In general, the model framework is shown in FIG. 2. The target detection network processing the input image was pre-trained on Visual genome datasets, input images and labels, size 448x448. The number of training rounds of the target detection network is 20. The fixed number of detected objects in each image is 18, detection frames with confidence degree larger than 0.5 are reserved during detection, and if the number of the detection frames is less than 18, the spare entity features are filled with 0. The text of the knowledge-graph and the text of the input problem initially use the same 300-dimensional word embedded representation. When detecting that related entities in the text are used for knowledge embedding, only the entity with the highest confidence is reserved. We used an Adma optimizer during model training with a minimum batch size of 128. The training was stabilized using dropout and batch normalization methods. During training, the temperature loss τ was set to 0.02.
The above list of detailed descriptions is only specific to practical embodiments of the present invention, and they are not intended to limit the scope of the present invention, and all equivalent embodiments or modifications that do not depart from the spirit of the present invention should be included in the scope of the present invention.

Claims (5)

1. A visual question-answering method for constructing answer semantic space based on knowledge graph participation is characterized by comprising the following steps:
step 1.1: extracting the problem and the image context related entity node from the external large-scale knowledge graph, respectively embedding the image feature and the text feature of the problem to obtain the external knowledge embedded image feature V k And problem feature Q for external knowledge embedding k
Step 1.2: will V k ,Q k Inputting into a multimode bilinear pooling network, removing the last softmax classification layer, and taking the state F of the last layer θ (V k ,Q k ) As a joint feature of the image and the problem, where θ is a parameter of the bilinear pooling model.
Step 1.3: screening out entity nodes related to answers from an external large-scale knowledge graph, and reserving entity nodes with highest degree of correlation in the knowledge graph for embedding answer characteristic expression so as to generate answer characteristic expression G embedded by external knowledge φ (a)。
Step 1.4: combining the images obtained in step 1.2 and step 1.3 with the joint feature F of the problem θ (V k ,Q k ) And
answer characteristic expression G for embedding external knowledge φ (a) Projected to the same directionLearning is performed in an answer semantic space.
Step 1.5: given an input image and a question, the model predicts the answer by the following formula:
wherein a is * Representing the answer predicted.
2. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein the method of embedding relevant external knowledge into text and image in step 1.1 is as follows:
step 2.1: processing the input image using a target detection network pre-trained on the Visual Genome data set to obtain Visual featuresWherein n is v =18,v i Representing the characteristics of the objects in the detection frames, and simultaneously, each detection frame also obtains the characteristics d of the detection frame in 5 dimensions v And the detected entity name->The detection frame features comprise the coordinates of the upper, lower, left and right corners of the detection frame and the confidence of the detection frame. Entity set involved in an image ∈>Wherein v is n Is the number of detected entities.
Step 2.2: each word in the input question is represented by GloVe word embedding, each word is represented by 1024-dimensional word embedding, and then the word is input into a long-short-term memory network to obtain the integral text characteristic representation of the question
Step 2.3: analyzing the problem through Steady named entity recognition to obtain an entity set involved in the problemWherein q is n Is the number of detected entities.
Step 2.4: screening entities appearing in images from large-scale knowledge graph ConceptNet by computing cosine similarityEntities related to questions->Obtaining the most relevant entity node e k E, where E represents a collection of entity nodes. With entity node e k Starting from e k And expanding to the second-order field, reserving field entities and edges, and finally forming a knowledge graph related to the image and the problem context, wherein all entity names are characterized by embedding GloVe words. Obtaining updated entity node characteristic representation related to the image and the problem context after the knowledge graph obtained through RGCN processing is subjected to multiple iterations of the graph neural network>
Step 2.5: embedding the final entity node characteristics obtained in the step 2.4 into the image characteristics and the text characteristics of the problem respectively to obtain the image characteristics V containing the external knowledge information k And image feature Q containing external knowledge information k The method is used for downstream multi-modal feature fusion tasks.
3. Embedding external knowledge in text and images respectively according to step 2.5 of claim 2, characterized in that the embedding method comprises the following specific steps:
step 3.1: for each entity detected by the object detection network, its visual characteristics v are determined i Splicing the external knowledge with the corresponding entity node characteristics in the knowledge graph to obtain visual characteristics embedded by the external knowledge, namely each visual characteristicThe overall input image features are denoted as V after embedding the external knowledge k
Step 3.2: splicing the detected entity node characteristics in the problem as auxiliary characteristics after the problem characteristics to obtain the problem characteristic expression embedded by external knowledgeThen q k Inputting into a multi-layer perceptron to obtain problem characteristics Q embedded by external knowledge k
4. The visual question-answering method for constructing answer semantic space based on knowledge graph according to claim 1, wherein the method of embedding the answer features into the external entity node containing knowledge information in step 1.3 is as follows:
step 4.1: modeling the first 3500 answer labels with highest probability of occurrence in visual question-answering as entity e a And (3) selecting entities related to the answer entities from the large-scale knowledge graph through calculating cosine similarity, embedding all entity names by using Glove words as characteristics, expanding the entities to the second-order field, reserving field entities and edges, and finally forming the knowledge graph related to the answer context.
Step 4.2: after RGCN is iterated for many times, the entity node related to each answer gathers the surrounding node knowledge information to obtain updated entity node characteristic representation
Step 4.3: if the label of the answer is consistent with the entity name, directly modeling the entity in the knowledge graph as an element in the answer set. If the answer label contains words beyond the entity names, the entity features related to the answers are spliced to the text features of the answer label to obtain new answer features serving as elements in the answer set.
5. The visual question-answering method based on knowledge graph and answer semantic embedding according to claim 1, wherein in step 1.4, F is set up θ (V k ,Q k ) And G φ (a) The method of learning projected into answer semantic space is as follows:
step 5.1: inputting each element in the answer set obtained in the step 4.3 into an LSTM neural network, and taking the last layer of the LSTM hidden layer as the representing characteristic of the answerWhere phi is a parameter of the LSTM network.
Step 5.2: will feature F θ (V k ,Q k ) And G φ (a) Projected into the same answer semantic space for learning, and in the neural network training process, the model is optimizedF is made to θ (V k ,Q k ) And->Are close to each other in a common feature space, where a is the set of all answers. The loss function in the training process is as follows:
where I (a, b) is a binary indicator function, taking 1 when a=b is true, otherwise 0.
Step 5.3: prediction to penalize semantic proximityAnd goal->Classification errors between, defining a semantic loss function as follows:
where s represents the similarity calculation function,finally, the overall loss function of the model during training is:
where λ is a predefined hyper-parameter.
CN202310703224.8A 2023-06-14 2023-06-14 Visual question-answering method for constructing answer semantic space based on knowledge graph Pending CN116701590A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310703224.8A CN116701590A (en) 2023-06-14 2023-06-14 Visual question-answering method for constructing answer semantic space based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310703224.8A CN116701590A (en) 2023-06-14 2023-06-14 Visual question-answering method for constructing answer semantic space based on knowledge graph

Publications (1)

Publication Number Publication Date
CN116701590A true CN116701590A (en) 2023-09-05

Family

ID=87837050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310703224.8A Pending CN116701590A (en) 2023-06-14 2023-06-14 Visual question-answering method for constructing answer semantic space based on knowledge graph

Country Status (1)

Country Link
CN (1) CN116701590A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US11113598B2 (en) Dynamic memory network
JP7468929B2 (en) How to acquire geographical knowledge
US20160350653A1 (en) Dynamic Memory Network
CN111984766B (en) Missing semantic completion method and device
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN105631479A (en) Imbalance-learning-based depth convolution network image marking method and apparatus
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN109783666A (en) A kind of image scene map generation method based on iteration fining
CN113779211A (en) Intelligent question-answer reasoning method and system based on natural language entity relationship
CN111522965A (en) Question-answering method and system for entity relationship extraction based on transfer learning
CN115618045B (en) Visual question answering method, device and storage medium
CN112308115B (en) Multi-label image deep learning classification method and equipment
CN109376222A (en) Question and answer matching degree calculation method, question and answer automatic matching method and device
CN111898636B (en) Data processing method and device
CN111611367B (en) Visual question-answering method introducing external knowledge
CN113673244B (en) Medical text processing method, medical text processing device, computer equipment and storage medium
CN115599899B (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN114186076A (en) Knowledge graph construction method, device, equipment and computer readable storage medium
CN116151263B (en) Multi-mode named entity recognition method, device, equipment and storage medium
CN116385937B (en) Method and system for solving video question and answer based on multi-granularity cross-mode interaction framework
CN116701590A (en) Visual question-answering method for constructing answer semantic space based on knowledge graph
CN116089645A (en) Hierarchical style-based conditional text-e-commerce picture retrieval method and system
CN115440384A (en) Medical knowledge map processing method and system based on multitask learning
CN113705402B (en) Video behavior prediction method, system, electronic device and storage medium
CN111125318A (en) Method for improving knowledge graph relation prediction performance based on sememe-semantic item information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination