CN114969459A - Cognitive-channel-based cognitive-inference visual question-answering method - Google Patents

Cognitive-channel-based cognitive-inference visual question-answering method Download PDF

Info

Publication number
CN114969459A
CN114969459A CN202210343042.XA CN202210343042A CN114969459A CN 114969459 A CN114969459 A CN 114969459A CN 202210343042 A CN202210343042 A CN 202210343042A CN 114969459 A CN114969459 A CN 114969459A
Authority
CN
China
Prior art keywords
question
cognitive
content
image
answering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210343042.XA
Other languages
Chinese (zh)
Inventor
张文强
张开磊
王昊奋
刘威辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202210343042.XA priority Critical patent/CN114969459A/en
Publication of CN114969459A publication Critical patent/CN114969459A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3338Query expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of computer multi-mode information processing and the field of cognitive science, in particular to a visual question-answering method based on cognitive inference of cognitive dual channels. The method mainly comprises the following steps: step 1, constructing cognitive content, extracting problem keywords and label content of an image target area to serve as an index for searching a large-scale knowledge base, and constructing a task knowledge base through the searched content; step 2, performing priori cognitive calculation, and training visual text content representation through a multi-mode pre-training model; step 3, establishing reasoning spatiotemporal characteristics of tasks, performing syntactic analysis and part-of-speech analysis on the question sentence to establish a topological graph among the problem key words, calculating the relevance of the internal region of the image by using the visual representation content in the step 2, and establishing an image content space graph structure; step 4, locking image content related to the question and answer content, calculating image content concerned by each text vector according to the visual vectors and the text vectors of the joint representation in the step 2, and constructing question and answer related image content; and 5, reasoning of level cognition, combining the knowledge content constructed in the step 1 with the prior calculation in the step 2, recoding the question and answer content, constructing a question and answer instruction set according to the combination of the recoded representation content and the spatio-temporal characteristics analyzed in the step 3, and performing question and answer reasoning through the question and answer instruction to obtain a visual question and answer result. The invention improves the accuracy of the open domain visual question-answering model, and continuously corrects the cognitive understanding of the question-answering content according to the external knowledge content in the reasoning process, so that the visual question-answering process has robustness and interpretability.

Description

Cognitive-channel-based cognitive-inference visual question-answering method
Technical Field
The invention relates to the field of computer multi-mode information processing and the field of cognitive science, in particular to a visual question-answering method based on cognitive inference of cognitive dual channels.
Background
In the common visual question-answering task, a good result is obtained through the study of multi-modal information representation, but the reasoning property of the question-answering process is ignored, so that the interpretability of the question-answering process is lacked, and meanwhile, when a complex visual question-answering task is encountered, namely the question-answering relation exceeds a given condition, the question can be answered only by means of prior knowledge or based on the common fact, the visual question-answering has great deviation, and the situation of answering questions is caused. In the two-channel theory of cognitive science, there are two systems in the cognitive system of the human brain: system1 and System 2. System1 is an intuitive System that can find answers by an intuitive match of a person to relevant information, and is quick and simple. And System2 is an analysis System that finds answers through certain reasoning, logic. The System1 can quickly understand the task in a task representation mode, and analyze and reason the complex visual question-answering task through 2 on the basis of 1 recognition, so that the understanding of the task by the computer is retained, and the reasoning process of the task by the computer can be realized.
The method based on representation learning can enable a computer to well understand the task content, information irrelevant to the task cannot be processed through joint representation of the task content, and simultaneously, the represented content can only be limited to a given condition and cannot understand the association relation except the content. Based on the retrieval method, a large amount of corpus information can be searched as the evidence of question and answer, the limitation of representation caused by content limitation can be solved, however, one of the corpuses through retrieval is increased in calculation cost, and meanwhile, the introduction of irrelevant corpuses causes the distance between the feature representation of the original calculation and the correct answer to deviate, and the final calculation result is influenced.
Disclosure of Invention
The invention is carried out to solve the problems, and aims to provide a visual question-answering algorithm based on cognitive inference of cognitive dual channels, which is used for solving the visual question-answering method of an open domain, dynamically learning a question-answering mode and improving the accuracy and interpretability of the question-answering, and adopts the following technical scheme:
the invention provides a visual question-answering method based on cognitive inference of cognitive channels, which is characterized by comprising the following steps of: step S1, extracting question keywords and image target area labels, and constructing an index set based on the keywords and the target area labels; step S2, based on the index set, retrieving and constructing a knowledge base related to the visual question-answering task, and acquiring cognitive content; step S3, calculating and training a cross-modal representation model of a character mode and an image mode through a multi-modal pre-training cross-modal model; step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various types of representation vectors by using the cognitive model; step S5, constructing a topological structure among the question keywords by performing syntactic analysis and part-of-speech analysis on the question, and updating the topological structure by using the parts-of-speech of the question keywords so as to obtain question reasoning time characteristics; step S6, calculating the image target area space characteristic based on the visual question-answering task; step S7, acquiring a semantic matching relevance matrix of the image target region based on the image target region space characteristics, and constructing a cross-modal task relevance graph structure; step S8, recoding the visual question-answering content based on the cognitive content and the various expression vectors, and acquiring recoded expression content; step S9, constructing a question-answer instruction set based on the recoded representation content, the question reasoning time characteristic and the image target area space characteristic, and carrying out question-answer reasoning through the question-answer instruction set so as to obtain the visual question-answer result.
The visual question-answering method based on cognitive inference of two channels of cognition provided by the invention can also have the technical characteristics that in the step S1, the constructed index set carries out word segmentation processing on the input question to obtain question words and phrases, then corresponding images are input aiming at the input question, the images are subjected to target area division, and the target areas are classified, so that the index set is constructed.
The method for constructing the visual question-answer related to the cognitive inference based on the cognitive dual channels provided by the invention can also have the technical characteristics that in the step S2, the knowledge base related to the visual question-answer task is constructed by constructing the source node and the target node based on the cross-modal symbolic representation with the relationship between the parts of speech, then using the algorithm to calculate all the shortest path sets from the source node to the target node, and then combining the shortest path sets based on the commonality of the question-answer content, thereby constructing the knowledge base related to the visual question-answer task.
The visual question-answering method based on cognitive inference of the cognitive double channels provided by the invention can also have the technical characteristics that the various expression vectors at least comprise text modal vectors, image modal vectors and linear mapping vectors; the calculation of the spatial characteristics in step S6 includes the sub-steps of: step S6-1, calculating the linear mapping vector by jointly representing the text modality vector and the image modality vector, the formula being as follows: JoinR k =f([R k |S]) K is equal to omega, wherein f is a linear mapping function, R k Representing a vector for the feature of the Kth target area on the input image, wherein S represents a sentence feature vector calculated by the primary cognitive model; step S6-2, calculating the correlation matrix in the target area through matrix multiplication, wherein the formula is as follows: relationship is JoinR × JoinR T (ii) a Step S6-3, according to the rectangular coordinate system of the target region, calculating corresponding polar coordinates (r, theta) as the physical space relationship:
Figure BDA0003580059130000041
the visual question-answering method based on cognitive inference of cognitive binary channels provided by the invention can also have the technical characteristics that the step S7 further comprises the following substeps: step S7-1, calculating the overall similarity Sam of the cross-modal task SF The expression is as follows:
Figure BDA0003580059130000042
Figure BDA0003580059130000043
wherein S is a problem semantic vector, and F is an image i representation vector; step S7-2, calculating the keyword feature expression vector EP by using a mean pooling method k The formula is as follows:
Figure BDA0003580059130000044
and step S7-3, calculating the relevancy matrix based on the attention inquiry mechanism.
The visual question-answering method based on cognitive inference of the cognitive dual channels provided by the invention can also have the technical characteristics that the relevancy matrix comprises a weight matrix of a question-answering task and the image target area and an attention degree score matrix of the keywords and the content of the image target area, and the step S7-3 further comprises the following steps: step a1, splicing the question expression vector and the image expression feature for task expression: t ═ σ [ S, F ]
Where σ is the activation function; step A2, calculating the weight matrix and processing the weight matrix by a smoothing method, wherein the formula is as follows: w tr G (S, R); step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting learning parameter W q 、W v :K=R r ×W tr ,Q=Key r ×W q ,V=Key r ×W v Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target region, and V represents a mapped value vector of the keyword; step a4, calculating each attention degree score matrix Att:
Figure BDA0003580059130000051
step A5, calculating the discrete degree of each attention degree score matrix and the question-answering task, the formula is as follows:
Figure BDA0003580059130000052
wherein n is the dimension of the vector; step A6, based on the overall similarity Sam SF Selecting TopN (Scorre).
The visual question-answering method based on cognitive inference of cognitive binary channels provided by the invention can also have the technical characteristics that the recoding of the visual question-answering content in the step S8 comprises the following steps: calculating the similarity between the question-answering task representation vector and the relation or tail entity in the tuple in the question-answering task knowledge base, wherein the expression is as follows:
Figure BDA0003580059130000053
Figure BDA0003580059130000054
where n is the vector dimension.
Action and Effect of the invention
According to the visual question-answering method based on cognitive inference of the cognitive dual channels, a novel machine is provided, a hierarchical cognitive question-answering algorithm is provided, the model is made to have the understanding degree of the task through pre-training model training, a first intuitive System1 of a cognitive System is constructed, and the task model can adapt to a new task through fine adjustment of the task model; then, splitting the task to obtain the symbolic language of the task, and constructing a knowledge base related to the task in a retrieval mode to supplement the content of the task and improve the intuition of a cognitive system; and finally, constructing a cognitive System2, wherein the cognitive System2 comprises two stages of contents, the stages are used for re-understanding a pair of tasks, and for the intuitive understanding of the System1, partial question and answer contents are solved, but for the non-intuitive problem, the task representation needs to be redefined on the basis of the System1, under the condition of the System1, the degree of association between the task representation and corpus evidence is calculated, the contents related to the task are calculated, the content representation is limited, the two pairs of reasoning contents in the stages are obtained according to the contents of the cognitive knowledge base, a knowledge graph representing the task is constructed, and the relationship between layers is deduced through the constructed spatio-temporal sequence of each knowledge graph, so that the final question and answer is obtained. The invention greatly improves the accuracy of the question and answer result and continuously corrects the cognitive understanding of the question and answer content according to the external knowledge content in the reasoning process, so that the visual question and answer process has robustness and interpretability.
Drawings
FIG. 1 is a schematic flow chart of a visual question-answering method based on cognitive inference of cognitive dual channels in an embodiment of the present invention;
Detailed Description
In order to make the technical means, creation features, achievement purposes and effects of the present invention easy to understand, the following describes the method and apparatus for visual question-answering based on cognitive inference based on cognitive dual channels in detail with reference to the embodiments and the accompanying drawings.
< example >
In the embodiment, two systems, namely an intuition system and a cognition system, are constructed. The intuition system is convenient for quickly identifying, understanding and analyzing complex visual question-answering tasks; the cognitive system is based on an intuitive system, and the understanding of the computer on the question and answer task is reserved through analysis and reasoning, so that the reasoning process of the computer on the question and answer task is realized.
Fig. 1 is a schematic flow chart of a visual question answering method based on cognitive inference of cognitive channels in an embodiment of the present invention.
As shown in fig. 1, the visual question answering method based on cognitive inference of cognitive channels mainly includes the following steps:
step S1, extracting question keywords and image target area labels, and constructing an index set based on the keywords and the target area labels;
in this embodiment, in step S1, the index set is constructed by performing word segmentation on the input question to obtain a question word and a question phrase, inputting a corresponding image for the input question, performing the target area division on the image, and classifying the target area. Specifically, the index set construction step is as follows: performing word segmentation processing on the question and question Q to obtain a question word W k The phrase P k In which P is k Is greater than W k (ii) a For the image I corresponding to the problem, the image I is divided into a plurality of target frames O through target detection k Calculating the corresponding classification Tag of each target k (ii) a Respectively labeling, classifying and deleting stop words in the sets W, P and Tag by using a part-of-speech labeling tool to construct a keyword index set K;
step S2, retrieving and constructing a knowledge base related to the visual question-answering task based on the index set, and acquiring cognitive content;
in this embodiment, the step S2 of constructing the knowledge base related to the visual question and answer task is to construct a source node and a target node based on the symbolic representation of cross-modal and the relationship between parts of speech, then use an algorithm to find all shortest path sets from the source node to the target node, and then combine the shortest path sets based on the commonality of the question and answer content, thereby constructing the knowledge base related to the visual question and answer task. The method comprises the following specific steps: constructing a source node and a target node based on symbolic representations of relations and modes existing between parts of speech: commonality of question-answer content, merging
Figure BDA0003580059130000071
Constructing a task knowledge base TKG, and dividing the knowledge base TKG into a word level knowledge base KG according to the structure of the composition index w Knowledge base KG for text structure s Cross-modal knowledge base KG c
Step S3, calculating and training a cross-modal representation model of a character mode and an image mode through a multi-modal pre-training cross-modal model;
step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various representation vectors by utilizing the cognitive model;
in this embodiment, the specific steps are as follows: firstly, a pre-trained pre-training model is used for calculating a cross-modal expression model of a character mode and an image mode; and then, according to the obtained model fine tuning task, obtaining a primary cognitive model, and finally, calculating a cross-modal representation vector of the content of the primary cognitive stage.
Step S5, constructing a topological structure among the question keywords by performing syntactic analysis and part-of-speech analysis on the question, and updating the topological structure by using the parts-of-speech of the question keywords so as to obtain question reasoning time characteristics;
step S6, calculating the image target area space characteristic based on the visual question-answering task;
in this embodiment, step S6 includes the following sub-steps: step S6-1, calculating the linear mapping vector by jointly representing the text modality vector and the image modality vector; step S6-2, calculating a correlation matrix in the target area through matrix multiplication; and step S6-3, calculating corresponding polar coordinates (r, theta) as the physical space relation of the target area according to the rectangular coordinate system of the target area. Specifically, a topological structure of question-answer reasoning is constructed through syntax information of question sentences, the part of speech is used as node label content, the topological structure is updated to be used as the time characteristic of question reasoning, and finally the space characteristic in the image area is calculated under the condition of a question-answer task. The specific steps for calculating the spatial characteristics in the image area are as follows: using a sentence feature vector representation S calculated by a primary cognitive model and a feature representation set R of an image target area; the vectors representing the text mode and the image mode are combined, and the linear mapping vector is obtained:
JoinR k =f([R k |S]),k∈Ω
wherein the function f is a linear mapping function,R k is a feature vector representation of the kth target region on the input image; calculating a correlation matrix in the target area by multiplication of the matrix:
Relation=JoinR×JoinR T
according to the rectangular coordinate system of the target frame, calculating corresponding polar coordinates (r, theta) as the physical space relation:
Figure BDA0003580059130000091
Figure BDA0003580059130000092
wherein: x, y are respectively the object O k The abscissa and ordinate of a rectangular coordinate system in the image;
step S7, acquiring a semantic matching relevance matrix of the image target region based on the image target region space characteristics, and constructing a cross-modal task relevance graph structure;
in this embodiment, step S7 includes the following sub-steps: step S7-1, calculating the overall similarity expression of the cross-modal task; step S7-2, calculating the keyword feature expression vector by using a mean pooling method; step S7-3, based on the inquiry mechanism of attention, the correlation matrix is calculated; wherein the step S7-3 further includes the following sub-steps: step A1, splicing the question expression vector and the image expression characteristic for task expression; step A2, calculating a weight matrix of the relevancy matrix including a question answering task and the image target area and processing the weight matrix by a smoothing method; step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting a learning parameter W q 、W v (ii) a Step A4, calculating an attention degree score matrix Att of each keyword and the content of the image target area; step A5, calculating the discrete degree of each attention degree score matrix and the question-answering task; step A6, based on the overall similarity Sam SF Selecting TopN (Scorre). Specifically, based on the correlation matrix calculated in the previous step, TopN (JoinR) is selected k ) Selecting 8 target areas with the highest association degree of each target area, wherein N is 8; the specific steps for constructing the target content lock between the cross-modalities are as follows: firstly, in the process of calculating the initial cognitive coding, the overall similarity of the cross-modal tasks needs to be calculated:
Figure BDA0003580059130000101
wherein S is a question Q semantic vector, and F is an image I representation vector;
secondly, according to the word segmentation vector EW of the joint coding, calculating the vector representation of the related word group through mean pooling:
Figure BDA0003580059130000102
among them, EP k Representing vectors which are the characteristics of the keywords;
then, an attention inquiry mechanism is applied to calculate a semantic matching relevance matrix, and the specific steps are as follows: and (3) task joint representation, namely splicing the sentence representation vector with the image representation characteristics for task representation:
T=σ[S,F]
wherein S is a problem semantic vector, F is a characterization vector of the image, and σ is an activation function;
calculating the weight matrix of task and image region, and smoothing the weight matrix W tr
W tr =g(S,R)
S is a problem semantic vector, R is vector representation of a target area in an image, and g is a similarity function;
calculating linear mapping vector K, Q, V of keyword, and setting learning parameter W q 、W v :
K=R r ×W tr
Q=Key r ×W q
V=Key r ×W v
Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target area, and V represents a mapped value vector of the keyword;
calculating an attention degree score matrix Att of each keyword and the content of the image target keyword:
Figure BDA0003580059130000111
calculating the degree of dispersion of each attention score Att and the task:
Figure BDA0003580059130000112
wherein n is the dimension of the vector;
finally, based on the overall similarity Sam of the joint representation SF Selecting TopN (Scorre), wherein N2 is the number of image target areas associated with the keywords;
step S8, recoding the visual question-answer content based on the cognitive content and the various types of expression vectors, and acquiring recoded expression content;
step S9, constructing a question-answer instruction set based on the recoded representation content, the question reasoning time characteristic and the image target area space characteristic, and carrying out question-answer reasoning through the question-answer instruction set so as to obtain the visual question-answer result.
In this embodiment, the specific steps of obtaining the recoded representation content are as follows: firstly, a task knowledge base KG constructed by coding w As a vector representation of EKG w Then, a task representation vector T and a task knowledge base EKG are calculated w The similarity between the relations or tail entities in the middle tuple is expressed as follows, wherein n is a vector dimension:
Figure BDA0003580059130000121
then eliminating task-independent tuples to obtain KG w Task subgraph SubKG w (ii) a And finally, using Key index and structure associated graph content node to update EW into EW 'by graph attention network and using average pooling EW' as text representation S of task associated graph knowledge g
The method comprises the following specific steps of constructing a reasoning instruction D based on the reasoning spatiotemporal characteristics constructed by the questions and answers: constructing a time sequence route TimeRout for reasoning according to the topological structure and the inverse topological structure of the graph; for each node in time, constructing a space diagram of a time point;
the specific steps of calculating the inferred instruction D based on the time points of the inferred time route are as follows: constructing a vector join of a time-connected node and a task representation
Figure BDA0003580059130000122
Wherein T is k Is a vector representation of time node k; computing
Figure BDA0003580059130000123
A Gram matrix of (a); determining an inference set D according to the size and the positive and negative of the Gram corresponding value + 、D -
Reasoning is carried out based on the instruction and the space-time diagram structure constructed in the previous steps, and the specific steps of obtaining the question and answer result are as follows: first of all, according to the text representation S g Starting at a point in time T 0 Forward reasoning of D + 0 Retention of T 0 The state vector Sta of (a); second according to D + 0 Instruction calculation T 0 Based on the graph attention network updating T 0 Vector of representations of time of day h 0 For T k Time of day based on T k-1 Previous state quantities Sta and D + k Instruction of (2) updates the representation vector h k For T k At a time according to D - The k instruction of (a) updates the updated representation vector h k '; finally, after one reasoning set is finished, calculating task representation state quantity h and each oneAnd (4) taking the maximum value of the similarity scores of the space map nodes updated at the time points as answers of the visual questions and answers.
Examples effects and effects
According to the visual question-answering method based on cognitive inference of the cognitive dual channels, a novel machine is provided, a hierarchical cognitive question-answering algorithm is provided, the model is made to have the understanding degree of the task through pre-training model training, a first intuitive System1 of a cognitive System is constructed, and the task model can adapt to a new task through fine adjustment of the task model; then, splitting the task to obtain the symbolic language of the task, and constructing a knowledge base related to the task in a retrieval mode to supplement the content of the task and improve the intuition of a cognitive system; and finally, constructing a cognitive System2, wherein the cognitive System2 comprises two stages of contents, the stages are used for re-understanding a pair of tasks, and for the intuitive understanding of the System1, partial question and answer contents are solved, but for the non-intuitive problem, the task representation needs to be redefined on the basis of the System1, under the condition of the System1, the degree of association between the task representation and corpus evidence is calculated, the contents related to the task are calculated, the content representation is limited, the two pairs of reasoning contents in the stages are obtained according to the contents of the cognitive knowledge base, a knowledge graph representing the task is constructed, and the relationship between layers is deduced through the constructed spatio-temporal sequence of each knowledge graph, so that the final question and answer is obtained. The invention greatly improves the accuracy of the question and answer result and continuously corrects the cognitive understanding of the question and answer content according to the external knowledge content in the reasoning process, so that the visual question and answer process has robustness and interpretability.
The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims (7)

1. A visual question-answering method based on cognitive inference of cognitive channels is used for solving the visual question-answering method of an open domain, dynamically learning the question-answering mode and improving the accuracy rate and interpretability of the question-answering, and is characterized by comprising the following steps:
step S1, extracting question keywords and image target area labels, and constructing an index set based on the keywords and the target area labels;
step S2, based on the index set, retrieving and constructing a knowledge base related to the visual question-answering task, and acquiring cognitive content;
step S3, calculating and training a cross-modal representation model of a character mode and an image mode through a multi-modal pre-training cross-modal model;
step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various types of representation vectors by using the cognitive model;
step S5, constructing a topological structure among the question keywords by performing syntactic analysis and part-of-speech analysis on the question, and updating the topological structure by using the parts-of-speech of the question keywords so as to obtain question reasoning time characteristics;
step S6, calculating the image target area space characteristic based on the visual question-answering task;
step S7, acquiring a semantic matching relevance matrix of the image target region based on the image target region space characteristics, and constructing a cross-modal task relevance graph structure;
step S8, recoding the visual question-answer content based on the cognitive content and the various types of expression vectors, and acquiring recoded expression content;
step S9, constructing a question-answer instruction set based on the recoded representation content, the question reasoning time characteristic and the image target area space characteristic, and carrying out question-answer reasoning through the question-answer instruction set so as to obtain the visual question-answer result.
2. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:
in step S1, the index set is constructed by performing word segmentation on the input question to obtain a question word and a question phrase, inputting a corresponding image for the input question, performing the target area division on the image, and classifying the target area.
3. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:
in step S2, the constructing the knowledge base related to the visual question and answer task is to construct a source node and a target node based on the cross-modal symbolic representation with the relationship between parts of speech, then calculate all shortest path sets from the source node to the target node by using Dijkstra algorithm, and then merge the shortest path sets based on the commonality of the question and answer content, thereby constructing the knowledge base related to the visual question and answer task.
4. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:
the various types of expression vectors at least comprise text modal vectors, image modal vectors and linear mapping vectors;
the calculation of the spatial characteristics in step S6 includes the sub-steps of:
step S6-1, calculating the linear mapping vector JoinR by jointly representing the text modality vector and the image modality vector k The formula is as follows:
JoinR k =f([R k |S]),k∈Ω,
wherein f is a linear mapping function, R k Representing vectors for the features of the Kth target area on the input image, wherein S represents sentence feature vectors calculated by the primary cognitive model;
step S6-2, calculating the association matrix relationship in the target area by matrix multiplication, the formula is as follows:
Relation=JoinR×JoinR T
step S6-3, according to the rectangular coordinate system of the target region, calculating corresponding polar coordinates (r, theta) as the physical space relationship of the target region:
Figure FDA0003580059120000031
5. the visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:
wherein, step S7 further includes the following substeps:
step S7-1, calculating the overall similarity Sam of the cross-modal task SF The expression is as follows:
Figure FDA0003580059120000032
wherein S is a problem semantic vector, and F is an image i representation vector;
step S7-2, calculating the keyword feature expression vector EP based on the jointly coded word segmentation vector EW by using a mean pooling method k The formula is as follows:
Figure FDA0003580059120000041
in the formula (ii) i A subvector of the word segmentation vector EW;
and step S7-3, based on the inquiry mechanism of attention, calculating the relevance matrix.
6. The visual question-answering method based on cognitive binary channel cognitive inference of claim 5, characterized in that:
wherein the relevancy matrix includes a weight matrix of the question and answer task and the image target area, and an attention score matrix of the keywords and the content of the image target area, and step S7-3 further includes:
step a1, splicing the question expression vector and the image expression feature for task expression:
T=σ[S,F]
where σ is the activation function;
step A2, calculating the weight matrix and processing the weight matrix W by a smoothing method tr The formula is as follows:
W tr =g(S,R);
step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting learning parameter W q 、W v :
K=R r ×W tr
Q=Key r ×W q
V=Key r ×W v
Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target region, and V represents a mapped value vector of the keyword;
step a4, calculating each attention degree score matrix Att:
Figure FDA0003580059120000051
step A5, calculating the discrete degree of each attention degree score matrix and the question-answering task, the formula is as follows:
Figure FDA0003580059120000052
wherein n is the dimension of the vector;
step A6, based on the overall similarity Sam SF Selecting TopN (Scorre).
7. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:
wherein the step S8 of recoding the visual question-answering content includes:
calculating the similarity between the question-answering task representation vector and the relation or tail entity in the tuple in the question-answering task knowledge base, wherein the expression is as follows:
Figure FDA0003580059120000053
where n is the vector dimension.
CN202210343042.XA 2022-04-02 2022-04-02 Cognitive-channel-based cognitive-inference visual question-answering method Pending CN114969459A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210343042.XA CN114969459A (en) 2022-04-02 2022-04-02 Cognitive-channel-based cognitive-inference visual question-answering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210343042.XA CN114969459A (en) 2022-04-02 2022-04-02 Cognitive-channel-based cognitive-inference visual question-answering method

Publications (1)

Publication Number Publication Date
CN114969459A true CN114969459A (en) 2022-08-30

Family

ID=82977657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210343042.XA Pending CN114969459A (en) 2022-04-02 2022-04-02 Cognitive-channel-based cognitive-inference visual question-answering method

Country Status (1)

Country Link
CN (1) CN114969459A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
WO2024101929A1 (en) * 2022-11-09 2024-05-16 Samsung Electronics Co., Ltd. Confidence-based interactable neural-symbolic visual question answering

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024101929A1 (en) * 2022-11-09 2024-05-16 Samsung Electronics Co., Ltd. Confidence-based interactable neural-symbolic visual question answering
CN117892140A (en) * 2024-03-15 2024-04-16 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium
CN117892140B (en) * 2024-03-15 2024-05-31 浪潮电子信息产业股份有限公司 Visual question and answer and model training method and device thereof, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108509519B (en) General knowledge graph enhanced question-answer interaction system and method based on deep learning
CN111708873B (en) Intelligent question-answering method, intelligent question-answering device, computer equipment and storage medium
CN108563653B (en) Method and system for constructing knowledge acquisition model in knowledge graph
CN111475623B (en) Case Information Semantic Retrieval Method and Device Based on Knowledge Graph
CN112989005B (en) Knowledge graph common sense question-answering method and system based on staged query
CN108363743A (en) A kind of intelligence questions generation method, device and computer readable storage medium
CN114969459A (en) Cognitive-channel-based cognitive-inference visual question-answering method
CN113569023A (en) Chinese medicine question-answering system and method based on knowledge graph
CN112364132A (en) Similarity calculation model and system based on dependency syntax and method for building system
CN112328800A (en) System and method for automatically generating programming specification question answers
CN112417170B (en) Relationship linking method for incomplete knowledge graph
CN116541472B (en) Knowledge graph construction method in medical field
CN115525751A (en) Intelligent question-answering system and method based on knowledge graph
CN116244448A (en) Knowledge graph construction method, device and system based on multi-source data information
CN113240046A (en) Knowledge-based multi-mode information fusion method under visual question-answering task
CN114691891A (en) Knowledge graph-oriented question-answer reasoning method
CN117112765A (en) Domain robot question-answering system based on big data model and knowledge graph fusion
CN114491174A (en) Image-text matching method and system based on hierarchical feature aggregation
CN116737911A (en) Deep learning-based hypertension question-answering method and system
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN110674276A (en) Robot self-learning method, robot terminal, device and readable storage medium
CN114579705A (en) Learning auxiliary method and system for education of sustainable development
CN114372454B (en) Text information extraction method, model training method, device and storage medium
CN112084312B (en) Intelligent customer service system constructed based on knowledge graph
CN115186072A (en) Knowledge graph visual question-answering method based on double-process cognitive theory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination