CN114969459A

CN114969459A - Cognitive-channel-based cognitive-inference visual question-answering method

Info

Publication number: CN114969459A
Application number: CN202210343042.XA
Authority: CN
Inventors: 张文强; 张开磊; 王昊奋; 刘威辰
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-08-30

Abstract

The invention relates to the field of computer multi-mode information processing and the field of cognitive science, in particular to a visual question-answering method based on cognitive inference of cognitive dual channels. The method mainly comprises the following steps: step 1, constructing cognitive content, extracting problem keywords and label content of an image target area to serve as an index for searching a large-scale knowledge base, and constructing a task knowledge base through the searched content; step 2, performing priori cognitive calculation, and training visual text content representation through a multi-mode pre-training model; step 3, establishing reasoning spatiotemporal characteristics of tasks, performing syntactic analysis and part-of-speech analysis on the question sentence to establish a topological graph among the problem key words, calculating the relevance of the internal region of the image by using the visual representation content in the step 2, and establishing an image content space graph structure; step 4, locking image content related to the question and answer content, calculating image content concerned by each text vector according to the visual vectors and the text vectors of the joint representation in the step 2, and constructing question and answer related image content; and 5, reasoning of level cognition, combining the knowledge content constructed in the step 1 with the prior calculation in the step 2, recoding the question and answer content, constructing a question and answer instruction set according to the combination of the recoded representation content and the spatio-temporal characteristics analyzed in the step 3, and performing question and answer reasoning through the question and answer instruction to obtain a visual question and answer result. The invention improves the accuracy of the open domain visual question-answering model, and continuously corrects the cognitive understanding of the question-answering content according to the external knowledge content in the reasoning process, so that the visual question-answering process has robustness and interpretability.

Description

Cognitive-channel-based cognitive-inference visual question-answering method

Technical Field

The invention relates to the field of computer multi-mode information processing and the field of cognitive science, in particular to a visual question-answering method based on cognitive inference of cognitive dual channels.

Background

In the common visual question-answering task, a good result is obtained through the study of multi-modal information representation, but the reasoning property of the question-answering process is ignored, so that the interpretability of the question-answering process is lacked, and meanwhile, when a complex visual question-answering task is encountered, namely the question-answering relation exceeds a given condition, the question can be answered only by means of prior knowledge or based on the common fact, the visual question-answering has great deviation, and the situation of answering questions is caused. In the two-channel theory of cognitive science, there are two systems in the cognitive system of the human brain: system1 and System 2. System1 is an intuitive System that can find answers by an intuitive match of a person to relevant information, and is quick and simple. And System2 is an analysis System that finds answers through certain reasoning, logic. The System1 can quickly understand the task in a task representation mode, and analyze and reason the complex visual question-answering task through 2 on the basis of 1 recognition, so that the understanding of the task by the computer is retained, and the reasoning process of the task by the computer can be realized.

The method based on representation learning can enable a computer to well understand the task content, information irrelevant to the task cannot be processed through joint representation of the task content, and simultaneously, the represented content can only be limited to a given condition and cannot understand the association relation except the content. Based on the retrieval method, a large amount of corpus information can be searched as the evidence of question and answer, the limitation of representation caused by content limitation can be solved, however, one of the corpuses through retrieval is increased in calculation cost, and meanwhile, the introduction of irrelevant corpuses causes the distance between the feature representation of the original calculation and the correct answer to deviate, and the final calculation result is influenced.

Disclosure of Invention

The invention is carried out to solve the problems, and aims to provide a visual question-answering algorithm based on cognitive inference of cognitive dual channels, which is used for solving the visual question-answering method of an open domain, dynamically learning a question-answering mode and improving the accuracy and interpretability of the question-answering, and adopts the following technical scheme:

the invention provides a visual question-answering method based on cognitive inference of cognitive channels, which is characterized by comprising the following steps of: step S1, extracting question keywords and image target area labels, and constructing an index set based on the keywords and the target area labels; step S2, based on the index set, retrieving and constructing a knowledge base related to the visual question-answering task, and acquiring cognitive content; step S3, calculating and training a cross-modal representation model of a character mode and an image mode through a multi-modal pre-training cross-modal model; step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various types of representation vectors by using the cognitive model; step S5, constructing a topological structure among the question keywords by performing syntactic analysis and part-of-speech analysis on the question, and updating the topological structure by using the parts-of-speech of the question keywords so as to obtain question reasoning time characteristics; step S6, calculating the image target area space characteristic based on the visual question-answering task; step S7, acquiring a semantic matching relevance matrix of the image target region based on the image target region space characteristics, and constructing a cross-modal task relevance graph structure; step S8, recoding the visual question-answering content based on the cognitive content and the various expression vectors, and acquiring recoded expression content; step S9, constructing a question-answer instruction set based on the recoded representation content, the question reasoning time characteristic and the image target area space characteristic, and carrying out question-answer reasoning through the question-answer instruction set so as to obtain the visual question-answer result.

The visual question-answering method based on cognitive inference of two channels of cognition provided by the invention can also have the technical characteristics that in the step S1, the constructed index set carries out word segmentation processing on the input question to obtain question words and phrases, then corresponding images are input aiming at the input question, the images are subjected to target area division, and the target areas are classified, so that the index set is constructed.

The method for constructing the visual question-answer related to the cognitive inference based on the cognitive dual channels provided by the invention can also have the technical characteristics that in the step S2, the knowledge base related to the visual question-answer task is constructed by constructing the source node and the target node based on the cross-modal symbolic representation with the relationship between the parts of speech, then using the algorithm to calculate all the shortest path sets from the source node to the target node, and then combining the shortest path sets based on the commonality of the question-answer content, thereby constructing the knowledge base related to the visual question-answer task.

The visual question-answering method based on cognitive inference of the cognitive double channels provided by the invention can also have the technical characteristics that the various expression vectors at least comprise text modal vectors, image modal vectors and linear mapping vectors; the calculation of the spatial characteristics in step S6 includes the sub-steps of: step S6-1, calculating the linear mapping vector by jointly representing the text modality vector and the image modality vector, the formula being as follows: JoinR _k ＝f([R _k |S]) K is equal to omega, wherein f is a linear mapping function, R _k Representing a vector for the feature of the Kth target area on the input image, wherein S represents a sentence feature vector calculated by the primary cognitive model; step S6-2, calculating the correlation matrix in the target area through matrix multiplication, wherein the formula is as follows: relationship is JoinR × JoinR ^T (ii) a Step S6-3, according to the rectangular coordinate system of the target region, calculating corresponding polar coordinates (r, theta) as the physical space relationship:

the visual question-answering method based on cognitive inference of cognitive binary channels provided by the invention can also have the technical characteristics that the step S7 further comprises the following substeps: step S7-1, calculating the overall similarity Sam of the cross-modal task _SF The expression is as follows:

wherein S is a problem semantic vector, and F is an image i representation vector; step S7-2, calculating the keyword feature expression vector EP by using a mean pooling method _k The formula is as follows:

and step S7-3, calculating the relevancy matrix based on the attention inquiry mechanism.

The visual question-answering method based on cognitive inference of the cognitive dual channels provided by the invention can also have the technical characteristics that the relevancy matrix comprises a weight matrix of a question-answering task and the image target area and an attention degree score matrix of the keywords and the content of the image target area, and the step S7-3 further comprises the following steps: step a1, splicing the question expression vector and the image expression feature for task expression: t ═ σ [ S, F ]

Where σ is the activation function; step A2, calculating the weight matrix and processing the weight matrix by a smoothing method, wherein the formula is as follows: w _tr G (S, R); step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting learning parameter W _q 、W _v :K＝R _r ×W _tr ，Q＝Key _r ×W _q ，V＝Key _r ×W _v Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target region, and V represents a mapped value vector of the keyword; step a4, calculating each attention degree score matrix Att:

step A5, calculating the discrete degree of each attention degree score matrix and the question-answering task, the formula is as follows:

wherein n is the dimension of the vector; step A6, based on the overall similarity Sam _SF Selecting TopN (Scorre).

The visual question-answering method based on cognitive inference of cognitive binary channels provided by the invention can also have the technical characteristics that the recoding of the visual question-answering content in the step S8 comprises the following steps: calculating the similarity between the question-answering task representation vector and the relation or tail entity in the tuple in the question-answering task knowledge base, wherein the expression is as follows:

where n is the vector dimension.

Action and Effect of the invention

According to the visual question-answering method based on cognitive inference of the cognitive dual channels, a novel machine is provided, a hierarchical cognitive question-answering algorithm is provided, the model is made to have the understanding degree of the task through pre-training model training, a first intuitive System1 of a cognitive System is constructed, and the task model can adapt to a new task through fine adjustment of the task model; then, splitting the task to obtain the symbolic language of the task, and constructing a knowledge base related to the task in a retrieval mode to supplement the content of the task and improve the intuition of a cognitive system; and finally, constructing a cognitive System2, wherein the cognitive System2 comprises two stages of contents, the stages are used for re-understanding a pair of tasks, and for the intuitive understanding of the System1, partial question and answer contents are solved, but for the non-intuitive problem, the task representation needs to be redefined on the basis of the System1, under the condition of the System1, the degree of association between the task representation and corpus evidence is calculated, the contents related to the task are calculated, the content representation is limited, the two pairs of reasoning contents in the stages are obtained according to the contents of the cognitive knowledge base, a knowledge graph representing the task is constructed, and the relationship between layers is deduced through the constructed spatio-temporal sequence of each knowledge graph, so that the final question and answer is obtained. The invention greatly improves the accuracy of the question and answer result and continuously corrects the cognitive understanding of the question and answer content according to the external knowledge content in the reasoning process, so that the visual question and answer process has robustness and interpretability.

Drawings

FIG. 1 is a schematic flow chart of a visual question-answering method based on cognitive inference of cognitive dual channels in an embodiment of the present invention;

Detailed Description

In order to make the technical means, creation features, achievement purposes and effects of the present invention easy to understand, the following describes the method and apparatus for visual question-answering based on cognitive inference based on cognitive dual channels in detail with reference to the embodiments and the accompanying drawings.

< example >

In the embodiment, two systems, namely an intuition system and a cognition system, are constructed. The intuition system is convenient for quickly identifying, understanding and analyzing complex visual question-answering tasks; the cognitive system is based on an intuitive system, and the understanding of the computer on the question and answer task is reserved through analysis and reasoning, so that the reasoning process of the computer on the question and answer task is realized.

Fig. 1 is a schematic flow chart of a visual question answering method based on cognitive inference of cognitive channels in an embodiment of the present invention.

As shown in fig. 1, the visual question answering method based on cognitive inference of cognitive channels mainly includes the following steps:

step S1, extracting question keywords and image target area labels, and constructing an index set based on the keywords and the target area labels;

in this embodiment, in step S1, the index set is constructed by performing word segmentation on the input question to obtain a question word and a question phrase, inputting a corresponding image for the input question, performing the target area division on the image, and classifying the target area. Specifically, the index set construction step is as follows: performing word segmentation processing on the question and question Q to obtain a question word W _k The phrase P _k In which P is _k Is greater than W _k (ii) a For the image I corresponding to the problem, the image I is divided into a plurality of target frames O through target detection _k Calculating the corresponding classification Tag of each target _k (ii) a Respectively labeling, classifying and deleting stop words in the sets W, P and Tag by using a part-of-speech labeling tool to construct a keyword index set K;

step S2, retrieving and constructing a knowledge base related to the visual question-answering task based on the index set, and acquiring cognitive content;

in this embodiment, the step S2 of constructing the knowledge base related to the visual question and answer task is to construct a source node and a target node based on the symbolic representation of cross-modal and the relationship between parts of speech, then use an algorithm to find all shortest path sets from the source node to the target node, and then combine the shortest path sets based on the commonality of the question and answer content, thereby constructing the knowledge base related to the visual question and answer task. The method comprises the following specific steps: constructing a source node and a target node based on symbolic representations of relations and modes existing between parts of speech: commonality of question-answer content, merging

Constructing a task knowledge base TKG, and dividing the knowledge base TKG into a word level knowledge base KG according to the structure of the composition index _w Knowledge base KG for text structure _s Cross-modal knowledge base KG _c 。

Step S3, calculating and training a cross-modal representation model of a character mode and an image mode through a multi-modal pre-training cross-modal model;

step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various representation vectors by utilizing the cognitive model;

in this embodiment, the specific steps are as follows: firstly, a pre-trained pre-training model is used for calculating a cross-modal expression model of a character mode and an image mode; and then, according to the obtained model fine tuning task, obtaining a primary cognitive model, and finally, calculating a cross-modal representation vector of the content of the primary cognitive stage.

Step S5, constructing a topological structure among the question keywords by performing syntactic analysis and part-of-speech analysis on the question, and updating the topological structure by using the parts-of-speech of the question keywords so as to obtain question reasoning time characteristics;

step S6, calculating the image target area space characteristic based on the visual question-answering task;

in this embodiment, step S6 includes the following sub-steps: step S6-1, calculating the linear mapping vector by jointly representing the text modality vector and the image modality vector; step S6-2, calculating a correlation matrix in the target area through matrix multiplication; and step S6-3, calculating corresponding polar coordinates (r, theta) as the physical space relation of the target area according to the rectangular coordinate system of the target area. Specifically, a topological structure of question-answer reasoning is constructed through syntax information of question sentences, the part of speech is used as node label content, the topological structure is updated to be used as the time characteristic of question reasoning, and finally the space characteristic in the image area is calculated under the condition of a question-answer task. The specific steps for calculating the spatial characteristics in the image area are as follows: using a sentence feature vector representation S calculated by a primary cognitive model and a feature representation set R of an image target area; the vectors representing the text mode and the image mode are combined, and the linear mapping vector is obtained:

JoinR _k ＝f([R _k |S]),k∈Ω

wherein the function f is a linear mapping function,R _k is a feature vector representation of the kth target region on the input image; calculating a correlation matrix in the target area by multiplication of the matrix:

Relation＝JoinR×JoinR ^T

according to the rectangular coordinate system of the target frame, calculating corresponding polar coordinates (r, theta) as the physical space relation:

wherein: x, y are respectively the object O _k The abscissa and ordinate of a rectangular coordinate system in the image;

step S7, acquiring a semantic matching relevance matrix of the image target region based on the image target region space characteristics, and constructing a cross-modal task relevance graph structure;

in this embodiment, step S7 includes the following sub-steps: step S7-1, calculating the overall similarity expression of the cross-modal task; step S7-2, calculating the keyword feature expression vector by using a mean pooling method; step S7-3, based on the inquiry mechanism of attention, the correlation matrix is calculated; wherein the step S7-3 further includes the following sub-steps: step A1, splicing the question expression vector and the image expression characteristic for task expression; step A2, calculating a weight matrix of the relevancy matrix including a question answering task and the image target area and processing the weight matrix by a smoothing method; step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting a learning parameter W _q 、W _v (ii) a Step A4, calculating an attention degree score matrix Att of each keyword and the content of the image target area; step A5, calculating the discrete degree of each attention degree score matrix and the question-answering task; step A6, based on the overall similarity Sam _SF Selecting TopN (Scorre). Specifically, based on the correlation matrix calculated in the previous step, TopN (JoinR) is selected _k ) Selecting 8 target areas with the highest association degree of each target area, wherein N is 8; the specific steps for constructing the target content lock between the cross-modalities are as follows: firstly, in the process of calculating the initial cognitive coding, the overall similarity of the cross-modal tasks needs to be calculated:

wherein S is a question Q semantic vector, and F is an image I representation vector;

secondly, according to the word segmentation vector EW of the joint coding, calculating the vector representation of the related word group through mean pooling:

among them, EP _k Representing vectors which are the characteristics of the keywords;

then, an attention inquiry mechanism is applied to calculate a semantic matching relevance matrix, and the specific steps are as follows: and (3) task joint representation, namely splicing the sentence representation vector with the image representation characteristics for task representation:

T＝σ[S,F]

wherein S is a problem semantic vector, F is a characterization vector of the image, and σ is an activation function;

calculating the weight matrix of task and image region, and smoothing the weight matrix W _tr ：

W _tr ＝g(S,R)

S is a problem semantic vector, R is vector representation of a target area in an image, and g is a similarity function;

calculating linear mapping vector K, Q, V of keyword, and setting learning parameter W _q 、W _v :

K＝R _r ×W _tr

Q＝Key _r ×W _q

V＝Key _r ×W _v

Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target area, and V represents a mapped value vector of the keyword;

calculating an attention degree score matrix Att of each keyword and the content of the image target keyword:

calculating the degree of dispersion of each attention score Att and the task:

wherein n is the dimension of the vector;

finally, based on the overall similarity Sam of the joint representation _SF Selecting TopN (Scorre), wherein N2 is the number of image target areas associated with the keywords;

step S8, recoding the visual question-answer content based on the cognitive content and the various types of expression vectors, and acquiring recoded expression content;

step S9, constructing a question-answer instruction set based on the recoded representation content, the question reasoning time characteristic and the image target area space characteristic, and carrying out question-answer reasoning through the question-answer instruction set so as to obtain the visual question-answer result.

In this embodiment, the specific steps of obtaining the recoded representation content are as follows: firstly, a task knowledge base KG constructed by coding _w As a vector representation of EKG _w Then, a task representation vector T and a task knowledge base EKG are calculated _w The similarity between the relations or tail entities in the middle tuple is expressed as follows, wherein n is a vector dimension:

then eliminating task-independent tuples to obtain KG _w Task subgraph SubKG _w (ii) a And finally, using Key index and structure associated graph content node to update EW into EW 'by graph attention network and using average pooling EW' as text representation S of task associated graph knowledge ^g 。

The method comprises the following specific steps of constructing a reasoning instruction D based on the reasoning spatiotemporal characteristics constructed by the questions and answers: constructing a time sequence route TimeRout for reasoning according to the topological structure and the inverse topological structure of the graph; for each node in time, constructing a space diagram of a time point;

the specific steps of calculating the inferred instruction D based on the time points of the inferred time route are as follows: constructing a vector join of a time-connected node and a task representation

Wherein T is _k Is a vector representation of time node k; computing

A Gram matrix of (a); determining an inference set D according to the size and the positive and negative of the Gram corresponding value ⁺ 、D ^- ；

Reasoning is carried out based on the instruction and the space-time diagram structure constructed in the previous steps, and the specific steps of obtaining the question and answer result are as follows: first of all, according to the text representation S ^g Starting at a point in time T ₀ Forward reasoning of D ⁺ ₀ Retention of T ₀ The state vector Sta of (a); second according to D ⁺ ₀ Instruction calculation T ₀ Based on the graph attention network updating T ₀ Vector of representations of time of day h ₀ For T _k Time of day based on T _k-1 Previous state quantities Sta and D ⁺ _k Instruction of (2) updates the representation vector h _k For T _k At a time according to D ^- The k instruction of (a) updates the updated representation vector h _k '; finally, after one reasoning set is finished, calculating task representation state quantity h and each oneAnd (4) taking the maximum value of the similarity scores of the space map nodes updated at the time points as answers of the visual questions and answers.

Examples effects and effects

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims

1. A visual question-answering method based on cognitive inference of cognitive channels is used for solving the visual question-answering method of an open domain, dynamically learning the question-answering mode and improving the accuracy rate and interpretability of the question-answering, and is characterized by comprising the following steps:

step S2, based on the index set, retrieving and constructing a knowledge base related to the visual question-answering task, and acquiring cognitive content;

step S4, fine-tuning the cross-modal representation model in the visual question-answering task to obtain a primary cognitive model, and calculating various types of representation vectors by using the cognitive model;

2. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:

in step S1, the index set is constructed by performing word segmentation on the input question to obtain a question word and a question phrase, inputting a corresponding image for the input question, performing the target area division on the image, and classifying the target area.

3. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:

in step S2, the constructing the knowledge base related to the visual question and answer task is to construct a source node and a target node based on the cross-modal symbolic representation with the relationship between parts of speech, then calculate all shortest path sets from the source node to the target node by using Dijkstra algorithm, and then merge the shortest path sets based on the commonality of the question and answer content, thereby constructing the knowledge base related to the visual question and answer task.

4. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:

the various types of expression vectors at least comprise text modal vectors, image modal vectors and linear mapping vectors;

the calculation of the spatial characteristics in step S6 includes the sub-steps of:

step S6-1, calculating the linear mapping vector JoinR by jointly representing the text modality vector and the image modality vector _k The formula is as follows:

JoinR _k ＝f([R _k |S]),k∈Ω，

wherein f is a linear mapping function, R _k Representing vectors for the features of the Kth target area on the input image, wherein S represents sentence feature vectors calculated by the primary cognitive model;

step S6-2, calculating the association matrix relationship in the target area by matrix multiplication, the formula is as follows:

Relation＝JoinR×JoinR ^T ；

step S6-3, according to the rectangular coordinate system of the target region, calculating corresponding polar coordinates (r, theta) as the physical space relationship of the target region:

5. the visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:

wherein, step S7 further includes the following substeps:

step S7-1, calculating the overall similarity Sam of the cross-modal task _SF The expression is as follows:

wherein S is a problem semantic vector, and F is an image i representation vector;

step S7-2, calculating the keyword feature expression vector EP based on the jointly coded word segmentation vector EW by using a mean pooling method _k The formula is as follows:

in the formula (ii) _i A subvector of the word segmentation vector EW;

and step S7-3, based on the inquiry mechanism of attention, calculating the relevance matrix.

6. The visual question-answering method based on cognitive binary channel cognitive inference of claim 5, characterized in that:

wherein the relevancy matrix includes a weight matrix of the question and answer task and the image target area, and an attention score matrix of the keywords and the content of the image target area, and step S7-3 further includes:

step a1, splicing the question expression vector and the image expression feature for task expression:

T＝σ[S,F]

where σ is the activation function;

step A2, calculating the weight matrix and processing the weight matrix W by a smoothing method _tr The formula is as follows:

W _tr ＝g(S,R)；

step A3, calculating the linear mapping vector K, Q, V of the keyword, and setting learning parameter W _q 、W _v :

K＝R _r ×W _tr

Q＝Key _r ×W _q

V＝Key _r ×W _v

Wherein K represents a query vector of the image for the keyword, Q represents a query vector of the keyword for the image target region, and V represents a mapped value vector of the keyword;

step a4, calculating each attention degree score matrix Att:

wherein n is the dimension of the vector;

step A6, based on the overall similarity Sam _SF Selecting TopN (Scorre).

7. The visual question-answering method based on cognitive binary channel cognitive inference as claimed in claim 1, wherein:

wherein the step S8 of recoding the visual question-answering content includes:

calculating the similarity between the question-answering task representation vector and the relation or tail entity in the tuple in the question-answering task knowledge base, wherein the expression is as follows:

where n is the vector dimension.