CN114661874B - Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels - Google Patents
Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels Download PDFInfo
- Publication number
- CN114661874B CN114661874B CN202210223976.XA CN202210223976A CN114661874B CN 114661874 B CN114661874 B CN 114661874B CN 202210223976 A CN202210223976 A CN 202210223976A CN 114661874 B CN114661874 B CN 114661874B
- Authority
- CN
- China
- Prior art keywords
- features
- visual
- fusion
- attention
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 119
- 238000000034 method Methods 0.000 title claims abstract description 73
- 239000013598 vector Substances 0.000 claims abstract description 81
- 238000001514 detection method Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 230000004927 fusion Effects 0.000 claims description 89
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000011176 pooling Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 12
- 238000007500 overflow downdraw method Methods 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 5
- 238000005065 mining Methods 0.000 claims description 4
- 238000012546 transfer Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 3
- 238000009966 trimming Methods 0.000 claims description 3
- 230000002457 bidirectional effect Effects 0.000 claims 2
- 230000009977 dual effect Effects 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 description 9
- 238000011161 development Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- UPPMZCXMQRVMME-UHFFFAOYSA-N valethamate Chemical compound CC[N+](C)(CC)CCOC(=O)C(C(C)CC)C1=CC=CC=C1 UPPMZCXMQRVMME-UHFFFAOYSA-N 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Human Resources & Organizations (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Entrepreneurship & Innovation (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Economics (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Game Theory and Decision Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Quality & Reliability (AREA)
- Human Computer Interaction (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of cross-modal tasks combined with the fields of computer vision and natural language processing. The technical proposal is as follows: a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; and finally, the word vector is expressed and the state of the last time step is acquired through a long-short memory network, so that the problem characteristic is obtained. The method can enable the trained model to have robustness; the method has stronger generalization capability for more complex visual scenes, improves the semanteme of answers, and improves the accuracy of a visual question-answering model.
Description
Technical Field
The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels.
Background
The visual question-answering technology is a subject that requires simultaneous understanding of visual content, semantic information, and cross-modal relationships. In the past, a great deal of work has been done to develop individual stem models in a single machine vision or natural language processing field and has profound significance. After combining the two fields of machine vision and natural language processing, the visual question-answering technology serving as one of the cross-modal field branches has great potential influence on wide application such as visual navigation and remote monitoring.
Currently, various image algorithms have been applied to the field of visual questions and answers to show excellent performance, and the main stream methods are roughly divided into two categories: an algorithm based on multi-modal fusion and an algorithm based on an attention mechanism. The multi-modal fusion algorithm is based on a CNN-RNN structure, and fuses visual features and text features into a unified representation for predicting answers. The attention mechanism algorithm is used for distinguishing effective information related to the problem in the image and solving the problem of interaction between vision and language. However, the method of multimodal fusion and attention mechanism does not effectively combine text information with image information; and the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the acquisition capability of high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture, and can be matched to corresponding answers forward from the subtitle according to the questions, and the model should pay more attention to the high-level semantic information of the picture, so that the model has stronger robustness when matching the answers according to the subtitle.
Disclosure of Invention
The invention aims to overcome the defects of the background technology and provide a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which can enable a trained model to be more robust; the method has stronger generalization capability for more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answer model.
The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adapting double channels comprises the following steps:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
Step 3; for embedding of image captions and close captions text, the sentences are divided into words by using spaces and punctuations as well; then cascading the obtained caption features and converting the cascading features into a text paragraph form; finally, a long-short-term memory network is used for encoding text paragraphs, and the output of the last layer is the encoded word vector sequence;
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
Step 5; inputting the word vector sequences and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.
The invention is also characterized in that:
In the step 1, the usage object detection module specifically means: obtaining object detection frames by using a fast R-CNN model, and selecting the most relevant K detection frames (generally k=36) as important visual areas; for each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T, In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/>(X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.
The step 2 is specifically implemented according to the following steps:
Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-short-time memory network (LSTM) with the hidden layer of d q dimension; finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.
The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the step of cascading the image captions and the close captions is not included.
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention featuresExpressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 1,ω2,…,ωK]T is the mapping matrix of the attention.
The relation reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively. The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas to generate binary relation features in a mode of three continuous 1 multiplied by 1 convolution layers; the multivariate relation reasoning module is responsible for mining intra-group visual relations among visual areas to generate multivariate relation features in a mode of three continuous 3×3 cavity convolution layers. And finally, combining the binary relation features and the multiple relation features to obtain the relation features.
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]; secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Where W v and W q are learning parameters and b q and b v are biases. Where d s is the dimension of the subspace.
Combining visual areas in pairs, expanding visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V fu.
The binary relation reasoning comprises the following steps: three consecutive 1 x1 convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The number of channels for the three 1 x1 convolutional layers is d s respectively,/>The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
the step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
the specific implementation steps of the step 4 are as follows:
first, according to the multi-modal fusion:
Where 1 εR d is a vector with elements of 1, and Representing element-wise multiplication.
Second, the same mapping matrix is used for all image areasAnd/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 1,ω2,…,ωK]T is the mapping matrix of the attention.
The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t i and a problem q j by using a cosine similarity method, and selecting text features most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namelyWherein/>Representing the weight subtitle features; each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
The step5 is specifically implemented according to the following steps:
Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t i and a problem q j by using a cosine similarity method, and selecting a text feature most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM or BiLSTM, while each word of the question is also encoded using BiLSTM; finally, four methods of complete fusion, average pooling fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information;
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;
step 5.3: four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted respectively to capture the advanced semantic information.
The complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and the vector represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively;
The average pooling fusion is carried out, the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively;
The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward (or reverse) word vector embedding of the problem, and the average value is calculated, wherein the specific formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>And respectively corresponding to the forward and reverse attention vectors of the ith subtitle word vector, and representing the relevance of the whole problem and the word.
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
the maximum attention fusion is carried out, the problem with the maximum similarity coefficient is directly embedded into the attention vector, and finally the attention vector and the caption are embedded into the F function for fusion; the specific formula is as follows:
the four fusion strategies in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 feature vectors as
The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
The invention has the beneficial effects that:
1. The invention is based on a multi-angle semantic understanding and self-adaption dual-channel model, can capture visual clues and semantic clues of images at the same time, and adds gating in a later fusion stage to adaptively select visual information and semantic information to answer questions, so that the trained model has robustness.
2. The invention adopts the visual relation reasoning module in the visual channel, wherein the visual relation reasoning module comprises binary relation reasoning and multiple relation reasoning, enhances the understanding capability of the model on visual content, and has stronger generalization capability in the face of more complex visual scenes.
3. The invention adopts the multi-angle semantic module to generate semantic features in the semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of answers is improved, and the accuracy of a visual question-answering model is improved.
Drawings
FIG. 1 is a diagram of a network model structure of the method of the present invention.
Fig. 2 is a schematic diagram of a relationship inference module in the method of the present invention.
FIG. 3 is a schematic diagram of a multi-angle semantic module in the method of the present invention.
Detailed Description
The invention will be further described with reference to an embodiment (entitled child garment) shown in the drawings.
The invention discloses a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which comprises the following steps:
Step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Mesh features are extracted by using pre-training ResNet-101, object regions are explored in cooperation with a fast-RCNN model, 2048-dimensional target region features are extracted, and the K most relevant detection frames (generally K=36) are selected as important visual regions. For each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T, In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/> (X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively. W, h represents the width and height of the input image, respectively.
Step 2: for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
The implementation method is as follows: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions below 14 words are filled with 0 vectors. The problem of containing 14 words is then transformed into Glove vectors, resulting in word embedding sequences of size 14 x 300, which in turn are passed through a long-short-term memory network (LSTM) with hidden layer of d q dimensions. Finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.
Step 3: for embedding of image captions and close captions text, the sentences are also divided into words by using spaces and punctuation marks, and the sentence length is also set to be 14; then, the invention adopts the first 6 close captions (according to the average value of the caption distribution) as text input, and the obtained caption features are cascaded and converted into the text paragraph form; finally, the text paragraphs are encoded by using a long-short-time memory network, and the output of the last layer is the encoded word vector sequence.
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
The attention mechanism specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 1,ω2,…,ωK]T is the mapping matrix of the attention.
The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (using a relational reasoning module, which is an innovation point of the invention).
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]. Secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Where W v and W q are learning parameters and b q and b v are biases. Where d s is the dimension of the subspace.
To pair-wise combine visual areas, extended visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V fu.
The binary relation reasoning comprises the following steps: three consecutive 1 x1 convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The number of channels for the three 1 x1 convolutional layers is d s respectively,/>The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output and input size of each convolution the same. The visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
The specific implementation steps of the step 4 are as follows:
first, according to the simplest bilinear multi-modal fusion W i is replaced by two smaller matrices H iGi T, where/>/>
Where 1 εR d is a vector with elements of 1, andRepresenting element-by-element multiplication;
second, the same mapping matrix is used for all image areas And/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 1,ω2,…,ωK]T is the mapping matrix of the attention.
The specific implementation steps of the step5 are as follows:
Step 5.1: the method comprises the steps of associating the problem features with the subtitle features, firstly traversing and calculating the relevance between the subtitle t i and the problem q j by using a cosine similarity method, and selecting the text features most relevant to the problem q j. Secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM (BiLSTM), while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.Representing the hidden state of the forward and reverse LSTM of the problem at the j-th time step, respectively.
Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion and maximum attention fusion to capture high-level semantic information; (this is yet another innovative point of the present invention).
The complete fusion strategy is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector of the dimension l represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively.
The average pooling fusion strategy is to transfer the forward (or reverse) word vector characteristics of the caption paragraph and the forward (or reverse) problem characteristics on each time step into an F function for fusion, and then execute the average pooling operation, wherein the specific formula is as follows:
Wherein the method comprises the steps of And the vector is a vector in the dimension I, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively.
The attention fusion strategy is that firstly, similarity coefficients between the context embedding of the subtitle and the context embedding of the problem are calculated through cosine similarity functions, then the similarity coefficients are regarded as weights, and the weights are multiplied by each forward (or reverse) word vector embedding of the problem and are averaged, and the specific formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>And respectively corresponding to the forward and reverse attention vectors of the ith subtitle word vector, and representing the relevance of the whole problem and the word.
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
the 4 fusion methods are used for cascading the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith subtitle, and the comprehensive fusion feature is recorded as The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:
/>
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
Step 6: and (3) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of the visual channel and the semantic channel to the predicted answer in a feature fusion mode. The answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.
In summary, the invention uses R-CNN-LSTM frame to combine attention mechanism and relation reasoning method in visual channel based on VQA 1.0.0 and VQA 2.0.0 data set, and uses visual feature vector and geometric feature vector in Faster R-CNN coding image to input the visual feature vector and geometric feature vector into visual channel to generate visual mode representation; and the semantic channel adopts LSTM network coding to splice the global title and the local title, and semantic modal representation is output through a multi-angle semantic module. And finally, inputting the obtained visual mode representation and semantic mode representation into an adaptive selection gating to decide which mode clues are adopted to predict answers.
The innovation points are as follows: the relationship reasoning module is adopted in the visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multiple relationship reasoning, so that the understanding capability of the model on visual content can be enhanced, and the model has stronger generalization capability in the face of more complex visual scenes. Secondly, a multi-angle semantic module is adopted in the semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of an answer can be improved, and meanwhile, the accuracy of a visual question-answer model can be improved.
Simulation experiment and experimental result characterization:
1. Data set
The model was run on two visual question-and-answer published datasets VQA 1.0.0 and VQA 2.0.0 datasets, respectively. VQA 1.0.0 is built on the MSCOCO image dataset [38], the training set in the dataset contains 248 349 questions and 82 783 pictures, the validation set contains 121 512 questions and 40 504 pictures, and the test set contains 244 302 questions and 81 434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0, which adds more problem samples than VQA 1.0 to make the language bias more balanced. The training set of VQA 2.0.0 data set contained 443 757 questions and 82 783 pictures, the validation set contained 214 354 questions and 40 504 pictures, and the test set contained 447 793 questions and 81 434 pictures. There are three types of problems: yes/no, numerical, and others. Wherein the number of other types of samples is approximately half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and test results are reported on a test-development set (test-dev) and a test-standard set (test-standard) in order to ensure fair comparison with other works.
2. Experimental environment
The invention realizes the proposed model in pytorch library and completes the test experiment on GUP server. The server is configured as 256G RAM and has 4 Nvidia 1080Ti GPU, and the total memory is 64GB. The invention trains the model by using an Adam optimizer, the maximum iteration round number is 40, and the batch size is set to 256. The learning rate is set to be 1e-3 in the first training period, set to be 2e-3 in the second training period, set to be 3e-3 in the third training period, and kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient trimming method to update the gradient value of each period to be one fourth of the original gradient value. To prevent overfitting, a dropout layer was used after each fully connected layer, with a dropout rate of 0.5.
3. Experimental results and analysis
TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set
As shown in table 1, performance comparisons of various advanced models and the models herein are mainly shown, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the concept of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention adds three fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic modules in the semantic channel have important significance for improving the prediction precision. The model proposed by the present invention also has the same performance in the test-standard set.
/>
TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set
As shown in Table 2, the present invention further verifies the performance of the model on VQA 2.0.0 datasets, including test-development sets and test-standard sets. By comparing with the advanced method, the model provided by the invention has good performance on indexes such as overall precision and the like. Compared with MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in test-development set and test-standard set, respectively. The MuRel model is a more prominent model in the current multi-mode relation modeling method, and is a network structure adopting residual characteristic learning end-to-end reasoning. The performance of the invention is superior to that of the model, and the model can utilize a large amount of semantic information to improve the prediction accuracy due to the guiding effect of the semantic channel on answer prediction. In addition, compared with VCTREE model adopting reinforcement learning and supervision learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the invention has obvious advantages in indexes such as overall accuracy and the like. In view of the above-mentioned, it is desirable,
By comparing with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and improves the accuracy of the model on answer prediction.
Claims (7)
1. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels is characterized by comprising the following steps of:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
Step 2; for embedding of the problem text, dividing sentences into words by using a space and punctuation mark method; performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
Step 3; for embedding of image captions and close captions text, the sentences are divided into words by using spaces and punctuations as well; then cascading the obtained caption features and converting the cascading features into a text paragraph form; finally, a long-short-term memory network is used for encoding text paragraphs, and the output of the last layer is the encoded word vector sequence;
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
Step 5; inputting the word vector sequences and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected through a multi-classifier to be used as a final answer in the prediction of the answer;
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AΤ·V
wherein a= [ ω 1,ω2,…,ωK]Τ is a mapping matrix of the attention;
The relationship reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively; the relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas and generating binary relation features in a mode of three continuous 1X 1 convolution layers; the multi-relation reasoning module is responsible for mining intra-group visual relations among visual areas and generating multi-relation features in a mode of three continuous 3×3 cavity convolution layers; finally, combining the binary relation features and the multiple relation features to obtain relation features;
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]; secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Wherein W v and W q are learning parameters, b q and b v are biases; where d s is the dimension of the subspace;
The binary relation reasoning comprises the steps of adopting three continuous 1×1 convolution layers and adopting a ReLU activation layer after each convolution layer; the number of channels for the three 1 x1 convolutional layers is d s respectively, />The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolution layers are employed, and a ReLU activation layer is employed after each convolution layer; the holes of the three hole convolution layers are 1,2 and 4 respectively; the step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t i and a problem q j by using a cosine similarity method, and selecting text features most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features; then adopting bidirectional LSTM to code each word of the subtitle, and adopting BiLSTM to code each word of the problem; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
2. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 1, wherein the method is characterized by comprising the following steps of: in the step 1), the usage object detection module specifically means: obtaining object detection frames by adopting a Faster R-CNN model, and selecting K most relevant detection frames as important visual areas; for each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T,In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/>(X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.
3. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 2, wherein the method is characterized by comprising the following steps: the step 2) is specifically implemented according to the following steps:
Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-time memory network with a hidden layer of d q dimension; finally use The final hidden state of (a) is a question embedded representation of the input question Q;
The text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the step of cascading the image captions with the close captions is not included.
4. The visual question-answering method based on multi-angle semantic understanding and self-adapting dual channels according to claim 3, wherein: the specific implementation steps of the step 4 are as follows:
first, according to the multi-modal fusion:
Where 1 εR d is a vector with elements of 1, and Representing element-by-element multiplication;
second, the same mapping matrix is used for all image areas And/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AΤ·V
Where a= [ ω 1,ω2,…,ωK]Τ is the mapping matrix of the attention.
5. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 4, wherein the method is characterized by comprising the following steps of: the specific implementation steps of the step 5 are as follows:
Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t i and a problem q j by using a cosine similarity method, and selecting a text feature most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features;
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Respectively representing the hidden states of the forward LSTM and the reverse LSTM of the subtitle at the ith time step; The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;
step 5.3: four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted respectively to capture the advanced semantic information.
6. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 5, wherein the method is characterized by comprising the following steps of: the complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and the vector represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively;
The average pooling fusion is carried out, the forward or reverse word vector characteristics of the caption paragraphs and the forward problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively;
The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward word vector embedding of the problem and the average value is calculated, wherein the concrete formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>The attention vectors corresponding to the i-th subtitle word vector forward and reverse respectively represent the relevance of the whole problem and the word;
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
The maximum attention fusion is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
7. the visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 6, wherein the method is characterized by comprising the following steps: the four fusion methods in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 characteristic vectors as
The integrated fusion features are input into a bidirectional LSTM, and the final hidden states in two directions are obtained, and the formula is as follows:
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223976.XA CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223976.XA CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114661874A CN114661874A (en) | 2022-06-24 |
CN114661874B true CN114661874B (en) | 2024-04-30 |
Family
ID=82028726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210223976.XA Active CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661874B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115730059A (en) * | 2022-12-08 | 2023-03-03 | 安徽建筑大学 | Visual question answering method, device, equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110717431A (en) * | 2019-09-27 | 2020-01-21 | 华侨大学 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
KR20210056071A (en) * | 2019-11-08 | 2021-05-18 | 경기대학교 산학협력단 | System for visual dialog using deep visual understanding |
CN113886626A (en) * | 2021-09-14 | 2022-01-04 | 西安理工大学 | Visual question-answering method of dynamic memory network model based on multiple attention mechanism |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9965705B2 (en) * | 2015-11-03 | 2018-05-08 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering |
CN109902166A (en) * | 2019-03-12 | 2019-06-18 | 北京百度网讯科技有限公司 | Vision Question-Answering Model, electronic equipment and storage medium |
-
2022
- 2022-03-07 CN CN202210223976.XA patent/CN114661874B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110717431A (en) * | 2019-09-27 | 2020-01-21 | 华侨大学 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
KR20210056071A (en) * | 2019-11-08 | 2021-05-18 | 경기대학교 산학협력단 | System for visual dialog using deep visual understanding |
CN113886626A (en) * | 2021-09-14 | 2022-01-04 | 西安理工大学 | Visual question-answering method of dynamic memory network model based on multiple attention mechanism |
Non-Patent Citations (3)
Title |
---|
基于Spatial-DCTHash动态参数网络的视觉问答算法;孟祥申;江爱文;刘长红;叶继华;王明文;;中国科学:信息科学;20170820(08);全文 * |
基于视觉语义双通道的视觉问答算法研究;王鑫;《中国优秀硕士学位论文全文数据库》;20230215;全文 * |
结合自底向上注意力机制和记忆网络的视觉问答模型;闫茹玉;刘学亮;;中国图象图形学报;20200516(05);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114661874A (en) | 2022-06-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | A comprehensive survey on pretrained foundation models: A history from bert to chatgpt | |
CN110163299B (en) | Visual question-answering method based on bottom-up attention mechanism and memory network | |
CN112487807B (en) | Text relation extraction method based on expansion gate convolutional neural network | |
CN111652357B (en) | Method and system for solving video question-answer problem by using specific target network based on graph | |
CN109840322B (en) | Complete shape filling type reading understanding analysis model and method based on reinforcement learning | |
CN110390397A (en) | A kind of text contains recognition methods and device | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
CN115438674B (en) | Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment | |
Wei et al. | Enhance understanding and reasoning ability for image captioning | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN116610778A (en) | Bidirectional image-text matching method based on cross-modal global and local attention mechanism | |
CN115964459B (en) | Multi-hop reasoning question-answering method and system based on food safety cognition spectrum | |
CN117033602A (en) | Method for constructing multi-mode user mental perception question-answering model | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
CN114661874B (en) | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN115238691A (en) | Knowledge fusion based embedded multi-intention recognition and slot filling model | |
CN114003770A (en) | Cross-modal video retrieval method inspired by reading strategy | |
CN116661852B (en) | Code searching method based on program dependency graph | |
CN114511813B (en) | Video semantic description method and device | |
CN113779244B (en) | Document emotion classification method and device, storage medium and electronic equipment | |
CN114266905A (en) | Image description generation model method and device based on Transformer structure and computer equipment | |
CN117648429B (en) | Question-answering method and system based on multi-mode self-adaptive search type enhanced large model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |