CN114661874A - Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels - Google Patents
Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels Download PDFInfo
- Publication number
- CN114661874A CN114661874A CN202210223976.XA CN202210223976A CN114661874A CN 114661874 A CN114661874 A CN 114661874A CN 202210223976 A CN202210223976 A CN 202210223976A CN 114661874 A CN114661874 A CN 114661874A
- Authority
- CN
- China
- Prior art keywords
- visual
- fusion
- features
- attention
- caption
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 123
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000009977 dual effect Effects 0.000 title description 2
- 239000013598 vector Substances 0.000 claims abstract description 85
- 238000001514 detection method Methods 0.000 claims abstract description 11
- 230000015654 memory Effects 0.000 claims abstract description 10
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 230000004927 fusion Effects 0.000 claims description 97
- 239000011159 matrix material Substances 0.000 claims description 21
- 238000011176 pooling Methods 0.000 claims description 19
- 230000006870 function Effects 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000007500 overflow downdraw method Methods 0.000 claims description 11
- 239000011800 void material Substances 0.000 claims description 8
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 4
- 238000005065 mining Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 238000012549 training Methods 0.000 description 8
- 238000011161 development Methods 0.000 description 7
- 238000004422 calculation algorithm Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- UPPMZCXMQRVMME-UHFFFAOYSA-N valethamate Chemical compound CC[N+](C)(CC)CCOC(=O)C(C(C)CC)C1=CC=CC=C1 UPPMZCXMQRVMME-UHFFFAOYSA-N 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Strategic Management (AREA)
- Development Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Economics (AREA)
- Educational Administration (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Marketing (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing. The technical scheme is as follows: the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, a space and punctuation method is used to divide a sentence into words (numbers or words based on numbers are also regarded as a word); then carrying out vectorization representation on the words by adopting a pre-trained word vector model; and finally, expressing the word vector through a long-time memory network, and acquiring the state at the last time step to obtain the problem characteristics. The method can make the trained model more robust; the method has strong generalization capability in the face of more complex visual scenes, improves the semanteme of answers, and improves the accuracy of the visual question-answering model.
Description
Technical Field
The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing, and particularly relates to a multi-angle semantic understanding and self-adaptive dual-channel-based visual question-answering method.
Background
The visual question-answering technology is a subject which needs to understand visual contents, semantic information and cross-modal relationships at the same time. There has been a great deal of work in the past to develop respective skeleton models in a single machine vision or natural language processing domain and have profound significance. After two fields of machine vision and natural language processing are combined, a visual question-answering technology which is one of branches of a cross-modal field has great potential influence on wide application such as visual navigation, remote monitoring and the like.
At present, various image algorithms have been applied to the field of visual question answering to show excellent performance, wherein the mainstream methods are roughly divided into two types: multimodal fusion based algorithms and attention based algorithms. The multimodal fusion algorithm is based on a CNN-RNN structure, fusing visual and textual features into a unified representation for predicting answers. Attention mechanism algorithms have emerged to address the problem of visual and linguistic interaction by distinguishing between valid information in images that is relevant to the problem. However, the method of multi-modal fusion and attention mechanism cannot effectively combine text information and image information; the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the capability of acquiring high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture and can also be matched with the corresponding answer from the caption in the forward direction according to the question, and the model should pay more attention to the information of the high-level semantics of the picture and has stronger robustness when the answer is matched according to the caption.
Disclosure of Invention
The invention aims to overcome the defects of the background technology and provides a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, and the method can enable a trained model to have higher robustness; the method also has stronger generalization capability in the face of more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answering model.
The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 3; for embedding of image captions and dense caption texts, the sentences are segmented into words by using spaces and punctuations; then cascading the obtained multiple subtitle features and converting the subtitle features into a text paragraph form; finally, a long-time memory network is used for coding the text paragraphs, and the output of the last layer is a coded word vector sequence;
step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention feature and the relation feature to generate a visual feature representation;
step 5; inputting the word vector sequence and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
step 6; sending the visual features and the multi-angle semantic features generated in the step 4 and the step 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
The invention is also characterized in that:
in step 1, the using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames (generally K is 36) as important visual areas; for each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T,In addition, the geometric features of the input image are also recorded, and are marked as B ═ B1,b2,…,bK}TWherein(xi,yi),wi,hiRespectively representing the center coordinate, the width and the height of the selected area i; w, h represent the width and height of the input image, respectively.
The step 2 is implemented according to the following steps:
firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqLong and short term memory networks of dimensions (LSTM); finally useThe final hidden state of (a) is a problem embedding representation of the input problem Q.
The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the image subtitles and the dense subtitles are not cascaded.
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention featuresThe weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention.
The relationship inference module in the step 4) specifically includes: the relationship between the coded image areas is realized in a double-convolution stream mode, and two different types of relationship features are generated and are respectively a binary relationship feature and a multivariate relationship feature. The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the paired visual relation among the visual areas to generate binary relation characteristics in a mode of three continuous 1 multiplied by 1 convolutional layers; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas to generate multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers. And finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.
The characteristic fusion step is as follows: firstly, the invention cascades the object characteristics and the geometric characteristics of K visual areas of an image to generate visual area characteristics Vco=concat[V,B](ii) a Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset.Wherein d issIs the dimension of the subspace.
Combining visual regions in pairs to expand visual region featuresAnd transposing it withAdding to obtain paired combination V of visual region featuresfu。
The binary relation reasoning comprises the following steps: three consecutive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is ds,Andcombining visual regions into feature VfuWhen the input is input into the binary relation reasoning module, the output at the last layer isThen will beAdding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
the step of the multivariate relationship inference is as follows: by using three linksThe subsequent 3 x 3 void convolutional layers, and a ReLU active layer is employed after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. All convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference moduleThe same reasoning as the binary relation willAdding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
the specific implementation steps of the step 4 are as follows:
first, according to multimodal fusion:
wherein 1 ∈ RdIs a vector whose elements are all 1, andrepresenting element-by-element multiplication.
wherein P ∈ RdIs a learning parameter; to obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Vat=AT·V
wherein a ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention.
The multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.WhereinRepresenting weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, improving the generalization capability of the model for understanding semantic information by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion.
The step 5 is implemented according to the following steps:
step 5.1: method for associating problem features with subtitle features by utilizing cosine similarity firstlyTraversing calculation of caption tiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.WhereinRepresenting weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the problem is coded by adopting BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;
step 5.2: each word of the caption is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
whereinRespectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.The hidden states of the forward LSTM and the backward LSTM of the problem at the jth time step are respectively represented;
step 5.3: and four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are respectively adopted to capture high-level semantic information.
The complete fusion is to introduce each forward and backward word vector of the caption paragraph and the forward and backward final states of the whole problem into an F function for fusion, and the specific formula is as follows:
whereinThe vector is a vector with l dimension and respectively represents the forward and reverse complete fusion characteristics of the ith caption word vector;
the average pooling fusion is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:
whereinIs a vector of l dimensions, and respectively represents the ith subtitleForward and reverse average pooling fusion characteristics of word vectors;
the attention fusion comprises the steps of firstly calculating a similarity coefficient between the subtitle context embedding and the problem context embedding through a cosine similarity function, then taking the similarity coefficient as a weight, embedding and multiplying each forward (or reverse) word vector of a problem, and calculating an average value, wherein a specific formula is as follows:
whereinRespectively representing the degree of similarity coefficients in the forward direction and the reverse direction,the attention vectors corresponding to the ith subtitle word vector in the forward direction and the reverse direction respectively represent the relevance of the whole problem and the word.
Finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
the maximum attention fusion directly embeds the problem with the maximum similarity coefficient as an attention vector, and finally embeds the attention vector and the caption into an F function for fusion; the specific formula is as follows:
the four fusion strategies in the step 5) are to record the comprehensive fusion characteristics of the ith caption obtained by cascading the generated 8 characteristic vectors as
Inputting the comprehensive fusion features into a bidirectional LSTM (BilSTM), and acquiring final hidden states in two directions, wherein the formula is as follows:
secondly, the two ends are finally connectedHidden state cascading to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
The invention has the beneficial effects that:
1. the invention is based on a multi-angle semantic understanding and self-adaptive dual-channel model, can capture visual clues and semantic clues of images simultaneously, and adds gating in a later stage fusion stage to answer questions by self-adaptively selecting visual information and semantic information, so that the trained model has higher robustness.
2. The visual relationship reasoning module is adopted in the visual channel, the binary relationship reasoning module and the multivariate relationship reasoning module are included, the comprehension capability of the model to the visual contents is enhanced, and the visual relationship reasoning module also has strong generalization capability in the face of more complex visual scenes.
3. According to the invention, a multi-angle semantic module is adopted in a semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semantics of answers are improved and the accuracy of a visual question-answering model is improved.
Drawings
Fig. 1 is a diagram of a network model architecture for the method of the present invention.
FIG. 2 is a schematic diagram of a relationship inference module in the method of the present invention.
FIG. 3 is a diagram illustrating a multi-angle semantic module in the method of the present invention.
Detailed Description
The invention will be further described with reference to an embodiment shown in the drawings, in the context of a child's clothing.
The invention relates to a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, which comprises the following steps of:
step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Extracting grid features by using a pre-trained ResNet-101, searching an object region by matching with a Faster-RCNN model, extracting 2048-dimensional target region features, and selecting the most relevant K detection frames (generally K is 36) as important visual regions. For each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T,In addition, the geometric features of the input image are also recorded, and are recorded as B ═ B1,b2,…,bK}TWherein (xi,yi),wi,hiRespectively representing the center coordinate, width and height of the selected area i. w, h represent the width and height of the input image, respectively.
Step 2: for the embedding of question text, the method of space and punctuation is used to divide the sentence into words (the number or the word based on the number is also regarded as a word); performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;
the implementation method comprises the following steps: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions less than 14 words are filled with a 0 vector. The question containing 14 words is then converted into a Glove vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqLong and short time memory networks of dimensions (LSTM). Finally useThe final hidden state of (a) is a problem embedding representation of the input problem Q.
And step 3: for embedding of image subtitles and dense subtitle texts, a sentence is divided into words by using spaces and punctuations, and the length of the sentence is also set to be 14; then, the invention adopts the first 6 dense subtitles (according to the average value of the subtitle distribution) as text input, and cascades the obtained characteristics of a plurality of subtitles to convert the characteristics into a text paragraph form; and finally, using a long-time and short-time memory network to encode the text paragraphs, wherein the output of the last layer is the encoded word vector sequence.
Step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention features and the relation features to generate visual feature representation;
the attention mechanism specifically comprises the following steps: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention featuresThe weighted sum of (a) is expressed as:
Vat=AT·V
wherein A ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention.
The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (this is an innovative point of the present invention with a relational inference module).
The characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics Vco=concat[V,B]. Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset.Wherein d issIs the dimension of the subspace.
Extending visual zone characteristics for pairwise combining of visual zonesAnd transposing itAdding to obtain paired combination V of visual region featuresfu。
The binary relation reasoning comprises the following steps: three successive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is ds,Andcombining visual regions into feature VfuInput to binary relational inference moduleIn (3), the output at the last layer isThen will beAdding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
the step of the multivariate relationship inference is as follows: three consecutive 3 x 3 void convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output of each convolution the same size as the input. Combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference moduleThe same reasoning as the binary relation willAdding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
the specific implementation steps of the step 4 are as follows:
first, according to the simplest bilinear multimodal fusion asW is to beiReplaced by two smaller matrices HiGi TIn whichAnd
wherein 1 ∈ RdIs a vector whose elements are all 1, andrepresenting element-by-element multiplication;
wherein P ∈ RdIs a learning parameter; to obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Vat=AT·V
wherein a ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention.
The specific implementation steps of the step 5 are as follows:
step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features. Secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is paid more attention, i.e.WhereinRepresenting weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
Step 5.2: each word of the caption is encoded using bi-directional lstm (BiLSTM), while each word of the BiLSTM encoding problem is also employed:
whereinRespectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.Indicating the hidden states of the forward and backward LSTM of the problem at the jth time step, respectively.
Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion respectively to capture high-level semantic information; (this is yet another innovative aspect of the present invention).
The complete fusion strategy is to transmit each forward and reverse word vector of the caption paragraph and the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
whereinThe vector is a vector with l dimension, and respectively represents the forward and reverse fully fused characteristics of the ith caption word vector.
The average pooling fusion strategy is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then perform average pooling operation, and the specific formula is as follows:
whereinThe vector is a vector with l dimension, and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector.
The attention fusion strategy firstly calculates a similarity coefficient between the caption context embedding and the problem context embedding through a cosine similarity function, then takes the similarity coefficient as a weight, and multiplies each forward (or reverse) word vector embedding of the problem and calculates an average value, and the specific formula is as follows:
whereinRespectively representing the degree of similarity coefficients in the forward direction and the reverse direction,the attention vectors corresponding to the ith subtitle word vector in the forward direction and the reverse direction respectively represent the relevance of the whole problem and the word.
Finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
the 4 fusion methods cascade the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith caption, and record the comprehensive fusion feature asInputting the comprehensive fusion features into a bidirectional LSTM (BilSTM) and obtaining final hidden states in two directions, wherein the formula is as follows:
secondly, cascading the final hidden states at the head and the tail to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
And 6: and (5) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode. The prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
In summary, based on VQA 1.0.0 and VQA 2.0.0 data sets, the invention utilizes an R-CNN-LSTM frame in combination with an attention mechanism and a relational reasoning method in a visual channel, adopts visual characteristic vectors and geometric characteristic vectors in an Faster R-CNN coded image, and inputs the visual characteristic vectors and the geometric characteristic vectors into the visual channel to generate visual modal representation; and outputting semantic modal representation through a multi-angle semantic module by adopting the global title and the local title which are spliced by LSTM network coding in a semantic channel. And finally, inputting the obtained visual modal representation and semantic modal representation into an adaptive selection gate to decide which modal clue is adopted to predict the answer.
The innovation points are as follows: firstly, a relationship reasoning module is adopted in a visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multivariate relationship reasoning, the comprehension capability of the model to visual contents can be enhanced, and the model also has strong generalization capability in the face of more complex visual scenes. And secondly, generating semantic features by adopting a multi-angle semantic module in a semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, and the accuracy of the visual question-answering model can be improved while the semanteme of answers is improved.
Simulation experiment and experimental result characterization:
1. data set
The model was experimented on two published data sets of visual questions and answers, VQA 1.0.0 and VQA 2.0.0 data sets, respectively. VQA 1.0.0 was created based on the MSCOCO image dataset [38], with the training set containing 248349 questions and 82783 pictures, the validation set containing 121512 questions and 40504 pictures, and the test set containing 244302 questions and 81434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0.0, which adds more problem samples to make the language bias more balanced than VQA 1.0.0. The training set of the VQA 2.0.0 data set contained 443757 questions and 82783 pictures, the validation set contained 214354 questions and 40504 pictures, and the test set contained 447793 questions and 81434 pictures. There are three types of problems: yes/no, numeric, and others. Wherein the number of other types of samples is about half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and in order to ensure fair comparison with other works, a test result is reported on a test-development set (test-dev) and a test-standard set (test-standard).
2. Experimental Environment
The invention realizes the proposed model in the pyrrch library and completes the test experiment on the GUP server. The server is configured as 256G RAM and simultaneously has 4 Nvidia 1080Ti GPUs, and the total video memory is 64 GB. The invention trains the model using an Adam optimizer with a maximum iteration round of 40 and a batch size of 256. The learning rate is set to be 1e-3 in the first training period, the learning rate is set to be 2e-3 in the second training period, the learning rate is set to be 3e-3 in the third training period and is kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient pruning method to update the gradient value of each period to be one fourth of the original value. To prevent overfitting, dropout layers were employed after each fully connected layer, with a dropout ratio of 0.5.
3. Results and analysis of the experiments
TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set
As shown in Table 1, the performance comparison between various advanced models and the models herein is mainly demonstrated, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to that of other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the idea of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention also adds three types of fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic module in the semantic channel has important significance for improving the prediction precision. The model proposed by the present invention also performs in the test-criteria set.
TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set
As shown in Table 2, the present invention further verifies the performance of the model on the VQA 2.0.0 data set, which includes the test-development set and the test-standard set. Compared with the advanced method, the model provided by the invention has good performance on indexes such as overall accuracy and the like. Compared with the MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in the test-development set and the test-standard set respectively. The MuRel model is a prominent model in the existing multi-modal relational modeling method, and is a network structure adopting residual characteristics to learn end-to-end reasoning. The performance of the method is superior to that of the model because the semantic channel guides the answer prediction, so that the model can improve the prediction precision by using a large amount of semantic information. In addition, compared with a VCTREE model adopting a reinforcement learning and supervised learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the method has obvious advantages in indexes such as overall precision and the like. In view of the above, it is desirable to provide,
through comparison with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and the accuracy of the model for predicting the answer is improved.
Claims (10)
1. The visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels is characterized by comprising the following steps of:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 2; for embedding the question text, a space and punctuation method is used for segmenting the sentence into words; performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;
step 3; for embedding of image captions and dense caption texts, the sentences are segmented into words by using spaces and punctuations; then cascading the obtained multiple subtitle features and converting the subtitle features into a text paragraph form; finally, a long-time memory network is used for coding the text paragraphs, and the output of the last layer is a coded word vector sequence;
step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention feature and the relation feature to generate a visual feature representation;
step 5; inputting the word vector sequence and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
step 6; sending the visual features and the multi-angle semantic features generated in the steps 4 and 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
2. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 1, is characterized in that: in the step 1), using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames as important visual areas; for each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T,In addition, the geometric features of the input image are also recorded, and are marked as B ═ B1,b2,…,bK}TWherein(xi,yi),wi,hiRespectively representing the center coordinate, the width and the height of the selected area i; w, h represent the width and height of the input image, respectively.
3. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 2, is characterized in that: the step 2) is implemented according to the following steps:
firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqThe length of the dimension is memorized in the network; finally useThe final hidden state of (a) is a problem embedding representation of the input problem Q;
the text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the image subtitles and the dense subtitles are not cascaded.
4. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 3, wherein: the attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention featuresThe weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention;
the relationship inference module in the step 4) specifically includes: realizing the relation between the coding image areas by a double convolution stream mode, and generating two different types of relation features which are respectively a binary relation feature and a multivariate relation feature; the relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the pair-wise visual relation among the visual areas and generating binary relation characteristics in a three-continuous 1 multiplied by 1 convolutional layer mode; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas and generating multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers; and finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.
5. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 4, is characterized in that: the characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics Vco=concat[V,B](ii) a Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset;wherein d issIs a dimension of a subspace;
the binary relation reasoning comprises the steps of adopting three continuous 1 multiplied by 1 convolutional layers, and adopting a ReLU activation layer after each convolutional layer; the number of channels of the three 1 × 1 convolutional layers is ds,Andcombining visual regions into feature VfuWhen the input is input into the binary relation reasoning module, the output at the last layer isThen will beAdding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
the step of the multivariate relationship inference is as follows: three continuous 3 x 3 void convolutional layers are adopted, and a ReLU active layer is adopted after each convolutional layer; the voids of the three void convolution layers are 1, 2 and 4 respectively; all convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference moduleThe same reasoning as the binary relation willAdding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
6. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 5, wherein: the specific implementation steps of the step 4 are as follows:
first, according to multimodal fusion:
wherein 1 ∈ RdIs a vector whose elements are all 1, andrepresenting element-by-element multiplication.
wherein P ∈ RdAre learning parameters. To obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Vat=AT·V
wherein a ═ ω1,ω2,…,ωK]TIs the mapping matrix of attention.
7. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 6, is characterized in that: the multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.WhereinRepresenting weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, improving the generalization capability of the model for understanding semantic information by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion.
8. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 7, is characterized in that: the specific implementation steps of the step 5 are as follows:
step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, the semantic information more relevant to the problem is paid more attention,namely, it isWhereinRepresenting weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the BilSTM coding problem is also coded by adopting the BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;
step 5.2: each word of the caption is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
whereinRespectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.Hidden shape at jth time step for forward and backward LSTM, respectively, representing problemState;
step 5.3: and four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are respectively adopted to capture high-level semantic information.
9. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 8, characterized in that: the complete fusion is to transmit each forward and backward word vector of the caption paragraph and the forward and backward final state of the whole problem into the F function for fusion, and the specific formula is as follows:
whereinThe vector is a vector with l dimension and respectively represents the forward and reverse complete fusion characteristics of the ith caption word vector;
the average pooling fusion is to transmit the forward or backward word vector characteristics of the caption section and the forward (or backward) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:
whereinThe vector is a vector of l dimension and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector;
the attention fusion comprises the steps of firstly calculating a similarity degree coefficient between subtitle context embedding and problem context embedding through a cosine similarity function, then taking the similarity degree coefficient as weight, embedding and multiplying each forward (or reverse) word vector of a problem, and solving an average value, wherein the specific formula is as follows:
whereinRespectively representing the degree of similarity coefficients in the forward direction and the reverse direction,the attention vectors respectively corresponding to the forward direction and the reverse direction of the ith subtitle word vector represent the correlation between the whole problem and the word;
finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
the maximum attention fusion is to embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
10. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 9, is characterized in that: the four fusion methods of the step 5) are to cascade the generated 8 feature vectors to obtain the comprehensive fusion features of the ith caption, and to mark the comprehensive fusion features as
Inputting the integrated fusion features into a bi-directional LSTM (BilStm) and obtaining final hidden shapes in two directionsThe state, the formula is as follows:
secondly, cascading the final hidden states at the head and the tail to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223976.XA CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223976.XA CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114661874A true CN114661874A (en) | 2022-06-24 |
CN114661874B CN114661874B (en) | 2024-04-30 |
Family
ID=82028726
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210223976.XA Active CN114661874B (en) | 2022-03-07 | 2022-03-07 | Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661874B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115730059A (en) * | 2022-12-08 | 2023-03-03 | 安徽建筑大学 | Visual question answering method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110717431A (en) * | 2019-09-27 | 2020-01-21 | 华侨大学 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
US20200293921A1 (en) * | 2019-03-12 | 2020-09-17 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Visual question answering model, electronic device and storage medium |
KR20210056071A (en) * | 2019-11-08 | 2021-05-18 | 경기대학교 산학협력단 | System for visual dialog using deep visual understanding |
CN113886626A (en) * | 2021-09-14 | 2022-01-04 | 西安理工大学 | Visual question-answering method of dynamic memory network model based on multiple attention mechanism |
-
2022
- 2022-03-07 CN CN202210223976.XA patent/CN114661874B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170124432A1 (en) * | 2015-11-03 | 2017-05-04 | Baidu Usa Llc | Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering |
US20200293921A1 (en) * | 2019-03-12 | 2020-09-17 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Visual question answering model, electronic device and storage medium |
CN110163299A (en) * | 2019-05-31 | 2019-08-23 | 合肥工业大学 | A kind of vision answering method based on bottom-up attention mechanism and memory network |
CN110222770A (en) * | 2019-06-10 | 2019-09-10 | 成都澳海川科技有限公司 | A kind of vision answering method based on syntagmatic attention network |
CN110647612A (en) * | 2019-09-18 | 2020-01-03 | 合肥工业大学 | Visual conversation generation method based on double-visual attention network |
CN110717431A (en) * | 2019-09-27 | 2020-01-21 | 华侨大学 | Fine-grained visual question and answer method combined with multi-view attention mechanism |
KR20210056071A (en) * | 2019-11-08 | 2021-05-18 | 경기대학교 산학협력단 | System for visual dialog using deep visual understanding |
CN113886626A (en) * | 2021-09-14 | 2022-01-04 | 西安理工大学 | Visual question-answering method of dynamic memory network model based on multiple attention mechanism |
Non-Patent Citations (3)
Title |
---|
孟祥申;江爱文;刘长红;叶继华;王明文;: "基于Spatial-DCTHash动态参数网络的视觉问答算法", 中国科学:信息科学, no. 08, 20 August 2017 (2017-08-20) * |
王鑫: "基于视觉语义双通道的视觉问答算法研究", 《中国优秀硕士学位论文全文数据库》, 15 February 2023 (2023-02-15) * |
闫茹玉;刘学亮;: "结合自底向上注意力机制和记忆网络的视觉问答模型", 中国图象图形学报, no. 05, 16 May 2020 (2020-05-16) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115730059A (en) * | 2022-12-08 | 2023-03-03 | 安徽建筑大学 | Visual question answering method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114661874B (en) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN108959396B (en) | Machine reading model training method and device and question and answer method and device | |
CN108804530B (en) | Subtitling areas of an image | |
CN110991290B (en) | Video description method based on semantic guidance and memory mechanism | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN109740158B (en) | Text semantic parsing method and device | |
CN113569001A (en) | Text processing method and device, computer equipment and computer readable storage medium | |
CN114443899A (en) | Video classification method, device, equipment and medium | |
CN112214996A (en) | Text abstract generation method and system for scientific and technological information text | |
CN113626589A (en) | Multi-label text classification method based on mixed attention mechanism | |
CN116975350A (en) | Image-text retrieval method, device, equipment and storage medium | |
CN113392265A (en) | Multimedia processing method, device and equipment | |
CN114943921A (en) | Video text description method fusing multi-granularity video semantic information | |
CN114661874A (en) | Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels | |
CN113609326A (en) | Image description generation method based on external knowledge and target relation | |
CN117058276A (en) | Image generation method, device, equipment and storage medium | |
CN115223086B (en) | Cross-modal action positioning method and system based on interactive attention guidance and correction | |
CN114511813B (en) | Video semantic description method and device | |
CN116028888A (en) | Automatic problem solving method for plane geometry mathematics problem | |
CN115169472A (en) | Music matching method and device for multimedia data and computer equipment | |
CN115203388A (en) | Machine reading understanding method and device, computer equipment and storage medium | |
CN117668213B (en) | Chaotic engineering abstract generation method based on cascade extraction and graph comparison model | |
CN117648429B (en) | Question-answering method and system based on multi-mode self-adaptive search type enhanced large model | |
CN115858791B (en) | Short text classification method, device, electronic equipment and storage medium | |
US20240177507A1 (en) | Apparatus and method for generating text from image and method of training model for generating text from image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |