CN114661874A - Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels - Google Patents

Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels Download PDF

Info

Publication number
CN114661874A
CN114661874A CN202210223976.XA CN202210223976A CN114661874A CN 114661874 A CN114661874 A CN 114661874A CN 202210223976 A CN202210223976 A CN 202210223976A CN 114661874 A CN114661874 A CN 114661874A
Authority
CN
China
Prior art keywords
visual
fusion
features
attention
caption
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210223976.XA
Other languages
Chinese (zh)
Other versions
CN114661874B (en
Inventor
王鑫
陈巧红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sci Tech University ZSTU
Original Assignee
Zhejiang Sci Tech University ZSTU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sci Tech University ZSTU filed Critical Zhejiang Sci Tech University ZSTU
Priority to CN202210223976.XA priority Critical patent/CN114661874B/en
Publication of CN114661874A publication Critical patent/CN114661874A/en
Application granted granted Critical
Publication of CN114661874B publication Critical patent/CN114661874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Marketing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing. The technical scheme is as follows: the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, a space and punctuation method is used to divide a sentence into words (numbers or words based on numbers are also regarded as a word); then carrying out vectorization representation on the words by adopting a pre-trained word vector model; and finally, expressing the word vector through a long-time memory network, and acquiring the state at the last time step to obtain the problem characteristics. The method can make the trained model more robust; the method has strong generalization capability in the face of more complex visual scenes, improves the semanteme of answers, and improves the accuracy of the visual question-answering model.

Description

Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
Technical Field
The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing, and particularly relates to a multi-angle semantic understanding and self-adaptive dual-channel-based visual question-answering method.
Background
The visual question-answering technology is a subject which needs to understand visual contents, semantic information and cross-modal relationships at the same time. There has been a great deal of work in the past to develop respective skeleton models in a single machine vision or natural language processing domain and have profound significance. After two fields of machine vision and natural language processing are combined, a visual question-answering technology which is one of branches of a cross-modal field has great potential influence on wide application such as visual navigation, remote monitoring and the like.
At present, various image algorithms have been applied to the field of visual question answering to show excellent performance, wherein the mainstream methods are roughly divided into two types: multimodal fusion based algorithms and attention based algorithms. The multimodal fusion algorithm is based on a CNN-RNN structure, fusing visual and textual features into a unified representation for predicting answers. Attention mechanism algorithms have emerged to address the problem of visual and linguistic interaction by distinguishing between valid information in images that is relevant to the problem. However, the method of multi-modal fusion and attention mechanism cannot effectively combine text information and image information; the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the capability of acquiring high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture and can also be matched with the corresponding answer from the caption in the forward direction according to the question, and the model should pay more attention to the information of the high-level semantics of the picture and has stronger robustness when the answer is matched according to the caption.
Disclosure of Invention
The invention aims to overcome the defects of the background technology and provides a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, and the method can enable a trained model to have higher robustness; the method also has stronger generalization capability in the face of more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answering model.
The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 2; for the embedding of question text, a space and punctuation method is used to divide a sentence into words (numbers or words based on numbers are also regarded as a word); performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vector through a long-time and short-time memory network, and acquiring the state at the last time step to obtain the problem characteristics;
step 3; for embedding of image captions and dense caption texts, the sentences are segmented into words by using spaces and punctuations; then cascading the obtained multiple subtitle features and converting the subtitle features into a text paragraph form; finally, a long-time memory network is used for coding the text paragraphs, and the output of the last layer is a coded word vector sequence;
step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention feature and the relation feature to generate a visual feature representation;
step 5; inputting the word vector sequence and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
step 6; sending the visual features and the multi-angle semantic features generated in the step 4 and the step 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
The invention is also characterized in that:
in step 1, the using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames (generally K is 36) as important visual areas; for each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T
Figure BDA0003534898230000031
In addition, the geometric features of the input image are also recorded, and are marked as B ═ B1,b2,…,bK}TWherein
Figure BDA0003534898230000032
(xi,yi),wi,hiRespectively representing the center coordinate, the width and the height of the selected area i; w, h represent the width and height of the input image, respectively.
The step 2 is implemented according to the following steps:
firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqLong and short term memory networks of dimensions (LSTM); finally use
Figure BDA0003534898230000034
The final hidden state of (a) is a problem embedding representation of the input problem Q.
The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the image subtitles and the dense subtitles are not cascaded.
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features
Figure BDA0003534898230000033
The weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω12,…,ωK]TIs the mapping matrix of attention.
The relationship inference module in the step 4) specifically includes: the relationship between the coded image areas is realized in a double-convolution stream mode, and two different types of relationship features are generated and are respectively a binary relationship feature and a multivariate relationship feature. The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the paired visual relation among the visual areas to generate binary relation characteristics in a mode of three continuous 1 multiplied by 1 convolutional layers; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas to generate multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers. And finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.
The characteristic fusion step is as follows: firstly, the invention cascades the object characteristics and the geometric characteristics of K visual areas of an image to generate visual area characteristics Vco=concat[V,B](ii) a Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
Figure BDA0003534898230000041
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset.
Figure BDA0003534898230000042
Wherein d issIs the dimension of the subspace.
Combining visual regions in pairs to expand visual region features
Figure BDA0003534898230000043
And transposing it with
Figure BDA0003534898230000044
Adding to obtain paired combination V of visual region featuresfu
The binary relation reasoning comprises the following steps: three consecutive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is ds
Figure BDA0003534898230000045
And
Figure BDA0003534898230000046
combining visual regions into feature VfuWhen the input is input into the binary relation reasoning module, the output at the last layer is
Figure BDA0003534898230000047
Then will be
Figure BDA0003534898230000048
Adding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
Figure BDA0003534898230000049
the step of the multivariate relationship inference is as follows: by using three linksThe subsequent 3 x 3 void convolutional layers, and a ReLU active layer is employed after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. All convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module
Figure BDA0003534898230000051
The same reasoning as the binary relation will
Figure BDA0003534898230000052
Adding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
Figure BDA0003534898230000053
the specific implementation steps of the step 4 are as follows:
first, according to multimodal fusion:
Figure BDA0003534898230000054
wherein 1 ∈ RdIs a vector whose elements are all 1, and
Figure BDA00035348982300000512
representing element-by-element multiplication.
Secondly, the same mapping matrix is used for all image areas
Figure BDA0003534898230000055
And
Figure BDA0003534898230000056
Figure BDA0003534898230000057
wherein P ∈ RdIs a learning parameter; to obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Figure BDA0003534898230000058
thus all visual areas and corresponding attention features
Figure BDA0003534898230000059
The weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω12,…,ωK]TIs the mapping matrix of attention.
The multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.
Figure BDA00035348982300000510
Wherein
Figure BDA00035348982300000511
Representing weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, improving the generalization capability of the model for understanding semantic information by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion.
The step 5 is implemented according to the following steps:
step 5.1: method for associating problem features with subtitle features by utilizing cosine similarity firstlyTraversing calculation of caption tiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.
Figure BDA0003534898230000061
Wherein
Figure BDA0003534898230000062
Representing weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the problem is coded by adopting BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;
step 5.2: each word of the caption is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Figure BDA0003534898230000063
Figure BDA0003534898230000064
Figure BDA0003534898230000065
Figure BDA0003534898230000066
wherein
Figure BDA0003534898230000067
Respectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.
Figure BDA0003534898230000068
The hidden states of the forward LSTM and the backward LSTM of the problem at the jth time step are respectively represented;
step 5.3: and four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are respectively adopted to capture high-level semantic information.
The complete fusion is to introduce each forward and backward word vector of the caption paragraph and the forward and backward final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Figure BDA0003534898230000071
Figure BDA0003534898230000072
wherein
Figure BDA0003534898230000073
The vector is a vector with l dimension and respectively represents the forward and reverse complete fusion characteristics of the ith caption word vector;
the average pooling fusion is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:
Figure BDA0003534898230000074
Figure BDA0003534898230000075
wherein
Figure BDA0003534898230000076
Is a vector of l dimensions, and respectively represents the ith subtitleForward and reverse average pooling fusion characteristics of word vectors;
the attention fusion comprises the steps of firstly calculating a similarity coefficient between the subtitle context embedding and the problem context embedding through a cosine similarity function, then taking the similarity coefficient as a weight, embedding and multiplying each forward (or reverse) word vector of a problem, and calculating an average value, wherein a specific formula is as follows:
Figure BDA0003534898230000077
Figure BDA0003534898230000078
Figure BDA0003534898230000079
Figure BDA00035348982300000710
wherein
Figure BDA00035348982300000711
Respectively representing the degree of similarity coefficients in the forward direction and the reverse direction,
Figure BDA00035348982300000712
the attention vectors corresponding to the ith subtitle word vector in the forward direction and the reverse direction respectively represent the relevance of the whole problem and the word.
Finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
Figure BDA0003534898230000081
Figure BDA0003534898230000082
the maximum attention fusion directly embeds the problem with the maximum similarity coefficient as an attention vector, and finally embeds the attention vector and the caption into an F function for fusion; the specific formula is as follows:
Figure BDA0003534898230000083
Figure BDA0003534898230000084
Figure BDA0003534898230000085
Figure BDA0003534898230000086
the four fusion strategies in the step 5) are to record the comprehensive fusion characteristics of the ith caption obtained by cascading the generated 8 characteristic vectors as
Figure BDA0003534898230000087
Inputting the comprehensive fusion features into a bidirectional LSTM (BilSTM), and acquiring final hidden states in two directions, wherein the formula is as follows:
Figure BDA0003534898230000088
Figure BDA0003534898230000089
secondly, the two ends are finally connectedHidden state cascading to generate multi-angle semantic features
Figure BDA00035348982300000810
Figure BDA00035348982300000811
Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
Figure BDA0003534898230000091
wherein
Figure BDA0003534898230000092
Is a learnable weight matrix, bsIs an offset.
The invention has the beneficial effects that:
1. the invention is based on a multi-angle semantic understanding and self-adaptive dual-channel model, can capture visual clues and semantic clues of images simultaneously, and adds gating in a later stage fusion stage to answer questions by self-adaptively selecting visual information and semantic information, so that the trained model has higher robustness.
2. The visual relationship reasoning module is adopted in the visual channel, the binary relationship reasoning module and the multivariate relationship reasoning module are included, the comprehension capability of the model to the visual contents is enhanced, and the visual relationship reasoning module also has strong generalization capability in the face of more complex visual scenes.
3. According to the invention, a multi-angle semantic module is adopted in a semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semantics of answers are improved and the accuracy of a visual question-answering model is improved.
Drawings
Fig. 1 is a diagram of a network model architecture for the method of the present invention.
FIG. 2 is a schematic diagram of a relationship inference module in the method of the present invention.
FIG. 3 is a diagram illustrating a multi-angle semantic module in the method of the present invention.
Detailed Description
The invention will be further described with reference to an embodiment shown in the drawings, in the context of a child's clothing.
The invention relates to a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, which comprises the following steps of:
step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Extracting grid features by using a pre-trained ResNet-101, searching an object region by matching with a Faster-RCNN model, extracting 2048-dimensional target region features, and selecting the most relevant K detection frames (generally K is 36) as important visual regions. For each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T
Figure BDA0003534898230000101
In addition, the geometric features of the input image are also recorded, and are recorded as B ═ B1,b2,…,bK}TWherein
Figure BDA0003534898230000102
Figure BDA0003534898230000103
(xi,yi),wi,hiRespectively representing the center coordinate, width and height of the selected area i. w, h represent the width and height of the input image, respectively.
Step 2: for the embedding of question text, the method of space and punctuation is used to divide the sentence into words (the number or the word based on the number is also regarded as a word); performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;
the implementation method comprises the following steps: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions less than 14 words are filled with a 0 vector. The question containing 14 words is then converted into a Glove vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqLong and short time memory networks of dimensions (LSTM). Finally use
Figure BDA0003534898230000104
The final hidden state of (a) is a problem embedding representation of the input problem Q.
And step 3: for embedding of image subtitles and dense subtitle texts, a sentence is divided into words by using spaces and punctuations, and the length of the sentence is also set to be 14; then, the invention adopts the first 6 dense subtitles (according to the average value of the subtitle distribution) as text input, and cascades the obtained characteristics of a plurality of subtitles to convert the characteristics into a text paragraph form; and finally, using a long-time and short-time memory network to encode the text paragraphs, wherein the output of the last layer is the encoded word vector sequence.
Step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention features and the relation features to generate visual feature representation;
the attention mechanism specifically comprises the following steps: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features
Figure BDA0003534898230000111
The weighted sum of (a) is expressed as:
Vat=AT·V
wherein A ═ ω12,…,ωK]TIs the mapping matrix of attention.
The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (this is an innovative point of the present invention with a relational inference module).
The characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics Vco=concat[V,B]. Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
Figure BDA0003534898230000112
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset.
Figure BDA0003534898230000113
Wherein d issIs the dimension of the subspace.
Extending visual zone characteristics for pairwise combining of visual zones
Figure BDA0003534898230000114
And transposing it
Figure BDA0003534898230000115
Adding to obtain paired combination V of visual region featuresfu
The binary relation reasoning comprises the following steps: three successive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is ds
Figure BDA0003534898230000116
And
Figure BDA0003534898230000117
combining visual regions into feature VfuInput to binary relational inference moduleIn (3), the output at the last layer is
Figure BDA0003534898230000118
Then will be
Figure BDA0003534898230000119
Adding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
Figure BDA00035348982300001110
the step of the multivariate relationship inference is as follows: three consecutive 3 x 3 void convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output of each convolution the same size as the input. Combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module
Figure BDA0003534898230000121
The same reasoning as the binary relation will
Figure BDA0003534898230000122
Adding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
Figure BDA0003534898230000123
the specific implementation steps of the step 4 are as follows:
first, according to the simplest bilinear multimodal fusion as
Figure BDA0003534898230000124
W is to beiReplaced by two smaller matrices HiGi TIn which
Figure BDA0003534898230000125
And
Figure BDA0003534898230000126
Figure BDA0003534898230000127
wherein 1 ∈ RdIs a vector whose elements are all 1, and
Figure BDA0003534898230000128
representing element-by-element multiplication;
secondly, the same mapping matrix is used for all image areas
Figure BDA0003534898230000129
And
Figure BDA00035348982300001210
Figure BDA00035348982300001211
wherein P ∈ RdIs a learning parameter; to obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Figure BDA00035348982300001212
thus all visual areas and corresponding attention features
Figure BDA00035348982300001213
Is represented as:
Vat=AT·V
wherein a ═ ω12,…,ωK]TIs the mapping matrix of attention.
The specific implementation steps of the step 5 are as follows:
step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features. Secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is paid more attention, i.e.
Figure BDA0003534898230000131
Wherein
Figure BDA0003534898230000132
Representing weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
Step 5.2: each word of the caption is encoded using bi-directional lstm (BiLSTM), while each word of the BiLSTM encoding problem is also employed:
Figure BDA0003534898230000133
Figure BDA0003534898230000134
Figure BDA0003534898230000135
Figure BDA0003534898230000136
wherein
Figure BDA0003534898230000137
Respectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.
Figure BDA0003534898230000138
Indicating the hidden states of the forward and backward LSTM of the problem at the jth time step, respectively.
Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion respectively to capture high-level semantic information; (this is yet another innovative aspect of the present invention).
The complete fusion strategy is to transmit each forward and reverse word vector of the caption paragraph and the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Figure BDA0003534898230000139
Figure BDA00035348982300001310
wherein
Figure BDA00035348982300001311
The vector is a vector with l dimension, and respectively represents the forward and reverse fully fused characteristics of the ith caption word vector.
The average pooling fusion strategy is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then perform average pooling operation, and the specific formula is as follows:
Figure BDA0003534898230000141
Figure BDA0003534898230000142
wherein
Figure BDA0003534898230000143
The vector is a vector with l dimension, and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector.
The attention fusion strategy firstly calculates a similarity coefficient between the caption context embedding and the problem context embedding through a cosine similarity function, then takes the similarity coefficient as a weight, and multiplies each forward (or reverse) word vector embedding of the problem and calculates an average value, and the specific formula is as follows:
Figure BDA0003534898230000144
Figure BDA0003534898230000145
Figure BDA0003534898230000146
Figure BDA0003534898230000147
wherein
Figure BDA0003534898230000148
Respectively representing the degree of similarity coefficients in the forward direction and the reverse direction,
Figure BDA0003534898230000149
the attention vectors corresponding to the ith subtitle word vector in the forward direction and the reverse direction respectively represent the relevance of the whole problem and the word.
Finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
Figure BDA00035348982300001410
Figure BDA00035348982300001411
the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
Figure BDA0003534898230000151
Figure BDA0003534898230000152
Figure BDA0003534898230000153
Figure BDA0003534898230000154
the 4 fusion methods cascade the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith caption, and record the comprehensive fusion feature as
Figure BDA0003534898230000155
Inputting the comprehensive fusion features into a bidirectional LSTM (BilSTM) and obtaining final hidden states in two directions, wherein the formula is as follows:
Figure BDA0003534898230000156
Figure BDA0003534898230000157
secondly, cascading the final hidden states at the head and the tail to generate multi-angle semantic features
Figure BDA0003534898230000158
Figure BDA0003534898230000159
Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
Figure BDA00035348982300001510
wherein
Figure BDA00035348982300001511
Is a learnable weight matrix, bsIs an offset.
And 6: and (5) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode. The prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
In summary, based on VQA 1.0.0 and VQA 2.0.0 data sets, the invention utilizes an R-CNN-LSTM frame in combination with an attention mechanism and a relational reasoning method in a visual channel, adopts visual characteristic vectors and geometric characteristic vectors in an Faster R-CNN coded image, and inputs the visual characteristic vectors and the geometric characteristic vectors into the visual channel to generate visual modal representation; and outputting semantic modal representation through a multi-angle semantic module by adopting the global title and the local title which are spliced by LSTM network coding in a semantic channel. And finally, inputting the obtained visual modal representation and semantic modal representation into an adaptive selection gate to decide which modal clue is adopted to predict the answer.
The innovation points are as follows: firstly, a relationship reasoning module is adopted in a visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multivariate relationship reasoning, the comprehension capability of the model to visual contents can be enhanced, and the model also has strong generalization capability in the face of more complex visual scenes. And secondly, generating semantic features by adopting a multi-angle semantic module in a semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, and the accuracy of the visual question-answering model can be improved while the semanteme of answers is improved.
Simulation experiment and experimental result characterization:
1. data set
The model was experimented on two published data sets of visual questions and answers, VQA 1.0.0 and VQA 2.0.0 data sets, respectively. VQA 1.0.0 was created based on the MSCOCO image dataset [38], with the training set containing 248349 questions and 82783 pictures, the validation set containing 121512 questions and 40504 pictures, and the test set containing 244302 questions and 81434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0.0, which adds more problem samples to make the language bias more balanced than VQA 1.0.0. The training set of the VQA 2.0.0 data set contained 443757 questions and 82783 pictures, the validation set contained 214354 questions and 40504 pictures, and the test set contained 447793 questions and 81434 pictures. There are three types of problems: yes/no, numeric, and others. Wherein the number of other types of samples is about half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and in order to ensure fair comparison with other works, a test result is reported on a test-development set (test-dev) and a test-standard set (test-standard).
2. Experimental Environment
The invention realizes the proposed model in the pyrrch library and completes the test experiment on the GUP server. The server is configured as 256G RAM and simultaneously has 4 Nvidia 1080Ti GPUs, and the total video memory is 64 GB. The invention trains the model using an Adam optimizer with a maximum iteration round of 40 and a batch size of 256. The learning rate is set to be 1e-3 in the first training period, the learning rate is set to be 2e-3 in the second training period, the learning rate is set to be 3e-3 in the third training period and is kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient pruning method to update the gradient value of each period to be one fourth of the original value. To prevent overfitting, dropout layers were employed after each fully connected layer, with a dropout ratio of 0.5.
3. Results and analysis of the experiments
Figure BDA0003534898230000171
TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set
As shown in Table 1, the performance comparison between various advanced models and the models herein is mainly demonstrated, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to that of other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the idea of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention also adds three types of fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic module in the semantic channel has important significance for improving the prediction precision. The model proposed by the present invention also performs in the test-criteria set.
Figure BDA0003534898230000181
TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set
As shown in Table 2, the present invention further verifies the performance of the model on the VQA 2.0.0 data set, which includes the test-development set and the test-standard set. Compared with the advanced method, the model provided by the invention has good performance on indexes such as overall accuracy and the like. Compared with the MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in the test-development set and the test-standard set respectively. The MuRel model is a prominent model in the existing multi-modal relational modeling method, and is a network structure adopting residual characteristics to learn end-to-end reasoning. The performance of the method is superior to that of the model because the semantic channel guides the answer prediction, so that the model can improve the prediction precision by using a large amount of semantic information. In addition, compared with a VCTREE model adopting a reinforcement learning and supervised learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the method has obvious advantages in indexes such as overall precision and the like. In view of the above, it is desirable to provide,
through comparison with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and the accuracy of the model for predicting the answer is improved.

Claims (10)

1. The visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels is characterized by comprising the following steps of:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 2; for embedding the question text, a space and punctuation method is used for segmenting the sentence into words; performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;
step 3; for embedding of image captions and dense caption texts, the sentences are segmented into words by using spaces and punctuations; then cascading the obtained multiple subtitle features and converting the subtitle features into a text paragraph form; finally, a long-time memory network is used for coding the text paragraphs, and the output of the last layer is a coded word vector sequence;
step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention feature and the relation feature to generate a visual feature representation;
step 5; inputting the word vector sequence and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
step 6; sending the visual features and the multi-angle semantic features generated in the steps 4 and 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.
2. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 1, is characterized in that: in the step 1), using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames as important visual areas; for each selected region i, viIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V1,v2,…,vK}T
Figure FDA0003534898220000021
In addition, the geometric features of the input image are also recorded, and are marked as B ═ B1,b2,…,bK}TWherein
Figure FDA0003534898220000022
(xi,yi),wi,hiRespectively representing the center coordinate, the width and the height of the selected area i; w, h represent the width and height of the input image, respectively.
3. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 2, is characterized in that: the step 2) is implemented according to the following steps:
firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as dqThe length of the dimension is memorized in the network; finally use
Figure FDA0003534898220000023
The final hidden state of (a) is a problem embedding representation of the input problem Q;
the text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the image subtitles and the dense subtitles are not cascaded.
4. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 3, wherein: the attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features
Figure FDA0003534898220000024
The weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω12,…,ωK]TIs the mapping matrix of attention;
the relationship inference module in the step 4) specifically includes: realizing the relation between the coding image areas by a double convolution stream mode, and generating two different types of relation features which are respectively a binary relation feature and a multivariate relation feature; the relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the pair-wise visual relation among the visual areas and generating binary relation characteristics in a three-continuous 1 multiplied by 1 convolutional layer mode; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas and generating multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers; and finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.
5. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 4, is characterized in that: the characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics Vco=concat[V,B](ii) a Secondly, the visual region is characterized by VcoMapping into a low-dimensional subspace with the problem features:
Figure FDA0003534898220000031
wherein WvAnd WqIs a learning parameter, bqAnd bvIs an offset;
Figure FDA0003534898220000032
wherein d issIs a dimension of a subspace;
the binary relation reasoning comprises the steps of adopting three continuous 1 multiplied by 1 convolutional layers, and adopting a ReLU activation layer after each convolutional layer; the number of channels of the three 1 × 1 convolutional layers is ds
Figure FDA0003534898220000033
And
Figure FDA0003534898220000034
combining visual regions into feature VfuWhen the input is input into the binary relation reasoning module, the output at the last layer is
Figure FDA0003534898220000035
Then will be
Figure FDA0003534898220000036
Adding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmaxpThe concrete formula is as follows:
Figure FDA0003534898220000037
the step of the multivariate relationship inference is as follows: three continuous 3 x 3 void convolutional layers are adopted, and a ReLU active layer is adopted after each convolutional layer; the voids of the three void convolution layers are 1, 2 and 4 respectively; all convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs VfuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module
Figure FDA0003534898220000041
The same reasoning as the binary relation will
Figure FDA0003534898220000042
Adding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmaxgThe formula is as follows:
Figure FDA0003534898220000043
6. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 5, wherein: the specific implementation steps of the step 4 are as follows:
first, according to multimodal fusion:
Figure FDA0003534898220000044
wherein 1 ∈ RdIs a vector whose elements are all 1, and
Figure FDA0003534898220000045
representing element-by-element multiplication.
Secondly, the same mapping matrix is used for all image areas
Figure FDA0003534898220000046
And
Figure FDA0003534898220000047
Figure FDA0003534898220000048
wherein P ∈ RdAre learning parameters. To obtain the attention mapping matrix, an attention weight ω for the image region iiThe following formula:
Figure FDA0003534898220000049
thus all visual areas and corresponding attention features
Figure FDA00035348982200000410
The weighted sum of (a) is expressed as:
Vat=AT·V
wherein a ═ ω12,…,ωK]TIs the mapping matrix of attention.
7. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 6, is characterized in that: the multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, semantic information more relevant to the problem is given more attention, i.e.
Figure FDA00035348982200000411
Wherein
Figure FDA00035348982200000412
Representing weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, improving the generalization capability of the model for understanding semantic information by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion.
8. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 7, is characterized in that: the specific implementation steps of the step 5 are as follows:
step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity methodiAnd problem qjIs selected and question qjThe most relevant text features; secondly, the weight coefficient RiAnd caption feature tiIn combination, the semantic information more relevant to the problem is paid more attention,namely, it is
Figure FDA0003534898220000051
Wherein
Figure FDA0003534898220000052
Representing weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the BilSTM coding problem is also coded by adopting the BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;
step 5.2: each word of the caption is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Figure FDA0003534898220000053
Figure FDA0003534898220000054
Figure FDA0003534898220000055
Figure FDA0003534898220000056
wherein
Figure FDA0003534898220000057
Respectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.
Figure FDA0003534898220000058
Hidden shape at jth time step for forward and backward LSTM, respectively, representing problemState;
step 5.3: and four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are respectively adopted to capture high-level semantic information.
9. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 8, characterized in that: the complete fusion is to transmit each forward and backward word vector of the caption paragraph and the forward and backward final state of the whole problem into the F function for fusion, and the specific formula is as follows:
Figure FDA0003534898220000061
Figure FDA0003534898220000062
wherein
Figure FDA0003534898220000063
The vector is a vector with l dimension and respectively represents the forward and reverse complete fusion characteristics of the ith caption word vector;
the average pooling fusion is to transmit the forward or backward word vector characteristics of the caption section and the forward (or backward) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:
Figure FDA0003534898220000064
Figure FDA0003534898220000065
wherein
Figure FDA0003534898220000066
The vector is a vector of l dimension and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector;
the attention fusion comprises the steps of firstly calculating a similarity degree coefficient between subtitle context embedding and problem context embedding through a cosine similarity function, then taking the similarity degree coefficient as weight, embedding and multiplying each forward (or reverse) word vector of a problem, and solving an average value, wherein the specific formula is as follows:
Figure FDA0003534898220000067
Figure FDA0003534898220000068
Figure FDA0003534898220000069
Figure FDA00035348982200000610
wherein
Figure FDA0003534898220000071
Respectively representing the degree of similarity coefficients in the forward direction and the reverse direction,
Figure FDA0003534898220000072
the attention vectors respectively corresponding to the forward direction and the reverse direction of the ith subtitle word vector represent the correlation between the whole problem and the word;
finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:
Figure FDA0003534898220000073
Figure FDA0003534898220000074
the maximum attention fusion is to embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
Figure FDA0003534898220000075
Figure FDA0003534898220000076
Figure FDA0003534898220000077
Figure FDA0003534898220000078
10. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 9, is characterized in that: the four fusion methods of the step 5) are to cascade the generated 8 feature vectors to obtain the comprehensive fusion features of the ith caption, and to mark the comprehensive fusion features as
Figure FDA0003534898220000079
Inputting the integrated fusion features into a bi-directional LSTM (BilStm) and obtaining final hidden shapes in two directionsThe state, the formula is as follows:
Figure FDA00035348982200000710
Figure FDA00035348982200000711
secondly, cascading the final hidden states at the head and the tail to generate multi-angle semantic features
Figure FDA0003534898220000081
Figure FDA0003534898220000082
Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:
Figure FDA0003534898220000083
wherein
Figure FDA0003534898220000084
Is a learnable weight matrix, bsIs an offset.
CN202210223976.XA 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels Active CN114661874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210223976.XA CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210223976.XA CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Publications (2)

Publication Number Publication Date
CN114661874A true CN114661874A (en) 2022-06-24
CN114661874B CN114661874B (en) 2024-04-30

Family

ID=82028726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210223976.XA Active CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Country Status (1)

Country Link
CN (1) CN114661874B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730059A (en) * 2022-12-08 2023-03-03 安徽建筑大学 Visual question answering method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
US20200293921A1 (en) * 2019-03-12 2020-09-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Visual question answering model, electronic device and storage medium
KR20210056071A (en) * 2019-11-08 2021-05-18 경기대학교 산학협력단 System for visual dialog using deep visual understanding
CN113886626A (en) * 2021-09-14 2022-01-04 西安理工大学 Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170124432A1 (en) * 2015-11-03 2017-05-04 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (abc-cnn) for visual question answering
US20200293921A1 (en) * 2019-03-12 2020-09-17 Beijing Baidu Netcom Science And Technology Co., Ltd. Visual question answering model, electronic device and storage medium
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
KR20210056071A (en) * 2019-11-08 2021-05-18 경기대학교 산학협력단 System for visual dialog using deep visual understanding
CN113886626A (en) * 2021-09-14 2022-01-04 西安理工大学 Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
孟祥申;江爱文;刘长红;叶继华;王明文;: "基于Spatial-DCTHash动态参数网络的视觉问答算法", 中国科学:信息科学, no. 08, 20 August 2017 (2017-08-20) *
王鑫: "基于视觉语义双通道的视觉问答算法研究", 《中国优秀硕士学位论文全文数据库》, 15 February 2023 (2023-02-15) *
闫茹玉;刘学亮;: "结合自底向上注意力机制和记忆网络的视觉问答模型", 中国图象图形学报, no. 05, 16 May 2020 (2020-05-16) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730059A (en) * 2022-12-08 2023-03-03 安徽建筑大学 Visual question answering method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114661874B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN108959396B (en) Machine reading model training method and device and question and answer method and device
CN108804530B (en) Subtitling areas of an image
CN110991290B (en) Video description method based on semantic guidance and memory mechanism
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN109740158B (en) Text semantic parsing method and device
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN114443899A (en) Video classification method, device, equipment and medium
CN112214996A (en) Text abstract generation method and system for scientific and technological information text
CN113626589A (en) Multi-label text classification method based on mixed attention mechanism
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN113392265A (en) Multimedia processing method, device and equipment
CN114943921A (en) Video text description method fusing multi-granularity video semantic information
CN114661874A (en) Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels
CN113609326A (en) Image description generation method based on external knowledge and target relation
CN117058276A (en) Image generation method, device, equipment and storage medium
CN115223086B (en) Cross-modal action positioning method and system based on interactive attention guidance and correction
CN114511813B (en) Video semantic description method and device
CN116028888A (en) Automatic problem solving method for plane geometry mathematics problem
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN117668213B (en) Chaotic engineering abstract generation method based on cascade extraction and graph comparison model
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
CN115858791B (en) Short text classification method, device, electronic equipment and storage medium
US20240177507A1 (en) Apparatus and method for generating text from image and method of training model for generating text from image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant