CN114661874B - Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels - Google Patents

Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels Download PDF

Info

Publication number
CN114661874B
CN114661874B CN202210223976.XA CN202210223976A CN114661874B CN 114661874 B CN114661874 B CN 114661874B CN 202210223976 A CN202210223976 A CN 202210223976A CN 114661874 B CN114661874 B CN 114661874B
Authority
CN
China
Prior art keywords
features
visual
fusion
attention
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210223976.XA
Other languages
Chinese (zh)
Other versions
CN114661874A (en
Inventor
王鑫
陈巧红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Sci Tech University ZSTU
Original Assignee
Zhejiang Sci Tech University ZSTU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Sci Tech University ZSTU filed Critical Zhejiang Sci Tech University ZSTU
Priority to CN202210223976.XA priority Critical patent/CN114661874B/en
Publication of CN114661874A publication Critical patent/CN114661874A/en
Application granted granted Critical
Publication of CN114661874B publication Critical patent/CN114661874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Strategic Management (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of cross-modal tasks combined with the fields of computer vision and natural language processing. The technical proposal is as follows: a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; and finally, the word vector is expressed and the state of the last time step is acquired through a long-short memory network, so that the problem characteristic is obtained. The method can enable the trained model to have robustness; the method has stronger generalization capability for more complex visual scenes, improves the semanteme of answers, and improves the accuracy of a visual question-answering model.

Description

Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
Technical Field
The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels.
Background
The visual question-answering technology is a subject that requires simultaneous understanding of visual content, semantic information, and cross-modal relationships. In the past, a great deal of work has been done to develop individual stem models in a single machine vision or natural language processing field and has profound significance. After combining the two fields of machine vision and natural language processing, the visual question-answering technology serving as one of the cross-modal field branches has great potential influence on wide application such as visual navigation and remote monitoring.
Currently, various image algorithms have been applied to the field of visual questions and answers to show excellent performance, and the main stream methods are roughly divided into two categories: an algorithm based on multi-modal fusion and an algorithm based on an attention mechanism. The multi-modal fusion algorithm is based on a CNN-RNN structure, and fuses visual features and text features into a unified representation for predicting answers. The attention mechanism algorithm is used for distinguishing effective information related to the problem in the image and solving the problem of interaction between vision and language. However, the method of multimodal fusion and attention mechanism does not effectively combine text information with image information; and the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the acquisition capability of high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture, and can be matched to corresponding answers forward from the subtitle according to the questions, and the model should pay more attention to the high-level semantic information of the picture, so that the model has stronger robustness when matching the answers according to the subtitle.
Disclosure of Invention
The invention aims to overcome the defects of the background technology and provide a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which can enable a trained model to be more robust; the method has stronger generalization capability for more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answer model.
The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adapting double channels comprises the following steps:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
Step 3; for embedding of image captions and close captions text, the sentences are divided into words by using spaces and punctuations as well; then cascading the obtained caption features and converting the cascading features into a text paragraph form; finally, a long-short-term memory network is used for encoding text paragraphs, and the output of the last layer is the encoded word vector sequence;
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
Step 5; inputting the word vector sequences and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.
The invention is also characterized in that:
In the step 1, the usage object detection module specifically means: obtaining object detection frames by using a fast R-CNN model, and selecting the most relevant K detection frames (generally k=36) as important visual areas; for each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T, In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/>(X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.
The step 2 is specifically implemented according to the following steps:
Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-short-time memory network (LSTM) with the hidden layer of d q dimension; finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.
The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the step of cascading the image captions and the close captions is not included.
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention featuresExpressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 12,…,ωK]T is the mapping matrix of the attention.
The relation reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively. The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas to generate binary relation features in a mode of three continuous 1 multiplied by 1 convolution layers; the multivariate relation reasoning module is responsible for mining intra-group visual relations among visual areas to generate multivariate relation features in a mode of three continuous 3×3 cavity convolution layers. And finally, combining the binary relation features and the multiple relation features to obtain the relation features.
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]; secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Where W v and W q are learning parameters and b q and b v are biases. Where d s is the dimension of the subspace.
Combining visual areas in pairs, expanding visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V fu.
The binary relation reasoning comprises the following steps: three consecutive 1 x1 convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The number of channels for the three 1 x1 convolutional layers is d s respectively,/>The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
the step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
the specific implementation steps of the step 4 are as follows:
first, according to the multi-modal fusion:
Where 1 εR d is a vector with elements of 1, and Representing element-wise multiplication.
Second, the same mapping matrix is used for all image areasAnd/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 12,…,ωK]T is the mapping matrix of the attention.
The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t i and a problem q j by using a cosine similarity method, and selecting text features most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namelyWherein/>Representing the weight subtitle features; each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
The step5 is specifically implemented according to the following steps:
Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t i and a problem q j by using a cosine similarity method, and selecting a text feature most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM or BiLSTM, while each word of the question is also encoded using BiLSTM; finally, four methods of complete fusion, average pooling fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information;
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;
step 5.3: four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted respectively to capture the advanced semantic information.
The complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and the vector represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively;
The average pooling fusion is carried out, the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively;
The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward (or reverse) word vector embedding of the problem, and the average value is calculated, wherein the specific formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>And respectively corresponding to the forward and reverse attention vectors of the ith subtitle word vector, and representing the relevance of the whole problem and the word.
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
the maximum attention fusion is carried out, the problem with the maximum similarity coefficient is directly embedded into the attention vector, and finally the attention vector and the caption are embedded into the F function for fusion; the specific formula is as follows:
the four fusion strategies in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 feature vectors as
The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
The invention has the beneficial effects that:
1. The invention is based on a multi-angle semantic understanding and self-adaption dual-channel model, can capture visual clues and semantic clues of images at the same time, and adds gating in a later fusion stage to adaptively select visual information and semantic information to answer questions, so that the trained model has robustness.
2. The invention adopts the visual relation reasoning module in the visual channel, wherein the visual relation reasoning module comprises binary relation reasoning and multiple relation reasoning, enhances the understanding capability of the model on visual content, and has stronger generalization capability in the face of more complex visual scenes.
3. The invention adopts the multi-angle semantic module to generate semantic features in the semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of answers is improved, and the accuracy of a visual question-answering model is improved.
Drawings
FIG. 1 is a diagram of a network model structure of the method of the present invention.
Fig. 2 is a schematic diagram of a relationship inference module in the method of the present invention.
FIG. 3 is a schematic diagram of a multi-angle semantic module in the method of the present invention.
Detailed Description
The invention will be further described with reference to an embodiment (entitled child garment) shown in the drawings.
The invention discloses a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which comprises the following steps:
Step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Mesh features are extracted by using pre-training ResNet-101, object regions are explored in cooperation with a fast-RCNN model, 2048-dimensional target region features are extracted, and the K most relevant detection frames (generally K=36) are selected as important visual regions. For each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T, In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/> (X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively. W, h represents the width and height of the input image, respectively.
Step 2: for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
The implementation method is as follows: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions below 14 words are filled with 0 vectors. The problem of containing 14 words is then transformed into Glove vectors, resulting in word embedding sequences of size 14 x 300, which in turn are passed through a long-short-term memory network (LSTM) with hidden layer of d q dimensions. Finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.
Step 3: for embedding of image captions and close captions text, the sentences are also divided into words by using spaces and punctuation marks, and the sentence length is also set to be 14; then, the invention adopts the first 6 close captions (according to the average value of the caption distribution) as text input, and the obtained caption features are cascaded and converted into the text paragraph form; finally, the text paragraphs are encoded by using a long-short-time memory network, and the output of the last layer is the encoded word vector sequence.
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
The attention mechanism specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 12,…,ωK]T is the mapping matrix of the attention.
The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (using a relational reasoning module, which is an innovation point of the invention).
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]. Secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Where W v and W q are learning parameters and b q and b v are biases. Where d s is the dimension of the subspace.
To pair-wise combine visual areas, extended visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V fu.
The binary relation reasoning comprises the following steps: three consecutive 1 x1 convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The number of channels for the three 1 x1 convolutional layers is d s respectively,/>The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output and input size of each convolution the same. The visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
The specific implementation steps of the step 4 are as follows:
first, according to the simplest bilinear multi-modal fusion W i is replaced by two smaller matrices H iGi T, where/>/>
Where 1 εR d is a vector with elements of 1, andRepresenting element-by-element multiplication;
second, the same mapping matrix is used for all image areas And/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AT·V
Where a= [ ω 12,…,ωK]T is the mapping matrix of the attention.
The specific implementation steps of the step5 are as follows:
Step 5.1: the method comprises the steps of associating the problem features with the subtitle features, firstly traversing and calculating the relevance between the subtitle t i and the problem q j by using a cosine similarity method, and selecting the text features most relevant to the problem q j. Secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM (BiLSTM), while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.Representing the hidden state of the forward and reverse LSTM of the problem at the j-th time step, respectively.
Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion and maximum attention fusion to capture high-level semantic information; (this is yet another innovative point of the present invention).
The complete fusion strategy is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector of the dimension l represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively.
The average pooling fusion strategy is to transfer the forward (or reverse) word vector characteristics of the caption paragraph and the forward (or reverse) problem characteristics on each time step into an F function for fusion, and then execute the average pooling operation, wherein the specific formula is as follows:
Wherein the method comprises the steps of And the vector is a vector in the dimension I, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively.
The attention fusion strategy is that firstly, similarity coefficients between the context embedding of the subtitle and the context embedding of the problem are calculated through cosine similarity functions, then the similarity coefficients are regarded as weights, and the weights are multiplied by each forward (or reverse) word vector embedding of the problem and are averaged, and the specific formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>And respectively corresponding to the forward and reverse attention vectors of the ith subtitle word vector, and representing the relevance of the whole problem and the word.
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
the 4 fusion methods are used for cascading the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith subtitle, and the comprehensive fusion feature is recorded as The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:
/>
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
Step 6: and (3) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of the visual channel and the semantic channel to the predicted answer in a feature fusion mode. The answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.
In summary, the invention uses R-CNN-LSTM frame to combine attention mechanism and relation reasoning method in visual channel based on VQA 1.0.0 and VQA 2.0.0 data set, and uses visual feature vector and geometric feature vector in Faster R-CNN coding image to input the visual feature vector and geometric feature vector into visual channel to generate visual mode representation; and the semantic channel adopts LSTM network coding to splice the global title and the local title, and semantic modal representation is output through a multi-angle semantic module. And finally, inputting the obtained visual mode representation and semantic mode representation into an adaptive selection gating to decide which mode clues are adopted to predict answers.
The innovation points are as follows: the relationship reasoning module is adopted in the visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multiple relationship reasoning, so that the understanding capability of the model on visual content can be enhanced, and the model has stronger generalization capability in the face of more complex visual scenes. Secondly, a multi-angle semantic module is adopted in the semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of an answer can be improved, and meanwhile, the accuracy of a visual question-answer model can be improved.
Simulation experiment and experimental result characterization:
1. Data set
The model was run on two visual question-and-answer published datasets VQA 1.0.0 and VQA 2.0.0 datasets, respectively. VQA 1.0.0 is built on the MSCOCO image dataset [38], the training set in the dataset contains 248 349 questions and 82 783 pictures, the validation set contains 121 512 questions and 40 504 pictures, and the test set contains 244 302 questions and 81 434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0, which adds more problem samples than VQA 1.0 to make the language bias more balanced. The training set of VQA 2.0.0 data set contained 443 757 questions and 82 783 pictures, the validation set contained 214 354 questions and 40 504 pictures, and the test set contained 447 793 questions and 81 434 pictures. There are three types of problems: yes/no, numerical, and others. Wherein the number of other types of samples is approximately half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and test results are reported on a test-development set (test-dev) and a test-standard set (test-standard) in order to ensure fair comparison with other works.
2. Experimental environment
The invention realizes the proposed model in pytorch library and completes the test experiment on GUP server. The server is configured as 256G RAM and has 4 Nvidia 1080Ti GPU, and the total memory is 64GB. The invention trains the model by using an Adam optimizer, the maximum iteration round number is 40, and the batch size is set to 256. The learning rate is set to be 1e-3 in the first training period, set to be 2e-3 in the second training period, set to be 3e-3 in the third training period, and kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient trimming method to update the gradient value of each period to be one fourth of the original gradient value. To prevent overfitting, a dropout layer was used after each fully connected layer, with a dropout rate of 0.5.
3. Experimental results and analysis
TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set
As shown in table 1, performance comparisons of various advanced models and the models herein are mainly shown, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the concept of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention adds three fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic modules in the semantic channel have important significance for improving the prediction precision. The model proposed by the present invention also has the same performance in the test-standard set.
/>
TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set
As shown in Table 2, the present invention further verifies the performance of the model on VQA 2.0.0 datasets, including test-development sets and test-standard sets. By comparing with the advanced method, the model provided by the invention has good performance on indexes such as overall precision and the like. Compared with MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in test-development set and test-standard set, respectively. The MuRel model is a more prominent model in the current multi-mode relation modeling method, and is a network structure adopting residual characteristic learning end-to-end reasoning. The performance of the invention is superior to that of the model, and the model can utilize a large amount of semantic information to improve the prediction accuracy due to the guiding effect of the semantic channel on answer prediction. In addition, compared with VCTREE model adopting reinforcement learning and supervision learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the invention has obvious advantages in indexes such as overall accuracy and the like. In view of the above-mentioned, it is desirable,
By comparing with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and improves the accuracy of the model on answer prediction.

Claims (7)

1. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels is characterized by comprising the following steps of:
step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;
Step 2; for embedding of the problem text, dividing sentences into words by using a space and punctuation mark method; performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;
Step 3; for embedding of image captions and close captions text, the sentences are divided into words by using spaces and punctuations as well; then cascading the obtained caption features and converting the cascading features into a text paragraph form; finally, a long-short-term memory network is used for encoding text paragraphs, and the output of the last layer is the encoded word vector sequence;
Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;
Step 5; inputting the word vector sequences and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;
Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected through a multi-classifier to be used as a final answer in the prediction of the answer;
The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AΤ·V
wherein a= [ ω 12,…,ωK]Τ is a mapping matrix of the attention;
The relationship reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively; the relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas and generating binary relation features in a mode of three continuous 1X 1 convolution layers; the multi-relation reasoning module is responsible for mining intra-group visual relations among visual areas and generating multi-relation features in a mode of three continuous 3×3 cavity convolution layers; finally, combining the binary relation features and the multiple relation features to obtain relation features;
The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V co = concat [ V, B ]; secondly, the visual area characteristic V co and the problem characteristic are mapped into a subspace with low dimension:
Wherein W v and W q are learning parameters, b q and b v are biases; where d s is the dimension of the subspace;
The binary relation reasoning comprises the steps of adopting three continuous 1×1 convolution layers and adopting a ReLU activation layer after each convolution layer; the number of channels for the three 1 x1 convolutional layers is d s respectively, />The visual area combination characteristic V fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R p through softmax, wherein the specific formula is as follows:
The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolution layers are employed, and a ReLU activation layer is employed after each convolution layer; the holes of the three hole convolution layers are 1,2 and 4 respectively; the step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R g through softmax according to the following formula:
The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t i and a problem q j by using a cosine similarity method, and selecting text features most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features; then adopting bidirectional LSTM to code each word of the subtitle, and adopting BiLSTM to code each word of the problem; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.
2. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 1, wherein the method is characterized by comprising the following steps of: in the step 1), the usage object detection module specifically means: obtaining object detection frames by adopting a Faster R-CNN model, and selecting K most relevant detection frames as important visual areas; for each selected region i, V i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V 1,v2,…,vK}T,In addition, the geometric features of the input image are also recorded, denoted as b= { B 1,b2,…,bK}T, where/>(X i,yi),wi,hi represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.
3. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 2, wherein the method is characterized by comprising the following steps: the step 2) is specifically implemented according to the following steps:
Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-time memory network with a hidden layer of d q dimension; finally use The final hidden state of (a) is a question embedded representation of the input question Q;
The text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the step of cascading the image captions with the close captions is not included.
4. The visual question-answering method based on multi-angle semantic understanding and self-adapting dual channels according to claim 3, wherein: the specific implementation steps of the step 4 are as follows:
first, according to the multi-modal fusion:
Where 1 εR d is a vector with elements of 1, and Representing element-by-element multiplication;
second, the same mapping matrix is used for all image areas And/>
Wherein P ε R d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω i for the image region i is as follows:
Thus all visual areas and corresponding attention features Expressed as a weighted sum of:
Vat=AΤ·V
Where a= [ ω 12,…,ωK]Τ is the mapping matrix of the attention.
5. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 4, wherein the method is characterized by comprising the following steps of: the specific implementation steps of the step 5 are as follows:
Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t i and a problem q j by using a cosine similarity method, and selecting a text feature most relevant to the problem q j; secondly, combining the weight coefficient R i with the caption feature t i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features;
Step 5.2: each word of the subtitle is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:
Wherein the method comprises the steps of Respectively representing the hidden states of the forward LSTM and the reverse LSTM of the subtitle at the ith time step; The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;
step 5.3: four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted respectively to capture the advanced semantic information.
6. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 5, wherein the method is characterized by comprising the following steps of: the complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and the vector represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively;
The average pooling fusion is carried out, the forward or reverse word vector characteristics of the caption paragraphs and the forward problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:
Wherein the method comprises the steps of The vector is a vector in the dimension l, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively;
The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward word vector embedding of the problem and the average value is calculated, wherein the concrete formula is as follows:
Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>The attention vectors corresponding to the i-th subtitle word vector forward and reverse respectively represent the relevance of the whole problem and the word;
Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:
The maximum attention fusion is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:
7. the visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 6, wherein the method is characterized by comprising the following steps: the four fusion methods in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 characteristic vectors as
The integrated fusion features are input into a bidirectional LSTM, and the final hidden states in two directions are obtained, and the formula is as follows:
secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:
Wherein the method comprises the steps of As a learnable weight matrix, b s is bias.
CN202210223976.XA 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels Active CN114661874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210223976.XA CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210223976.XA CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Publications (2)

Publication Number Publication Date
CN114661874A CN114661874A (en) 2022-06-24
CN114661874B true CN114661874B (en) 2024-04-30

Family

ID=82028726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210223976.XA Active CN114661874B (en) 2022-03-07 2022-03-07 Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Country Status (1)

Country Link
CN (1) CN114661874B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730059A (en) * 2022-12-08 2023-03-03 安徽建筑大学 Visual question answering method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
KR20210056071A (en) * 2019-11-08 2021-05-18 경기대학교 산학협력단 System for visual dialog using deep visual understanding
CN113886626A (en) * 2021-09-14 2022-01-04 西安理工大学 Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163299A (en) * 2019-05-31 2019-08-23 合肥工业大学 A kind of vision answering method based on bottom-up attention mechanism and memory network
CN110222770A (en) * 2019-06-10 2019-09-10 成都澳海川科技有限公司 A kind of vision answering method based on syntagmatic attention network
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110717431A (en) * 2019-09-27 2020-01-21 华侨大学 Fine-grained visual question and answer method combined with multi-view attention mechanism
KR20210056071A (en) * 2019-11-08 2021-05-18 경기대학교 산학협력단 System for visual dialog using deep visual understanding
CN113886626A (en) * 2021-09-14 2022-01-04 西安理工大学 Visual question-answering method of dynamic memory network model based on multiple attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于Spatial-DCTHash动态参数网络的视觉问答算法;孟祥申;江爱文;刘长红;叶继华;王明文;;中国科学:信息科学;20170820(08);全文 *
基于视觉语义双通道的视觉问答算法研究;王鑫;《中国优秀硕士学位论文全文数据库》;20230215;全文 *
结合自底向上注意力机制和记忆网络的视觉问答模型;闫茹玉;刘学亮;;中国图象图形学报;20200516(05);全文 *

Also Published As

Publication number Publication date
CN114661874A (en) 2022-06-24

Similar Documents

Publication Publication Date Title
Zhou et al. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt
CN110163299B (en) Visual question-answering method based on bottom-up attention mechanism and memory network
CN112487807B (en) Text relation extraction method based on expansion gate convolutional neural network
CN111652357B (en) Method and system for solving video question-answer problem by using specific target network based on graph
CN109840322B (en) Complete shape filling type reading understanding analysis model and method based on reinforcement learning
CN110390397A (en) A kind of text contains recognition methods and device
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113569001A (en) Text processing method and device, computer equipment and computer readable storage medium
CN115438674B (en) Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
Wei et al. Enhance understanding and reasoning ability for image captioning
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN116610778A (en) Bidirectional image-text matching method based on cross-modal global and local attention mechanism
CN115964459B (en) Multi-hop reasoning question-answering method and system based on food safety cognition spectrum
CN117033602A (en) Method for constructing multi-mode user mental perception question-answering model
CN116975350A (en) Image-text retrieval method, device, equipment and storage medium
CN114661874B (en) Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN115238691A (en) Knowledge fusion based embedded multi-intention recognition and slot filling model
CN114003770A (en) Cross-modal video retrieval method inspired by reading strategy
CN116661852B (en) Code searching method based on program dependency graph
CN114511813B (en) Video semantic description method and device
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
CN114266905A (en) Image description generation model method and device based on Transformer structure and computer equipment
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant