CN114973041A - Language prior method for overcoming visual question and answer based on self-contrast learning - Google Patents
Language prior method for overcoming visual question and answer based on self-contrast learning Download PDFInfo
- Publication number
- CN114973041A CN114973041A CN202111557673.3A CN202111557673A CN114973041A CN 114973041 A CN114973041 A CN 114973041A CN 202111557673 A CN202111557673 A CN 202111557673A CN 114973041 A CN114973041 A CN 114973041A
- Authority
- CN
- China
- Prior art keywords
- image
- attention
- self
- contrast
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000000007 visual effect Effects 0.000 title claims abstract description 22
- 230000007246 mechanism Effects 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 14
- 239000010410 layer Substances 0.000 claims description 34
- 230000006870 function Effects 0.000 claims description 15
- 238000007781 pre-processing Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 239000013598 vector Substances 0.000 claims description 3
- 238000012805 post-processing Methods 0.000 claims description 2
- 239000013256 coordination polymer Substances 0.000 abstract description 11
- 238000000605 extraction Methods 0.000 abstract 1
- 238000005457 optimization Methods 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a language prior method for overcoming visual question answering based on self-contrast learning, which comprises the steps of firstly, completing image feature extraction through a pre-training model, and inputting a question word after the question word is embedded into a GRU (general purpose unit) to generate a question feature; then, an attention mechanism is used for the problem and the image, the problem characteristic and the image characteristic are fused into a joint representation characteristic, a weighted value learned by the attention mechanism is used for an anti-attention layer, the problem characteristic and the anti-image characteristic are fused into another joint representation characteristic, and the two characteristics are compared; VQA classification loss through optimization basis and self-contrast utilizationLearning loss L scl To increase the correlation of the problem with the image, these two losses are constructed as a joint loss untraining. The method is established on a model LMH, the most advanced performance of 59.00% is realized on the most common reference VQA CP v2 under the condition of not using an auxiliary task, and the absolute performance is improved by 6.51%.
Description
Technical Field
The invention relates to a language prior method for overcoming visual question answering based on self-contrast learning, belonging to the technical field of computer natural language processing and computer vision.
Background
The purpose of the Visual Question Answering (VQA) is to automatically answer natural language questions based on visual content, which is one of the benchmark tasks for multimodal (e.g., language and image). It requires visual analysis, language understanding, multimodal information fusion and reasoning. In recent years, VQA has attracted considerable interest, publishing various sets of reference data. With the introduction of a large number of works, the VQA task made a significant advance. A large number of works attempt to understand images and problems, but recent studies find that these works are largely driven by surface language correlation (i.e., language priors) in training QA pairs, ignoring image content. For example, these models tend to answer "how many … …? "question and" tennis "to answer" what movement … …? "and ignores the combination of image content to reason about this problem. To help address these bias factors, Agrawal et al created diagnostic benchmarks VQA-CP in 2018 (VQA altered priors by reorganizing the training and validation splits of the individual VQA datasets). Most prior works have designed various attention mechanisms to learn relationships between modalities, which can work well on VQA benchmark tests (e.g., VQA v 2). However, the performance of these works drops significantly on VQA-CP due to language priors. To alleviate language priors, the existing work focuses on reducing statistical priors of the problem and increasing image dependency and interpretability, which can be roughly divided into learning with auxiliary tasks and learning without auxiliary tasks. Learning without auxiliary tasks uses auxiliary QA branches to normalize the training of the target VQA model or a particular complex learning strategy. These methods attempt to add an auxiliary branch to capture the language prior to diminish its effect. The auxiliary task learning introduces additional manual supervision and auxiliary tasks (visual basis, image captioning, etc.) to increase image dependency and interpretability). The methods can better understand the image content under the guidance of auxiliary tasks, thereby achieving better performance. But the inherent data bias is severe leading to surface language correlation. Therefore, it is important to reduce the native language prior without introducing additional annotations while relying on the relevant visual areas for making decisions. Existing work to reduce language priors can be broadly divided into learning with and without auxiliary tasks. Learning through auxiliary tasks. These works introduce additional manual supervision and assistance tasks to increase image dependency and interpretability.
Disclosure of Invention
The invention provides a language prior method for overcoming visual question and answer based on self-contrast learning, which overcomes VQA language prior problem by using novel self-contrast learning, can concentrate on relevant areas to predict correct answers of given problems about input images, and improves the reasoning ability and robustness of VQA model.
The technical scheme of the invention is as follows: a language prior method for overcoming visual question answering based on self-contrast learning comprises the following specific steps:
step1, firstly, taking the questions, the images and answer options as experimental data, secondly, preprocessing the images to extract a feature map, and preprocessing the questions to generate question feature vectors;
step2, using attention layer learning to identify the image and the problem related area; after Step1 preprocessing, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem, and the resulting problem feature q and weighted image featuresFusing to a joint representation r;
step3, using the anti-attention layer to identify image areas that are currently not or less relevant; the question feature q and the weighted inverse image feature are weighted by the attention weight value obtained by Step2Fused into a joint representation r 0 Focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrast;
step4, post-treatment: joint representation r by Step3 0 And the joint representation r obtained at Step1, training the proposed network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.
Further, the Step1 includes the following steps:
step1.1, firstly, extracting a series of visual target characteristics from an image by using a pre-training model Faster-R-CNN;
step1.2, performing word embedding on the problem, and transmitting the problem to a single-layer GRU to generate a problem characteristic;
further, the specific Step of Step2 is as follows:
step2.1, after extracting the image and the problem features, transmitting the image and the problem features to an attention layer, and converting the image features and the problem features into a space with the same dimension;
step2.2, calculating attention weight, then generating a normalized attention weight for each feature map, wherein the final image feature is the weighted sum of all input features;
step2.3, fusing the weighted image characteristics and the question characteristics obtained by Step1.2 into a joint characteristic representation r, and further calculating the probability distribution of each answer a in the candidate answer set A
Further, the specific steps of Step3 are as follows:
step3.1, the attention mechanism uses the problem to compute attention weights on image regions to locate the image region associated with the problem. However, the counter-attention mechanism is the opposite of the attention mechanism. It helps VQA model overcome language priors by focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrasts.
Step3.2, normalized attention-back using attention weight a obtained from attention layer, centered on attention weight value obtained at Step2A force weight α' which performs a negative operation, opponent (a) ═ a or opponent (a) ═ e -a Making the larger weight smaller and the smaller weight larger, so that the attention weight output using the softmax function is focused on irrelevant areas;
step3.3, after learning the inverse attention weight, generating the weighted inverse image featureThen, similar to the attention layer, we will weight the image featuresFusion of problem feature q obtained with Step1.2 to joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculated
Further, the specific steps of Step4 are as follows:
step4.1, first loss layer contains two branches, the first one aiming at exploiting the underlying VQA model probability distributionIt is optimized by minimizing the cross-entropy loss of the binary, the loss function being defined as Lvqa;
step4.2, the other branch is a self-contrast layer, which aims at increasing the correlation and the dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, thus excluding the partial answers defined by the self-contrast layer and then providing the self-contrast learning loss Lscl to increase the correlation between the question and the image;
step4.3, training the proposed net to optimize the joint loss of self-contrast loss Lscl in step4.2 and the underlying VQA classification loss Lvqa in step4.1, by which we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.
Further, the model of the anti-attention layer is similar to the attention layer, first using the attention weight a obtained by the attention layer, the normalized anti-attention weight α' can be calculated as: α' ═ softmax (opponent (a)), after learning the inverse attention weights, we generate inverse image featuresAs follows:we will weight the inverse image featuresFusion of problem feature q obtained with Step1.2 to joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculatedThe formula is as follows: q's' 0 =f″ q (q), v′ 0 =f″ v (v), Wherein, f v ,f q ,f 0 Is a transformation function and w 0 Representing the weight matrix that needs to be learned.
The invention has the beneficial effects that:
1. the present invention solves the VQA problem by a novel self-contrast learning method that overcomes language priors by comparing answers generated for problem-related and problem-unrelated regions in an image.
2. After self-contrast learning training, the model is forced to learn more information from the relevant image regions. It effectively increases the semantic dependency and interpretability of the image. In this way, image features and problem context no longer exist in isolation during the modeling process.
3. Numerous experiments were performed on popular benchmark tests VQA-CP v1 and VQA-CP v 2. Experimental results show that our method can significantly improve the performance of the baseline data set without the use of additional annotations. In particular, by constructing on top of the LMH model, we achieved 59.00% of the most advanced performance at VQA-CP v2, with an absolute performance improvement of 6.51%.
Drawings
FIG. 1 is a block diagram of a language prior method for overcoming visual question-answering based on self-contrast learning;
FIG. 2 is a comparison of the present invention with several variations of the overcome language prior VQA model;
FIG. 3 is an example of the self-contrast learning of the present invention;
FIG. 4 is a flow chart of the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, a language prior method for overcoming visual question answering based on self-contrast learning is characterized in that: the method comprises the following specific steps:
step1, firstly, taking the questions, the images and answer options as experimental data, secondly, preprocessing the images to extract a feature map, and preprocessing the questions to generate question feature vectors;
step2, using attention layer learning to identify the image and the problem related area; after Step1 preprocessing, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem, and the resulting problem feature q and weighted image featuresFusion to a joint representation r;
step3, using the anti-attention layer to identify image areas that are currently not or less relevant; with the attention weight value obtained at Step2, the question will be askedFeature q and weighted inverse image featureFused into a joint representation r 0 Focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrast;
step4, post-processing: joint representation r by Step3 0 And the joint representation r obtained at Step1, training the proposed network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.
Further, the Step1 includes the following steps:
step1.1, firstly, extracting a series of visual target characteristics from an image by using a pre-training model Faster-R-CNN;
step1.2, performing word embedding on the problem, and transmitting the problem to a single-layer GRU to generate a problem characteristic;
further, the specific steps of Step2 are as follows:
step2.1, after extracting the image and the problem features, transmitting the image and the problem features to an attention layer, and converting the image features and the problem features into a space with the same dimension;
step2.2, calculating attention weight, then generating a normalized attention weight for each feature map, wherein the final image feature is the weighted sum of all input features;
step2.3, fusing the weighted image features and the question features obtained by Step1.2 into a joint feature representation r, and further calculating the probability distribution of each answer a in the candidate answer set A
Further, the specific steps of Step3 are as follows:
step3.1, the attention mechanism uses the problem to compute attention weights on image regions to locate the image region associated with the problem. However, the counter-attention mechanism is the opposite of the attention mechanism. It helps VQA model overcome language priors by focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrasts.
Step3.2, starting with the attention weight value obtained at Step2 as the center, uses the attention weight a obtained at the attention level, the normalized attention-deficit weight α', which performs a negative operation, opponent (a) or opponent (a) e -a Making the larger weight smaller and the smaller weight larger, so that the attention weight output using the softmax function is focused on irrelevant areas;
step3.3, after learning the inverse attention weight, generating the weighted inverse image featureThen, similar to the attention layer, we will weight the image featuresFusion of problem feature q obtained with Step1.2 to joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculated
Further, the specific steps of Step4 are as follows:
step4.1, first loss layer contains two branches, the first one aiming at exploiting the underlying VQA model probability distributionIt is optimized by minimizing the cross-entropy loss of the binary, the loss function being defined as Lvqa;
step4.2, another branch is self-contrast layer, aiming at increasing the correlation and dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, thus excluding the partial answers defined by the self-contrast layer and then providing self-contrast learning loss Lscl to increase the correlation between the question and the image;
step4.3, training the proposed net to optimize the joint loss of self-contrast loss Lscl in step4.2 and the underlying VQA classification loss Lvqa in step4.1, by which we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.
Further, the model of the anti-attention layer is similar to the attention layer, first using the attention weight a obtained by the attention layer, the normalized anti-attention weight α' can be calculated as: α' ═ softmax (opponent (a)), after learning the inverse attention weights, we generate inverse image featuresAs follows:we will weight the inverse image featuresFusion of problem feature q obtained with Step1.2 to joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculatedThe formula is as follows: q's' 0 =f″ q (q), v′ 0 =f″ v (v), Wherein f is v ,f q ,f 0 Is a transformation function and w 0 Representing the weight matrix that needs to be learned.
The present invention has performed a number of experiments on popular benchmark tests VQA-CP v1 and VQA-CP v 2. Experimental results show that the method of the present invention can significantly improve the performance of the reference data set without using additional annotations. In particular, by constructing on top of the LMH model, the most advanced performance of 59.00% was achieved at VQA-CP v2, with an absolute performance improvement of 6.51%, with the results shown in tables 1 and 2.
Table 1 shows the results of the experiment of the present invention at VQA-CP v2
Table 2 shows the results of the experiment of the present invention at VQA-CP v1
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (6)
1. A language prior method for overcoming visual question answering based on self-contrast learning is characterized in that: the method comprises the following specific steps:
step1, firstly, taking the questions, the images and answer options as experimental data, secondly, preprocessing the images to extract a feature map, and preprocessing the questions to generate question feature vectors;
step2, using attention layer learning to identify the image and the problem related area; after Step1 preprocessing, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem, and the resulting problem feature q and weighted image featuresAre fused intoJointly represent r;
step3, using the anti-attention layer to identify image areas that are currently not or less relevant; the question feature q and the weighted inverse image feature are weighted by the attention weight value obtained by Step2Fused into a joint representation r 0 Focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrast;
step4, post-processing: joint representation r by Step3 0 And Step1, training the network to optimize the joint loss from the contrast loss Lscl and the base VQA classification loss Lvqa, can focus on the relevant regions to predict the correct answer to a given question with respect to the input image.
2. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the Step1 comprises the following steps:
step1.1, firstly, extracting a series of visual target characteristics from an image by using a pre-training model Faster-R-CNN;
step1.2, performing word embedding on the problem, and transmitting the problem to a single-layer GRU to generate a problem characteristic.
3. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the specific steps of Step2 are as follows:
step2.1, after extracting the image and the problem features, transmitting the image and the problem features to an attention layer, and converting the image features and the problem features into a space with the same dimension;
step2.2, calculating attention weight, then generating a normalized attention weight for each feature map, wherein the final image feature is the weighted sum of all input features;
4. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1: the specific steps of Step3 are as follows:
step3.1, the attention mechanism uses the problem to calculate attention weights on image regions to locate the image region associated with the problem; however, the counter-attention mechanism is the opposite of the attention mechanism, which helps the VQA model overcome language priors by focusing the problem on irrelevant areas and ignoring relevant areas on the image to form contrasts;
step3.2, starting with the attention weight value obtained at Step2 as the center, uses the attention weight a obtained at the attention level, the normalized attention-deficit weight α', which performs a negative operation, opponent (a) or opponent (a) e -a Making the large weight small and the small weight large, so that the attention weight output using the softmax function is focused on an irrelevant area;
step3.3, after learning the inverse attention weight, generating the weighted inverse image featureThe weighted inverse image features are then weighted similarly to the attention layerFusing the problem feature q obtained from Step1 into a joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculated
5. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1: the specific steps of Step4 are as follows:
step4.1, first loss layer contains two branches, the first one aiming at exploiting the underlying VQA model probability distributionIt is optimized by minimizing the cross-entropy loss of the binary, the loss function being defined as Lvqa;
step4.2, the other branch is a self-contrast layer, which aims to increase the correlation and the dependency between the question and the image by using the answer distribution predicted by the self-contrast layer, firstly considering an objective function similar to QICE [36], benefiting from the objective function, and considering that certain correlation exists between the answers predicted by the question based on relevant and irrelevant areas in the same image, namely the predicted answers are mutually exclusive, so that partial answers defined by the self-contrast layer are eliminated, and then the self-contrast learning loss Lscl is provided to increase the correlation between the question and the image;
step4.3, training the proposed net to optimize the joint loss of self-contrast loss Lscl in step4.2 and the underlying VQA classification loss Lvqa in step4.1, by this approach, we can focus on the relevant areas to predict the correct answer to a given question with respect to the input image.
6. The method for overcoming language priors for visual question answering based on self-contrast learning according to claim 1, wherein: the model of the anti-attention layer is similar to the attention layer, first using the attention weight a obtained by the attention layer, and the normalized anti-attention weight α' is calculated as: α' ═ softmax (opponent (a)), and after learning the attention-back weight, the inverse image feature is generatedAs follows:the weighted inverse image featuresFusing with the obtained problem feature q into a joint feature representation r 0 Further, the probability distribution of each answer a in the candidate answer set A is calculatedThe formula is as follows: q's' 0 =f″ q (q),v′ 0 =f″ v (v), Wherein f is v ,f q ,f 0 Is a transformation function and w 0 Representing the weight matrix that needs to be learned.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111557673.3A CN114973041A (en) | 2021-12-20 | 2021-12-20 | Language prior method for overcoming visual question and answer based on self-contrast learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111557673.3A CN114973041A (en) | 2021-12-20 | 2021-12-20 | Language prior method for overcoming visual question and answer based on self-contrast learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114973041A true CN114973041A (en) | 2022-08-30 |
Family
ID=82974506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111557673.3A Pending CN114973041A (en) | 2021-12-20 | 2021-12-20 | Language prior method for overcoming visual question and answer based on self-contrast learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114973041A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117079142A (en) * | 2023-10-13 | 2023-11-17 | 昆明理工大学 | Anti-attention generation countermeasure road center line extraction method for automatic inspection of unmanned aerial vehicle |
-
2021
- 2021-12-20 CN CN202111557673.3A patent/CN114973041A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117079142A (en) * | 2023-10-13 | 2023-11-17 | 昆明理工大学 | Anti-attention generation countermeasure road center line extraction method for automatic inspection of unmanned aerial vehicle |
CN117079142B (en) * | 2023-10-13 | 2024-01-26 | 昆明理工大学 | Anti-attention generation countermeasure road center line extraction method for automatic inspection of unmanned aerial vehicle |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN109271505B (en) | Question-answering system implementation method based on question-answer pairs | |
CN110263912B (en) | Image question-answering method based on multi-target association depth reasoning | |
CN110490946B (en) | Text image generation method based on cross-modal similarity and antagonism network generation | |
CN112100351A (en) | Method and equipment for constructing intelligent question-answering system through question generation data set | |
CN112364150A (en) | Intelligent question and answer method and system combining retrieval and generation | |
CN109670168B (en) | Short answer automatic scoring method, system and storage medium based on feature learning | |
Jandial et al. | SAC: Semantic attention composition for text-conditioned image retrieval | |
CN108765383A (en) | Video presentation method based on depth migration study | |
CN112860945B (en) | Method for multi-mode video question answering by using frame-subtitle self-supervision | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN116166782A (en) | Intelligent question-answering method based on deep learning | |
CN113220856A (en) | Multi-round dialogue system based on Chinese pre-training model | |
CN112905762A (en) | Visual question-answering method based on equal attention-deficit-diagram network | |
Gomez-Perez et al. | ISAAQ--Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention | |
CN110516240B (en) | Semantic similarity calculation model DSSM (direct sequence spread spectrum) technology based on Transformer | |
Guo et al. | Which is the effective way for gaokao: Information retrieval or neural networks? | |
CN114973041A (en) | Language prior method for overcoming visual question and answer based on self-contrast learning | |
CN112749257A (en) | Intelligent marking system based on machine learning algorithm | |
CN116821297A (en) | Stylized legal consultation question-answering method, system, storage medium and equipment | |
CN116561274A (en) | Knowledge question-answering method based on digital human technology and natural language big model | |
CN111144134A (en) | Translation engine automatic evaluation system based on OpenKiwi | |
CN111259860B (en) | Multi-order characteristic dynamic fusion sign language translation method based on data self-driving | |
CN114997175A (en) | Emotion analysis method based on field confrontation training | |
CN114218439A (en) | Video question-answering method based on self-driven twin sampling and reasoning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |