CN114913403B - Visual question-answering method based on metric learning - Google Patents

Visual question-answering method based on metric learning Download PDF

Info

Publication number
CN114913403B
CN114913403B CN202210839762.5A CN202210839762A CN114913403B CN 114913403 B CN114913403 B CN 114913403B CN 202210839762 A CN202210839762 A CN 202210839762A CN 114913403 B CN114913403 B CN 114913403B
Authority
CN
China
Prior art keywords
visual
features
question
natural language
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210839762.5A
Other languages
Chinese (zh)
Other versions
CN114913403A (en
Inventor
舒昕垚
陆振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202210839762.5A priority Critical patent/CN114913403B/en
Publication of CN114913403A publication Critical patent/CN114913403A/en
Application granted granted Critical
Publication of CN114913403B publication Critical patent/CN114913403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Abstract

The invention discloses a visual question-answering method based on metric learning, which adopts a self-attention encoder and a cross-attention module to align and map natural language question features and visual image features in a high-dimensional feature space; carrying out similarity measurement on natural language problem features and visual image features by adopting a self-supervision multi-mode measurement learning method, and dividing the visual image features into positive visual features and negative visual features; the positive visual features and the original visual features are fused with the natural language question features to obtain correct answers, and the negative visual features and the natural language question features are fused to not obtain correct answers. The invention realizes the similarity measurement of multi-modal characteristics in a high-dimensional characteristic space, and adopts a comparative learning mode to carry out countertraining on the measured positive visual characteristics and the measured negative visual characteristics, thereby relieving the problems of semantic gap and semantic deviation in the visual question and answer and improving the performance and the robustness of the visual question and answer model.

Description

Visual question-answering method based on metric learning
Technical Field
The invention relates to a visual question-answering method, in particular to a visual question-answering method based on metric learning.
Background
Vision and language are the most important forms of human communication, and visual question answering is an important multi-modal task combining vision and language. Visual question-answering systems answer questions by exploring the content of visual images, which requires a deep understanding of both visual images and natural language. The difference between the computer understanding of the image and the problem results in a semantic gap between the two modalities, which cannot be well correlated, resulting in a decrease in the performance of the visual question-answering model.
Visual question answering is a task which needs to answer questions by observing images, however, most of the current visual question answering models answer questions by capturing the representation relationship between questions and answers, which results in that many questions can get correct answers without observing images, namely language deviation (language prior question) in the visual question answering models. If the fire hydrants in the training set are all red, the visual question-and-answer model will answer red when a picture containing the red fire hydrant is given and the "what color the fire hydrant is" is asked. Because the fire hydrants in the training set are all red, when the visual question-answering model meets the problems of 'what color the fire hydrant is', the visual picture information is directly ignored and the answer is red. When a picture containing a green fire hydrant is given and the question "what the fire hydrant is in color" is asked, the visual question-and-answer model will also answer directly red, which is obviously unreasonable. The existence of such language deviations also leads to the training of visual question-answering models from the training set, but cannot be generalized to test sets with different distributions.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a visual question-answering method based on metric learning, which reduces language deviation (language prior problem) of a visual question-answering model and improves accuracy and robustness of the visual question-answering model.
The technical scheme is as follows: the visual question answering method comprises the following steps:
s1, collecting data set, selecting an image and a natural language question related to the image as the input of the visual question-answer model;
s2, preprocessing the visual image and the natural language problem, extracting the regional characteristics of the visual image through a target detection algorithm, and extracting the object target characteristics and the boundary box characteristics; performing feature extraction on the natural language problem through a language representation algorithm;
s3, forming a multi-modal feature pair by the visual image features and the natural language problem features obtained by processing in the step S2, and performing feature fusion and alignment by adopting an encoder module; the encoder module comprises a self-attention module and a cross-attention module, wherein the self-attention module adopts a single-mode encoder, and the cross-attention module adopts a multi-mode cross encoder;
s4, calculating a correlation index of the fused visual image features and the natural language problem features by adopting an attention mechanism, and dividing the fused visual image features into positive visual features and negative visual features according to the correlation index;
s5, forming a triple by the active visual feature, the passive visual feature and the natural language problem feature, calculating the relation between the natural language problem feature and the visual image feature by a multi-modal triple loss function, and screening out the visual image feature related to the natural language problem;
s6, respectively performing feature fusion on the original visual features, the positive visual features and the negative visual features and the natural language question features; the feature fusion adopts a cross attention encoder module to finally obtain an original fusion feature, a positive fusion feature and a negative fusion feature;
s7, inputting the original fusion feature, the positive fusion feature and the negative fusion feature into an answer prediction module to predict answers, and calculating the answer obtained by the original fusion feature and the positive fusion feature through the answer prediction module and the loss of a standard label, and the answer obtained by the negative fusion feature through the answer prediction module and the loss of a false label by adopting a multi-label cross entropy loss function;
and S8, training the visual question-answer model according to the multi-mode triple loss function and the multi-label cross entropy loss function, and obtaining the final model parameters after the training conditions are met.
Further, in step S3, in the self-attention module, two single-modality encoders are established, which are a visual object encoder and a natural language question encoder respectively; the visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual errors are added into the self-attention layer and the feedforward neural network layer to be connected;
in the cross attention module, two multi-mode cross encoders are established, namely a visual object cross encoder and a natural language problem cross encoder; the visual object cross encoder is composed of a cross attention layer, a self attention layer and a feedforward neural network layer, and residual connection is added to each of the cross attention layer, the self attention layer and the feedforward neural network layer.
Further, in step S4, the attention mechanism calculates a correlation index between the visual image feature and the natural language question feature by using a dot product similarity:
Figure 665247DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 32774DEST_PATH_IMAGE002
for the aligned and mapped natural language question features of the 16 entries,
Figure 946504DEST_PATH_IMAGE003
for the aligned and mapped visual image features of the 36 strips,
Figure 392266DEST_PATH_IMAGE004
is a cosine function.
Further, in step S5, the multi-modal triplet loss function is:
Figure 566896DEST_PATH_IMAGE005
wherein the content of the first and second substances,
Figure 308587DEST_PATH_IMAGE006
in the form of the euclidean distance,
Figure 568667DEST_PATH_IMAGE007
is a natural language question feature;
Figure 788427DEST_PATH_IMAGE008
and
Figure 817563DEST_PATH_IMAGE009
positive visual features and negative visual features, respectively, that are visual image features;
Figure 854789DEST_PATH_IMAGE010
representing the distance between features for a hyper-parameter;
Figure 211952DEST_PATH_IMAGE011
indicating that the maximum value is selected.
Further, in step S7, the original fused features and the positive fused features can both obtain correct answers through the answer prediction module, and the negative fused features cannot obtain correct answers through the answer prediction module.
Further, in step S8, the visual question-answering model is trained according to the loss value, and the training is stopped by an early-stopping method when the accuracy of the verification set is greatly reduced, so as to obtain parameters of the final model.
Compared with the prior art, the invention has the following remarkable effects:
1. the visual question-answer model is improved by adopting a metric learning method, so that the language deviation of the visual question-answer model is reduced, and the accuracy and the robustness of the visual question-answer model are improved;
2. the invention designs a self-supervision training mode, and relevant visual image characteristics are screened out according to natural language problems; the visual question-answer model is trained through metric learning, so that when the visual question-answer model makes correct answers, objects in the image can be distinguished, wherein the objects are related to the questions, and the objects are unrelated to the questions;
3. the invention designs a comparative learning mode, which fuses the positive visual features distinguished by metric learning with the original visual features and the natural language question features to obtain correct answers, and fuses the negative visual features distinguished by metric learning with the natural language question features to obtain incorrect answers. The attention in the visual question-answering model is enhanced, the visual question-answering model must deduce the correct answer depending on the correct visual image area, the dependence of the visual question-answering model on the visual image is enhanced when making a decision, and the language deviation in the visual question-answering model is reduced.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of a self-attention model of the present invention;
FIG. 3(a) a schematic view of a visual object encoder;
FIG. 3(b) a schematic diagram of a natural language question encoder;
FIG. 4 is a schematic diagram of a multi-modal cross coder of the present invention;
FIG. 5 is a schematic diagram of a feature classification method according to the present invention;
FIG. 6 is a schematic diagram of a multi-modal metric learning method;
FIG. 7 is a schematic diagram of a visual question-answering model training method of the present invention;
FIG. 8 is a schematic diagram of the visual question-answering model of the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
As shown in FIG. 1, the visual question answering method of the present invention first collects visual image information and natural language question information; analyzing the visual image through a target detection algorithm, and extracting the regional characteristics of the image; performing feature extraction on the natural language problem through algorithms such as word embedding; secondly, performing feature fusion and mapping on the regional features of the visual images and the natural language problem features; extracting visual image features and natural language problem features after feature fusion, calculating a correlation index of the visual image features and the natural language problem features after fusion by adopting an attention mechanism, and specifically dividing the visual image after fusion into positive visual features and negative visual features according to the correlation index; then, adopting an automatic supervision scheme, forming a triple by the visual image characteristics of the two parts and the natural language problem characteristics, and calculating loss by adopting an improved multi-modal triple loss function; respectively carrying out feature fusion on the original visual image and the positive and negative visual images and natural language features and predicting an answer through an answer prediction module; and finally, training a visual question-answer model and optimizing parameters of the visual question-answer model.
The specific implementation steps are as follows:
step one, collecting a data set
With the VQA-CP v2 (VQA, Visual Question Answering) dataset, VQA-CP v2 changed the prior distribution of answers in the VQA v2.0 training and test segmentation so that the answers for each Question category (65 categories according to the Question prefix) had a different distribution in the training and test set to avoid the Visual Question Answering model giving the most popular answers for certain Question types without understanding the image content. VQA-CP v2 is composed of 265,016 pictures including COCO (Common Objects in Context image recognition dataset) pictures and abstract scenes, each picture containing at least 3 questions, each question containing 10 true answers, and each question containing 3 plausible answers.
Step two, extracting the characteristics of the visual image and the natural language problem
And extracting the region characteristics of the visual image by adopting a fast-RCNN algorithm, wherein 36 region characteristics are fixedly selected in each picture, and each characteristic contains the coordinates of a bounding box. Each regional characteristic corresponds to an object, the regional characteristics have higher precision than the traditional grid characteristics, and the similarity degree of each object and a problem is easier to observe when metric learning is carried out. And (4) extracting the features of the natural language problem by adopting a Bert algorithm.
Step three, data fusion and alignment
And forming a multi-modal feature pair by the visual image features obtained by processing and the natural language problem features, and fusing and aligning the features by adopting an encoder. The encoder module comprises a self-attention module and a cross-attention module, wherein the self-attention module is a single-mode encoder (as shown in fig. 3(a) and 3 (b)), and is formed by combining a self-attention layer and a full-connection layer; the cross attention module is a multi-modal cross encoder (as shown in fig. 4), and is formed by combining a cross attention layer, a self attention layer and a full connection layer.
Self-attention is the most important method for data fusion and alignment. Self-attention has three feature vectors, namely Query, Key and Value, which are obtained by the same input feature through three different full connection layers. As shown in fig. 2, the self-attention can be roughly divided into three stages, the first stage is to calculate the similarity by using the dot product similarity according to Query and Key, the second stage is to normalize the similarity coefficient by using softmax, and the third stage is to weight and sum Value according to the normalized weight coefficient in the second stage. The formula for self-attention is:
Figure 360036DEST_PATH_IMAGE012
wherein the content of the first and second substances,QKandVrespectively represent Query, Key and Value,
Figure 978100DEST_PATH_IMAGE013
is a vectorQ"T" represents a matrix transpose;
Figure 825708DEST_PATH_IMAGE014
eis a natural number, and the number of the main points is,
Figure 529221DEST_PATH_IMAGE015
representing a similarity coefficient, and n represents the input nth similarity coefficient.
In the self-attention module, two single-modality encoders are established, namely a visual object encoder (as shown in fig. 3 (a)) and a natural language problem encoder (as shown in fig. 3 (b)). The visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual connection is added in the self-attention layer and the feedforward neural network layer.
As shown in fig. 4, in the cross attention module, two multi-modal cross encoders are established, respectively, a visual object cross encoder and a natural language problem cross encoder. The cross encoder mainly comprises a cross attention layer, a self attention layer and a feedforward neural network layer, and residual connection is added in each of the cross attention layer, the self attention layer and the feedforward neural network layer. The multi-modal data fusion and alignment are mainly completed by a cross encoder, and the cross attention layer adopts a cooperative attention mode to enable visual image features and natural language problem features to pay attention to each other for feature fusion.
Step four, feature classification based on metric learning
And calculating a correlation index of the fused visual image features and the natural language problem features by adopting an attention mechanism, and dividing the fused visual image features into positive visual features and negative visual features according to the correlation index. The attention mechanism is a dot product similarity attention calculation mode, and a correlation index between visual image features and natural language problem features is calculated and used for distinguishing positive visual features from negative visual features. The formula of the dot product similarity of the visual image features and the natural language problem features is as follows:
Figure 356363DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 94512DEST_PATH_IMAGE017
for the aligned and mapped natural language question features of the 16 entries,
Figure 473541DEST_PATH_IMAGE018
cos () is the cosine function for the aligned and mapped visual image features of the 36 strips.
As shown in fig. 5, after the relevance index is obtained, the relevance index size is sorted, and 36 visual image features are divided into 20 features related to the natural language question features and 16 features unrelated to the natural language question features. Wherein visual image features that are associated with natural language question features are referred to as positive visual features and visual image features that are not associated with natural language question features are referred to as negative visual features.
In order to more accurately screen out visual image features related to natural language questions, a self-supervision mechanism is added during training of a visual question-answer model, and the relation between the natural language question features and positive visual features and negative visual features is learned by means of metric learning. As shown in fig. 6, the mapped natural language problem features are made into triplets with the active visual features and the passive visual features of the visual image, and the relationship between the triplets and the passive visual features is calculated by using a multi-modal triplet loss function and used for training. The multi-modal triplet loss function is:
Figure 805296DEST_PATH_IMAGE019
wherein d () is the Euclidean distance, q is the natural language problem feature, pos and neg are the positive visual feature and the negative visual feature of the visual image feature respectively; margin is a hyper-parameter, representing the distance between features, typically set to 0.8; max () represents selecting the maximum value.
Step five, an answer prediction module
The answer prediction module is composed of two layers of fully-connected neural networks, and the activation function adopts a Relu function. Weight normalization is also added into the answer prediction module to accelerate the training of the neural network. In the answer prediction module, natural language question features fused with visual image features (or extracted positive visual features and negative visual features) are extracted at first, and the fused natural language question features are pooled to obtain final fusion features. The output dimensionality of the single-layer neural network layer adopted by the pooling layer is 768, and the activation function adopts a Tanh function. And inputting the final fusion features obtained through pooling into an answer prediction module, and outputting an 2274-dimensional answer.
Step six, calculating a loss function
The method comprises the steps that a loss function is needed to be calculated at two positions during training of a visual question-answer model, the first position is to calculate the relation between natural language question features and visual image features through a multi-mode triple loss function, and the second position is to calculate the loss between answers and labels by adopting a multi-label cross entropy loss function.
Metric learning is a method of spatial mapping, and can learn a feature space in which the distance between features of similar samples is smaller and the distance between features of dissimilar samples is larger, so as to distinguish them. The multi-modal triplet loss function is thus:
Figure 295183DEST_PATH_IMAGE020
where q is the natural language question feature, pos is the positive visual feature, neg is the negative visual feature, and margin is the hyper-parameter (typically set to 0.8).
The cross entropy loss function for multiple tags is:
Figure 887839DEST_PATH_IMAGE021
wherein x is a predicted answer, y is a standard answer, k represents the total number of all answers, i =1, 2, …, k;
Figure 578714DEST_PATH_IMAGE022
step seven, a comparison learning module
As shown in fig. 7, the original visual features, the active visual features and the passive visual features are respectively fused with the natural language question features to obtain original fusion features, active fusion features and passive fusion features. The original visual features and the positive fusion features should both get correct answers through the answer prediction module, and the negative fusion features should not get correct answers through the answer prediction module. And calculating loss by respectively passing the original visual features and the positive fusion features through answers and standard labels obtained by an answer prediction module, calculating loss by passing the negative fusion features through answers and false labels obtained by the answer prediction module, and training by using a comparison method and utilizing the positive visual features and the negative visual features to cooperate with original features.
Step eight, training a visual question-answer model
And finally, training a visual question-answer model (as shown in fig. 8) according to the loss value by calculating the loss, and optimizing parameters of the visual question-answer model. And setting the maximum number of training rounds as 30 rounds, stopping training by adopting an early stop method when the accuracy of the verification set is greatly reduced, and taking the model parameter with the highest accuracy of the verification set as the parameter of the final model.
In order to show the beneficial effects of the invention, the invention adopts the standard visual question-answer evaluation index to evaluate the visual question-answer model:
Figure 522399DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 455458DEST_PATH_IMAGE024
is the answer of the statistical people's answer. If at least 3 annotators provide answers, the predicted outcome of the candidate answer will be 100%.
The experimental results are shown in Table 1, and are shown on the VQA-CPv2 data set validation set.
Table 1 effect display using metric learning-based visual question-answering method
Figure 902620DEST_PATH_IMAGE025
According to the table 1, the overall accuracy of the visual question-answer model added with the metric learning method is greatly improved together with the Yes/no problem, the semantic gap and semantic deviation problem in the visual question-answer is well relieved, and the performance and robustness of the visual question-answer model are improved.
The above-mentioned embodiments, objects, technical solutions and advantages of the present invention are further described in detail, it should be understood that the above-mentioned embodiments are only illustrative of the present invention, and are not intended to limit the scope of the present invention, and any person skilled in the art should understand that equivalent changes and modifications made without departing from the concept and principle of the present invention should fall within the protection scope of the present invention.

Claims (6)

1. A visual question-answering method based on metric learning is characterized by comprising the following steps:
s1, collecting data set, selecting an image and a natural language question related to the image as the input of the visual question-answer model;
s2, preprocessing the visual image and the natural language problem, extracting the regional characteristics of the visual image through a target detection algorithm, and extracting the object target characteristics and the boundary box characteristics; performing feature extraction on the natural language problem through a language representation algorithm;
s3, forming a multi-modal feature pair by the visual image features and the natural language problem features obtained by processing in the step S2, and performing feature fusion and alignment by adopting an encoder module; the encoder module comprises a self-attention module and a cross-attention module, wherein the self-attention module adopts a single-mode encoder, and the cross-attention module adopts a multi-mode cross encoder;
s4, calculating a correlation index of the fused visual image features and the natural language problem features by adopting an attention mechanism, and dividing the fused visual image features into positive visual features and negative visual features according to the correlation index;
s5, forming a triple by the active visual feature, the passive visual feature and the natural language problem feature, calculating the relation between the natural language problem feature and the visual image feature by a multi-modal triple loss function, and screening out the visual image feature related to the natural language problem;
s6, respectively performing feature fusion on the original visual features, the positive visual features and the negative visual features and the natural language question features; the feature fusion adopts a cross attention encoder module to finally obtain an original fusion feature, a positive fusion feature and a negative fusion feature;
s7, inputting the original fusion feature, the positive fusion feature and the negative fusion feature into an answer prediction module to predict answers, and calculating the answer obtained by the original fusion feature and the positive fusion feature through the answer prediction module and the loss of a standard label, and the answer obtained by the negative fusion feature through the answer prediction module and the loss of a false label by adopting a multi-label cross entropy loss function;
and S8, training the visual question-answer model according to the multi-mode triple loss function and the multi-label cross entropy loss function, and obtaining the final model parameters after the training conditions are met.
2. The visual question-answering method based on metric learning of claim 1, wherein in the step S3, in the self-attention module, two single-modality encoders are established, respectively, a visual object encoder and a natural language question encoder; the visual object encoder and the natural language problem encoder are both composed of a self-attention layer and a feedforward neural network layer, and residual errors are added into the self-attention layer and the feedforward neural network layer to be connected;
in the cross attention module, two multi-mode cross encoders are established, namely a visual object cross encoder and a natural language problem cross encoder; the visual object cross encoder is composed of a cross attention layer, a self attention layer and a feedforward neural network layer, and residual connection is added to each of the cross attention layer, the self attention layer and the feedforward neural network layer.
3. The visual question-answering method based on metric learning of claim 1, wherein in the step S4, the attention mechanism calculates a correlation index between visual image features and natural language question features by using a dot product similarity:
Figure DEST_PATH_IMAGE002
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE004
for the aligned and mapped natural language question features of the 16 entries,
Figure DEST_PATH_IMAGE006
for the aligned and mapped visual image features of the 36 strips,
Figure DEST_PATH_IMAGE008
is a cosine function.
4. The visual question-answering method based on metric learning of claim 1, wherein in the step S5, the multi-modal triplet loss function is:
Figure DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE012
in the form of the euclidean distance,
Figure DEST_PATH_IMAGE014
is a natural language question feature;
Figure DEST_PATH_IMAGE016
and
Figure DEST_PATH_IMAGE018
positive visual features and negative visual features, respectively, that are visual image features;
Figure DEST_PATH_IMAGE020
representing the distance between features for hyper-parameters;
Figure DEST_PATH_IMAGE022
indicating that a maximum value is selected.
5. The visual question-answering method based on metric learning of claim 1, wherein in step S7, the original fused features and the positive fused features can both get correct answers through an answer prediction module, and the negative fused features can not get correct answers through the answer prediction module.
6. The visual question-answering method based on metric learning of claim 1, wherein in step S8, the visual question-answering model is trained according to the loss value, and the training is stopped by adopting an early-stopping method when the accuracy of the verification set is greatly reduced, so as to obtain the parameters of the final model.
CN202210839762.5A 2022-07-18 2022-07-18 Visual question-answering method based on metric learning Active CN114913403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210839762.5A CN114913403B (en) 2022-07-18 2022-07-18 Visual question-answering method based on metric learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210839762.5A CN114913403B (en) 2022-07-18 2022-07-18 Visual question-answering method based on metric learning

Publications (2)

Publication Number Publication Date
CN114913403A CN114913403A (en) 2022-08-16
CN114913403B true CN114913403B (en) 2022-09-20

Family

ID=82771773

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210839762.5A Active CN114913403B (en) 2022-07-18 2022-07-18 Visual question-answering method based on metric learning

Country Status (1)

Country Link
CN (1) CN114913403B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797655B (en) * 2022-12-13 2023-11-07 南京恩博科技有限公司 Character interaction detection model, method, system and device
CN115936073B (en) * 2023-02-16 2023-05-16 江西省科学院能源研究所 Language-oriented convolutional neural network and visual question-answering method
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11417235B2 (en) * 2017-05-25 2022-08-16 Baidu Usa Llc Listen, interact, and talk: learning to speak via interaction
US11074829B2 (en) * 2018-04-12 2021-07-27 Baidu Usa Llc Systems and methods for interactive language acquisition with one-shot visual concept learning through a conversational game
CN110147457B (en) * 2019-02-28 2023-07-25 腾讯科技(深圳)有限公司 Image-text matching method, device, storage medium and equipment
CN110516530A (en) * 2019-07-09 2019-11-29 杭州电子科技大学 A kind of Image Description Methods based on the enhancing of non-alignment multiple view feature

Also Published As

Publication number Publication date
CN114913403A (en) 2022-08-16

Similar Documents

Publication Publication Date Title
CN114913403B (en) Visual question-answering method based on metric learning
CN111709409B (en) Face living body detection method, device, equipment and medium
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
WO2023065545A1 (en) Risk prediction method and apparatus, and device and storage medium
US11106951B2 (en) Method of bidirectional image-text retrieval based on multi-view joint embedding space
CN108171209B (en) Face age estimation method for metric learning based on convolutional neural network
EP2063393B1 (en) Color classifying method, color recognizing method, color classifying device, color recognizing device, color recognizing system, computer program, and recording medium
CN110135459B (en) Zero sample classification method based on double-triple depth measurement learning network
CN111325115A (en) Countermeasures cross-modal pedestrian re-identification method and system with triple constraint loss
CN109753571B (en) Scene map low-dimensional space embedding method based on secondary theme space projection
CN111611367B (en) Visual question-answering method introducing external knowledge
CN112085072B (en) Cross-modal retrieval method of sketch retrieval three-dimensional model based on space-time characteristic information
CN111325237B (en) Image recognition method based on attention interaction mechanism
CN115860152B (en) Cross-modal joint learning method for character military knowledge discovery
CN115050064A (en) Face living body detection method, device, equipment and medium
CN111666852A (en) Micro-expression double-flow network identification method based on convolutional neural network
CN112182275A (en) Trademark approximate retrieval system and method based on multi-dimensional feature fusion
CN116012922A (en) Face image gender identification method suitable for mask wearing state
CN115471885A (en) Action unit correlation learning method and device, electronic device and storage medium
TW202125323A (en) Processing method of learning face recognition by artificial intelligence module
CN112269892B (en) Based on multi-mode is unified at many levels Interactive phrase positioning and identifying method
CN109886206B (en) Three-dimensional object identification method and equipment
WO2023160666A1 (en) Target detection method and apparatus, and target detection model training method and apparatus
CN116468043A (en) Nested entity identification method, device, equipment and storage medium
CN115050044B (en) Cross-modal pedestrian re-identification method based on MLP-Mixer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant