CN114841243A - Cross-modal retrieval model training method, cross-modal retrieval method, device and medium - Google Patents

Cross-modal retrieval model training method, cross-modal retrieval method, device and medium Download PDF

Info

Publication number
CN114841243A
CN114841243A CN202210351114.5A CN202210351114A CN114841243A CN 114841243 A CN114841243 A CN 114841243A CN 202210351114 A CN202210351114 A CN 202210351114A CN 114841243 A CN114841243 A CN 114841243A
Authority
CN
China
Prior art keywords
text
image
similarity
coding information
positive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210351114.5A
Other languages
Chinese (zh)
Other versions
CN114841243B (en
Inventor
黄�俊
潘浩
魏鑫燏
朱智聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Advanced Research Institute of CAS
Original Assignee
Shanghai Advanced Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Advanced Research Institute of CAS filed Critical Shanghai Advanced Research Institute of CAS
Priority to CN202210351114.5A priority Critical patent/CN114841243B/en
Publication of CN114841243A publication Critical patent/CN114841243A/en
Application granted granted Critical
Publication of CN114841243B publication Critical patent/CN114841243B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a cross-modal retrieval model training method, a cross-modal retrieval method, equipment and a medium, wherein the cross-modal retrieval model training method comprises the following steps: training a cross-modal retrieval model between an image and a text based on training data, wherein the model comprises a feature coding module, a sampling optimization module and a feature matching module; the model training method comprises the following steps: acquiring Q coding information and K coding information of each image, and Q coding information and K coding information of each text based on a characteristic coding module; updating an image memory pool and a text memory pool in the sampling optimization module based on the image K coding information and the text K coding information; obtaining matching similarity of the image sample and the text sample based on the feature matching module, further obtaining a loss function of the model, and reversely updating model parameters based on the loss function; the method enhances the discrimination capability of the model for distinguishing the positive sample from the negative sample, and improves the attention of the model to the high-discrimination sample pair.

Description

Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
Technical Field
The invention belongs to the technical field of cross-modal retrieval, and relates to a cross-modal retrieval model training method, a cross-modal retrieval method, cross-modal retrieval equipment and a computer storage medium.
Background
With the continuous development of science and technology, the living environment of people is full of data of various modalities, such as images, texts, voice, videos and the like. For the same thing, the expression forms of data in different modes are different, but the expressed semantic information is the same, so that the information in different modes is correlated and supplemented, and people can better sense the external environment. As the presentation of multi-modal data on the internet grows exponentially, it is important to accurately retrieve data desired by a user from a large database. However, most search engines support retrieval of single-mode data, cross-mode retrieval requires that data of one mode is used as a request to retrieve data of the most relevant other mode, and compared with single-mode retrieval, cross-mode retrieval is more in line with the requirements of users, and has important research and application values.
Although the current cross-modal search method has achieved good results, the traditional contrast Loss function such as triple Loss (Triplet Loss) only obtains negative samples in a Batch (Batch) by means of random sampling, the number of samples in a Batch is limited, so that the degree of difficulty (Hardness) of the sampled negative samples is not enough, and the current contrast Loss function treats the positive samples and the negative samples equally, and is not limited by weights according to different degrees of similarity, so that the discrimination capability of the model for distinguishing the positive samples from the hardest negative samples is not enough.
Disclosure of Invention
In view of the above disadvantages in the prior art, an object of the present invention is to provide a cross-modal search model training method, a cross-modal search apparatus, and a computer storage medium, which are used to solve the problems that in the existing cross-modal search model training, due to the fact that the range degree of a negative sample is not enough, samples with different similarities are not treated differently, so that the coding feature discrimination of the sample is not enough, and further, the discrimination capability of the model for positive and negative samples is not enough, and the like.
In order to achieve the above and other related objects, the present invention provides, in a first aspect, a cross-modal search model training method, which trains a cross-modal search model between an image and a text based on training data of each batch; the single batch of training data comprises sample pairs consisting of positive sample pairs and negative sample pairs, and each sample pair comprises an image sample and a text sample; the cross-modal retrieval model comprises a feature coding module, a sampling optimization module and a feature matching module; the sampling optimization module comprises an image memory pool and a text memory pool; the feature matching module comprises a first feature matching submodule and a second feature matching submodule; the cross-modal search model training method comprises the following steps: acquiring the training data of the current batch, and inputting the training data into the feature coding module to acquire image Q coding information and image K coding information of each sample pair in the training data, and text Q coding information and text K coding information of each sample pair; inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module so as to update an image memory pool and a text memory pool in the sampling optimization module; inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule to obtain the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; acquiring a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity; inputting the image Q coding information and the K coding information of each sample pair, and the text Q coding information and the K coding information into the second feature matching submodule to obtain the image-text QK code positive similarity and the text-text QK code positive similarity of the positive sample pair; acquiring negative similarity of each image-text QK code based on Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on Q coding information of each text sample and each text K coding information in the image memory pool; acquiring a second loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code; obtaining a total loss function value based on the first to third loss function values; reversely updating each model parameter in the feature coding module based on the total loss function value; updating the training data to perform model training based on the training data of the next batch until exiting.
In an embodiment of the present invention, the feature encoding module includes an image feature encoding sub-module and a text feature encoding sub-module; the image coding submodule comprises an image Query coder and an image Key coder; the text coding submodule comprises a text Query coder and a text Key coder; the acquiring image Q coding information and image K coding information of each sample pair, and text Q coding information and text K coding information of each sample pair, includes: inputting each sample pair into the image feature coding submodule so as to correspondingly extract image Q coding information and image K coding information of each sample pair based on the image Query encoder and the image Key encoder; and inputting each sample pair into the text feature coding submodule, so as to correspondingly extract text Q coding information and text K coding information of each sample pair based on the text Query coder and the text Key coder.
In an embodiment of the present invention, the updating the image memory pool and the text memory pool in the sampling optimization module includes: storing the input image K coding information to the topmost layer of the image memory pool, and removing the image K coding information of the bottommost layer of the image memory pool; and storing the input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information at the bottommost layer of the text memory pool.
In an embodiment of the present invention, the obtaining of the positive similarity of the Q code of the positive sample pair and the negative similarity of the Q code of the negative sample pair includes: calculating cosine similarity between the image Q coding information and the text Q coding information of the positive sample pair to obtain Q code positive similarity of the positive sample pair; calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair; and calculating cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the negative similarity of the text and image Q codes of the negative sample pair.
In an embodiment of the present invention, the obtaining a first loss function value of the training data based on the positive Q-code similarity and the negative Q-code similarity includes: based on the positive similarity of the Q code, the negative similarity of the image-text Q code and the negative similarity of the image-text Q code, calculating the first loss function value by using a ternary loss function, wherein the first loss function value is as follows:
Figure BDA0003580344960000031
wherein the content of the first and second substances,
Figure BDA0003580344960000032
Is a ternary loss function; [ x ] of] + =max(x,0);S(Q V ,Q T ) For positive similarity of the Q code, S (Q) V ,Q T - ) The negative similarity of the image-text Q code is obtained; s (Q) V - ,Q T ) Negative similarity of the Q code of the text and the image; m represents a preset similarity threshold.
In an embodiment of the present invention, the obtaining of the positive similarity between the image-text QK code and the positive similarity between the image-text QK code of the positive sample pair includes: calculating the cosine similarity of the image Q coding information in the positive sample pair and the text K coding information in the positive sample pair to obtain the image-text positive similarity of the positive sample pair; and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive text-image similarity of the positive sample pair.
In an embodiment of the present invention, the obtaining the second loss function value of the training data further includes: weighting the positive similarity of the image-text QK code and the negative similarity of the image-text QK code according to different weights to obtain a second loss function value; and weighting the positive similarity of the text image QK code and the negative similarity of the text image QK code according to different weights to obtain a new third loss function value.
In an embodiment of the invention, the obtaining the second loss function value of the training data includes: calculating the second loss function value using the graphic NCE loss function as:
Figure BDA0003580344960000033
wherein the content of the first and second substances,
Figure BDA0003580344960000034
is the teletext NCE loss function;
Figure BDA0003580344960000035
the positive similarity of the image-text QK codes is obtained;
Figure BDA0003580344960000036
the negative similarity of the image-text QK code is obtained; τ is a hyperparameter; w is a p Is a positive similarity weight, w n Is a negative similarity weight; and the obtaining a third loss function value of the training data comprises: calculating the second loss function value using the graphic NCE loss function as:
Figure BDA0003580344960000037
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003580344960000038
is the graph NCE loss function;
Figure BDA0003580344960000039
positive similarity of the text image QK codes;
Figure BDA00035803449600000310
negative similarity of each text and graph QK code; τ is a hyperparameter; w is a p Is a positive similarity weight, w n Is a negative similarity weight.
In an embodiment of the present invention, the positive similarity weight and the negative similarity weight are respectively:
Figure BDA0003580344960000041
Figure BDA0003580344960000042
wherein, alpha, beta and gamma are adjustable hyper-parameters, and satisfy beta > gamma > alpha.
In a second aspect, the present invention provides a cross-modal search method, including: constructing training data of each batch based on cross-modal retrieval sample data, wherein the training data of each batch comprises a positive sample pair and a negative sample pair; training a preset cross-modal retrieval model by adopting the cross-modal retrieval model training method as claimed in any one of the preceding claims based on each batch of the training data to obtain a trained cross-modal retrieval model; and based on the first modal data, searching the second modal data by using the trained cross-modal search model to obtain the second modal data corresponding to the first modal data.
The present invention provides, in a third aspect, an electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the cross-modal search model training method as described in any of the above or the cross-modal search method as described above.
The present invention provides in a fourth aspect a computer storage medium storing a computer program for execution by a processor of a cross-modal search model training method as described in any of the above or a cross-modal search method as described above.
As described above, the cross-modal search model training method, the cross-modal search apparatus, and the computer storage medium provided by the present invention set the image memory pool and the text memory pool in the sampling optimization module, correspondingly updating the image memory pool and the text memory pool in model training based on the image K coding information and the text K coding information, and by calculating the similarity between the image Q coding information and each text K coding information in the text memory pool, and calculating the similarity between the text Q coding information and each image K coding information in the image memory pool, compared with the existing random sampling method, the method increases the number of difficultly-divided negative samples in each batch of training process, therefore, the discrimination capability of the model for distinguishing the positive sample from the most difficult negative sample is enhanced, the training effect of the cross-modal retrieval model is improved, and the retrieval precision and accuracy of the trained cross-modal retrieval model are improved.
Drawings
FIG. 1 is a flow chart illustrating a cross-modal search model training method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a cross-modal search model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating the training of a pre-constructed cross-modal search model in one embodiment of the present invention;
FIG. 4 is a flow chart illustrating a cross-modal search method according to an embodiment of the present invention;
reference numerals
810 a feature encoding module for performing a feature encoding process,
811 an image feature encoding sub-module for encoding image features,
812 a text feature encoding sub-module that,
820 a sampling optimization module that is adapted to perform,
830 a feature matching module for use in a computer system,
831 a first feature sub-matching module which,
832 second feature sub-matching module.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than being drawn according to the number, shape and size of the components in actual implementation, and the type, amount and proportion of each component in actual implementation can be changed freely, and the layout of the components can be more complicated.
In order to solve the problems in the prior art, the invention provides, in a first aspect, a cross-modal search model training method, which is suitable for training a cross-modal search model; the cross-modal retrieval model is used for realizing cross-modal retrieval between the image and the text, namely, the image is used as retrieval information, and the text is used as the cross-modal retrieval executed by the retrieved information; or the text is used as the retrieval information, and the image is used as the cross-modal retrieval of the retrieved information.
Referring to fig. 1, a flowchart illustrating an implementation of the cross-modal search model training method according to the present invention is shown.
As shown in fig. 1, the cross-modal search model training method includes the following steps:
s100, acquiring training data of a current batch;
wherein the training data comprises positive sample pairs and negative sample pairs.
Specifically, the training data includes sample pairs for performing model training on the cross-modal search model; wherein a single said sample pair consists of a single image sample and a single text sample. Each sample pair has a corresponding similarity label for characterizing the degree of information similarity between the image sample and the text sample in the sample pair, including at least similarity and non-similarity.
Dividing each sample pair in the training data into a positive sample pair and a negative sample pair based on the similarity label information of the sample pair; wherein the positive sample pair is a sample pair with positive similarity, i.e. the similarity label is a similar sample pair; the negative sample pair is a sample pair with negative similarity, i.e. the similarity label is a non-similar sample pair.
S200, based on the training data, performing model training on the pre-constructed cross-modal retrieval model;
in this embodiment, as shown in fig. 2, the pre-constructed cross-modal search model includes a feature encoding module 810, a sampling optimization module 820, and a feature matching module 830.
Wherein the feature encoding module 810 comprises an image feature encoding submodule 811 and a text feature encoding submodule 812;
The image feature coding submodule 811 includes an image Query encoder and an image Key encoder, and is configured to obtain Query coding information and Key coding information of image samples in each sample pair, which are referred to as image Q coding information and image K coding information in the following text;
the text feature encoding sub-module 812 includes a text Query encoder and a text Key encoder, and is configured to respectively obtain Query encoding information and Key encoding information of text samples in each sample pair, which are referred to as text Q encoding information and text K encoding information in the following text.
The sampling optimization module 820 comprises a pre-constructed image memory pool and a text memory pool; the image memory pool is constructed in advance according to the characteristic dimension of the image K coding information and a preset capacity, and the text memory pool is constructed in advance according to the characteristic dimension of the text K coding information and the preset capacity; the capacity size of the image memory pool is the same as that of the text memory pool.
In this embodiment, the sampling optimization module is configured to update the image memory pool according to the obtained each piece of image K coding information, and update the text memory pool based on each piece of text K coding information obtained by the text feature coding submodule.
The feature matching module 830 includes a first feature sub-matching module 831 and a second feature matching sub-module 832.
The first feature sub-matching module 831 includes a first similarity calculator (not shown) for calculating a positive Q-code similarity of the positive sample pair and a negative Q-code similarity of the negative sample pair, respectively; the positive similarity of the Q code of the positive sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the positive sample pair, and the negative similarity of the Q code of the negative sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the negative sample pair, and comprises the negative similarity of the image and text Q code of the image sample matched to the text sample and the negative similarity of the text sample matched to the image and text Q code of the image sample.
The second feature matching sub-module 832 includes a second similarity calculator (not shown) for calculating a positive QK-code similarity of the positive sample pair, a negative QK-code similarity between each image sample and the text sample; wherein the QK code positive similarity of the positive sample pair comprises: positive similarity of the image-text QK code and the image-text QK code; the image-text QK code positive similarity is the cross-modal similarity between the image Q coding information and the text K coding information in the positive sample pair, and the image-text QK code positive similarity is the cross-modal similarity between the text Q coding information and the image K coding information in the positive sample pair.
The negative similarity of QK codes between the image samples and text samples comprises: the negative similarity of the image-text QK code and the negative similarity of the image-text QK code; the negative similarity of the image-text QK codes is the cross-modal similarity between the image Q coding information of the image sample and each text K code information in the memory pool; the negative similarity of the text and image QK codes is the cross-modal similarity between the text Q coding information of the text sample and the K code information of each image in the memory pool.
The first feature sub-matching module 831 further includes a first loss function, so as to obtain a first loss function value based on the Q code positive similarity and the Q code negative similarity output by the first similarity calculator; in one embodiment, the first loss function comprises a ternary loss function.
A second loss function and a third loss function are also included in the second feature sub-matching module 832; the second loss function is used for acquiring a second loss function value according to the positive similarity of the image-text QK code and the negative similarity of the image-text QK code output by the second similarity calculator; the third loss function is used for obtaining a third loss function value according to the positive similarity of the text image QK code and the negative similarity of the text image QK code output by the second similarity calculator and output by the second similarity calculator; in one embodiment, the second loss function includes a teletext NCE loss function, and the third loss function includes a teletext NCE loss function.
Further, the image-text NCE loss function is a loss function constructed by weighting the image-text QK code positive similarity and the image-text QK code negative similarity; and the text and image NCE loss function is constructed by weighting the positive similarity of the text and image QK codes and the negative similarity of the text and image QK codes.
Specifically, for a single batch of the training data, the performing model training on the pre-constructed cross-modal search model, as shown in fig. 3, includes:
s201, inputting the training data into the text feature coding submodule, enabling the image Query encoder to perform feature extraction on the image samples in each sample pair to obtain image Q coding information of each sample pair, and enabling the image Key encoder to perform feature extraction on the image samples in each sample pair to obtain image K coding information of each sample pair;
in this embodiment, for an image sample in a single sample pair, the performing, by the image Query encoder, feature extraction on the image sample in each sample pair includes:
based on the pre-trained target detector(s), Extracting the local coding information of the input image sample, namely extracting the Q coding information of each local area in the image sample as
Figure BDA0003580344960000071
Wherein f is n Encoding information for the Q of the nth local region; d f The characteristic dimension of the Q coding information is a preset coder parameter;
based on a full connection layer, the characteristic dimension of Q coding information of each local area is represented by d f Mapping to d v To obtain the total Q coding information of each local area, the calculation formula is:
Figure BDA0003580344960000081
wherein, W F And b F Is a learnable parameter matrix, W F Has a dimension of d f *d v ,b F Has a dimension of 1 x d v
Figure BDA0003580344960000082
Information is encoded for an initial picture of image sample data.
Further, to improve the stability of model training, the Q coding information of each local region in the image sample is normalized to obtain new Q coding information of each local region, and the new total Q coding information is obtained as follows:
Figure BDA0003580344960000083
based on the new total Q coding information of the local areas, obtaining the local attention weight corresponding to each local area through an attention mechanism method, and performing weighted summation on the new Q coding information corresponding to each local area in the image sample and the corresponding local attention weight to obtain the image Q coding information of the image sample, wherein the calculation formula is as follows:
Figure BDA0003580344960000084
α V =softmax(s V ) (4)
Figure BDA0003580344960000085
Wherein s is V Representing an image attention model; w 1 ,b 1 ,W 2 ,b 2 A parameter matrix representing the image attention model, σ represents a Sigmoid activation function of the image attention model, α V Representing a local attention weight corresponding to each local region in the image data; dropout () represents a random deactivation operation; softmax () represents a normalized exponential function.
In this embodiment, the image Query encoder in the feature encoding module can be simplified as follows:
Figure BDA0003580344960000086
wherein f is V Representing a model of an image Query encoder,
Figure BDA0003580344960000087
representing parameters in the model, including W 1 ,b 1 ,W 2 ,b 2 ,W F And b F
And performing feature extraction on the image samples in the sample pairs based on the image Query encoder to obtain image Q encoding information of each image sample.
In this embodiment, the model structure of the image Key encoder is the same as that of the image Query encoder, and only the parameter updating manner is different, and the model structure is a dynamically updated parameter; in one embodiment, the image Key encoder is:
Figure BDA0003580344960000088
wherein the content of the first and second substances,
Figure BDA0003580344960000089
are parameters of an image Key encoder, respectively, including W 1 ,b 1 ,W 2 ,b 2 ,W F And b F
In one embodiment, the image Key encoder parameters
Figure BDA0003580344960000091
Based on the image Key encoder parameters obtained during the previous batch of model training and the image Query encoder parameters obtained during the current batch of model training, the new image Key encoder parameters obtained are as follows:
Figure BDA0003580344960000092
Wherein μ represents a hyper-parameter; optionally, its value is set to 0.999;
Figure BDA0003580344960000093
and
Figure BDA0003580344960000094
the parameters are the image Key encoder parameters when the current batch of model training is executed and the image Key encoder parameters when the previous batch of model training is executed.
In this embodiment, for an image sample in a single sample pair, the performing, by the image Key encoder, feature extraction on the image sample in each sample pair includes:
acquiring image Key encoder parameters of the current batch of model training based on image Key encoder parameters obtained by the last batch of model training and image Query encoder parameters obtained by the current batch of model training, so as to acquire an image Key encoder of the current batch of model training based on the image Key encoder parameters; and extracting the features of the image samples in the sample pairs based on the image Key encoder to obtain the image K encoding information of the image samples.
S202, inputting the training data into the text feature coding submodule, enabling the text Query coder to perform feature extraction on text samples in each sample pair to obtain text Q coding information of each sample pair, and enabling the text Key coder to perform feature extraction on text samples in each sample pair to obtain text K coding information of each sample pair;
In this embodiment, for a single text sample, the performing, by the text Query encoder, feature extraction on the text sample in each sample pair includes:
specifically, based on the pre-trained word encoding matrix, the text samples are mapped into word vectors of
Figure BDA0003580344960000095
Figure BDA0003580344960000096
Wherein e is k Is the k word vector; d e Is the feature space of the word vector.
Based on a Gated Round Unit (GRU), semantic association vectors between the word vectors and the contexts are obtained and serve as Q coding information of the word vectors, and the calculation formula is as follows:
Figure BDA0003580344960000097
wherein, W E Is a parameter matrix of the gated cyclic unit;
Figure BDA0003580344960000098
information is encoded for the Q of each word vector,
Figure BDA0003580344960000099
d t and coding the feature space of the information for the Q of each word vector.
In this embodiment, the feature space dimension of the Q-coding information of the text sample is the same as the feature dimension of the Q-coding information of the image sample.
Further, to improve the stability of model training, the Q coding information of each word vector in the text sample is normalized to obtain new Q coding information of each word vector, and the method includes:
Figure BDA0003580344960000101
based on the new Q coding information of each word vector, obtaining the local attention weight corresponding to each word vector through an attention mechanism method, and carrying out weighted summation on the Q coding information corresponding to each word vector in a text sample and the corresponding local attention weight to obtain the text Q coding information of the text sample, wherein the calculation formula is as follows:
Figure BDA0003580344960000102
α T =softmax(s T ) (12)
Figure BDA0003580344960000103
Wherein s is T Is a text attention model; w 3 ,b 3 ,W 4 ,b 4 Is a parameter matrix of the text attention model, sigma represents a Sigmoid activation function of the text attention model, alpha T To representEach one ofLocal attention weights corresponding to the word vectors; dropout () represents a random deactivation operation; softmax () represents a normalized exponential function.
In this embodiment, the text Query encoder in the feature encoding module can be simplified as follows:
Q T =f T (E;θ t ) (14)
wherein f is T Representing a text Query encoder model, theta t Representing parameters in the text attention model, including W 3 ,b 3 ,W 4 ,b 4 And W and E
and performing feature extraction on the text samples in the sample pairs based on the text Query encoder to obtain text Q encoding information of the text samples.
In this embodiment, the model structure of the text Key encoder is the same as that of the text Query encoder, and only the parameter updating manner is different, and the model structure is a dynamically updated parameter; in one embodiment, the text Key encoder is:
Figure BDA0003580344960000104
wherein the content of the first and second substances,
Figure BDA0003580344960000105
are parameters of a text Key encoder, respectively, including W 3 ,b 3 ,W 4 ,b 4 And W and E
in one embodiment, the text Key encoder parameters
Figure BDA0003580344960000106
Based on the text Key encoder parameters obtained during the previous batch of model training and the text Query encoder parameters obtained during the current batch of model training, new text Key encoder parameters are obtained, which are:
Figure BDA0003580344960000107
Wherein μ represents a hyper-parameter; optionally, its value is set to 0.999;
Figure BDA0003580344960000108
and
Figure BDA0003580344960000109
are respectively provided withThe parameters of an image Key encoder when the current batch of model training is executed and the parameters of an image Key encoder when the previous batch of model training is executed are obtained;
Figure BDA00035803449600001010
the parameters of the text Query encoder when the current batch of training is executed.
In this embodiment, for a text sample in a single sample pair, the performing, by the text Key encoder, feature extraction on the text sample in each sample pair includes:
acquiring the parameters of a text Key encoder trained by the current batch of models based on the parameters of the text Key encoder acquired by the last batch of model training and the parameters of a text Query encoder acquired by the current batch of model training, so as to acquire the parameters of the text Key encoder trained by the current batch of models based on the parameters of the text Key encoder; and based on the text Key encoder, extracting the characteristics of the text samples in the sample pairs to obtain text K encoding information of the text samples.
S203, inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module to update an image memory pool and a text memory pool in the sampling optimization module;
Specifically, in the sampling optimization module, for image samples in the current batch, storing each input image K coding information to the topmost layer of the image memory pool, and removing the same number of image K coding information in the bottommost layer of the image memory pool, that is, the number of stored image K coding information is the same as the number of removed image K coding information; similarly, for the text samples in the current batch, storing each input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information with the same data in the bottommost layer of the text memory pool, that is, the number of the stored text K coding information is the same as that of the removed text K coding information, so that the data volume of each memory pool is kept unchanged while dynamically updating each memory pool.
S204, inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule so as to obtain the positive similarity of the Q codes of the positive sample pairs based on the image Q coding information and the text Q coding information of the positive sample pairs and obtain the negative similarity of the Q codes of the negative sample pairs based on the image Q coding information and the text Q coding information of the negative sample pairs; obtaining a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity;
Specifically, a first similarity calculator is adopted to calculate cosine similarity between image Q coding information and text Q coding information of the positive sample pair so as to obtain Q code positive similarity of the positive sample pair; calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair; and calculating cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the negative similarity of the text and image Q codes of the negative sample pair.
Further, in order to make the similarity between the image sample and the text sample of each positive sample pair higher and the similarity between the image sample and the text sample of each negative sample pair lower in the training data, a ternary loss function is used for constraint, which is:
Figure BDA0003580344960000111
wherein the content of the first and second substances,
Figure BDA0003580344960000112
is a ternary loss function; [ x ] of] + Max (x, 0); s (-) is used for calculating the cross-modal similarity between the image sample and the text sample, namely the first similarity calculator; s (Q) V ,Q T ) Is the Q-code positive similarity of the positive sample pair, S (Q) V ,Q T - ) The negative similarity of the image-text Q code of the negative sample pair is obtained; s (Q) V - ,Q T ) Negative similarity of the text and graph Q codes of the negative sample pairs; m represents a preset similarity threshold.
Based on equation 17, a first loss function value for obtaining the training data is calculated.
S205, inputting the image Q coding information and the K coding information of each sample pair, and the text Q coding information and the K coding information into the second feature matching submodule, so as to obtain the image-text QK code positive similarity of the positive sample pair and the text image QK code positive similarity of the positive sample pair based on the image Q coding information and the text K coding information of the positive sample pair; acquiring negative similarity of each image-text QK code based on the image Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on the text Q coding information of each text sample and each text K coding information in the image memory pool; acquiring a second loss function value of the training data based on the positive similarity of the image-text QK codes of the positive sample pair and the negative similarity of each image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK codes of the positive sample pair and the negative similarity of each image-text QK code;
specifically, a second similarity calculator is adopted to calculate cosine similarity of image Q coding information in the positive sample pair and text K coding information in the positive sample pair, so as to obtain positive image-text similarity of the positive sample pair; and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive text-image similarity of the positive sample pair.
Calculating cosine similarity of the image Q coding information in each sample pair and each text K coding information stored in the text memory pool by adopting a second similarity calculator to obtain image-text negative similarity of the image sample; and calculating cosine similarity of the text Q coding information in each sample pair and the image K coding information stored in the image memory pool to obtain the text-image negative similarity of the text sample.
In this embodiment, the second loss function value calculated by using the graphic NCE loss function is:
Figure BDA0003580344960000121
wherein the content of the first and second substances,
Figure BDA0003580344960000122
is the teletext NCE loss function;
Figure BDA0003580344960000123
positive similarity of the image-text QK codes of the positive sample pair;
Figure BDA0003580344960000124
the negative similarity of the image-text QK code of the image sample is obtained; τ is a hyperparameter, optionally with a value set to 0.07.
In this embodiment, similarly, the graph NCE loss function is used to calculate the third loss function value as follows:
Figure BDA0003580344960000125
wherein the content of the first and second substances,
Figure BDA0003580344960000126
is the graph NCE loss function;
Figure BDA0003580344960000127
positive similarity for the literal QK code of the positive sample pair;
Figure BDA0003580344960000128
negative similarity of a text and image QK code of the text sample; τ is a hyperparameter, optionally with a value set to 0.07.
Further, in order to make the weights of the positive sample pairs decrease with the increase of the similarity score and the weights of the negative sample pairs increase with the increase of the similarity score, when the second loss function value is calculated, weighting processing is performed on the positive similarity of the image-text QK codes of the positive sample pairs and the negative similarity of each image-text QK code according to different weights to obtain a new second loss function value; namely, the second loss function value is calculated by the image-text NCE loss function after weighting processing, and is:
Figure BDA0003580344960000131
Figure BDA0003580344960000132
Figure BDA0003580344960000133
Similarly, when the third loss function value is calculated, weighting the positive similarity of the text image QK codes of the positive sample pair and the negative similarity of each text image QK code according to different weights to obtain a new third loss function value; namely, the weighted text figure NCE loss function calculates the third loss function value as follows:
Figure BDA0003580344960000134
Figure BDA0003580344960000135
Figure BDA0003580344960000136
wherein, w p Is a positive similarity weight, w n Is a negative similarity weight; alpha, beta, gamma are all adjustable hyper-parameters and satisfy beta>γ>α; alternatively, a is 0.4, β is 3, and γ is 0.9.
S206, obtaining a total loss function value based on the first loss function value, the second loss function value and the third loss function value; and reversely updating each model parameter in the feature coding module based on the total loss function value.
Specifically, the first, second and third loss function values are summed to obtain a total loss function value as:
Figure BDA0003580344960000137
wherein the content of the first and second substances,
Figure BDA0003580344960000138
the total loss function value is given.
And reversely updating each model parameter in the image feature coding submodule and the text feature coding submodule based on the total loss function value so as to obtain the updated cross-modal retrieval model.
It should be noted that, in the cross-modal search model training method provided by the present invention, the sequence of the execution sequence of the step S201 and the step S202 is not limited, and the sequence of the execution sequence of the step S204 and the step S205 is not limited; for example, in other embodiments, the step S202 may be performed first, and then the step S201 may be performed, or the step S205 may be performed first, and then the step S204 may be performed;
and S300, updating the training data to execute model training based on the training data of the next batch until quitting so as to obtain the trained cross-modal retrieval model.
Specifically, training data of the next batch is obtained and used as new training data of the current batch; based on the new training data, re-executing the model training process, i.e. executing step S201 to step S206; the process is repeated (step S100 to step S300) until the total number of times of execution reaches a preset training iteration number threshold, thereby obtaining the trained cross-modal retrieval model.
In order to solve the problems in the prior art, a second aspect of the present invention provides a cross-modal search method for searching image data based on image information or searching image data based on text information.
Referring to fig. 4, a flow chart of the cross-modal search method according to an embodiment of the invention is shown.
As shown in fig. 4, the cross-modality retrieval method includes the following steps:
s10, constructing training data of each batch based on the cross-modal retrieval sample data;
and each batch of training data comprises a positive sample pair and a negative sample pair.
Specifically, the positive sample pair is positive similarity between the image sample and the text sample in the current sample pair, that is, the two are in a similar relationship; the negative sample pair is negative similarity between the image sample and the text sample in the current sample pair, namely, the two are in a non-similar relationship.
S20, training a preset cross-modal retrieval model based on the training data of each batch to obtain a trained cross-modal retrieval model;
specifically, the cross-modal search model training method shown in fig. 1 is adopted to train the preset cross-modal search model to obtain the trained cross-modal search model.
In this embodiment, the pre-constructed cross-modal search model includes a feature encoding module, a sampling optimization module, and a feature matching module.
The feature coding module comprises an image feature coding submodule and a text feature coding submodule;
The image characteristic coding submodule comprises an image Query coder and an image key coder and is used for respectively obtaining Query coding information and key coding information of image samples in each sample pair, and the Query coding information and the key coding information are referred to as image Q coding information and image K coding information in the following text;
the text feature coding submodule comprises a text Query coder and a text key coder and is used for respectively obtaining Query coding information and key coding information of text samples in each sample pair, and the Query coding information and the key coding information are referred to as text Q coding information and text K coding information in the following text.
The sampling optimization module comprises a pre-constructed image memory pool and a pre-constructed text memory pool; the image memory pool is pre-constructed according to the characteristic dimension of the image K coding information and a preset capacity, and the text memory pool is pre-constructed according to the characteristic dimension of the text K coding information and the preset capacity; the capacity size of the image memory pool is the same as that of the text memory pool. In this embodiment, the sampling optimization module is configured to update the image memory pool according to the obtained each image K coding information, and update the text memory pool based on each text K coding information obtained by the text feature coding submodule, so as to construct the negative sample pair by using the image memory pool and the text memory pool, thereby increasing the number of the negative samples that are difficult to be separated in the training process of each batch compared with the existing random sampling method.
The feature matching module comprises a first feature sub-matching module and a second feature matching sub-module.
The first feature sub-matching module comprises a first similarity calculator for respectively calculating the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; the positive similarity of the Q code of the positive sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the positive sample pair, and the negative similarity of the Q code of the negative sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the negative sample pair, and comprises the negative similarity of the image-text Q code matched with the text sample and the negative similarity of the text sample matched with the image-text Q code of the image sample.
The second feature matching submodule comprises a second similarity calculator and a second feature matching submodule, wherein the second similarity calculator is used for calculating the positive similarity of the QK codes of the positive sample pairs and the negative similarity of the QK codes between the image samples and the text samples; wherein the QK code positive similarity of the positive sample pair comprises: positive similarity of the image-text QK code and the image-text QK code; the image-text QK code positive similarity is the cross-modal similarity between the image Q coding information and the text K coding information in the positive sample pair, and the image-text QK code positive similarity is the cross-modal similarity between the text Q coding information and the image K coding information in the positive sample pair.
The negative similarity of QK codes between the image samples and text samples comprises: the negative similarity of the image-text QK code and the negative similarity of the image-text QK code; the negative similarity of the image-text QK codes is the cross-modal similarity between the image Q coding information of the image sample and each text K code information in the memory pool; the negative similarity of the text and image QK codes is the cross-modal similarity between the text Q coding information of the text sample and the K code information of each image in the memory pool.
The first characteristic sub-matching module further comprises a first loss function, so that a first loss function value is obtained based on the Q code positive similarity and the Q code negative similarity output by the first similarity calculator; in one embodiment, the first loss function comprises a ternary loss function.
The second characteristic sub-matching module further comprises a second loss function and a third loss function; the second loss function is used for acquiring a second loss function value according to the positive similarity of the image-text QK code and the negative similarity of the image-text QK code output by the second similarity calculator; the third loss function is used for obtaining a third loss function value according to the positive similarity of the text image QK code and the negative similarity of the text image QK code output by the second similarity calculator and output by the second similarity calculator; in one embodiment, the second loss function includes a teletext NCE loss function, and the third loss function includes a teletext NCE loss function.
Further, the graphic NCE loss function is: a loss function constructed by weighting the positive similarity based on the image-text QK code and the negative similarity based on the image-text QK code; and the text and image NCE loss function is constructed by weighting the positive similarity of the text and image QK codes and the negative similarity of the text and image QK codes.
And S30, based on the first modal data, searching in the second modal data by using the trained cross-modal search model to obtain second modal data corresponding to the first modal data.
Specifically, when the first modality data is image data, the second modality data is text data;
and when the first modality data is text data, the second modality data is image data.
In order to solve the problems in the prior art, the present invention also provides, in a third aspect, an electronic device, including: a processor, a memory, a transceiver, a communication interface, and a system bus; the memory is used for storing the computer program, the communication interface is used for communicating with other devices, and the processor and the transceiver are used for operating the computer program to enable the processing device to execute the cross-mode retrieval model training method or the cross-mode retrieval method.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.
The present invention also provides, in a fourth aspect, a computer-readable storage medium, on which a computer program is stored, which, when being invoked by a processor, implements each step in the cross-modal search model training method as described above or the cross-modal search method as described above.
Among other things, the computer-readable storage medium can be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device.
The computer-readable programs described herein may be downloaded from a computer-readable storage medium to a variety of computing/processing devices, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
In summary, according to the cross-modal search model training method, the cross-modal search method, the device and the computer storage medium provided by the present invention, the image memory pool and the text memory pool are set in the sampling optimization module, so as to be updated in the model training process based on the image K coding information and the text K coding information, and the number of the difficultly-divided negative samples in each batch of training process is increased compared with the existing random sampling method by calculating the similarity between the image Q coding information and each text K coding information in the text memory pool and calculating the similarity between the text Q coding information and each image K coding information in the image memory pool, thereby enhancing the discrimination ability of the model for distinguishing the positive samples from the most difficultly-divided negative samples, and further improving the training effect of the cross-modal search model; in addition, by adaptively applying proper weights to the positive similarity and the negative similarity, a sample pair which is difficult to distinguish can be endowed with a larger weight, and a sample pair which is easy to distinguish can be endowed with a smaller weight, so that the cross-modal retrieval model can pay more attention to the sample pair with discriminability, namely the model can pay more attention to the sample pair with discriminability from the redundant sample pair, thereby enhancing the discriminability of the coding characteristics, enhancing the semantic similarity of the multimodal data in the feature space, and further improving the precision of the cross-modal retrieval.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (12)

1. A training method of a cross-modal retrieval model is characterized in that the cross-modal retrieval model between an image and a text is trained based on training data of each batch; the single batch of training data comprises sample pairs consisting of positive sample pairs and negative sample pairs, and each sample pair comprises an image sample and a text sample; the cross-modal retrieval model comprises a feature coding module, a sampling optimization module and a feature matching module; the sampling optimization module comprises an image memory pool and a text memory pool; the feature matching module comprises a first feature matching submodule and a second feature matching submodule;
the cross-modal search model training method comprises the following steps:
Acquiring the training data of the current batch, and inputting the training data into the feature coding module to acquire image Q coding information and image K coding information of each sample pair in the training data, and text Q coding information and text K coding information of each sample pair;
inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module so as to update an image memory pool and a text memory pool in the sampling optimization module;
inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule to obtain the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; acquiring a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity;
inputting the image Q coding information and the image K coding information of each sample pair, and the text Q coding information and the text K coding information into the second feature matching submodule to obtain the image-text QK code positive similarity and the text-text QK code positive similarity of the positive sample pair; acquiring negative similarity of each image-text QK code based on Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on Q coding information of each text sample and each text K coding information in the image memory pool;
Acquiring a second loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code;
obtaining a total loss function value based on the first to third loss function values; based on the total loss function value, reversely updating each model parameter in the feature coding module;
updating the training data to perform the model training based on the training data of the next batch until exiting.
2. The training method of the cross-modal search model according to claim 1, wherein the feature encoding module comprises an image feature encoding submodule and a text feature encoding submodule; the image coding submodule comprises an image Query coder and an image Key coder; the text coding submodule comprises a text Query coder and a text Key coder;
the acquiring image Q coding information and image K coding information of each sample pair, and text Q coding information and text K coding information of each sample pair, includes:
Inputting each sample pair into the image feature coding submodule so as to correspondingly extract image Q coding information and image K coding information of each sample pair based on the image Query encoder and the image Key encoder;
and inputting each sample pair into the text feature coding submodule, so as to correspondingly extract text Q coding information and text K coding information of each sample pair based on the text Query coder and the text Key coder.
3. The method for training the cross-modal search model according to claim 1, wherein the updating the image memory pool and the text memory pool in the sampling optimization module comprises:
storing the input image K coding information to the topmost layer of the image memory pool, and removing the image K coding information of the bottommost layer of the image memory pool;
storing the input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information at the bottommost layer of the text memory pool.
4. The training method of the cross-modal search model according to claim 1, wherein the negative similarity of Q codes of the negative sample pair comprises negative similarity of teletext Q codes and negative similarity of teletext Q codes of the negative sample pair, and the obtaining the positive similarity of Q codes of the positive sample pair and the negative similarity of Q codes of the negative sample pair comprises:
Calculating cosine similarity between the image Q coding information and the text Q coding information of the positive sample pair to obtain Q code positive similarity of the positive sample pair;
calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair;
and calculating the cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the text and image Q code negative similarity of the negative sample pair.
5. The method for training a cross-modal search model according to claim 4, wherein the obtaining a first loss function value of the training data based on the positive similarity of the Q code and the negative similarity of the Q code comprises:
based on the positive similarity of the Q code, the negative similarity of the image-text Q code and the negative similarity of the image-text Q code, calculating the first loss function value by using a ternary loss function, wherein the first loss function value is as follows:
Figure FDA0003580344950000021
wherein the content of the first and second substances,
Figure FDA0003580344950000022
is a ternary loss function; [ x ] of] + =max(x,0);S(Q V ,Q T ) For positive similarity of the Q-code, S (Q) V ,Q T - ) The negative similarity of the image-text Q code is obtained; s (Q) V - ,Q T ) Negative similarity of the Q code of the text and the image; m represents a preset similarity threshold.
6. The method of claim 1, wherein the obtaining of the positive similarity of the text-to-text QK code and the positive similarity of the text-to-text QK code of the positive sample pair comprises:
Calculating the cosine similarity of the image Q coding information in the positive sample pair and the text K coding information in the positive sample pair to obtain the image-text positive similarity of the positive sample pair;
and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive similarity of the text and the image of the positive sample pair.
7. The method of claim 1, wherein obtaining the second loss function value of the training data further comprises:
weighting the positive similarity of the image-text QK code and the negative similarity of the image-text QK code according to different weights to obtain a new second loss function value;
the obtaining a third loss function value of the training data further includes:
and weighting the positive similarity of the text image QK code and the negative similarity of the text image QK code according to different weights to obtain a new third loss function value.
8. The method of claim 7, wherein the obtaining a new second loss function value of the training data comprises:
calculating the second loss function value using the graphic NCE loss function as:
Figure FDA0003580344950000031
Wherein the content of the first and second substances,
Figure FDA0003580344950000032
is the graphic NCE loss function;
Figure FDA0003580344950000033
the positive similarity of the image-text QK codes is obtained;
Figure FDA0003580344950000034
the negative similarity of the image-text QK code is obtained; τ is a hyperparameter; w is a p Is a positive similarity weight, w n Is a negative similarity weight;
the obtaining a third loss function value of the training data includes:
calculating a new third loss function value using the graph NCE loss function as:
Figure FDA0003580344950000035
wherein the content of the first and second substances,
Figure FDA0003580344950000036
is the graph NCE loss function;
Figure FDA0003580344950000037
positive similarity of the text image QK codes;
Figure FDA0003580344950000038
negative similarity of the QK codes of the text and the graph; τ is a hyperparameter; w is a p Is a positive similarity weight, w n Is a negative similarity weight.
9. The training method of the cross-modal search model according to claim 8, wherein the positive similarity weight and the negative similarity weight are respectively:
Figure FDA0003580344950000039
Figure FDA00035803449500000310
wherein, alpha, beta and gamma are adjustable hyper-parameters, and satisfy beta > gamma > alpha.
10. A cross-modal retrieval method, comprising:
constructing training data of each batch based on cross-modal retrieval sample data, wherein the training data of each batch comprises a positive sample pair and a negative sample pair;
training a preset cross-modal search model by using the cross-modal search model training method according to any one of claims 1 to 9 based on the training data of each batch to obtain a trained cross-modal search model;
And based on the first modal data, searching the second modal data by using the trained cross-modal search model to obtain the second modal data corresponding to the first modal data.
11. An electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the cross-modal search model training method of any one of claims 1 to 9 or the cross-modal search method of claim 10.
12. A computer storage medium storing a computer program, the computer program being executable by a processor to perform the cross-modal search model training method of any one of claims 1 to 9 or the cross-modal search method of claim 10.
CN202210351114.5A 2022-04-02 2022-04-02 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium Active CN114841243B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210351114.5A CN114841243B (en) 2022-04-02 2022-04-02 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210351114.5A CN114841243B (en) 2022-04-02 2022-04-02 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium

Publications (2)

Publication Number Publication Date
CN114841243A true CN114841243A (en) 2022-08-02
CN114841243B CN114841243B (en) 2023-04-07

Family

ID=82564646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210351114.5A Active CN114841243B (en) 2022-04-02 2022-04-02 Cross-modal retrieval model training method, cross-modal retrieval method, device and medium

Country Status (1)

Country Link
CN (1) CN114841243B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391578A (en) * 2022-08-03 2022-11-25 北京乾图科技有限公司 Cross-modal image-text retrieval model training method and system
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN115860587A (en) * 2023-03-02 2023-03-28 广州市玄武无线科技股份有限公司 Visit assessment method, device, equipment and storage medium based on image-text matching
CN116431788A (en) * 2023-04-14 2023-07-14 中电科大数据研究院有限公司 Cross-modal data-oriented semantic retrieval method
WO2024065645A1 (en) * 2022-09-30 2024-04-04 北京京东方技术开发有限公司 Image and text matching model training method and apparatus, and device and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN112990297A (en) * 2021-03-10 2021-06-18 北京智源人工智能研究院 Training method, application method and device of multi-mode pre-training model
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
US20210240761A1 (en) * 2019-01-31 2021-08-05 Shenzhen Sensetime Technology Co., Ltd. Method and device for cross-modal information retrieval, and storage medium
US20210240758A1 (en) * 2020-01-30 2021-08-05 Electronics And Telecommunications Research Institute Method of image searching based on artificial intelligence and apparatus for performing the same
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210240761A1 (en) * 2019-01-31 2021-08-05 Shenzhen Sensetime Technology Co., Ltd. Method and device for cross-modal information retrieval, and storage medium
US20210240758A1 (en) * 2020-01-30 2021-08-05 Electronics And Telecommunications Research Institute Method of image searching based on artificial intelligence and apparatus for performing the same
CN112148916A (en) * 2020-09-28 2020-12-29 华中科技大学 Cross-modal retrieval method, device, equipment and medium based on supervision
CN112613451A (en) * 2020-12-29 2021-04-06 民生科技有限责任公司 Modeling method of cross-modal text picture retrieval model
CN112784092A (en) * 2021-01-28 2021-05-11 电子科技大学 Cross-modal image text retrieval method of hybrid fusion model
CN112990297A (en) * 2021-03-10 2021-06-18 北京智源人工智能研究院 Training method, application method and device of multi-mode pre-training model
CN113095415A (en) * 2021-04-15 2021-07-09 齐鲁工业大学 Cross-modal hashing method and system based on multi-modal attention mechanism
CN113239214A (en) * 2021-05-19 2021-08-10 中国科学院自动化研究所 Cross-modal retrieval method, system and equipment based on supervised contrast
CN114201621A (en) * 2021-11-24 2022-03-18 人民网股份有限公司 Cross-modal retrieval model construction and retrieval method based on image-text cooperative attention

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RUI ZHAO ET AL.: "Memory Enhanced Embedding Learning for Cross-Modal Video-Text Retrieval", 《ARXIV:2103.15686V1 [CS.CV]》 *
赵瑞: "基于深度学习的视频-文本跨模态搜索", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115391578A (en) * 2022-08-03 2022-11-25 北京乾图科技有限公司 Cross-modal image-text retrieval model training method and system
WO2024065645A1 (en) * 2022-09-30 2024-04-04 北京京东方技术开发有限公司 Image and text matching model training method and apparatus, and device and storage medium
CN115861995A (en) * 2023-02-08 2023-03-28 山东海量信息技术研究院 Visual question-answering method and device, electronic equipment and storage medium
CN115860587A (en) * 2023-03-02 2023-03-28 广州市玄武无线科技股份有限公司 Visit assessment method, device, equipment and storage medium based on image-text matching
CN116431788A (en) * 2023-04-14 2023-07-14 中电科大数据研究院有限公司 Cross-modal data-oriented semantic retrieval method
CN116431788B (en) * 2023-04-14 2024-03-29 中电科大数据研究院有限公司 Cross-modal data-oriented semantic retrieval method

Also Published As

Publication number Publication date
CN114841243B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN114841243B (en) Cross-modal retrieval model training method, cross-modal retrieval method, device and medium
WO2022007823A1 (en) Text data processing method and device
US11093560B2 (en) Stacked cross-modal matching
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
CN111914067B (en) Chinese text matching method and system
CN108846077B (en) Semantic matching method, device, medium and electronic equipment for question and answer text
CN110348535B (en) Visual question-answering model training method and device
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN110427486B (en) Body condition text classification method, device and equipment
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN110263218B (en) Video description text generation method, device, equipment and medium
CN115080764A (en) Medical similar entity classification method and system based on knowledge graph and clustering algorithm
CN110968697B (en) Text classification method, apparatus, device and readable storage medium
CN112581327B (en) Knowledge graph-based law recommendation method and device and electronic equipment
CN110727769B (en) Corpus generation method and device and man-machine interaction processing method and device
WO2019201024A1 (en) Method, apparatus and device for updating model parameter, and storage medium
CN117313861A (en) Model pre-training data acquisition method, model pre-training method, device and equipment
CN111460117A (en) Dialog robot intention corpus generation method, device, medium and electronic equipment
CN113326383B (en) Short text entity linking method, device, computing equipment and storage medium
CN113435531A (en) Zero sample image classification method and system, electronic equipment and storage medium
WO2023116572A1 (en) Word or sentence generation method and related device
CN110413745B (en) Method for selecting representative text, method and device for determining standard problem
JP6586026B2 (en) Word vector learning device, natural language processing device, method, and program
CN117056501A (en) Method and device for extracting argument, electronic equipment and storage medium
CN115964474A (en) Policy keyword extraction method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant