CN114841243A

CN114841243A - Cross-modal retrieval model training method, cross-modal retrieval method, device and medium

Info

Publication number: CN114841243A
Application number: CN202210351114.5A
Authority: CN
Inventors: 黄�俊; 潘浩; 魏鑫燏; 朱智聪
Original assignee: Shanghai Advanced Research Institute of CAS
Current assignee: Shanghai Advanced Research Institute of CAS
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-08-02
Anticipated expiration: 2042-04-02
Also published as: CN114841243B

Abstract

The invention provides a cross-modal retrieval model training method, a cross-modal retrieval method, equipment and a medium, wherein the cross-modal retrieval model training method comprises the following steps: training a cross-modal retrieval model between an image and a text based on training data, wherein the model comprises a feature coding module, a sampling optimization module and a feature matching module; the model training method comprises the following steps: acquiring Q coding information and K coding information of each image, and Q coding information and K coding information of each text based on a characteristic coding module; updating an image memory pool and a text memory pool in the sampling optimization module based on the image K coding information and the text K coding information; obtaining matching similarity of the image sample and the text sample based on the feature matching module, further obtaining a loss function of the model, and reversely updating model parameters based on the loss function; the method enhances the discrimination capability of the model for distinguishing the positive sample from the negative sample, and improves the attention of the model to the high-discrimination sample pair.

Description

Cross-modal retrieval model training method, cross-modal retrieval method, device and medium

Technical Field

The invention belongs to the technical field of cross-modal retrieval, and relates to a cross-modal retrieval model training method, a cross-modal retrieval method, cross-modal retrieval equipment and a computer storage medium.

Background

With the continuous development of science and technology, the living environment of people is full of data of various modalities, such as images, texts, voice, videos and the like. For the same thing, the expression forms of data in different modes are different, but the expressed semantic information is the same, so that the information in different modes is correlated and supplemented, and people can better sense the external environment. As the presentation of multi-modal data on the internet grows exponentially, it is important to accurately retrieve data desired by a user from a large database. However, most search engines support retrieval of single-mode data, cross-mode retrieval requires that data of one mode is used as a request to retrieve data of the most relevant other mode, and compared with single-mode retrieval, cross-mode retrieval is more in line with the requirements of users, and has important research and application values.

Although the current cross-modal search method has achieved good results, the traditional contrast Loss function such as triple Loss (Triplet Loss) only obtains negative samples in a Batch (Batch) by means of random sampling, the number of samples in a Batch is limited, so that the degree of difficulty (Hardness) of the sampled negative samples is not enough, and the current contrast Loss function treats the positive samples and the negative samples equally, and is not limited by weights according to different degrees of similarity, so that the discrimination capability of the model for distinguishing the positive samples from the hardest negative samples is not enough.

Disclosure of Invention

In view of the above disadvantages in the prior art, an object of the present invention is to provide a cross-modal search model training method, a cross-modal search apparatus, and a computer storage medium, which are used to solve the problems that in the existing cross-modal search model training, due to the fact that the range degree of a negative sample is not enough, samples with different similarities are not treated differently, so that the coding feature discrimination of the sample is not enough, and further, the discrimination capability of the model for positive and negative samples is not enough, and the like.

In order to achieve the above and other related objects, the present invention provides, in a first aspect, a cross-modal search model training method, which trains a cross-modal search model between an image and a text based on training data of each batch; the single batch of training data comprises sample pairs consisting of positive sample pairs and negative sample pairs, and each sample pair comprises an image sample and a text sample; the cross-modal retrieval model comprises a feature coding module, a sampling optimization module and a feature matching module; the sampling optimization module comprises an image memory pool and a text memory pool; the feature matching module comprises a first feature matching submodule and a second feature matching submodule; the cross-modal search model training method comprises the following steps: acquiring the training data of the current batch, and inputting the training data into the feature coding module to acquire image Q coding information and image K coding information of each sample pair in the training data, and text Q coding information and text K coding information of each sample pair; inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module so as to update an image memory pool and a text memory pool in the sampling optimization module; inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule to obtain the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; acquiring a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity; inputting the image Q coding information and the K coding information of each sample pair, and the text Q coding information and the K coding information into the second feature matching submodule to obtain the image-text QK code positive similarity and the text-text QK code positive similarity of the positive sample pair; acquiring negative similarity of each image-text QK code based on Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on Q coding information of each text sample and each text K coding information in the image memory pool; acquiring a second loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code; obtaining a total loss function value based on the first to third loss function values; reversely updating each model parameter in the feature coding module based on the total loss function value; updating the training data to perform model training based on the training data of the next batch until exiting.

In an embodiment of the present invention, the feature encoding module includes an image feature encoding sub-module and a text feature encoding sub-module; the image coding submodule comprises an image Query coder and an image Key coder; the text coding submodule comprises a text Query coder and a text Key coder; the acquiring image Q coding information and image K coding information of each sample pair, and text Q coding information and text K coding information of each sample pair, includes: inputting each sample pair into the image feature coding submodule so as to correspondingly extract image Q coding information and image K coding information of each sample pair based on the image Query encoder and the image Key encoder; and inputting each sample pair into the text feature coding submodule, so as to correspondingly extract text Q coding information and text K coding information of each sample pair based on the text Query coder and the text Key coder.

In an embodiment of the present invention, the updating the image memory pool and the text memory pool in the sampling optimization module includes: storing the input image K coding information to the topmost layer of the image memory pool, and removing the image K coding information of the bottommost layer of the image memory pool; and storing the input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information at the bottommost layer of the text memory pool.

In an embodiment of the present invention, the obtaining of the positive similarity of the Q code of the positive sample pair and the negative similarity of the Q code of the negative sample pair includes: calculating cosine similarity between the image Q coding information and the text Q coding information of the positive sample pair to obtain Q code positive similarity of the positive sample pair; calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair; and calculating cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the negative similarity of the text and image Q codes of the negative sample pair.

In an embodiment of the present invention, the obtaining a first loss function value of the training data based on the positive Q-code similarity and the negative Q-code similarity includes: based on the positive similarity of the Q code, the negative similarity of the image-text Q code and the negative similarity of the image-text Q code, calculating the first loss function value by using a ternary loss function, wherein the first loss function value is as follows:

wherein the content of the first and second substances,

Is a ternary loss function; [ x ] of] ₊ ＝max(x,0)；S(Q _V ,Q _T ) For positive similarity of the Q code, S (Q) _V ,Q _T ^- ) The negative similarity of the image-text Q code is obtained; s (Q) _V ^- ,Q _T ) Negative similarity of the Q code of the text and the image; m represents a preset similarity threshold.

In an embodiment of the present invention, the obtaining of the positive similarity between the image-text QK code and the positive similarity between the image-text QK code of the positive sample pair includes: calculating the cosine similarity of the image Q coding information in the positive sample pair and the text K coding information in the positive sample pair to obtain the image-text positive similarity of the positive sample pair; and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive text-image similarity of the positive sample pair.

In an embodiment of the present invention, the obtaining the second loss function value of the training data further includes: weighting the positive similarity of the image-text QK code and the negative similarity of the image-text QK code according to different weights to obtain a second loss function value; and weighting the positive similarity of the text image QK code and the negative similarity of the text image QK code according to different weights to obtain a new third loss function value.

In an embodiment of the invention, the obtaining the second loss function value of the training data includes: calculating the second loss function value using the graphic NCE loss function as:

wherein the content of the first and second substances,

is the teletext NCE loss function;

the positive similarity of the image-text QK codes is obtained;

the negative similarity of the image-text QK code is obtained; τ is a hyperparameter; w is a _p Is a positive similarity weight, w _n Is a negative similarity weight; and the obtaining a third loss function value of the training data comprises: calculating the second loss function value using the graphic NCE loss function as:

wherein, the first and the second end of the pipe are connected with each other,

is the graph NCE loss function;

positive similarity of the text image QK codes;

negative similarity of each text and graph QK code; τ is a hyperparameter; w is a _p Is a positive similarity weight, w _n Is a negative similarity weight.

In an embodiment of the present invention, the positive similarity weight and the negative similarity weight are respectively:

wherein, alpha, beta and gamma are adjustable hyper-parameters, and satisfy beta > gamma > alpha.

In a second aspect, the present invention provides a cross-modal search method, including: constructing training data of each batch based on cross-modal retrieval sample data, wherein the training data of each batch comprises a positive sample pair and a negative sample pair; training a preset cross-modal retrieval model by adopting the cross-modal retrieval model training method as claimed in any one of the preceding claims based on each batch of the training data to obtain a trained cross-modal retrieval model; and based on the first modal data, searching the second modal data by using the trained cross-modal search model to obtain the second modal data corresponding to the first modal data.

The present invention provides, in a third aspect, an electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the cross-modal search model training method as described in any of the above or the cross-modal search method as described above.

The present invention provides in a fourth aspect a computer storage medium storing a computer program for execution by a processor of a cross-modal search model training method as described in any of the above or a cross-modal search method as described above.

As described above, the cross-modal search model training method, the cross-modal search apparatus, and the computer storage medium provided by the present invention set the image memory pool and the text memory pool in the sampling optimization module, correspondingly updating the image memory pool and the text memory pool in model training based on the image K coding information and the text K coding information, and by calculating the similarity between the image Q coding information and each text K coding information in the text memory pool, and calculating the similarity between the text Q coding information and each image K coding information in the image memory pool, compared with the existing random sampling method, the method increases the number of difficultly-divided negative samples in each batch of training process, therefore, the discrimination capability of the model for distinguishing the positive sample from the most difficult negative sample is enhanced, the training effect of the cross-modal retrieval model is improved, and the retrieval precision and accuracy of the trained cross-modal retrieval model are improved.

Drawings

FIG. 1 is a flow chart illustrating a cross-modal search model training method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a cross-modal search model according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating the training of a pre-constructed cross-modal search model in one embodiment of the present invention;

FIG. 4 is a flow chart illustrating a cross-modal search method according to an embodiment of the present invention;

reference numerals

810 a feature encoding module for performing a feature encoding process,

811 an image feature encoding sub-module for encoding image features,

812 a text feature encoding sub-module that,

820 a sampling optimization module that is adapted to perform,

830 a feature matching module for use in a computer system,

831 a first feature sub-matching module which,

832 second feature sub-matching module.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than being drawn according to the number, shape and size of the components in actual implementation, and the type, amount and proportion of each component in actual implementation can be changed freely, and the layout of the components can be more complicated.

In order to solve the problems in the prior art, the invention provides, in a first aspect, a cross-modal search model training method, which is suitable for training a cross-modal search model; the cross-modal retrieval model is used for realizing cross-modal retrieval between the image and the text, namely, the image is used as retrieval information, and the text is used as the cross-modal retrieval executed by the retrieved information; or the text is used as the retrieval information, and the image is used as the cross-modal retrieval of the retrieved information.

Referring to fig. 1, a flowchart illustrating an implementation of the cross-modal search model training method according to the present invention is shown.

As shown in fig. 1, the cross-modal search model training method includes the following steps:

s100, acquiring training data of a current batch;

wherein the training data comprises positive sample pairs and negative sample pairs.

Specifically, the training data includes sample pairs for performing model training on the cross-modal search model; wherein a single said sample pair consists of a single image sample and a single text sample. Each sample pair has a corresponding similarity label for characterizing the degree of information similarity between the image sample and the text sample in the sample pair, including at least similarity and non-similarity.

Dividing each sample pair in the training data into a positive sample pair and a negative sample pair based on the similarity label information of the sample pair; wherein the positive sample pair is a sample pair with positive similarity, i.e. the similarity label is a similar sample pair; the negative sample pair is a sample pair with negative similarity, i.e. the similarity label is a non-similar sample pair.

S200, based on the training data, performing model training on the pre-constructed cross-modal retrieval model;

in this embodiment, as shown in fig. 2, the pre-constructed cross-modal search model includes a feature encoding module 810, a sampling optimization module 820, and a feature matching module 830.

Wherein the feature encoding module 810 comprises an image feature encoding submodule 811 and a text feature encoding submodule 812;

The image feature coding submodule 811 includes an image Query encoder and an image Key encoder, and is configured to obtain Query coding information and Key coding information of image samples in each sample pair, which are referred to as image Q coding information and image K coding information in the following text;

the text feature encoding sub-module 812 includes a text Query encoder and a text Key encoder, and is configured to respectively obtain Query encoding information and Key encoding information of text samples in each sample pair, which are referred to as text Q encoding information and text K encoding information in the following text.

The sampling optimization module 820 comprises a pre-constructed image memory pool and a text memory pool; the image memory pool is constructed in advance according to the characteristic dimension of the image K coding information and a preset capacity, and the text memory pool is constructed in advance according to the characteristic dimension of the text K coding information and the preset capacity; the capacity size of the image memory pool is the same as that of the text memory pool.

In this embodiment, the sampling optimization module is configured to update the image memory pool according to the obtained each piece of image K coding information, and update the text memory pool based on each piece of text K coding information obtained by the text feature coding submodule.

The feature matching module 830 includes a first feature sub-matching module 831 and a second feature matching sub-module 832.

The first feature sub-matching module 831 includes a first similarity calculator (not shown) for calculating a positive Q-code similarity of the positive sample pair and a negative Q-code similarity of the negative sample pair, respectively; the positive similarity of the Q code of the positive sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the positive sample pair, and the negative similarity of the Q code of the negative sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the negative sample pair, and comprises the negative similarity of the image and text Q code of the image sample matched to the text sample and the negative similarity of the text sample matched to the image and text Q code of the image sample.

The second feature matching sub-module 832 includes a second similarity calculator (not shown) for calculating a positive QK-code similarity of the positive sample pair, a negative QK-code similarity between each image sample and the text sample; wherein the QK code positive similarity of the positive sample pair comprises: positive similarity of the image-text QK code and the image-text QK code; the image-text QK code positive similarity is the cross-modal similarity between the image Q coding information and the text K coding information in the positive sample pair, and the image-text QK code positive similarity is the cross-modal similarity between the text Q coding information and the image K coding information in the positive sample pair.

The negative similarity of QK codes between the image samples and text samples comprises: the negative similarity of the image-text QK code and the negative similarity of the image-text QK code; the negative similarity of the image-text QK codes is the cross-modal similarity between the image Q coding information of the image sample and each text K code information in the memory pool; the negative similarity of the text and image QK codes is the cross-modal similarity between the text Q coding information of the text sample and the K code information of each image in the memory pool.

The first feature sub-matching module 831 further includes a first loss function, so as to obtain a first loss function value based on the Q code positive similarity and the Q code negative similarity output by the first similarity calculator; in one embodiment, the first loss function comprises a ternary loss function.

A second loss function and a third loss function are also included in the second feature sub-matching module 832; the second loss function is used for acquiring a second loss function value according to the positive similarity of the image-text QK code and the negative similarity of the image-text QK code output by the second similarity calculator; the third loss function is used for obtaining a third loss function value according to the positive similarity of the text image QK code and the negative similarity of the text image QK code output by the second similarity calculator and output by the second similarity calculator; in one embodiment, the second loss function includes a teletext NCE loss function, and the third loss function includes a teletext NCE loss function.

Further, the image-text NCE loss function is a loss function constructed by weighting the image-text QK code positive similarity and the image-text QK code negative similarity; and the text and image NCE loss function is constructed by weighting the positive similarity of the text and image QK codes and the negative similarity of the text and image QK codes.

Specifically, for a single batch of the training data, the performing model training on the pre-constructed cross-modal search model, as shown in fig. 3, includes:

s201, inputting the training data into the text feature coding submodule, enabling the image Query encoder to perform feature extraction on the image samples in each sample pair to obtain image Q coding information of each sample pair, and enabling the image Key encoder to perform feature extraction on the image samples in each sample pair to obtain image K coding information of each sample pair;

in this embodiment, for an image sample in a single sample pair, the performing, by the image Query encoder, feature extraction on the image sample in each sample pair includes:

based on the pre-trained target detector(s), Extracting the local coding information of the input image sample, namely extracting the Q coding information of each local area in the image sample as

Wherein f is _n Encoding information for the Q of the nth local region; d _f The characteristic dimension of the Q coding information is a preset coder parameter;

based on a full connection layer, the characteristic dimension of Q coding information of each local area is represented by d _f Mapping to d _v To obtain the total Q coding information of each local area, the calculation formula is:

wherein, W _F And b _F Is a learnable parameter matrix, W _F Has a dimension of d _f *d _v ，b _F Has a dimension of 1 x d _v ；

Information is encoded for an initial picture of image sample data.

Further, to improve the stability of model training, the Q coding information of each local region in the image sample is normalized to obtain new Q coding information of each local region, and the new total Q coding information is obtained as follows:

based on the new total Q coding information of the local areas, obtaining the local attention weight corresponding to each local area through an attention mechanism method, and performing weighted summation on the new Q coding information corresponding to each local area in the image sample and the corresponding local attention weight to obtain the image Q coding information of the image sample, wherein the calculation formula is as follows:

α ^V ＝softmax(s ^V ) (4)

Wherein s is ^V Representing an image attention model; w ₁ ，b ₁ ，W ₂ ，b ₂ A parameter matrix representing the image attention model, σ represents a Sigmoid activation function of the image attention model, α ^V Representing a local attention weight corresponding to each local region in the image data; dropout () represents a random deactivation operation; softmax () represents a normalized exponential function.

In this embodiment, the image Query encoder in the feature encoding module can be simplified as follows:

wherein f is _V Representing a model of an image Query encoder,

representing parameters in the model, including W ₁ ，b ₁ ，W ₂ ，b ₂ ，W _F And b _F 。

And performing feature extraction on the image samples in the sample pairs based on the image Query encoder to obtain image Q encoding information of each image sample.

In this embodiment, the model structure of the image Key encoder is the same as that of the image Query encoder, and only the parameter updating manner is different, and the model structure is a dynamically updated parameter; in one embodiment, the image Key encoder is:

wherein the content of the first and second substances,

are parameters of an image Key encoder, respectively, including W ₁ ，b ₁ ，W ₂ ，b ₂ ，W _F And b _F 。

In one embodiment, the image Key encoder parameters

Based on the image Key encoder parameters obtained during the previous batch of model training and the image Query encoder parameters obtained during the current batch of model training, the new image Key encoder parameters obtained are as follows:

Wherein μ represents a hyper-parameter; optionally, its value is set to 0.999;

and

the parameters are the image Key encoder parameters when the current batch of model training is executed and the image Key encoder parameters when the previous batch of model training is executed.

In this embodiment, for an image sample in a single sample pair, the performing, by the image Key encoder, feature extraction on the image sample in each sample pair includes:

acquiring image Key encoder parameters of the current batch of model training based on image Key encoder parameters obtained by the last batch of model training and image Query encoder parameters obtained by the current batch of model training, so as to acquire an image Key encoder of the current batch of model training based on the image Key encoder parameters; and extracting the features of the image samples in the sample pairs based on the image Key encoder to obtain the image K encoding information of the image samples.

S202, inputting the training data into the text feature coding submodule, enabling the text Query coder to perform feature extraction on text samples in each sample pair to obtain text Q coding information of each sample pair, and enabling the text Key coder to perform feature extraction on text samples in each sample pair to obtain text K coding information of each sample pair;

In this embodiment, for a single text sample, the performing, by the text Query encoder, feature extraction on the text sample in each sample pair includes:

specifically, based on the pre-trained word encoding matrix, the text samples are mapped into word vectors of

Wherein e is _k Is the k word vector; d _e Is the feature space of the word vector.

Based on a Gated Round Unit (GRU), semantic association vectors between the word vectors and the contexts are obtained and serve as Q coding information of the word vectors, and the calculation formula is as follows:

wherein, W _E Is a parameter matrix of the gated cyclic unit;

information is encoded for the Q of each word vector,

d _t and coding the feature space of the information for the Q of each word vector.

In this embodiment, the feature space dimension of the Q-coding information of the text sample is the same as the feature dimension of the Q-coding information of the image sample.

Further, to improve the stability of model training, the Q coding information of each word vector in the text sample is normalized to obtain new Q coding information of each word vector, and the method includes:

based on the new Q coding information of each word vector, obtaining the local attention weight corresponding to each word vector through an attention mechanism method, and carrying out weighted summation on the Q coding information corresponding to each word vector in a text sample and the corresponding local attention weight to obtain the text Q coding information of the text sample, wherein the calculation formula is as follows:

α ^T ＝softmax(s ^T ) (12)

Wherein s is ^T Is a text attention model; w ₃ ，b ₃ ，W ₄ ，b ₄ Is a parameter matrix of the text attention model, sigma represents a Sigmoid activation function of the text attention model, alpha ^T To representEach one ofLocal attention weights corresponding to the word vectors; dropout () represents a random deactivation operation; softmax () represents a normalized exponential function.

In this embodiment, the text Query encoder in the feature encoding module can be simplified as follows:

Q _T ＝f _T (E；θ ^t ) (14)

wherein f is _T Representing a text Query encoder model, theta ^t Representing parameters in the text attention model, including W ₃ ，b ₃ ，W ₄ ，b ₄ And W and _E 。

and performing feature extraction on the text samples in the sample pairs based on the text Query encoder to obtain text Q encoding information of the text samples.

In this embodiment, the model structure of the text Key encoder is the same as that of the text Query encoder, and only the parameter updating manner is different, and the model structure is a dynamically updated parameter; in one embodiment, the text Key encoder is:

wherein the content of the first and second substances,

are parameters of a text Key encoder, respectively, including W ₃ ，b ₃ ，W ₄ ，b ₄ And W and _E 。

in one embodiment, the text Key encoder parameters

Based on the text Key encoder parameters obtained during the previous batch of model training and the text Query encoder parameters obtained during the current batch of model training, new text Key encoder parameters are obtained, which are:

Wherein μ represents a hyper-parameter; optionally, its value is set to 0.999;

and

are respectively provided withThe parameters of an image Key encoder when the current batch of model training is executed and the parameters of an image Key encoder when the previous batch of model training is executed are obtained;

the parameters of the text Query encoder when the current batch of training is executed.

In this embodiment, for a text sample in a single sample pair, the performing, by the text Key encoder, feature extraction on the text sample in each sample pair includes:

acquiring the parameters of a text Key encoder trained by the current batch of models based on the parameters of the text Key encoder acquired by the last batch of model training and the parameters of a text Query encoder acquired by the current batch of model training, so as to acquire the parameters of the text Key encoder trained by the current batch of models based on the parameters of the text Key encoder; and based on the text Key encoder, extracting the characteristics of the text samples in the sample pairs to obtain text K encoding information of the text samples.

S203, inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module to update an image memory pool and a text memory pool in the sampling optimization module;

Specifically, in the sampling optimization module, for image samples in the current batch, storing each input image K coding information to the topmost layer of the image memory pool, and removing the same number of image K coding information in the bottommost layer of the image memory pool, that is, the number of stored image K coding information is the same as the number of removed image K coding information; similarly, for the text samples in the current batch, storing each input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information with the same data in the bottommost layer of the text memory pool, that is, the number of the stored text K coding information is the same as that of the removed text K coding information, so that the data volume of each memory pool is kept unchanged while dynamically updating each memory pool.

S204, inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule so as to obtain the positive similarity of the Q codes of the positive sample pairs based on the image Q coding information and the text Q coding information of the positive sample pairs and obtain the negative similarity of the Q codes of the negative sample pairs based on the image Q coding information and the text Q coding information of the negative sample pairs; obtaining a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity;

Specifically, a first similarity calculator is adopted to calculate cosine similarity between image Q coding information and text Q coding information of the positive sample pair so as to obtain Q code positive similarity of the positive sample pair; calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair; and calculating cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the negative similarity of the text and image Q codes of the negative sample pair.

Further, in order to make the similarity between the image sample and the text sample of each positive sample pair higher and the similarity between the image sample and the text sample of each negative sample pair lower in the training data, a ternary loss function is used for constraint, which is:

wherein the content of the first and second substances,

is a ternary loss function; [ x ] of] ₊ Max (x, 0); s (-) is used for calculating the cross-modal similarity between the image sample and the text sample, namely the first similarity calculator; s (Q) _V ,Q _T ) Is the Q-code positive similarity of the positive sample pair, S (Q) _V ,Q _T ^- ) The negative similarity of the image-text Q code of the negative sample pair is obtained; s (Q) _V ^- ,Q _T ) Negative similarity of the text and graph Q codes of the negative sample pairs; m represents a preset similarity threshold.

Based on equation 17, a first loss function value for obtaining the training data is calculated.

S205, inputting the image Q coding information and the K coding information of each sample pair, and the text Q coding information and the K coding information into the second feature matching submodule, so as to obtain the image-text QK code positive similarity of the positive sample pair and the text image QK code positive similarity of the positive sample pair based on the image Q coding information and the text K coding information of the positive sample pair; acquiring negative similarity of each image-text QK code based on the image Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on the text Q coding information of each text sample and each text K coding information in the image memory pool; acquiring a second loss function value of the training data based on the positive similarity of the image-text QK codes of the positive sample pair and the negative similarity of each image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK codes of the positive sample pair and the negative similarity of each image-text QK code;

specifically, a second similarity calculator is adopted to calculate cosine similarity of image Q coding information in the positive sample pair and text K coding information in the positive sample pair, so as to obtain positive image-text similarity of the positive sample pair; and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive text-image similarity of the positive sample pair.

Calculating cosine similarity of the image Q coding information in each sample pair and each text K coding information stored in the text memory pool by adopting a second similarity calculator to obtain image-text negative similarity of the image sample; and calculating cosine similarity of the text Q coding information in each sample pair and the image K coding information stored in the image memory pool to obtain the text-image negative similarity of the text sample.

In this embodiment, the second loss function value calculated by using the graphic NCE loss function is:

wherein the content of the first and second substances,

is the teletext NCE loss function;

positive similarity of the image-text QK codes of the positive sample pair;

the negative similarity of the image-text QK code of the image sample is obtained; τ is a hyperparameter, optionally with a value set to 0.07.

In this embodiment, similarly, the graph NCE loss function is used to calculate the third loss function value as follows:

wherein the content of the first and second substances,

is the graph NCE loss function;

positive similarity for the literal QK code of the positive sample pair;

negative similarity of a text and image QK code of the text sample; τ is a hyperparameter, optionally with a value set to 0.07.

Further, in order to make the weights of the positive sample pairs decrease with the increase of the similarity score and the weights of the negative sample pairs increase with the increase of the similarity score, when the second loss function value is calculated, weighting processing is performed on the positive similarity of the image-text QK codes of the positive sample pairs and the negative similarity of each image-text QK code according to different weights to obtain a new second loss function value; namely, the second loss function value is calculated by the image-text NCE loss function after weighting processing, and is:

Similarly, when the third loss function value is calculated, weighting the positive similarity of the text image QK codes of the positive sample pair and the negative similarity of each text image QK code according to different weights to obtain a new third loss function value; namely, the weighted text figure NCE loss function calculates the third loss function value as follows:

wherein, w _p Is a positive similarity weight, w _n Is a negative similarity weight; alpha, beta, gamma are all adjustable hyper-parameters and satisfy beta>γ>α; alternatively, a is 0.4, β is 3, and γ is 0.9.

S206, obtaining a total loss function value based on the first loss function value, the second loss function value and the third loss function value; and reversely updating each model parameter in the feature coding module based on the total loss function value.

Specifically, the first, second and third loss function values are summed to obtain a total loss function value as:

wherein the content of the first and second substances,

the total loss function value is given.

And reversely updating each model parameter in the image feature coding submodule and the text feature coding submodule based on the total loss function value so as to obtain the updated cross-modal retrieval model.

It should be noted that, in the cross-modal search model training method provided by the present invention, the sequence of the execution sequence of the step S201 and the step S202 is not limited, and the sequence of the execution sequence of the step S204 and the step S205 is not limited; for example, in other embodiments, the step S202 may be performed first, and then the step S201 may be performed, or the step S205 may be performed first, and then the step S204 may be performed;

and S300, updating the training data to execute model training based on the training data of the next batch until quitting so as to obtain the trained cross-modal retrieval model.

Specifically, training data of the next batch is obtained and used as new training data of the current batch; based on the new training data, re-executing the model training process, i.e. executing step S201 to step S206; the process is repeated (step S100 to step S300) until the total number of times of execution reaches a preset training iteration number threshold, thereby obtaining the trained cross-modal retrieval model.

In order to solve the problems in the prior art, a second aspect of the present invention provides a cross-modal search method for searching image data based on image information or searching image data based on text information.

Referring to fig. 4, a flow chart of the cross-modal search method according to an embodiment of the invention is shown.

As shown in fig. 4, the cross-modality retrieval method includes the following steps:

s10, constructing training data of each batch based on the cross-modal retrieval sample data;

and each batch of training data comprises a positive sample pair and a negative sample pair.

Specifically, the positive sample pair is positive similarity between the image sample and the text sample in the current sample pair, that is, the two are in a similar relationship; the negative sample pair is negative similarity between the image sample and the text sample in the current sample pair, namely, the two are in a non-similar relationship.

S20, training a preset cross-modal retrieval model based on the training data of each batch to obtain a trained cross-modal retrieval model;

specifically, the cross-modal search model training method shown in fig. 1 is adopted to train the preset cross-modal search model to obtain the trained cross-modal search model.

In this embodiment, the pre-constructed cross-modal search model includes a feature encoding module, a sampling optimization module, and a feature matching module.

The feature coding module comprises an image feature coding submodule and a text feature coding submodule;

The image characteristic coding submodule comprises an image Query coder and an image key coder and is used for respectively obtaining Query coding information and key coding information of image samples in each sample pair, and the Query coding information and the key coding information are referred to as image Q coding information and image K coding information in the following text;

the text feature coding submodule comprises a text Query coder and a text key coder and is used for respectively obtaining Query coding information and key coding information of text samples in each sample pair, and the Query coding information and the key coding information are referred to as text Q coding information and text K coding information in the following text.

The sampling optimization module comprises a pre-constructed image memory pool and a pre-constructed text memory pool; the image memory pool is pre-constructed according to the characteristic dimension of the image K coding information and a preset capacity, and the text memory pool is pre-constructed according to the characteristic dimension of the text K coding information and the preset capacity; the capacity size of the image memory pool is the same as that of the text memory pool. In this embodiment, the sampling optimization module is configured to update the image memory pool according to the obtained each image K coding information, and update the text memory pool based on each text K coding information obtained by the text feature coding submodule, so as to construct the negative sample pair by using the image memory pool and the text memory pool, thereby increasing the number of the negative samples that are difficult to be separated in the training process of each batch compared with the existing random sampling method.

The feature matching module comprises a first feature sub-matching module and a second feature matching sub-module.

The first feature sub-matching module comprises a first similarity calculator for respectively calculating the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; the positive similarity of the Q code of the positive sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the positive sample pair, and the negative similarity of the Q code of the negative sample pair is the cross-modal similarity between the Q coding information of the image and the Q coding information of the text in the negative sample pair, and comprises the negative similarity of the image-text Q code matched with the text sample and the negative similarity of the text sample matched with the image-text Q code of the image sample.

The second feature matching submodule comprises a second similarity calculator and a second feature matching submodule, wherein the second similarity calculator is used for calculating the positive similarity of the QK codes of the positive sample pairs and the negative similarity of the QK codes between the image samples and the text samples; wherein the QK code positive similarity of the positive sample pair comprises: positive similarity of the image-text QK code and the image-text QK code; the image-text QK code positive similarity is the cross-modal similarity between the image Q coding information and the text K coding information in the positive sample pair, and the image-text QK code positive similarity is the cross-modal similarity between the text Q coding information and the image K coding information in the positive sample pair.

The first characteristic sub-matching module further comprises a first loss function, so that a first loss function value is obtained based on the Q code positive similarity and the Q code negative similarity output by the first similarity calculator; in one embodiment, the first loss function comprises a ternary loss function.

The second characteristic sub-matching module further comprises a second loss function and a third loss function; the second loss function is used for acquiring a second loss function value according to the positive similarity of the image-text QK code and the negative similarity of the image-text QK code output by the second similarity calculator; the third loss function is used for obtaining a third loss function value according to the positive similarity of the text image QK code and the negative similarity of the text image QK code output by the second similarity calculator and output by the second similarity calculator; in one embodiment, the second loss function includes a teletext NCE loss function, and the third loss function includes a teletext NCE loss function.

Further, the graphic NCE loss function is: a loss function constructed by weighting the positive similarity based on the image-text QK code and the negative similarity based on the image-text QK code; and the text and image NCE loss function is constructed by weighting the positive similarity of the text and image QK codes and the negative similarity of the text and image QK codes.

And S30, based on the first modal data, searching in the second modal data by using the trained cross-modal search model to obtain second modal data corresponding to the first modal data.

Specifically, when the first modality data is image data, the second modality data is text data;

and when the first modality data is text data, the second modality data is image data.

In order to solve the problems in the prior art, the present invention also provides, in a third aspect, an electronic device, including: a processor, a memory, a transceiver, a communication interface, and a system bus; the memory is used for storing the computer program, the communication interface is used for communicating with other devices, and the processor and the transceiver are used for operating the computer program to enable the processing device to execute the cross-mode retrieval model training method or the cross-mode retrieval method.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components.

The present invention also provides, in a fourth aspect, a computer-readable storage medium, on which a computer program is stored, which, when being invoked by a processor, implements each step in the cross-modal search model training method as described above or the cross-modal search method as described above.

Among other things, the computer-readable storage medium can be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device.

The computer-readable programs described herein may be downloaded from a computer-readable storage medium to a variety of computing/processing devices, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

In summary, according to the cross-modal search model training method, the cross-modal search method, the device and the computer storage medium provided by the present invention, the image memory pool and the text memory pool are set in the sampling optimization module, so as to be updated in the model training process based on the image K coding information and the text K coding information, and the number of the difficultly-divided negative samples in each batch of training process is increased compared with the existing random sampling method by calculating the similarity between the image Q coding information and each text K coding information in the text memory pool and calculating the similarity between the text Q coding information and each image K coding information in the image memory pool, thereby enhancing the discrimination ability of the model for distinguishing the positive samples from the most difficultly-divided negative samples, and further improving the training effect of the cross-modal search model; in addition, by adaptively applying proper weights to the positive similarity and the negative similarity, a sample pair which is difficult to distinguish can be endowed with a larger weight, and a sample pair which is easy to distinguish can be endowed with a smaller weight, so that the cross-modal retrieval model can pay more attention to the sample pair with discriminability, namely the model can pay more attention to the sample pair with discriminability from the redundant sample pair, thereby enhancing the discriminability of the coding characteristics, enhancing the semantic similarity of the multimodal data in the feature space, and further improving the precision of the cross-modal retrieval.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A training method of a cross-modal retrieval model is characterized in that the cross-modal retrieval model between an image and a text is trained based on training data of each batch; the single batch of training data comprises sample pairs consisting of positive sample pairs and negative sample pairs, and each sample pair comprises an image sample and a text sample; the cross-modal retrieval model comprises a feature coding module, a sampling optimization module and a feature matching module; the sampling optimization module comprises an image memory pool and a text memory pool; the feature matching module comprises a first feature matching submodule and a second feature matching submodule;

the cross-modal search model training method comprises the following steps:

Acquiring the training data of the current batch, and inputting the training data into the feature coding module to acquire image Q coding information and image K coding information of each sample pair in the training data, and text Q coding information and text K coding information of each sample pair;

inputting the image K coding information and the text K coding information of each sample pair into the sampling optimization module so as to update an image memory pool and a text memory pool in the sampling optimization module;

inputting the image Q coding information and the text Q coding information of each sample pair into the first feature matching submodule to obtain the positive similarity of the Q codes of the positive sample pairs and the negative similarity of the Q codes of the negative sample pairs; acquiring a first loss function value of the training data based on the Q code positive similarity and the Q code negative similarity;

inputting the image Q coding information and the image K coding information of each sample pair, and the text Q coding information and the text K coding information into the second feature matching submodule to obtain the image-text QK code positive similarity and the text-text QK code positive similarity of the positive sample pair; acquiring negative similarity of each image-text QK code based on Q coding information of each image sample and each text K coding information in the text memory pool, and acquiring negative similarity of each image-text QK code based on Q coding information of each text sample and each text K coding information in the image memory pool;

Acquiring a second loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code, and acquiring a third loss function value of the training data based on the positive similarity of the image-text QK code and the negative similarity of the image-text QK code;

obtaining a total loss function value based on the first to third loss function values; based on the total loss function value, reversely updating each model parameter in the feature coding module;

updating the training data to perform the model training based on the training data of the next batch until exiting.

2. The training method of the cross-modal search model according to claim 1, wherein the feature encoding module comprises an image feature encoding submodule and a text feature encoding submodule; the image coding submodule comprises an image Query coder and an image Key coder; the text coding submodule comprises a text Query coder and a text Key coder;

the acquiring image Q coding information and image K coding information of each sample pair, and text Q coding information and text K coding information of each sample pair, includes:

Inputting each sample pair into the image feature coding submodule so as to correspondingly extract image Q coding information and image K coding information of each sample pair based on the image Query encoder and the image Key encoder;

and inputting each sample pair into the text feature coding submodule, so as to correspondingly extract text Q coding information and text K coding information of each sample pair based on the text Query coder and the text Key coder.

3. The method for training the cross-modal search model according to claim 1, wherein the updating the image memory pool and the text memory pool in the sampling optimization module comprises:

storing the input image K coding information to the topmost layer of the image memory pool, and removing the image K coding information of the bottommost layer of the image memory pool;

storing the input text K coding information to the topmost layer of the text memory pool, and removing the text K coding information at the bottommost layer of the text memory pool.

4. The training method of the cross-modal search model according to claim 1, wherein the negative similarity of Q codes of the negative sample pair comprises negative similarity of teletext Q codes and negative similarity of teletext Q codes of the negative sample pair, and the obtaining the positive similarity of Q codes of the positive sample pair and the negative similarity of Q codes of the negative sample pair comprises:

Calculating cosine similarity between the image Q coding information and the text Q coding information of the positive sample pair to obtain Q code positive similarity of the positive sample pair;

calculating cosine similarity between the image Q coding information and the text Q coding information of the negative sample pair to obtain image-text Q code negative similarity of the negative sample pair;

and calculating the cosine similarity between the text Q coding information and the image Q coding information of the negative sample pair to obtain the text and image Q code negative similarity of the negative sample pair.

5. The method for training a cross-modal search model according to claim 4, wherein the obtaining a first loss function value of the training data based on the positive similarity of the Q code and the negative similarity of the Q code comprises:

based on the positive similarity of the Q code, the negative similarity of the image-text Q code and the negative similarity of the image-text Q code, calculating the first loss function value by using a ternary loss function, wherein the first loss function value is as follows:

wherein the content of the first and second substances,

is a ternary loss function; [ x ] of] ₊ ＝max(x,0)；S(Q _V ,Q _T ) For positive similarity of the Q-code, S (Q) _V ,Q _T ^- ) The negative similarity of the image-text Q code is obtained; s (Q) _V ^- ,Q _T ) Negative similarity of the Q code of the text and the image; m represents a preset similarity threshold.

6. The method of claim 1, wherein the obtaining of the positive similarity of the text-to-text QK code and the positive similarity of the text-to-text QK code of the positive sample pair comprises:

Calculating the cosine similarity of the image Q coding information in the positive sample pair and the text K coding information in the positive sample pair to obtain the image-text positive similarity of the positive sample pair;

and calculating the cosine similarity of the text Q coding information in the positive sample pair and the image K coding information in the positive sample pair to obtain the positive similarity of the text and the image of the positive sample pair.

7. The method of claim 1, wherein obtaining the second loss function value of the training data further comprises:

weighting the positive similarity of the image-text QK code and the negative similarity of the image-text QK code according to different weights to obtain a new second loss function value;

the obtaining a third loss function value of the training data further includes:

and weighting the positive similarity of the text image QK code and the negative similarity of the text image QK code according to different weights to obtain a new third loss function value.

8. The method of claim 7, wherein the obtaining a new second loss function value of the training data comprises:

calculating the second loss function value using the graphic NCE loss function as:

Wherein the content of the first and second substances,

is the graphic NCE loss function;

the positive similarity of the image-text QK codes is obtained;

the negative similarity of the image-text QK code is obtained; τ is a hyperparameter; w is a _p Is a positive similarity weight, w _n Is a negative similarity weight;

the obtaining a third loss function value of the training data includes:

calculating a new third loss function value using the graph NCE loss function as:

wherein the content of the first and second substances,

is the graph NCE loss function;

positive similarity of the text image QK codes;

negative similarity of the QK codes of the text and the graph; τ is a hyperparameter; w is a _p Is a positive similarity weight, w _n Is a negative similarity weight.

9. The training method of the cross-modal search model according to claim 8, wherein the positive similarity weight and the negative similarity weight are respectively:

10. A cross-modal retrieval method, comprising:

constructing training data of each batch based on cross-modal retrieval sample data, wherein the training data of each batch comprises a positive sample pair and a negative sample pair;

training a preset cross-modal search model by using the cross-modal search model training method according to any one of claims 1 to 9 based on the training data of each batch to obtain a trained cross-modal search model;

And based on the first modal data, searching the second modal data by using the trained cross-modal search model to obtain the second modal data corresponding to the first modal data.

11. An electronic device, comprising: a processor and a memory; the memory is configured to store a computer program, and the processor is configured to execute the computer program stored by the memory to cause the electronic device to perform the cross-modal search model training method of any one of claims 1 to 9 or the cross-modal search method of claim 10.

12. A computer storage medium storing a computer program, the computer program being executable by a processor to perform the cross-modal search model training method of any one of claims 1 to 9 or the cross-modal search method of claim 10.