CN115080769A

CN115080769A - Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Info

Publication number: CN115080769A
Application number: CN202211002415.3A
Authority: CN
Inventors: 许扬汶; 刘天鹏; 韩冬; 孙腾中; 刘灵娟
Original assignee: Nanjing Big Data Group Co ltd
Current assignee: Nanjing Big Data Group Co ltd
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-09-20
Anticipated expiration: 2042-08-22
Also published as: CN115080769B

Abstract

The invention discloses a picture and text retrieval method and a system based on double-branch system-balance mutual learning, wherein the method utilizes a feature generation model to generate feature vectors of images and texts, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is utilized to distinguish input modes, positive and negative combined loss functions and similarity regular minimization loss functions are utilized to guide the alternate updating of parameters of the feature generation model containing double branches and the mode distinguishing model, the features generated by the first branch feature generation model are used for similarity calculation, and the highest similarity is a retrieval result; the method maps the image and the text to the public space through the double-branch feature generation model, reduces the heterogeneous difference between the image and the text mode by utilizing balance mutual learning, improves the similarity operation accuracy by optimizing the loss function, and enlarges the distance between the positive and negative examples, thereby more accurately obtaining the retrieval result.

Description

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Technical Field

The invention belongs to the field of image-text retrieval, and particularly relates to an image-text retrieval method, an image-text retrieval system and a storage medium based on double-branch balance-control mutual learning.

Background

The cross-modal retrieval of the pictures and texts can help a user to inquire the image which the user wants to find, and the image meeting the requirement can be found through a retrieval system according to a section of descriptive words provided by the user. The image-text retrieval is a relatively basic research direction in the cross-mode retrieval, but the similarity between an image and a text cannot be directly measured due to the existence of a heterogeneous gap. The traditional image-text similarity retrieval method only utilizes a simple linear relation to map images and texts to a public space to measure similarity, so that the similarity calculation does not meet the actual complex condition, and meanwhile, the complex image and text similarity calculation brings huge calculation amount.

The mutual learning of the weighing system is an important mutual learning method of models, and the mutual checking and learning process of judging whether the generated characteristics meet specific requirements or not through the characteristics generated by the characteristic generation model and the distinguishing model can promote the characteristic generation model to generate characteristics containing more information, so that the method is very suitable for cross-modal tasks. The two-tower model refers to a model technology for mapping image and text data into a common space respectively and calculating similarity in the common space, and it is expected that the problem of the gap can be solved by mapping multi-mode data to the common space in a non-linear mode. However, the traditional double-tower model is difficult to learn a good enough mapping relation and a public space, and the speed does not meet the requirement of large-batch data processing. At the same time, the different classes of features within the image and text modalities are also not sufficiently learned, resulting in a model that cannot distinguish between images (text) of different content. The feature data generated by the model is also high-dimensional floating point data, and still occupies more resources and takes more time when similarity calculation is performed.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a method for accurately realizing cross-modal similarity learning and further accurately realizing image-text retrieval.

The technical scheme is as follows: the invention relates to a picture and text retrieval method based on double-branch balance mutual learning, which comprises the following steps: the user inputs a specific image or text in the image-text retrieval model, and retrieves to obtain the text or image with the highest similarity, wherein the training method of the image-text retrieval model comprises the following steps:

(1) preprocessing the image and text data sets;

(2) generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise first branch image featuresvAnd second branch image featurev ^s The text feature comprises a first branch text featuretAnd a second branch text featuret ^s ；

(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model;

(4) alternately updating parameters of the feature generation model and the modal discrimination model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and positive examples by a positive and negative example combined loss functionL _trip The formula of (1) is as follows:

wherein the content of the first and second substances,L _trip,v the loss function is combined for positive and negative examples of the image,L _trip,t the penalty function is combined for positive and negative examples of text,t _i is as followsiA first branch-text feature that is,

and

respectively represent images ofjText example and secondkA second branch text feature of the text counterexample,v _i is as followsiA first branch image feature;

and

respectively represent the second of the textjThe first and second imagekA second branch image feature of the image counterexample;

and

the proportion of the positive losses of the image and the text respectively,

and

regulating and controlling the value of overall loss;

calculate the formula for the similarity:

wherein

Is a Euler power distance function;

(5) and calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.

Further, in step (4), the generation of the first branch image feature and the first branch text feature is guided by a similarity regularization minimization loss function, and the similarity regularization minimization loss functionL _min Comprises the following steps:

wherein

And

the similarity regularization minimization loss function representing images and text, respectively.

Further, in the step (2), the second branch feature generation model includes a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model includes:

wherein

Are parameters of the second branch image model,

is a parameter of the first branch image model;

are parameters of the second branch text model,

is a parameter of the first branch text model; k controls the ratio of addition.

Further, the loss function of the modal differentiation model is:

wherein the content of the first and second substances,

is input with the characteristics off _i The true output of the temporal modal discrimination model,y _i is the expected output of the modal discrimination model and n represents the number of features.

Further, in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping, image brightness color temperature saturation adjustment, and converting the pixel value into the range of [0,1 ].

Further, in the step (1), the preprocessing method of the text data set is to perform vectorization processing, and sum up words appearing in the text into a sequence, and if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.

Further, in step (4), the first branch image feature and the first branch text feature are passed throughSoftmaxFunction(s)pConversion to class probability, from true tagslGuiding, distinguishing different features inside the image and text modes, and performing a probability normalization loss function as follows:

。

the invention discloses a picture and text retrieval system based on double-branch balance mutual learning, which comprises:

the preprocessing module is used for preprocessing the image and text data sets;

the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether the input features belong to images or texts; the loss function of the model training includes a positive and negative example combined loss function,L _trip the formula of (1) is:

wherein the content of the first and second substances,L _trip,v the loss function is combined for positive and negative examples of the image,L _trip,t the penalty function is combined for positive and negative examples of text,t _i for the ith first branch text feature,

and

second branch text features respectively representing a jth text positive example and a kth text negative example of the image,v _i is the ith first branch image feature;

and

second branch image features respectively representing a j image positive example and a k image negative example of the text;

and

the proportion of the positive losses of the image and the text respectively,

and

regulating and controlling the value of the overall loss;

calculate the formula for the similarity:

。

further, guiding generation of the first branch image features and the text features by using a similarity regularization minimization loss function which is used for regularizing the similarity to minimize the loss functionL _min Comprises the following steps:

wherein

And

the similarity regularization minimization loss function representing images and text respectively,

and

second branch text features respectively representing a jth text positive example and a kth text negative example of the image,v _i representing the ith first branch image feature;

and

second branch image features respectively representing a j-th image positive example and a k-th image negative example of an image,t _i representing the ith first branch text feature.

Has the advantages that: compared with the prior art, the invention has the advantages that: (1) the heterogeneous difference between different modes of the picture and the text is reduced by using the mutual learning of the balance control, and the similarity can be compared more easily; (2) when the weighing mutual learning is carried out, each mode has a double-branch feature generation model, the two branches mutually guide and learn to generate features with richer information, and the similarity can be more accurately calculated and the classification effect can be realized; (3) by optimizing the positive and negative combined loss function, the difference of the similarity of the positive and negative examples is directly calculated, and meanwhile, the distance between the positive and negative examples is directly increased by utilizing the distance of molecules for regularization, so that the retrieval accuracy is improved; (4) the method has the advantages that the generation of the features is directly guided by utilizing a similarity regularization minimization loss function, and the distance between the image features and the text features with the same semantics is reduced better, so that the image features and the text features can contain richer semantic information.

Drawings

Fig. 1 is a flowchart of the image retrieval method of the present invention.

Fig. 2 is a diagram of the image retrieval model architecture of the present invention.

Fig. 3 is an input image for performing the teletext search in the embodiment.

FIG. 4 is a graph showing the results of the experiment in the example of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1 and fig. 2, the image-text retrieval method based on dual-branch balance mutual learning according to the present invention includes the following steps:

(1) preprocessing image and text data sets

The data set includes images and text. Each image needs to be subjected to 5 data enhancement means and tensor, and the text needs to be subjected to vectorization. In this embodiment, a method for preprocessing a set of image and text data description data is shown as fig. 3, which is an image for playing basketball input in this embodiment. The size of the input image is 1280 × 960, and the image size is adjusted first. The longest side of the picture is 1280, so it is adjusted to any 32 times length of 640 to 1920, assuming scaling to 960, at which time the picture has been adjusted to 960 × 960; then, the image is turned, and if left-right turning is selected from left-right turning and up-down turning, the image is turned left and right along the middle shaft; the width of the image is then scaled randomly, assuming a random scale of 0.9. Then the width adjustment of the image is 960 × 0.9=864 and the size of the image is 960 × 864; and finally, cutting the image into 640 multiplied by 640 sizes according to the content, and adjusting the color temperature, the brightness and the saturation of the image to random values. At this time, the number of channels is counted, and the size of the picture is 640 × 640 × 3.

According to the enhanced image, the dimension order of the image is firstly converted, the number of channels is converted to the first dimension, and then the size of the converted image is 3 multiplied by 640. And dividing all pixel values in the picture by 255, converting the pixel values into the range of 0 to 1, and completing tensor processing of the image.

Further, the descriptive text corresponding to the input image is "a group of players playing basketball on a basketball court", wherein the key words are "players", "basketball court", "playing basketball". Because there are many key words in a dataset, not all vectors can be listed here. Assuming that there are only 6 total words, namely "player", "basketball court", "basketball", "dancer", "dance" and "dance room", all texts correspond to 6 bits, and the vectors of the texts correspond to "player", "basketball court" and "basketball" bits with values of 1 and other values of 0. The above completes the vectorization of the text.

(2) And generating a feature vector F of an image and a text by the preprocessed data set through a feature generation model containing double branches to generate initial parameters of the feature generation model. And inputting the image characteristics and the text characteristics into the modal distinguishing model to generate initial parameters of the modal distinguishing model. Alternately updating parameters of the feature generation model and the modal differentiation model.

(2.1) the feature generation model comprises a first branch feature generation model and a second branch feature generation model, the preprocessed training data set is input into the neural network model, and the feature generation model generates the features of the first branch imagevFirst branch text featuret、Second branch image featurev ^s And a second branch text featuret ^s (ii) a The second branch feature successor and the first branch feature mutually guide learning.

(2.2) characterizing the first branch imagevAnd a first branch text featuretThe input modality discriminates the model D. The goal of the modal discrimination model D is to attempt to discriminate whether the input features belong to an image or text, and the output is a two-bit vectory = [y ₀ ,y ₁ ]. Ideally, for the first branch image featurevThe output of the modal classification model is [1, 0 ]](ii) a Otherwise, the first branch text characteristictThe model outputs [0,1]]. Modal discriminative model loss functionL _adv Comprises the following steps:

wherein the content of the first and second substances,

as a function of the euler squared distance,

is input with the characteristics off _i The true output of the temporal modal discrimination model,y _i is the expected output of the modal discrimination model, n representing the characteristicAnd (5) characterizing the quantity. The cross entropy function adopted usually focuses more on guiding feature learning from the aspect of distribution, but the distinguishing model does not need strict learning of the distribution of true values, and the inventionL _adv The difference between the learning result and the truth value of the distinguishing model can be more directly guided by using the Euler power distance, so that the model is adjusted; meanwhile, the label y added at the front part can further select the value of the Euler power distance for guidance, so that the loss calculation is more accurate.

(2.3) calculating the first branch image featurevAnd a first branch text featuretThe similarity formula is as follows:

，

and (4) utilizing a positive and negative example combination loss function to pull the distance between the features and the positive examples closer and push the distance between the features and the negative examples farther. The loss function is as follows:

，

in order to distance positive and negative examples as far as possible and enable the model to generate features that adequately represent the distances of the positive and negative examples, the prior art uses a triplet loss function, usingmaxThe functional regulation increases the distance of the positive and negative examples. The method directly calculates the difference value of the similarity of the positive example and the negative example, and regularizes by using the sum of the positive characteristic distance and the negative characteristic distance of the molecule, so that the distance between the positive example and the negative example can be directly increased; meanwhile, the positive and negative characteristic distance part of the numerator can control the weight of different positive and negative example combinations, and the distances of different positive and negative examples are different. With respect to the image of playing basketball in this embodiment,

namely that a group of players play basketball on a basketball court,

that is, "a dancer dancesHouse dancing ".

(2.4) feature generation model-generated featureFThe method (including images and texts) uses a similarity regularization minimization loss function for guidance, so that the distance between the image and text features with the same semantics can be reduced better, and the loss functionL _min Comprises the following steps:

。

(2.5) first branch image features for class distinguishability within a learning modalityvAnd a first branch text featuretBy passingSoftmaxFunction(s)pAnd converting into class probabilities. Defining a probabilistic normalization loss functionL _label ：

，

The cross entropy function usually adopted is too single, the inventionL _label Comprehensive consideration truth valuelAnd generating characteristics, wherein the difference of the two distributions can be better described by using the mean value of the two characteristics, and the generated characteristics are prevented from being over-fitted to the true value, so that a more accurate distribution gap is given.

(2.6) the overall loss function can be divided into characteristically generated loss functionsL _gen And modal discrimination penalty functionL _adv The feature generation loss function is defined as the sum of a positive and negative case combination loss function, a similarity regularization minimization loss function and a probability normalization loss function:

loss of global model requiring optimizationLIs composed ofL _gen AndL _adv the difference between:

。

(2.7) updating the parameters of the first branch feature generation model into a normal gradient return, wherein the parameter updating method of the second branch feature generation model comprises the following steps:

where k controls the ratio of addition, in this example 0.8.

And (2.8) in the training process, alternately and circularly updating the feature generation model network and the modal differentiation model network. First-use loss functionLOptimizing the network parameters of the feature generation model, and then generating the loss function obtained according to the features output by the new feature generation modelLOptimizing the network parameters of the modal discrimination model, and repeating the staggered iteration for multiple rounds.

(3) According to the first branch image characteristicsvAnd a first branch text featuretAnd calculating the similarity, wherein the highest similarity is the result of image-text retrieval.

Using the preprocessed test data set as input to the trained model, using only the first branch image featuresvAnd a first branch text featuretSimilarity calculation is carried out, and similarity calculation is also used for a similarity function

. The image and text groups with the highest similarity scores are the corresponding content of the match. In this embodiment, the similarity value between the image and the text "a team of players playing basketball on a basketball court" is 0.91, and the similarity value between the image and the text "a dancer dancing in a dance cup" is 0.26, so that the matching degree between the image and the "a team of players playing basketball on a basketball court" is higher.

(4) The user inputs a specific image or text, and the text or image result with the highest similarity can be retrieved by following the prediction process. The user inputs a picture of playing basketball, and according to the calculated similarity result, the retrieval result is the text 'a group of players play basketball on a basketball court'. Similarly, the user inputs the text to obtain a picture which is also bound to play basketball.

The method provided by the invention is verified through experiments, and the test data set used in the experiments is Pascal Serial dataset, which is one of the commonly used cross-modal retrieval data sets. The evaluation index used is the mean average of precision (mAP), i.e. the mean of the precision (AP) of all test samples. For the first K search results of a test sample, the accuracy AP @ K is expressed as follows:

where N represents the total number of correct entries in the first K search results,n _V it means that the result of the vth search is 1 if it is correct, otherwise it is 0.

The experimental comparison methods are CCA, LCFS, Corr-AE, DCCA, Deep-SM, MHTN and ACMR, which are common cross-modal retrieval methods. The results of all the methods and the method under the mAP @50 evaluation index are shown in fig. 4, and it can be seen from fig. 4 that the method is obviously superior to the comparison method, and the evaluation results are higher than those of all the comparison methods by more than 15%, thereby fully proving the high accuracy of retrieval.

Claims

1. A picture and text retrieval method based on double-branch balance mutual learning is characterized in that a user inputs a specific image or text in a picture and text retrieval model, and retrieves to obtain the text or the image with the highest similarity, wherein the training method of the picture and text retrieval model comprises the following steps:

(1) preprocessing the image and text data sets;

(2) generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise first branch image featuresvAnd second branch image featurev ^s The text featureIncluding a first branch text featuretAnd a second branch text featuret ^s ；

(4) alternately updating parameters of the feature generation model and the modal discrimination model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and positive examples by a positive and negative example combined loss functionL _trip The formula of (1) is:

wherein the content of the first and second substances,L _trip,v the loss function is combined for positive and negative examples of the image,L _trip,t the loss function is combined for both positive and negative examples of text,t _i for the ith first branch text feature,

and

and

and

the proportion of the positive losses of the image and the text respectively,

and

regulating and controlling the value of overall loss;

calculate the formula for the similarity:

wherein

Is an Euler power distance function;

2. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (4), a similarity regularization minimization loss function is used for guiding generation of the first branch image feature and the first branch text feature, and the similarity regularization minimization loss functionL _min Comprises the following steps:

wherein

And

3. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (2), the second branch feature generation model comprises a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model comprises:

wherein

Are parameters of the second branch image model,

is a parameter of the first branch image model;

are parameters of the second branch text model,

4. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein the loss function of the modal discrimination model is as follows:

wherein, the first and the second end of the pipe are connected with each other,

5. The method for retrieving graphics and text based on dual-branch balance mutual learning of claim 1, wherein in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping and image brightness color temperature saturation adjustment, and converting the pixel value into the range of [0,1 ].

6. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (1), the preprocessing method of the text data set comprises vectorization processing, words appearing in the text are counted into a sequence, if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.

7. The graphics context retrieval method based on the double-branch balance mutual learning of claim 1, wherein in the step (4), the first branch image feature and the first branch text feature are passed throughSoftmaxFunction(s)pConversion to class probability, from true tagslGuiding, distinguishing different features inside the image and the text, wherein a probability normalization loss function is as follows:

。

8. the utility model provides a picture and text retrieval system based on mutual study of two branch weighing-appliances which characterized in that includes:

the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether input features belong to images or texts; the loss function of the model training includes a positive and negative example combined loss function,L _trip the formula of (1) is as follows:

wherein the content of the first and second substances,L _trip,v the loss function is combined for positive and negative examples of the image,L _trip,t the loss function is combined for both positive and negative examples of text,t _i is as followsiA first branch-text feature that is,

and

respectively represent imagesjText example and secondkA second branching text feature of the text counterexample,v _i is as followsiA first branch image feature;

and

respectively represent the second of the textjPositive example of an image andka second branch image feature of the image counterexample;

and

the proportion of the positive losses of the image and the text respectively,

and

regulating and controlling the value of overall loss;

calculate the formula for the similarity:

wherein

As a function of the euler squared distance.

9. According to claim 8The image-text retrieval system based on double-branch system-balance mutual learning is characterized in that a similarity regularization minimization loss function is used for guiding the generation of the first branch image characteristics and the first branch text characteristics, and the similarity regularization minimization loss functionL _min Comprises the following steps:

wherein

And

and

second branch text features respectively representing a jth text positive example and a kth text negative example of the image,

representing the ith first branch image feature;

and

second branch image features respectively representing a j-th image positive example and a k-th image negative example of an image,

representing the ith first branch text feature.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for retrieving a graph and text based on dual-branch balance mutual learning according to any one of claims 1 to 7.