CN115080769B

CN115080769B - Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Info

Publication number: CN115080769B
Application number: CN202211002415.3A
Authority: CN
Inventors: 许扬汶; 刘天鹏; 韩冬; 孙腾中; 刘灵娟
Original assignee: Nanjing Big Data Group Co ltd
Current assignee: Nanjing Big Data Group Co ltd
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-02
Anticipated expiration: 2042-08-22
Also published as: CN115080769A

Abstract

The invention discloses a picture and text retrieval method and a system based on double-branch system-balance mutual learning, wherein the method utilizes a feature generation model to generate feature vectors of images and texts, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is utilized to distinguish input modes, positive and negative combined loss functions and similarity regular minimization loss functions are utilized to guide the alternate updating of parameters of the feature generation model containing double branches and the mode distinguishing model, the features generated by the first branch feature generation model are used for similarity calculation, and the highest similarity is a retrieval result; the method maps the image and the text to the public space through the double-branch feature generation model, reduces the heterogeneous difference between the image and the text mode by utilizing balance mutual learning, improves the similarity operation accuracy by optimizing the loss function, and enlarges the distance between the positive and negative examples, thereby more accurately obtaining the retrieval result.

Description

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Technical Field

The invention belongs to the field of image-text retrieval, and particularly relates to an image-text retrieval method, an image-text retrieval system and a storage medium based on double-branch balance mutual learning.

Background

The cross-modal retrieval of the pictures and texts can help a user to inquire the image which the user wants to find, and the image meeting the requirement can be found through a retrieval system according to a section of descriptive words provided by the user. The image-text retrieval is a relatively basic research direction in the cross-mode retrieval, but the similarity between an image and a text cannot be directly measured due to the existence of a heterogeneous gap. The traditional image-text similarity retrieval method only utilizes a simple linear relation to map images and texts to a public space to measure similarity, so that the similarity calculation does not meet the actual complex condition, and meanwhile, the complex image and text similarity calculation brings huge calculation amount.

The mutual learning of the weighing system is an important mutual learning method of models, and the mutual checking and learning process of judging whether the generated characteristics meet specific requirements or not through the characteristics generated by the characteristic generation model and the distinguishing model can promote the characteristic generation model to generate characteristics containing more information, so that the method is very suitable for cross-modal tasks. The two-tower model refers to a model technology for mapping image and text data into a common space respectively and calculating similarity in the common space, and it is expected that the problem of the gap can be solved by mapping multi-mode data to the common space in a non-linear mode. However, the traditional double-tower model is difficult to learn a good enough mapping relation and a public space, and the speed does not meet the requirement of large-batch data processing. At the same time, the different classes of features within the image and text modalities are also not sufficiently learned, resulting in a model that cannot distinguish between images (text) of different content. The feature data generated by the model is also high-dimensional floating point data, and still occupies more resources and takes more time when similarity calculation is performed.

Disclosure of Invention

The invention aims to: the invention aims to provide a method for accurately realizing cross-modal similarity learning and further accurately realizing image-text retrieval.

The technical scheme is as follows: the invention relates to a picture and text retrieval method based on double branch balance control mutual learning, which comprises the following steps: the user inputs a specific image or text in the image-text retrieval model, and retrieves to obtain the text or image with the highest similarity, wherein the training method of the image-text retrieval model comprises the following steps:

(1) Preprocessing the image and text data sets;

(2) Generating image characteristics by the preprocessed data set through a characteristic generation modelGenerating initial parameters of a feature generation model according to the text features; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise a first branch image feature v and a second branch image feature v ^s The text features comprise a first branch text feature t and a second branch text feature t ^s ；

(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model;

(4) Alternately updating parameters of the feature generation model and the modal differentiation model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and counter-examples by a positive and negative combined loss function L _trip The formula of (1) is as follows:

L _trip ＝L _trip,v +L _trip,t

wherein L is _trip,v Combining the loss functions for positive and negative examples of an image, L _trip,t Combining the loss function for positive and negative examples of text, t _i For the ith first branch text feature,

and

a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively _i Is the ith first branch image feature;

and

second branch image features respectively representing a j image positive example and a k image negative example of the text; alpha is alpha ₁ And alpha ₂ The proportion of the positive example loss of the image and the text, mu ₁ And mu ₂ Regulating and controlling the value of overall loss; i | · | purple wind _sim Calculate the formula for the similarity:

wherein | · | purple ₂ Is a Euler power distance function;

(5) And calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.

Further, in the step (4), generating of the first branch image feature and the first branch text feature is guided by a similarity regularization minimization loss function L, which is a function of regularizing and minimizing the loss of similarity L _min Comprises the following steps:

L _min ＝L _min,v +L _min,t

wherein L is _min,v And L _min,t The similarity regularization minimization loss function representing images and text, respectively.

Further, in the step (2), the second branch feature generation model includes a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model includes:

wherein

Is a parameter of the second branch image model, θ _v Is a parameter of the first branch image model;

is a parameter of the second branch text model, θ _t Is a parameter of the first branch text model; k controls the ratio of addition.

Further, the loss function of the modal differentiation model is:

wherein D (f) _i ；θ _D ) Is input with a characteristic of f _i True output of the temporal modal discriminative model, y _i Is the expected output of the modal discrimination model and n represents the number of features.

Further, in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping, image brightness color temperature saturation adjustment, and converting the pixel values into the range of [0,1 ].

Further, in the step (1), the preprocessing method of the text data set is to perform vectorization processing, sum up words appearing in the text into a sequence, if a core word in a sentence of text appears in the sequence, an element value of the core word in the text vector is 1, otherwise, the element value is 0.

Further, in step (4), the first branch image feature and the first branch text feature are converted into category probabilities through a Softmax function p, and the category probabilities are guided by a real tag l to distinguish different features inside the image modality and the text modality, wherein a probability normalization loss function is as follows:

the invention discloses a picture and text retrieval system based on double-branch balance mutual learning, which comprises:

the preprocessing module is used for preprocessing the image and text data sets;

the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether the input features belong to images or texts; the loss function of model training comprises a positive and negative example combined loss function, L _trip The formula of (1) is:

L _trip ＝L _trip,v +L _trip,t

and

and

second branch image features respectively representing a j image positive example and a k image negative example of the text; alpha (alpha) ("alpha") ₁ And alpha ₂ The proportion of the positive example loss of the image and the text, mu ₁ And mu ₂ Regulating and controlling the value of overall loss; i | · | purple wind _sim Calculate the formula for the similarity:

further, guiding the generation of the first branch image feature and the first branch text feature by using a similarity regularization minimization loss function L _min Comprises the following steps:

L _min ＝L _min,v +L _min,t

wherein L is _min,v And L _min,t The similarity regularization minimization loss function representing images and text respectively,

and

a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively _i Representing the ith first branch image feature;

and

second branch image features, t, representing the positive jth and negative kth image examples of an image, respectively _i Representing the ith first branch text feature.

Has the advantages that: compared with the prior art, the invention has the advantages that: (1) The heterogeneous differences between different modes of the picture and the text are reduced by utilizing balance control mutual learning, and the similarity can be compared more easily; (2) When the balance control mutual learning is carried out, each mode has a double-branch feature generation model, the two branches guide and learn each other, the features with richer information are generated, and the similarity can be calculated more accurately and the classification effect can be realized; (3) By optimizing the positive and negative combined loss function, the difference of the similarity of the positive and negative examples is directly calculated, and meanwhile, the distance between the positive and negative examples is directly increased by utilizing the distance of molecules for regularization, so that the retrieval accuracy is improved; (4) The generation of the features is directly and further guided by utilizing a similarity regularization minimization loss function, and the distance between the image features and the text features with the same semantics is better reduced, so that the image features and the text features can contain richer semantic information.

Drawings

Fig. 1 is a flowchart of the image retrieval method of the present invention.

Fig. 2 is a diagram of the architecture of the image retrieval model according to the present invention.

Fig. 3 is an input image for performing the image-text search in the present embodiment.

FIG. 4 is a graph showing the results of the experiment in the example of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

As shown in fig. 1 and fig. 2, the image-text retrieval method based on dual-branch balance mutual learning according to the present invention includes the following steps:

(1) Preprocessing image and text data sets

The data set includes images and text. Each image needs to be subjected to 5 data enhancement means and tensor, and the text needs to be subjected to vectorization. In this embodiment, a method for preprocessing a set of image and text data is described, and as shown in fig. 3, a basketball shooting image input in this embodiment is illustrated. The size of the input image is 1280 × 960, and the image size is adjusted first. The longest side of the picture is 1280, so it is adjusted to any 32 times length of 640 to 1920, assuming scaling to 960, at which time the picture has been adjusted to 960 × 960; then, the image is turned, and if left-right turning is selected from left-right turning and up-down turning, the image is turned left and right along the middle shaft; the width of the image is then scaled randomly, assuming a random scale of 0.9. Then the width adjustment of the image is 960 × 0.9=864 and the size of the image is 960 × 864; and finally, cutting the image into 640 multiplied by 640 sizes according to the content, and adjusting the color temperature, the brightness and the saturation of the image to random values. At this time, the number of channels is counted, and the size of the picture is 640 × 640 × 3.

According to the enhanced image, the dimension order of the image is firstly converted, the number of channels is converted to the first dimension, and then the size of the converted image is 3 multiplied by 640. And dividing all pixel values in the picture by 255, converting the pixel values into the range of 0 to 1, and finishing tensor processing of the image.

Further, the descriptive text corresponding to the input image is "a group of players playing basketball on a basketball court", wherein the key words are "players", "basketball court", "playing basketball". Because there are many key words in a dataset, not all vectors can be listed here. Assuming that there are only 6 total words, namely "player", "basketball court", "basketball", "dancer", "dance" and "dance room", all texts correspond to 6 bits, and the vectors of the texts correspond to "player", "basketball court" and "basketball" bits with values of 1 and other values of 0. The above completes the vectorization of the text.

(2) And generating a feature vector F of an image and a text by the preprocessed data set through a feature generation model containing double branches to generate initial parameters of the feature generation model. And inputting the image characteristics and the text characteristics into the modal distinguishing model to generate initial parameters of the modal distinguishing model. Alternately updating parameters of the feature generation model and the modal differentiation model.

(2.1) the feature generation model comprises a first branch feature generation model and a second branch feature generation model, the preprocessed training data set is input into the neural network model, and a first branch image feature v, a first branch text feature t and a second branch image feature v are generated through the feature generation model ^s And a second branch text feature t ^s (ii) a The second branch feature successor and the first branch feature mutually guide learning.

(2.2) inputting the first branch image feature v and the first branch text feature t into the modality discrimination model D. The goal of the modality discrimination model D is to try to discriminate whether the input features belong to images or text, and the output is a two-bit vector y = [ y ] ₀ ,y ₁ ]. In an ideal case, for the first branch image feature v, the output of the modality differentiation model is [1,0 ]](ii) a Otherwise, the first branch text characteristic t is output [0,1] by the model]. Modal-discriminating model loss function L _adv Comprises the following steps:

wherein | · | charging ₂ As a function of the Euler squared distance, D (f) _i ；θ _D ) Is input with a characteristic of f _i True output of the temporal modal discriminative model, y _i Is the expected output of the modal discrimination model and n represents the number of features. The cross entropy function adopted usually focuses more on guiding feature learning from the aspect of distribution, but the distinguishing model does not need strict learning of the distribution of true values, and the L of the invention _adv The difference between the learning result and the truth value of the distinguishing model can be more directly guided by using the Euler power distance, so that the model is adjusted; meanwhile, the label y added at the front part can further select the value of the Euler power distance for guidance, so that the loss calculation is more accurate.

(2.3) calculating the similarity formula of the first branch image feature v and the first branch text feature t as follows:

and (4) utilizing a positive and negative example combination loss function to pull the distance between the features and the positive examples closer and push the distance between the features and the negative examples farther. The loss function is as follows:

L _trip ＝L _trip,v +L _trip,t 。

in order to make the distance between the positive example and the negative example as far as possible and enable the model to generate features which fully represent the distance between the positive example and the negative example, the prior art uses a triple loss function and utilizes a max function to regulate and control the distance between the positive example and the negative example to be increased. The method directly calculates the difference value of the similarity of the positive case and the negative case, and regularizes by using the sum of the positive characteristic distance and the negative characteristic distance of the molecule, so that the distance between the positive case and the negative case can be directly increased; meanwhile, the positive and negative characteristic distance part of the numerator can control the weight of different positive and negative example combinations, and the distances of different positive and negative examples are different. With respect to the image of playing basketball in this embodiment,

namely that a group of players play basketball on a basketball court,

it is "one dancer dancing in a dance hall in a jazz".

(2.4) guiding the characteristics F (including images and texts) generated by the characteristic generation model by using a similarity regularization minimization loss function, and reducing the distance and loss of the image and text characteristics with the same semantics betterFunction L _min Comprises the following steps:

L _min ＝L _min,v +L _min,t 。

(2.5) for class distinctiveness within the learning modality, the first branch image feature v and the first branch text feature t are converted into class probabilities by a Softmax function p. Defining a probability normalization loss function L _label ：

。

The cross entropy function generally employed is too single, L of the present invention _label The truth value l and the generated characteristics are comprehensively considered, the difference of the distribution of the truth value l and the generated characteristics can be better described by using the mean value of the truth value l and the generated characteristics, the generated characteristics are prevented from being over-fitted to the truth value, and therefore a more accurate distribution gap is given.

(2.6) the overall loss function can be divided into a feature generation loss function L _gen And a modal discrimination loss function L _adv The feature generation loss function is defined as the sum of a positive and negative case combination loss function, a similarity regularization minimization loss function and a probability normalization loss function:

the loss L of the integral model needing to be optimized is L _gen And L _adv The difference between:

(2.7) updating the parameters of the first branch feature generation model into a normal gradient return, wherein the parameter updating method of the second branch feature generation model comprises the following steps:

where k controls the ratio of addition, in this example 0.8.

And (2.8) in the training process, alternately and circularly updating the feature generation model network and the modal distinction model network. Firstly, optimizing the network parameters of the feature generation model by using a loss function L, then distinguishing the network parameters of the model according to the loss function-L optimization mode obtained by the features output by the new feature generation model, and repeating the staggered iteration for multiple rounds.

(3) And calculating the similarity according to the first branch image characteristic v and the first branch text characteristic t, wherein the highest similarity is the result of image-text retrieval.

Taking the preprocessed test data set as the input of the trained model, only using the first branch image feature v and the first branch text feature t to carry out similarity operation, and using the similarity function to calculate | | · | as well as the similarity function _sim . The image and text groups with the highest similarity scores are the corresponding content of the match. In this embodiment, the similarity value between the image and the text "a team of players playing basketball on a basketball court" is 0.91, and the similarity value between the image and the text "a dancer dancing in a dance cup" is 0.26, so that the matching degree between the image and the "a team of players playing basketball on a basketball court" is higher.

(4) The user inputs a specific image or text, and the text or image result with the highest similarity can be retrieved by following the prediction process. The user inputs a picture of playing basketball, and according to the calculated similarity result, the retrieval result is the text "a group of players play basketball on the basketball court". Similarly, the user inputs the text to obtain a photo which is also bound to play basketball.

The method provided by the invention is verified through experiments, and the test data set used in the experiments is Pascal sequence dataset, which is one of the commonly used cross-modal retrieval data sets. The evaluation index used is the mean average of precision (mAP), i.e. the mean of the precision (AP) of all test samples. For the first K search results of a test sample, the accuracy AP @ K is expressed as follows:

where N represents the total number of correct items in the first K search results, N _V Indicating that the result of the Vth search is 1 if the result is correct, and is 0 otherwise.

The experimental comparison methods are CCA, LCFS, corr-AE, DCCA, deep-SM, MHTN and ACMR, which are common cross-modal retrieval methods. The results of all the methods and the method under the mAP @50 evaluation index are shown in FIG. 4, and it can be seen from FIG. 4 that the method is obviously superior to the comparison method, the evaluation results are higher than those of all the comparison methods by more than 15%, and the high accuracy of the retrieval is fully proved.

Claims

1. A picture and text retrieval method based on double-branch balance mutual learning is characterized in that a user inputs a specific image or text in a picture and text retrieval model, and retrieves to obtain the text or the image with the highest similarity, wherein the training method of the picture and text retrieval model comprises the following steps:

(1) Preprocessing the image and text data sets;

(2) Generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise a first branch image feature v and a second branch image feature v ^s The text features comprise a first branch text feature t and a second branch textCharacteristic t of ^s ；

(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model; the loss function of the modal discrimination model is:

wherein D (f) _i ；θ _D ) Is input with a characteristic of f _i True output of the temporal modal discriminative model, y _i Is the expected output of the modal discrimination model, n represents the number of features;

(4) Alternately updating parameters of the feature generation model and the modal differentiation model, wherein the method comprises the following steps: updating the network parameters of the feature generation model by using a loss function trained by the image-text retrieval model, obtaining the loss function trained by the image-text retrieval model according to the optimized features output by the feature generation model, then updating the network parameters of the modal discrimination model, and iteratively updating according to the method; the loss function comprises a positive and negative example combined loss function, the positive and negative example combined loss function draws the distance between the characteristic and the positive example, and draws the distance between the characteristic and the negative example, and the positive and negative example combined loss function L _trip The formula of (1) is:

L _trip ＝L _trip,v +L _trip,t

and

and

wherein | · | purple ₂ Is a Euler power distance function;

2. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (4), a similarity regularization minimization loss function L is used for guiding generation of the first branch image feature and the first branch text feature, and the similarity regularization minimization loss function L _min Comprises the following steps:

L _min ＝L _min,v +L _min,t

3. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (2), the second branch feature generation model comprises a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model comprises:

wherein

4. The image-text retrieval method based on the double-branch balance learning of claim 1, wherein in the step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping and image brightness color temperature saturation adjustment, and converts the pixel value into the range of [0,1 ].

5. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (1), the preprocessing method of the text data set comprises vectorization processing, words appearing in the text are counted into a sequence, if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.

6. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (4), the first branch image feature and the first branch text feature are converted into class probabilities through a Softmax function p, and are guided by a real label l to distinguish different features inside the image and the text, and the probability normalization loss function is as follows:

7. the utility model provides a picture and text retrieval system based on mutual study of two branch weighing-appliances which characterized in that includes:

the model training module comprises a graph-text retrieval model and a modal distinguishing model, wherein the graph-text retrieval model is used for alternately updating parameters of the characteristic generation model and the modal distinguishing model, firstly, the network parameters of the characteristic generation model are updated by using a loss function trained by the graph-text retrieval model, the loss function trained by the graph-text retrieval model is obtained according to the optimized characteristics output by the characteristic generation model, then, the network parameters of the modal distinguishing model are updated, and iterative updating is carried out; the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is used for distinguishing whether the input features belong to images or texts, and the loss function of the mode distinguishing model is as follows:

the loss function of the model training comprises a positive and negative example combined loss function, L _trip The formula of (1) is:

L _trip ＝L _trip,v +L _trip,t

wherein L is _trip,v Combining the loss functions for positive and negative examples of an image, L _trip,t Combining the penalty functions for positive and negative examples of text, t _i For the ith first branch text feature,

and

and

j-th image positive example and j-th image positive example respectively representing textSecond branch image features of k image counterexamples; alpha is alpha ₁ And alpha ₂ The proportion of the positive example loss of the image and the text, mu ₁ And mu ₂ Regulating and controlling the value of overall loss; i | · | purple wind _sim Calculate the formula for the similarity:

wherein | · | charging ₂ Is an Euler power distance function;

and the image-text retrieval module is used for calculating the similarity according to the text and image characteristics generated by the first branch characteristic generation model, and the highest similarity is the image-text retrieval result.

8. The system of claim 7, wherein a similarity regularization minimization loss function L is used to guide generation of the first branch image features and the first branch text features, and wherein the similarity regularization minimization loss function L is used to guide generation of the first branch image features and the first branch text features _min Comprises the following steps:

L _min ＝L _min,v +L _min,t

and

respectively represent imagesSecond branch text feature, v, of j text positive examples and k text negative examples _i Representing the ith first branch image feature;

and

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for retrieving a graph and text based on dual-branch balance mutual learning according to any one of claims 1 to 6.