CN115080769B - Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning - Google Patents

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning Download PDF

Info

Publication number
CN115080769B
CN115080769B CN202211002415.3A CN202211002415A CN115080769B CN 115080769 B CN115080769 B CN 115080769B CN 202211002415 A CN202211002415 A CN 202211002415A CN 115080769 B CN115080769 B CN 115080769B
Authority
CN
China
Prior art keywords
image
text
branch
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211002415.3A
Other languages
Chinese (zh)
Other versions
CN115080769A (en
Inventor
许扬汶
刘天鹏
韩冬
孙腾中
刘灵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Big Data Group Co ltd
Original Assignee
Nanjing Big Data Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Big Data Group Co ltd filed Critical Nanjing Big Data Group Co ltd
Priority to CN202211002415.3A priority Critical patent/CN115080769B/en
Publication of CN115080769A publication Critical patent/CN115080769A/en
Application granted granted Critical
Publication of CN115080769B publication Critical patent/CN115080769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a picture and text retrieval method and a system based on double-branch system-balance mutual learning, wherein the method utilizes a feature generation model to generate feature vectors of images and texts, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is utilized to distinguish input modes, positive and negative combined loss functions and similarity regular minimization loss functions are utilized to guide the alternate updating of parameters of the feature generation model containing double branches and the mode distinguishing model, the features generated by the first branch feature generation model are used for similarity calculation, and the highest similarity is a retrieval result; the method maps the image and the text to the public space through the double-branch feature generation model, reduces the heterogeneous difference between the image and the text mode by utilizing balance mutual learning, improves the similarity operation accuracy by optimizing the loss function, and enlarges the distance between the positive and negative examples, thereby more accurately obtaining the retrieval result.

Description

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning
Technical Field
The invention belongs to the field of image-text retrieval, and particularly relates to an image-text retrieval method, an image-text retrieval system and a storage medium based on double-branch balance mutual learning.
Background
The cross-modal retrieval of the pictures and texts can help a user to inquire the image which the user wants to find, and the image meeting the requirement can be found through a retrieval system according to a section of descriptive words provided by the user. The image-text retrieval is a relatively basic research direction in the cross-mode retrieval, but the similarity between an image and a text cannot be directly measured due to the existence of a heterogeneous gap. The traditional image-text similarity retrieval method only utilizes a simple linear relation to map images and texts to a public space to measure similarity, so that the similarity calculation does not meet the actual complex condition, and meanwhile, the complex image and text similarity calculation brings huge calculation amount.
The mutual learning of the weighing system is an important mutual learning method of models, and the mutual checking and learning process of judging whether the generated characteristics meet specific requirements or not through the characteristics generated by the characteristic generation model and the distinguishing model can promote the characteristic generation model to generate characteristics containing more information, so that the method is very suitable for cross-modal tasks. The two-tower model refers to a model technology for mapping image and text data into a common space respectively and calculating similarity in the common space, and it is expected that the problem of the gap can be solved by mapping multi-mode data to the common space in a non-linear mode. However, the traditional double-tower model is difficult to learn a good enough mapping relation and a public space, and the speed does not meet the requirement of large-batch data processing. At the same time, the different classes of features within the image and text modalities are also not sufficiently learned, resulting in a model that cannot distinguish between images (text) of different content. The feature data generated by the model is also high-dimensional floating point data, and still occupies more resources and takes more time when similarity calculation is performed.
Disclosure of Invention
The invention aims to: the invention aims to provide a method for accurately realizing cross-modal similarity learning and further accurately realizing image-text retrieval.
The technical scheme is as follows: the invention relates to a picture and text retrieval method based on double branch balance control mutual learning, which comprises the following steps: the user inputs a specific image or text in the image-text retrieval model, and retrieves to obtain the text or image with the highest similarity, wherein the training method of the image-text retrieval model comprises the following steps:
(1) Preprocessing the image and text data sets;
(2) Generating image characteristics by the preprocessed data set through a characteristic generation modelGenerating initial parameters of a feature generation model according to the text features; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise a first branch image feature v and a second branch image feature v s The text features comprise a first branch text feature t and a second branch text feature t s
(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model;
(4) Alternately updating parameters of the feature generation model and the modal differentiation model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and counter-examples by a positive and negative combined loss function L trip The formula of (1) is as follows:
Figure GDA0003893168840000021
Figure GDA0003893168840000022
L trip =L trip,v +L trip,t
wherein L is trip,v Combining the loss functions for positive and negative examples of an image, L trip,t Combining the loss function for positive and negative examples of text, t i For the ith first branch text feature,
Figure GDA0003893168840000023
and
Figure GDA0003893168840000024
a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively i Is the ith first branch image feature;
Figure GDA0003893168840000025
and
Figure GDA0003893168840000026
second branch image features respectively representing a j image positive example and a k image negative example of the text; alpha is alpha 1 And alpha 2 The proportion of the positive example loss of the image and the text, mu 1 And mu 2 Regulating and controlling the value of overall loss; i | · | purple wind sim Calculate the formula for the similarity:
Figure GDA0003893168840000027
wherein | · | purple 2 Is a Euler power distance function;
(5) And calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.
Further, in the step (4), generating of the first branch image feature and the first branch text feature is guided by a similarity regularization minimization loss function L, which is a function of regularizing and minimizing the loss of similarity L min Comprises the following steps:
Figure GDA0003893168840000028
Figure GDA0003893168840000029
L min =L min,v +L min,t
wherein L is min,v And L min,t The similarity regularization minimization loss function representing images and text, respectively.
Further, in the step (2), the second branch feature generation model includes a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model includes:
Figure GDA00038931688400000210
Figure GDA0003893168840000031
wherein
Figure GDA0003893168840000032
Is a parameter of the second branch image model, θ v Is a parameter of the first branch image model;
Figure GDA0003893168840000033
is a parameter of the second branch text model, θ t Is a parameter of the first branch text model; k controls the ratio of addition.
Further, the loss function of the modal differentiation model is:
Figure GDA0003893168840000034
wherein D (f) i ;θ D ) Is input with a characteristic of f i True output of the temporal modal discriminative model, y i Is the expected output of the modal discrimination model and n represents the number of features.
Further, in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping, image brightness color temperature saturation adjustment, and converting the pixel values into the range of [0,1 ].
Further, in the step (1), the preprocessing method of the text data set is to perform vectorization processing, sum up words appearing in the text into a sequence, if a core word in a sentence of text appears in the sequence, an element value of the core word in the text vector is 1, otherwise, the element value is 0.
Further, in step (4), the first branch image feature and the first branch text feature are converted into category probabilities through a Softmax function p, and the category probabilities are guided by a real tag l to distinguish different features inside the image modality and the text modality, wherein a probability normalization loss function is as follows:
Figure GDA0003893168840000035
the invention discloses a picture and text retrieval system based on double-branch balance mutual learning, which comprises:
the preprocessing module is used for preprocessing the image and text data sets;
the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether the input features belong to images or texts; the loss function of model training comprises a positive and negative example combined loss function, L trip The formula of (1) is:
Figure GDA0003893168840000036
Figure GDA0003893168840000037
L trip =L trip,v +L trip,t
wherein L is trip,v Combining the loss functions for positive and negative examples of an image, L trip,t Combining the loss function for positive and negative examples of text, t i For the ith first branch text feature,
Figure GDA0003893168840000041
and
Figure GDA0003893168840000042
a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively i Is the ith first branch image feature;
Figure GDA0003893168840000043
and
Figure GDA0003893168840000044
second branch image features respectively representing a j image positive example and a k image negative example of the text; alpha (alpha) ("alpha") 1 And alpha 2 The proportion of the positive example loss of the image and the text, mu 1 And mu 2 Regulating and controlling the value of overall loss; i | · | purple wind sim Calculate the formula for the similarity:
Figure GDA0003893168840000045
further, guiding the generation of the first branch image feature and the first branch text feature by using a similarity regularization minimization loss function L min Comprises the following steps:
Figure GDA0003893168840000046
Figure GDA0003893168840000047
L min =L min,v +L min,t
wherein L is min,v And L min,t The similarity regularization minimization loss function representing images and text respectively,
Figure GDA0003893168840000048
and
Figure GDA0003893168840000049
a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively i Representing the ith first branch image feature;
Figure GDA00038931688400000410
and
Figure GDA00038931688400000411
second branch image features, t, representing the positive jth and negative kth image examples of an image, respectively i Representing the ith first branch text feature.
Has the advantages that: compared with the prior art, the invention has the advantages that: (1) The heterogeneous differences between different modes of the picture and the text are reduced by utilizing balance control mutual learning, and the similarity can be compared more easily; (2) When the balance control mutual learning is carried out, each mode has a double-branch feature generation model, the two branches guide and learn each other, the features with richer information are generated, and the similarity can be calculated more accurately and the classification effect can be realized; (3) By optimizing the positive and negative combined loss function, the difference of the similarity of the positive and negative examples is directly calculated, and meanwhile, the distance between the positive and negative examples is directly increased by utilizing the distance of molecules for regularization, so that the retrieval accuracy is improved; (4) The generation of the features is directly and further guided by utilizing a similarity regularization minimization loss function, and the distance between the image features and the text features with the same semantics is better reduced, so that the image features and the text features can contain richer semantic information.
Drawings
Fig. 1 is a flowchart of the image retrieval method of the present invention.
Fig. 2 is a diagram of the architecture of the image retrieval model according to the present invention.
Fig. 3 is an input image for performing the image-text search in the present embodiment.
FIG. 4 is a graph showing the results of the experiment in the example of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1 and fig. 2, the image-text retrieval method based on dual-branch balance mutual learning according to the present invention includes the following steps:
(1) Preprocessing image and text data sets
The data set includes images and text. Each image needs to be subjected to 5 data enhancement means and tensor, and the text needs to be subjected to vectorization. In this embodiment, a method for preprocessing a set of image and text data is described, and as shown in fig. 3, a basketball shooting image input in this embodiment is illustrated. The size of the input image is 1280 × 960, and the image size is adjusted first. The longest side of the picture is 1280, so it is adjusted to any 32 times length of 640 to 1920, assuming scaling to 960, at which time the picture has been adjusted to 960 × 960; then, the image is turned, and if left-right turning is selected from left-right turning and up-down turning, the image is turned left and right along the middle shaft; the width of the image is then scaled randomly, assuming a random scale of 0.9. Then the width adjustment of the image is 960 × 0.9=864 and the size of the image is 960 × 864; and finally, cutting the image into 640 multiplied by 640 sizes according to the content, and adjusting the color temperature, the brightness and the saturation of the image to random values. At this time, the number of channels is counted, and the size of the picture is 640 × 640 × 3.
According to the enhanced image, the dimension order of the image is firstly converted, the number of channels is converted to the first dimension, and then the size of the converted image is 3 multiplied by 640. And dividing all pixel values in the picture by 255, converting the pixel values into the range of 0 to 1, and finishing tensor processing of the image.
Further, the descriptive text corresponding to the input image is "a group of players playing basketball on a basketball court", wherein the key words are "players", "basketball court", "playing basketball". Because there are many key words in a dataset, not all vectors can be listed here. Assuming that there are only 6 total words, namely "player", "basketball court", "basketball", "dancer", "dance" and "dance room", all texts correspond to 6 bits, and the vectors of the texts correspond to "player", "basketball court" and "basketball" bits with values of 1 and other values of 0. The above completes the vectorization of the text.
(2) And generating a feature vector F of an image and a text by the preprocessed data set through a feature generation model containing double branches to generate initial parameters of the feature generation model. And inputting the image characteristics and the text characteristics into the modal distinguishing model to generate initial parameters of the modal distinguishing model. Alternately updating parameters of the feature generation model and the modal differentiation model.
(2.1) the feature generation model comprises a first branch feature generation model and a second branch feature generation model, the preprocessed training data set is input into the neural network model, and a first branch image feature v, a first branch text feature t and a second branch image feature v are generated through the feature generation model s And a second branch text feature t s (ii) a The second branch feature successor and the first branch feature mutually guide learning.
(2.2) inputting the first branch image feature v and the first branch text feature t into the modality discrimination model D. The goal of the modality discrimination model D is to try to discriminate whether the input features belong to images or text, and the output is a two-bit vector y = [ y ] 0 ,y 1 ]. In an ideal case, for the first branch image feature v, the output of the modality differentiation model is [1,0 ]](ii) a Otherwise, the first branch text characteristic t is output [0,1] by the model]. Modal-discriminating model loss function L adv Comprises the following steps:
Figure GDA0003893168840000061
wherein | · | charging 2 As a function of the Euler squared distance, D (f) i ;θ D ) Is input with a characteristic of f i True output of the temporal modal discriminative model, y i Is the expected output of the modal discrimination model and n represents the number of features. The cross entropy function adopted usually focuses more on guiding feature learning from the aspect of distribution, but the distinguishing model does not need strict learning of the distribution of true values, and the L of the invention adv The difference between the learning result and the truth value of the distinguishing model can be more directly guided by using the Euler power distance, so that the model is adjusted; meanwhile, the label y added at the front part can further select the value of the Euler power distance for guidance, so that the loss calculation is more accurate.
(2.3) calculating the similarity formula of the first branch image feature v and the first branch text feature t as follows:
Figure GDA0003893168840000062
and (4) utilizing a positive and negative example combination loss function to pull the distance between the features and the positive examples closer and push the distance between the features and the negative examples farther. The loss function is as follows:
Figure GDA0003893168840000063
Figure GDA0003893168840000064
L trip =L trip,v +L trip,t
in order to make the distance between the positive example and the negative example as far as possible and enable the model to generate features which fully represent the distance between the positive example and the negative example, the prior art uses a triple loss function and utilizes a max function to regulate and control the distance between the positive example and the negative example to be increased. The method directly calculates the difference value of the similarity of the positive case and the negative case, and regularizes by using the sum of the positive characteristic distance and the negative characteristic distance of the molecule, so that the distance between the positive case and the negative case can be directly increased; meanwhile, the positive and negative characteristic distance part of the numerator can control the weight of different positive and negative example combinations, and the distances of different positive and negative examples are different. With respect to the image of playing basketball in this embodiment,
Figure GDA0003893168840000065
namely that a group of players play basketball on a basketball court,
Figure GDA0003893168840000066
it is "one dancer dancing in a dance hall in a jazz".
(2.4) guiding the characteristics F (including images and texts) generated by the characteristic generation model by using a similarity regularization minimization loss function, and reducing the distance and loss of the image and text characteristics with the same semantics betterFunction L min Comprises the following steps:
Figure GDA0003893168840000067
Figure GDA0003893168840000071
L min =L min,v +L min,t
(2.5) for class distinctiveness within the learning modality, the first branch image feature v and the first branch text feature t are converted into class probabilities by a Softmax function p. Defining a probability normalization loss function L label
Figure GDA0003893168840000072
The cross entropy function generally employed is too single, L of the present invention label The truth value l and the generated characteristics are comprehensively considered, the difference of the distribution of the truth value l and the generated characteristics can be better described by using the mean value of the truth value l and the generated characteristics, the generated characteristics are prevented from being over-fitted to the truth value, and therefore a more accurate distribution gap is given.
(2.6) the overall loss function can be divided into a feature generation loss function L gen And a modal discrimination loss function L adv The feature generation loss function is defined as the sum of a positive and negative case combination loss function, a similarity regularization minimization loss function and a probability normalization loss function:
Figure GDA0003893168840000073
the loss L of the integral model needing to be optimized is L gen And L adv The difference between:
Figure GDA0003893168840000081
(2.7) updating the parameters of the first branch feature generation model into a normal gradient return, wherein the parameter updating method of the second branch feature generation model comprises the following steps:
Figure GDA0003893168840000082
Figure GDA0003893168840000083
where k controls the ratio of addition, in this example 0.8.
And (2.8) in the training process, alternately and circularly updating the feature generation model network and the modal distinction model network. Firstly, optimizing the network parameters of the feature generation model by using a loss function L, then distinguishing the network parameters of the model according to the loss function-L optimization mode obtained by the features output by the new feature generation model, and repeating the staggered iteration for multiple rounds.
(3) And calculating the similarity according to the first branch image characteristic v and the first branch text characteristic t, wherein the highest similarity is the result of image-text retrieval.
Taking the preprocessed test data set as the input of the trained model, only using the first branch image feature v and the first branch text feature t to carry out similarity operation, and using the similarity function to calculate | | · | as well as the similarity function sim . The image and text groups with the highest similarity scores are the corresponding content of the match. In this embodiment, the similarity value between the image and the text "a team of players playing basketball on a basketball court" is 0.91, and the similarity value between the image and the text "a dancer dancing in a dance cup" is 0.26, so that the matching degree between the image and the "a team of players playing basketball on a basketball court" is higher.
(4) The user inputs a specific image or text, and the text or image result with the highest similarity can be retrieved by following the prediction process. The user inputs a picture of playing basketball, and according to the calculated similarity result, the retrieval result is the text "a group of players play basketball on the basketball court". Similarly, the user inputs the text to obtain a photo which is also bound to play basketball.
The method provided by the invention is verified through experiments, and the test data set used in the experiments is Pascal sequence dataset, which is one of the commonly used cross-modal retrieval data sets. The evaluation index used is the mean average of precision (mAP), i.e. the mean of the precision (AP) of all test samples. For the first K search results of a test sample, the accuracy AP @ K is expressed as follows:
Figure GDA0003893168840000091
where N represents the total number of correct items in the first K search results, N V Indicating that the result of the Vth search is 1 if the result is correct, and is 0 otherwise.
The experimental comparison methods are CCA, LCFS, corr-AE, DCCA, deep-SM, MHTN and ACMR, which are common cross-modal retrieval methods. The results of all the methods and the method under the mAP @50 evaluation index are shown in FIG. 4, and it can be seen from FIG. 4 that the method is obviously superior to the comparison method, the evaluation results are higher than those of all the comparison methods by more than 15%, and the high accuracy of the retrieval is fully proved.

Claims (9)

1. A picture and text retrieval method based on double-branch balance mutual learning is characterized in that a user inputs a specific image or text in a picture and text retrieval model, and retrieves to obtain the text or the image with the highest similarity, wherein the training method of the picture and text retrieval model comprises the following steps:
(1) Preprocessing the image and text data sets;
(2) Generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise a first branch image feature v and a second branch image feature v s The text features comprise a first branch text feature t and a second branch textCharacteristic t of s
(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model; the loss function of the modal discrimination model is:
Figure FDA0003893168830000011
wherein D (f) i ;θ D ) Is input with a characteristic of f i True output of the temporal modal discriminative model, y i Is the expected output of the modal discrimination model, n represents the number of features;
(4) Alternately updating parameters of the feature generation model and the modal differentiation model, wherein the method comprises the following steps: updating the network parameters of the feature generation model by using a loss function trained by the image-text retrieval model, obtaining the loss function trained by the image-text retrieval model according to the optimized features output by the feature generation model, then updating the network parameters of the modal discrimination model, and iteratively updating according to the method; the loss function comprises a positive and negative example combined loss function, the positive and negative example combined loss function draws the distance between the characteristic and the positive example, and draws the distance between the characteristic and the negative example, and the positive and negative example combined loss function L trip The formula of (1) is:
Figure FDA0003893168830000012
Figure FDA0003893168830000013
L trip =L trip,v +L trip,t
wherein L is trip,v Combining the loss functions for positive and negative examples of an image, L trip,t Combining the loss function for positive and negative examples of text, t i For the ith first branch text feature,
Figure FDA0003893168830000014
and
Figure FDA0003893168830000015
a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively i Is the ith first branch image feature;
Figure FDA0003893168830000016
and
Figure FDA0003893168830000017
second branch image features respectively representing a j image positive example and a k image negative example of the text; alpha is alpha 1 And alpha 2 The proportion of the positive example loss of the image and the text, mu 1 And mu 2 Regulating and controlling the value of overall loss; i | · | purple wind sim Calculate the formula for the similarity:
Figure FDA0003893168830000021
wherein | · | purple 2 Is a Euler power distance function;
(5) And calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.
2. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (4), a similarity regularization minimization loss function L is used for guiding generation of the first branch image feature and the first branch text feature, and the similarity regularization minimization loss function L min Comprises the following steps:
Figure FDA0003893168830000022
Figure FDA0003893168830000023
L min =L min,v +L min,t
wherein L is min,v And L min,t The similarity regularization minimization loss function representing images and text, respectively.
3. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (2), the second branch feature generation model comprises a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model comprises:
Figure FDA0003893168830000024
Figure FDA0003893168830000025
wherein
Figure FDA0003893168830000026
Is a parameter of the second branch image model, θ v Is a parameter of the first branch image model;
Figure FDA0003893168830000027
is a parameter of the second branch text model, θ t Is a parameter of the first branch text model; k controls the ratio of addition.
4. The image-text retrieval method based on the double-branch balance learning of claim 1, wherein in the step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping and image brightness color temperature saturation adjustment, and converts the pixel value into the range of [0,1 ].
5. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (1), the preprocessing method of the text data set comprises vectorization processing, words appearing in the text are counted into a sequence, if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.
6. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (4), the first branch image feature and the first branch text feature are converted into class probabilities through a Softmax function p, and are guided by a real label l to distinguish different features inside the image and the text, and the probability normalization loss function is as follows:
Figure FDA0003893168830000031
7. the utility model provides a picture and text retrieval system based on mutual study of two branch weighing-appliances which characterized in that includes:
the preprocessing module is used for preprocessing the image and text data sets;
the model training module comprises a graph-text retrieval model and a modal distinguishing model, wherein the graph-text retrieval model is used for alternately updating parameters of the characteristic generation model and the modal distinguishing model, firstly, the network parameters of the characteristic generation model are updated by using a loss function trained by the graph-text retrieval model, the loss function trained by the graph-text retrieval model is obtained according to the optimized characteristics output by the characteristic generation model, then, the network parameters of the modal distinguishing model are updated, and iterative updating is carried out; the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is used for distinguishing whether the input features belong to images or texts, and the loss function of the mode distinguishing model is as follows:
Figure FDA0003893168830000032
wherein D (f) i ;θ D ) Is input with a characteristic of f i True output of the temporal modal discriminative model, y i Is the expected output of the modal discrimination model, n represents the number of features;
the loss function of the model training comprises a positive and negative example combined loss function, L trip The formula of (1) is:
Figure FDA0003893168830000033
Figure FDA0003893168830000034
L trip =L trip,v +L trip,t
wherein L is trip,v Combining the loss functions for positive and negative examples of an image, L trip,t Combining the penalty functions for positive and negative examples of text, t i For the ith first branch text feature,
Figure FDA0003893168830000035
and
Figure FDA0003893168830000036
a second branch text feature, v, representing the jth text positive example and the kth text negative example of the image, respectively i Is the ith first branch image feature;
Figure FDA0003893168830000037
and
Figure FDA0003893168830000038
j-th image positive example and j-th image positive example respectively representing textSecond branch image features of k image counterexamples; alpha is alpha 1 And alpha 2 The proportion of the positive example loss of the image and the text, mu 1 And mu 2 Regulating and controlling the value of overall loss; i | · | purple wind sim Calculate the formula for the similarity:
Figure FDA0003893168830000039
wherein | · | charging 2 Is an Euler power distance function;
and the image-text retrieval module is used for calculating the similarity according to the text and image characteristics generated by the first branch characteristic generation model, and the highest similarity is the image-text retrieval result.
8. The system of claim 7, wherein a similarity regularization minimization loss function L is used to guide generation of the first branch image features and the first branch text features, and wherein the similarity regularization minimization loss function L is used to guide generation of the first branch image features and the first branch text features min Comprises the following steps:
Figure FDA0003893168830000041
Figure FDA0003893168830000042
L min =L min,v +L min,t
wherein L is min,v And L min,t The similarity regularization minimization loss function representing images and text respectively,
Figure FDA0003893168830000043
and
Figure FDA0003893168830000044
respectively represent imagesSecond branch text feature, v, of j text positive examples and k text negative examples i Representing the ith first branch image feature;
Figure FDA0003893168830000045
and
Figure FDA0003893168830000046
second branch image features, t, representing the positive jth and negative kth image examples of an image, respectively i Representing the ith first branch text feature.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for retrieving a graph and text based on dual-branch balance mutual learning according to any one of claims 1 to 6.
CN202211002415.3A 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning Active CN115080769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211002415.3A CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211002415.3A CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Publications (2)

Publication Number Publication Date
CN115080769A CN115080769A (en) 2022-09-20
CN115080769B true CN115080769B (en) 2022-12-02

Family

ID=83244044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211002415.3A Active CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Country Status (1)

Country Link
CN (1) CN115080769B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712740B (en) * 2023-01-10 2023-06-06 苏州大学 Method and system for multi-modal implication enhanced image text retrieval

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN113010700A (en) * 2021-03-01 2021-06-22 电子科技大学 Image text cross-modal retrieval method based on category information alignment
CN114298158A (en) * 2021-12-06 2022-04-08 湖南工业大学 Multi-mode pre-training method based on image-text linear combination

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN113010700A (en) * 2021-03-01 2021-06-22 电子科技大学 Image text cross-modal retrieval method based on category information alignment
CN114298158A (en) * 2021-12-06 2022-04-08 湖南工业大学 Multi-mode pre-training method based on image-text linear combination

Also Published As

Publication number Publication date
CN115080769A (en) 2022-09-20

Similar Documents

Publication Publication Date Title
Wang et al. M3: Multimodal memory modelling for video captioning
He et al. Local descriptors optimized for average precision
CN112905822B (en) Deep supervision cross-modal counterwork learning method based on attention mechanism
CN111444968A (en) Image description generation method based on attention fusion
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN110347857B (en) Semantic annotation method of remote sensing image based on reinforcement learning
Du et al. Semi-siamese training for shallow face learning
CN111444342A (en) Short text classification method based on multiple weak supervision integration
CN106203483A (en) A kind of zero sample image sorting technique of multi-modal mapping method of being correlated with based on semanteme
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN115080769B (en) Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning
CN114841257A (en) Small sample target detection method based on self-supervision contrast constraint
CN113128369A (en) Lightweight network facial expression recognition method fusing balance loss
CN113537304A (en) Cross-modal semantic clustering method based on bidirectional CNN
CN106021402A (en) Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval
CN113297369A (en) Intelligent question-answering system based on knowledge graph subgraph retrieval
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
CN110704665A (en) Image feature expression method and system based on visual attention mechanism
CN117236330B (en) Mutual information and antagonistic neural network based method for enhancing theme diversity
Jin et al. Discriminant zero-shot learning with center loss
Cheng et al. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval
CN112711676A (en) Video recall method and device, electronic equipment and storage medium
JP7055848B2 (en) Learning device, learning method, learning program, and claim mapping device
CN114462466A (en) Deep learning-oriented data depolarization method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant