CN115080769A - Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning - Google Patents

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning Download PDF

Info

Publication number
CN115080769A
CN115080769A CN202211002415.3A CN202211002415A CN115080769A CN 115080769 A CN115080769 A CN 115080769A CN 202211002415 A CN202211002415 A CN 202211002415A CN 115080769 A CN115080769 A CN 115080769A
Authority
CN
China
Prior art keywords
text
image
branch
model
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211002415.3A
Other languages
Chinese (zh)
Other versions
CN115080769B (en
Inventor
许扬汶
刘天鹏
韩冬
孙腾中
刘灵娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Big Data Group Co ltd
Original Assignee
Nanjing Big Data Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Big Data Group Co ltd filed Critical Nanjing Big Data Group Co ltd
Priority to CN202211002415.3A priority Critical patent/CN115080769B/en
Publication of CN115080769A publication Critical patent/CN115080769A/en
Application granted granted Critical
Publication of CN115080769B publication Critical patent/CN115080769B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a picture and text retrieval method and a system based on double-branch system-balance mutual learning, wherein the method utilizes a feature generation model to generate feature vectors of images and texts, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, the mode distinguishing model is utilized to distinguish input modes, positive and negative combined loss functions and similarity regular minimization loss functions are utilized to guide the alternate updating of parameters of the feature generation model containing double branches and the mode distinguishing model, the features generated by the first branch feature generation model are used for similarity calculation, and the highest similarity is a retrieval result; the method maps the image and the text to the public space through the double-branch feature generation model, reduces the heterogeneous difference between the image and the text mode by utilizing balance mutual learning, improves the similarity operation accuracy by optimizing the loss function, and enlarges the distance between the positive and negative examples, thereby more accurately obtaining the retrieval result.

Description

Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning
Technical Field
The invention belongs to the field of image-text retrieval, and particularly relates to an image-text retrieval method, an image-text retrieval system and a storage medium based on double-branch balance-control mutual learning.
Background
The cross-modal retrieval of the pictures and texts can help a user to inquire the image which the user wants to find, and the image meeting the requirement can be found through a retrieval system according to a section of descriptive words provided by the user. The image-text retrieval is a relatively basic research direction in the cross-mode retrieval, but the similarity between an image and a text cannot be directly measured due to the existence of a heterogeneous gap. The traditional image-text similarity retrieval method only utilizes a simple linear relation to map images and texts to a public space to measure similarity, so that the similarity calculation does not meet the actual complex condition, and meanwhile, the complex image and text similarity calculation brings huge calculation amount.
The mutual learning of the weighing system is an important mutual learning method of models, and the mutual checking and learning process of judging whether the generated characteristics meet specific requirements or not through the characteristics generated by the characteristic generation model and the distinguishing model can promote the characteristic generation model to generate characteristics containing more information, so that the method is very suitable for cross-modal tasks. The two-tower model refers to a model technology for mapping image and text data into a common space respectively and calculating similarity in the common space, and it is expected that the problem of the gap can be solved by mapping multi-mode data to the common space in a non-linear mode. However, the traditional double-tower model is difficult to learn a good enough mapping relation and a public space, and the speed does not meet the requirement of large-batch data processing. At the same time, the different classes of features within the image and text modalities are also not sufficiently learned, resulting in a model that cannot distinguish between images (text) of different content. The feature data generated by the model is also high-dimensional floating point data, and still occupies more resources and takes more time when similarity calculation is performed.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to provide a method for accurately realizing cross-modal similarity learning and further accurately realizing image-text retrieval.
The technical scheme is as follows: the invention relates to a picture and text retrieval method based on double-branch balance mutual learning, which comprises the following steps: the user inputs a specific image or text in the image-text retrieval model, and retrieves to obtain the text or image with the highest similarity, wherein the training method of the image-text retrieval model comprises the following steps:
(1) preprocessing the image and text data sets;
(2) generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise first branch image featuresvAnd second branch image featurev s The text feature comprises a first branch text featuretAnd a second branch text featuret s
(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model;
(4) alternately updating parameters of the feature generation model and the modal discrimination model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and positive examples by a positive and negative example combined loss functionL trip The formula of (1) is as follows:
Figure 771859DEST_PATH_IMAGE001
wherein the content of the first and second substances,L trip,v the loss function is combined for positive and negative examples of the image,L trip,t the penalty function is combined for positive and negative examples of text,t i is as followsiA first branch-text feature that is,
Figure 70116DEST_PATH_IMAGE002
and
Figure 926077DEST_PATH_IMAGE003
respectively represent images ofjText example and secondkA second branch text feature of the text counterexample,v i is as followsiA first branch image feature;
Figure 13987DEST_PATH_IMAGE004
and
Figure 842266DEST_PATH_IMAGE005
respectively represent the second of the textjThe first and second imagekA second branch image feature of the image counterexample;
Figure 108162DEST_PATH_IMAGE006
and
Figure 694827DEST_PATH_IMAGE007
the proportion of the positive losses of the image and the text respectively,
Figure 71582DEST_PATH_IMAGE008
and
Figure 551105DEST_PATH_IMAGE009
regulating and controlling the value of overall loss;
Figure 706011DEST_PATH_IMAGE010
calculate the formula for the similarity:
Figure 270985DEST_PATH_IMAGE011
wherein
Figure 451431DEST_PATH_IMAGE012
Is a Euler power distance function;
(5) and calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.
Further, in step (4), the generation of the first branch image feature and the first branch text feature is guided by a similarity regularization minimization loss function, and the similarity regularization minimization loss functionL min Comprises the following steps:
Figure 503569DEST_PATH_IMAGE013
wherein
Figure 111268DEST_PATH_IMAGE014
And
Figure 897958DEST_PATH_IMAGE015
the similarity regularization minimization loss function representing images and text, respectively.
Further, in the step (2), the second branch feature generation model includes a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model includes:
Figure 882095DEST_PATH_IMAGE016
Figure 585478DEST_PATH_IMAGE017
wherein
Figure 364078DEST_PATH_IMAGE018
Are parameters of the second branch image model,
Figure 638064DEST_PATH_IMAGE019
is a parameter of the first branch image model;
Figure 160312DEST_PATH_IMAGE020
are parameters of the second branch text model,
Figure 983781DEST_PATH_IMAGE021
is a parameter of the first branch text model; k controls the ratio of addition.
Further, the loss function of the modal differentiation model is:
Figure 933282DEST_PATH_IMAGE022
wherein the content of the first and second substances,
Figure 960144DEST_PATH_IMAGE023
is input with the characteristics off i The true output of the temporal modal discrimination model,y i is the expected output of the modal discrimination model and n represents the number of features.
Further, in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping, image brightness color temperature saturation adjustment, and converting the pixel value into the range of [0,1 ].
Further, in the step (1), the preprocessing method of the text data set is to perform vectorization processing, and sum up words appearing in the text into a sequence, and if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.
Further, in step (4), the first branch image feature and the first branch text feature are passed throughSoftmaxFunction(s)pConversion to class probability, from true tagslGuiding, distinguishing different features inside the image and text modes, and performing a probability normalization loss function as follows:
Figure 20504DEST_PATH_IMAGE024
the invention discloses a picture and text retrieval system based on double-branch balance mutual learning, which comprises:
the preprocessing module is used for preprocessing the image and text data sets;
the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether the input features belong to images or texts; the loss function of the model training includes a positive and negative example combined loss function,L trip the formula of (1) is:
Figure 698479DEST_PATH_IMAGE025
wherein the content of the first and second substances,L trip,v the loss function is combined for positive and negative examples of the image,L trip,t the penalty function is combined for positive and negative examples of text,t i for the ith first branch text feature,
Figure 818882DEST_PATH_IMAGE026
and
Figure 67460DEST_PATH_IMAGE027
second branch text features respectively representing a jth text positive example and a kth text negative example of the image,v i is the ith first branch image feature;
Figure 931511DEST_PATH_IMAGE028
and
Figure 463993DEST_PATH_IMAGE029
second branch image features respectively representing a j image positive example and a k image negative example of the text;
Figure 755297DEST_PATH_IMAGE030
and
Figure 491171DEST_PATH_IMAGE031
the proportion of the positive losses of the image and the text respectively,
Figure 611443DEST_PATH_IMAGE032
and
Figure 749163DEST_PATH_IMAGE033
regulating and controlling the value of the overall loss;
Figure 211369DEST_PATH_IMAGE034
calculate the formula for the similarity:
Figure 434540DEST_PATH_IMAGE035
further, guiding generation of the first branch image features and the text features by using a similarity regularization minimization loss function which is used for regularizing the similarity to minimize the loss functionL min Comprises the following steps:
Figure 161099DEST_PATH_IMAGE036
wherein
Figure 153326DEST_PATH_IMAGE037
And
Figure 786433DEST_PATH_IMAGE038
the similarity regularization minimization loss function representing images and text respectively,
Figure 496900DEST_PATH_IMAGE039
and
Figure 755712DEST_PATH_IMAGE040
second branch text features respectively representing a jth text positive example and a kth text negative example of the image,v i representing the ith first branch image feature;
Figure 602445DEST_PATH_IMAGE041
and
Figure 937611DEST_PATH_IMAGE042
second branch image features respectively representing a j-th image positive example and a k-th image negative example of an image,t i representing the ith first branch text feature.
Has the advantages that: compared with the prior art, the invention has the advantages that: (1) the heterogeneous difference between different modes of the picture and the text is reduced by using the mutual learning of the balance control, and the similarity can be compared more easily; (2) when the weighing mutual learning is carried out, each mode has a double-branch feature generation model, the two branches mutually guide and learn to generate features with richer information, and the similarity can be more accurately calculated and the classification effect can be realized; (3) by optimizing the positive and negative combined loss function, the difference of the similarity of the positive and negative examples is directly calculated, and meanwhile, the distance between the positive and negative examples is directly increased by utilizing the distance of molecules for regularization, so that the retrieval accuracy is improved; (4) the method has the advantages that the generation of the features is directly guided by utilizing a similarity regularization minimization loss function, and the distance between the image features and the text features with the same semantics is reduced better, so that the image features and the text features can contain richer semantic information.
Drawings
Fig. 1 is a flowchart of the image retrieval method of the present invention.
Fig. 2 is a diagram of the image retrieval model architecture of the present invention.
Fig. 3 is an input image for performing the teletext search in the embodiment.
FIG. 4 is a graph showing the results of the experiment in the example of the present invention.
Detailed Description
The technical scheme of the invention is further explained by combining the attached drawings.
As shown in fig. 1 and fig. 2, the image-text retrieval method based on dual-branch balance mutual learning according to the present invention includes the following steps:
(1) preprocessing image and text data sets
The data set includes images and text. Each image needs to be subjected to 5 data enhancement means and tensor, and the text needs to be subjected to vectorization. In this embodiment, a method for preprocessing a set of image and text data description data is shown as fig. 3, which is an image for playing basketball input in this embodiment. The size of the input image is 1280 × 960, and the image size is adjusted first. The longest side of the picture is 1280, so it is adjusted to any 32 times length of 640 to 1920, assuming scaling to 960, at which time the picture has been adjusted to 960 × 960; then, the image is turned, and if left-right turning is selected from left-right turning and up-down turning, the image is turned left and right along the middle shaft; the width of the image is then scaled randomly, assuming a random scale of 0.9. Then the width adjustment of the image is 960 × 0.9=864 and the size of the image is 960 × 864; and finally, cutting the image into 640 multiplied by 640 sizes according to the content, and adjusting the color temperature, the brightness and the saturation of the image to random values. At this time, the number of channels is counted, and the size of the picture is 640 × 640 × 3.
According to the enhanced image, the dimension order of the image is firstly converted, the number of channels is converted to the first dimension, and then the size of the converted image is 3 multiplied by 640. And dividing all pixel values in the picture by 255, converting the pixel values into the range of 0 to 1, and completing tensor processing of the image.
Further, the descriptive text corresponding to the input image is "a group of players playing basketball on a basketball court", wherein the key words are "players", "basketball court", "playing basketball". Because there are many key words in a dataset, not all vectors can be listed here. Assuming that there are only 6 total words, namely "player", "basketball court", "basketball", "dancer", "dance" and "dance room", all texts correspond to 6 bits, and the vectors of the texts correspond to "player", "basketball court" and "basketball" bits with values of 1 and other values of 0. The above completes the vectorization of the text.
(2) And generating a feature vector F of an image and a text by the preprocessed data set through a feature generation model containing double branches to generate initial parameters of the feature generation model. And inputting the image characteristics and the text characteristics into the modal distinguishing model to generate initial parameters of the modal distinguishing model. Alternately updating parameters of the feature generation model and the modal differentiation model.
(2.1) the feature generation model comprises a first branch feature generation model and a second branch feature generation model, the preprocessed training data set is input into the neural network model, and the feature generation model generates the features of the first branch imagevFirst branch text featuret、Second branch image featurev s And a second branch text featuret s (ii) a The second branch feature successor and the first branch feature mutually guide learning.
(2.2) characterizing the first branch imagevAnd a first branch text featuretThe input modality discriminates the model D. The goal of the modal discrimination model D is to attempt to discriminate whether the input features belong to an image or text, and the output is a two-bit vectory = [y 0 ,y 1 ]. Ideally, for the first branch image featurevThe output of the modal classification model is [1, 0 ]](ii) a Otherwise, the first branch text characteristictThe model outputs [0,1]]. Modal discriminative model loss functionL adv Comprises the following steps:
Figure 869795DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 932298DEST_PATH_IMAGE044
as a function of the euler squared distance,
Figure 633538DEST_PATH_IMAGE045
is input with the characteristics off i The true output of the temporal modal discrimination model,y i is the expected output of the modal discrimination model, n representing the characteristicAnd (5) characterizing the quantity. The cross entropy function adopted usually focuses more on guiding feature learning from the aspect of distribution, but the distinguishing model does not need strict learning of the distribution of true values, and the inventionL adv The difference between the learning result and the truth value of the distinguishing model can be more directly guided by using the Euler power distance, so that the model is adjusted; meanwhile, the label y added at the front part can further select the value of the Euler power distance for guidance, so that the loss calculation is more accurate.
(2.3) calculating the first branch image featurevAnd a first branch text featuretThe similarity formula is as follows:
Figure 77289DEST_PATH_IMAGE046
and (4) utilizing a positive and negative example combination loss function to pull the distance between the features and the positive examples closer and push the distance between the features and the negative examples farther. The loss function is as follows:
Figure 746036DEST_PATH_IMAGE047
in order to distance positive and negative examples as far as possible and enable the model to generate features that adequately represent the distances of the positive and negative examples, the prior art uses a triplet loss function, usingmaxThe functional regulation increases the distance of the positive and negative examples. The method directly calculates the difference value of the similarity of the positive example and the negative example, and regularizes by using the sum of the positive characteristic distance and the negative characteristic distance of the molecule, so that the distance between the positive example and the negative example can be directly increased; meanwhile, the positive and negative characteristic distance part of the numerator can control the weight of different positive and negative example combinations, and the distances of different positive and negative examples are different. With respect to the image of playing basketball in this embodiment,
Figure 97383DEST_PATH_IMAGE048
namely that a group of players play basketball on a basketball court,
Figure 918709DEST_PATH_IMAGE049
that is, "a dancer dancesHouse dancing ".
(2.4) feature generation model-generated featureFThe method (including images and texts) uses a similarity regularization minimization loss function for guidance, so that the distance between the image and text features with the same semantics can be reduced better, and the loss functionL min Comprises the following steps:
Figure 330098DEST_PATH_IMAGE050
(2.5) first branch image features for class distinguishability within a learning modalityvAnd a first branch text featuretBy passingSoftmaxFunction(s)pAnd converting into class probabilities. Defining a probabilistic normalization loss functionL label
Figure 486142DEST_PATH_IMAGE051
The cross entropy function usually adopted is too single, the inventionL label Comprehensive consideration truth valuelAnd generating characteristics, wherein the difference of the two distributions can be better described by using the mean value of the two characteristics, and the generated characteristics are prevented from being over-fitted to the true value, so that a more accurate distribution gap is given.
(2.6) the overall loss function can be divided into characteristically generated loss functionsL gen And modal discrimination penalty functionL adv The feature generation loss function is defined as the sum of a positive and negative case combination loss function, a similarity regularization minimization loss function and a probability normalization loss function:
Figure 641180DEST_PATH_IMAGE052
loss of global model requiring optimizationLIs composed ofL gen AndL adv the difference between:
Figure 582591DEST_PATH_IMAGE053
(2.7) updating the parameters of the first branch feature generation model into a normal gradient return, wherein the parameter updating method of the second branch feature generation model comprises the following steps:
Figure 164882DEST_PATH_IMAGE054
where k controls the ratio of addition, in this example 0.8.
And (2.8) in the training process, alternately and circularly updating the feature generation model network and the modal differentiation model network. First-use loss functionLOptimizing the network parameters of the feature generation model, and then generating the loss function obtained according to the features output by the new feature generation modelLOptimizing the network parameters of the modal discrimination model, and repeating the staggered iteration for multiple rounds.
(3) According to the first branch image characteristicsvAnd a first branch text featuretAnd calculating the similarity, wherein the highest similarity is the result of image-text retrieval.
Using the preprocessed test data set as input to the trained model, using only the first branch image featuresvAnd a first branch text featuretSimilarity calculation is carried out, and similarity calculation is also used for a similarity function
Figure 808222DEST_PATH_IMAGE055
. The image and text groups with the highest similarity scores are the corresponding content of the match. In this embodiment, the similarity value between the image and the text "a team of players playing basketball on a basketball court" is 0.91, and the similarity value between the image and the text "a dancer dancing in a dance cup" is 0.26, so that the matching degree between the image and the "a team of players playing basketball on a basketball court" is higher.
(4) The user inputs a specific image or text, and the text or image result with the highest similarity can be retrieved by following the prediction process. The user inputs a picture of playing basketball, and according to the calculated similarity result, the retrieval result is the text 'a group of players play basketball on a basketball court'. Similarly, the user inputs the text to obtain a picture which is also bound to play basketball.
The method provided by the invention is verified through experiments, and the test data set used in the experiments is Pascal Serial dataset, which is one of the commonly used cross-modal retrieval data sets. The evaluation index used is the mean average of precision (mAP), i.e. the mean of the precision (AP) of all test samples. For the first K search results of a test sample, the accuracy AP @ K is expressed as follows:
Figure 501371DEST_PATH_IMAGE056
where N represents the total number of correct entries in the first K search results,n V it means that the result of the vth search is 1 if it is correct, otherwise it is 0.
The experimental comparison methods are CCA, LCFS, Corr-AE, DCCA, Deep-SM, MHTN and ACMR, which are common cross-modal retrieval methods. The results of all the methods and the method under the mAP @50 evaluation index are shown in fig. 4, and it can be seen from fig. 4 that the method is obviously superior to the comparison method, and the evaluation results are higher than those of all the comparison methods by more than 15%, thereby fully proving the high accuracy of retrieval.

Claims (10)

1. A picture and text retrieval method based on double-branch balance mutual learning is characterized in that a user inputs a specific image or text in a picture and text retrieval model, and retrieves to obtain the text or the image with the highest similarity, wherein the training method of the picture and text retrieval model comprises the following steps:
(1) preprocessing the image and text data sets;
(2) generating image characteristics and text characteristics by the preprocessed data set through a characteristic generation model, and generating initial parameters of the characteristic generation model; the feature generation model comprises a first branch feature generation model and a second branch feature generation model, and mutual instruction learning is performed; the image features comprise first branch image featuresvAnd second branch image featurev s The text featureIncluding a first branch text featuretAnd a second branch text featuret s
(3) Inputting the image characteristics and the text characteristics into a modal distinguishing model to generate initial parameters of the modal distinguishing model;
(4) alternately updating parameters of the feature generation model and the modal discrimination model; distance between features and counter-examples is extrapolated by zooming in on the distance between features and positive examples by a positive and negative example combined loss functionL trip The formula of (1) is:
Figure 288363DEST_PATH_IMAGE001
Figure 41555DEST_PATH_IMAGE002
Figure 922923DEST_PATH_IMAGE003
wherein the content of the first and second substances,L trip,v the loss function is combined for positive and negative examples of the image,L trip,t the loss function is combined for both positive and negative examples of text,t i for the ith first branch text feature,
Figure 872294DEST_PATH_IMAGE004
and
Figure 522718DEST_PATH_IMAGE005
second branch text features respectively representing a jth text positive example and a kth text negative example of the image,v i is the ith first branch image feature;
Figure 446812DEST_PATH_IMAGE006
and
Figure 549897DEST_PATH_IMAGE007
second branch image features respectively representing a j image positive example and a k image negative example of the text;
Figure 834116DEST_PATH_IMAGE008
and
Figure 604626DEST_PATH_IMAGE009
the proportion of the positive losses of the image and the text respectively,
Figure 965201DEST_PATH_IMAGE010
and
Figure 555582DEST_PATH_IMAGE011
regulating and controlling the value of overall loss;
Figure 649352DEST_PATH_IMAGE012
calculate the formula for the similarity:
Figure 477631DEST_PATH_IMAGE013
wherein
Figure 9106DEST_PATH_IMAGE014
Is an Euler power distance function;
(5) and calculating the similarity according to the text and image features generated by the first branch feature generation model, wherein the highest similarity is the result of image-text retrieval.
2. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (4), a similarity regularization minimization loss function is used for guiding generation of the first branch image feature and the first branch text feature, and the similarity regularization minimization loss functionL min Comprises the following steps:
Figure 336051DEST_PATH_IMAGE015
wherein
Figure 916068DEST_PATH_IMAGE016
And
Figure 395591DEST_PATH_IMAGE017
the similarity regularization minimization loss function representing images and text, respectively.
3. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein in the step (2), the second branch feature generation model comprises a second branch image model and a second branch text model, and the parameter updating method of the second branch feature generation model comprises:
Figure 81656DEST_PATH_IMAGE018
Figure 912209DEST_PATH_IMAGE019
wherein
Figure 92654DEST_PATH_IMAGE020
Are parameters of the second branch image model,
Figure 879214DEST_PATH_IMAGE021
is a parameter of the first branch image model;
Figure 486913DEST_PATH_IMAGE022
are parameters of the second branch text model,
Figure 539182DEST_PATH_IMAGE023
is a parameter of the first branch text model; k controls the ratio of addition.
4. The image-text retrieval method based on double-branch balance mutual learning of claim 1, wherein the loss function of the modal discrimination model is as follows:
Figure 523319DEST_PATH_IMAGE024
wherein, the first and the second end of the pipe are connected with each other,
Figure 961122DEST_PATH_IMAGE025
is input with the characteristics off i The true output of the temporal modal discrimination model,y i is the expected output of the modal discrimination model and n represents the number of features.
5. The method for retrieving graphics and text based on dual-branch balance mutual learning of claim 1, wherein in step (1), the pre-processing method of the image data set comprises image resizing, image flipping, image scaling, image cropping and image brightness color temperature saturation adjustment, and converting the pixel value into the range of [0,1 ].
6. The image-text retrieval method based on double-branch system-balance mutual learning of claim 1, wherein in the step (1), the preprocessing method of the text data set comprises vectorization processing, words appearing in the text are counted into a sequence, if a core word in a sentence of text appears in the sequence, the element value of the core word in the text vector is 1, otherwise, the element value is 0.
7. The graphics context retrieval method based on the double-branch balance mutual learning of claim 1, wherein in the step (4), the first branch image feature and the first branch text feature are passed throughSoftmaxFunction(s)pConversion to class probability, from true tagslGuiding, distinguishing different features inside the image and the text, wherein a probability normalization loss function is as follows:
Figure 739722DEST_PATH_IMAGE026
8. the utility model provides a picture and text retrieval system based on mutual study of two branch weighing-appliances which characterized in that includes:
the preprocessing module is used for preprocessing the image and text data sets;
the model training module is used for alternately updating parameters of a feature generation model and a modal distinguishing model, the feature generation model comprises a first branch feature generation model and a second branch feature generation model which mutually guide learning, and the modal distinguishing model is used for distinguishing whether input features belong to images or texts; the loss function of the model training includes a positive and negative example combined loss function,L trip the formula of (1) is as follows:
Figure 279288DEST_PATH_IMAGE027
Figure 801536DEST_PATH_IMAGE028
Figure 359425DEST_PATH_IMAGE029
wherein the content of the first and second substances,L trip,v the loss function is combined for positive and negative examples of the image,L trip,t the loss function is combined for both positive and negative examples of text,t i is as followsiA first branch-text feature that is,
Figure 574506DEST_PATH_IMAGE030
and
Figure 601368DEST_PATH_IMAGE005
respectively represent imagesjText example and secondkA second branching text feature of the text counterexample,v i is as followsiA first branch image feature;
Figure 661728DEST_PATH_IMAGE031
and
Figure 74124DEST_PATH_IMAGE007
respectively represent the second of the textjPositive example of an image andka second branch image feature of the image counterexample;
Figure 460106DEST_PATH_IMAGE032
and
Figure 708684DEST_PATH_IMAGE033
the proportion of the positive losses of the image and the text respectively,
Figure 25265DEST_PATH_IMAGE034
and
Figure 308479DEST_PATH_IMAGE011
regulating and controlling the value of overall loss;
Figure 599783DEST_PATH_IMAGE035
calculate the formula for the similarity:
Figure 335658DEST_PATH_IMAGE036
wherein
Figure 512387DEST_PATH_IMAGE014
As a function of the euler squared distance.
9. According to claim 8The image-text retrieval system based on double-branch system-balance mutual learning is characterized in that a similarity regularization minimization loss function is used for guiding the generation of the first branch image characteristics and the first branch text characteristics, and the similarity regularization minimization loss functionL min Comprises the following steps:
Figure 384528DEST_PATH_IMAGE037
wherein
Figure 112312DEST_PATH_IMAGE038
And
Figure 69904DEST_PATH_IMAGE039
the similarity regularization minimization loss function representing images and text respectively,
Figure 790604DEST_PATH_IMAGE040
and
Figure 517252DEST_PATH_IMAGE041
second branch text features respectively representing a jth text positive example and a kth text negative example of the image,
Figure 681517DEST_PATH_IMAGE042
representing the ith first branch image feature;
Figure 391984DEST_PATH_IMAGE043
and
Figure 385217DEST_PATH_IMAGE044
second branch image features respectively representing a j-th image positive example and a k-th image negative example of an image,
Figure 497529DEST_PATH_IMAGE045
representing the ith first branch text feature.
10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the method for retrieving a graph and text based on dual-branch balance mutual learning according to any one of claims 1 to 7.
CN202211002415.3A 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning Active CN115080769B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211002415.3A CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211002415.3A CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Publications (2)

Publication Number Publication Date
CN115080769A true CN115080769A (en) 2022-09-20
CN115080769B CN115080769B (en) 2022-12-02

Family

ID=83244044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211002415.3A Active CN115080769B (en) 2022-08-22 2022-08-22 Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning

Country Status (1)

Country Link
CN (1) CN115080769B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN113010700A (en) * 2021-03-01 2021-06-22 电子科技大学 Image text cross-modal retrieval method based on category information alignment
CN114298158A (en) * 2021-12-06 2022-04-08 湖南工业大学 Multi-mode pre-training method based on image-text linear combination

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299341A (en) * 2018-10-29 2019-02-01 山东师范大学 One kind confrontation cross-module state search method dictionary-based learning and system
CN113010700A (en) * 2021-03-01 2021-06-22 电子科技大学 Image text cross-modal retrieval method based on category information alignment
CN114298158A (en) * 2021-12-06 2022-04-08 湖南工业大学 Multi-mode pre-training method based on image-text linear combination

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712740A (en) * 2023-01-10 2023-02-24 苏州大学 Method and system for multi-modal implication enhanced image text retrieval

Also Published As

Publication number Publication date
CN115080769B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
Wang et al. M3: Multimodal memory modelling for video captioning
CN110147457B (en) Image-text matching method, device, storage medium and equipment
Huang et al. Bi-directional spatial-semantic attention networks for image-text matching
CN112328767B (en) Question-answer matching method based on BERT model and comparative aggregation framework
CN110147548B (en) Emotion identification method based on bidirectional gating circulation unit network and novel network initialization
CN111444968A (en) Image description generation method based on attention fusion
Zhou et al. Ladder loss for coherent visual-semantic embedding
CN113204952B (en) Multi-intention and semantic slot joint identification method based on cluster pre-analysis
CN114936623B (en) Aspect-level emotion analysis method integrating multi-mode data
CN110347857B (en) Semantic annotation method of remote sensing image based on reinforcement learning
CN113128369A (en) Lightweight network facial expression recognition method fusing balance loss
CN116662582B (en) Specific domain business knowledge retrieval method and retrieval device based on natural language
CN113297369B (en) Intelligent question-answering system based on knowledge graph subgraph retrieval
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN112487822A (en) Cross-modal retrieval method based on deep learning
CN115080769B (en) Image-text retrieval method, system and storage medium based on double-branch system balance mutual learning
CN111651661B (en) Image-text cross-media retrieval method
CN106021402A (en) Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval
CN116610831A (en) Semanteme subdivision and modal alignment reasoning learning cross-modal retrieval method and retrieval system
CN110704665A (en) Image feature expression method and system based on visual attention mechanism
CN116311323A (en) Pre-training document model alignment optimization method based on contrast learning
CN113837229B (en) Knowledge-driven text-to-image generation method
CN113934835B (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
Huang et al. An effective multimodal representation and fusion method for multimodal intent recognition
Cheng et al. Deep attentional fine-grained similarity network with adversarial learning for cross-modal retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant