CN115934992A - Text and image retrieval method, device and computer readable storage medium - Google Patents

Text and image retrieval method, device and computer readable storage medium Download PDF

Info

Publication number
CN115934992A
CN115934992A CN202211550479.7A CN202211550479A CN115934992A CN 115934992 A CN115934992 A CN 115934992A CN 202211550479 A CN202211550479 A CN 202211550479A CN 115934992 A CN115934992 A CN 115934992A
Authority
CN
China
Prior art keywords
image
text
module
affine transformation
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211550479.7A
Other languages
Chinese (zh)
Inventor
叶桔
禹世杰
吴伟华
范艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN HARZONE TECHNOLOGY CO LTD
Original Assignee
SHENZHEN HARZONE TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN HARZONE TECHNOLOGY CO LTD filed Critical SHENZHEN HARZONE TECHNOLOGY CO LTD
Priority to CN202211550479.7A priority Critical patent/CN115934992A/en
Publication of CN115934992A publication Critical patent/CN115934992A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The embodiment of the application discloses a text and image retrieval method, a text and image retrieval device and a computer readable storage medium, which are applied to electronic equipment, wherein a machine learning model is configured in the electronic equipment, and the machine learning model comprises the following steps: the method comprises the following steps of: acquiring a target vehicle image; inputting a target vehicle image into an image encoder to obtain a first image characteristic; inputting the text description of the target vehicle image into a text encoder to obtain a first text characteristic; inputting the first image characteristic and the first text characteristic into at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic; and inputting the second image characteristics into a post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics. By the adoption of the method and the device, the text and image retrieval consistency can be accurately achieved in vehicle retrieval.

Description

Text and image retrieval method, device and computer readable storage medium
Technical Field
The application relates to the technical field of video monitoring or computers, in particular to a text and image retrieval method and device and a computer readable storage medium.
Background
In the prior art, text-based vehicle retrieval aims at identifying images of target vehicles from a large vehicle image database on the premise of natural language. Since text descriptions are more accessible than other types of queries in most real application scenarios, text-based vehicle retrieval is of great significance in the field of video surveillance and receives increasing attention. However, most of the existing methods mainly focus on image-based vehicle retrieval, and since research on text-based vehicle retrieval is still in the beginning, a problem of how to accurately implement consistency between text and image retrieval in vehicle retrieval is urgently needed to be solved.
Disclosure of Invention
The embodiment of the application provides a text and image retrieval method and device and a computer readable storage medium, which can accurately realize the retrieval consistency of texts and images in vehicle retrieval.
In a first aspect, an embodiment of the present application provides a text and image retrieval method, which is applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: the method comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, and comprises the following steps:
acquiring a target vehicle image and a text description of the target vehicle image;
inputting the target vehicle image into the image encoder to obtain a first image characteristic;
inputting the text description into the text encoder to obtain a first text feature;
inputting the first image feature and the first text feature into the at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature;
inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
In a second aspect, an embodiment of the present application provides a text and image retrieval apparatus applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: image encoder, text encoder, at least one preset cyclic affine transformation module and a post-processing module, the apparatus comprising: an acquisition unit, an extraction unit, a transformation unit and a processing unit, wherein,
the acquisition unit is used for acquiring a target vehicle image and a text description of the target vehicle image;
the extraction unit is used for inputting the target vehicle image into the image encoder to obtain a first image characteristic; inputting the text description into the text encoder to obtain a first text characteristic;
the transformation unit is used for inputting the first image characteristic and the first text characteristic into the at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic;
and the processing unit is used for inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, a communication interface, and one or more programs, where the one or more programs are stored in the memory and configured to be executed by the processor, and the program includes instructions for executing the steps in the first aspect of the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program for electronic data exchange, where the computer program enables a computer to perform some or all of the steps described in the first aspect of the embodiment of the present application.
In a fifth aspect, embodiments of the present application provide a computer program product, where the computer program product includes a non-transitory computer-readable storage medium storing a computer program, where the computer program is operable to cause a computer to perform some or all of the steps as described in the first aspect of the embodiments of the present application. The computer program product may be a software installation package.
The embodiment of the application has the following beneficial effects:
it can be seen that the text and image retrieval method, apparatus, and computer-readable storage medium described in the embodiments of the present application are applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: the system comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein the image encoder is used for acquiring a target vehicle image and text description of the target vehicle image, and the target vehicle image is input into the image encoder to obtain a first image characteristic; and inputting the text description into a text encoder to obtain a first text feature, inputting the first image feature and the first text feature into at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature, inputting the second image feature into a post-processing module for post-processing to obtain a processing result, determining a target retrieval result according to the processing result and the second text feature, on one hand, respectively inputting the vehicle image into the text encoder and the image encoder to obtain the corresponding text feature and the image feature, on the other hand, performing multi-angle depth feature fusion on the text feature and the image feature in at least one preset cyclic affine transformation module to obtain the fused text feature and image feature, performing post-processing on the image feature, and then corresponding the image feature and the text feature, thereby, establishing the relevance between the image feature and the text feature in depth, and accurately realizing the retrieval consistency of the text and the image in vehicle retrieval.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1A is a schematic flowchart of a text and image retrieval method according to an embodiment of the present disclosure;
FIG. 1B is a schematic structural diagram of a machine learning model provided by an embodiment of the present application;
FIG. 1C is a schematic structural diagram of another machine learning model provided in the embodiments of the present application;
FIG. 1D is a diagram illustrating a DSE module in a machine learning model according to an embodiment of the present disclosure;
FIG. 2 is a schematic flow chart diagram illustrating another text and image retrieval method according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present application;
fig. 4 is a block diagram illustrating functional units of a text and image retrieval apparatus according to an embodiment of the present disclosure.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.
The electronic device described in the embodiment of the present application may include a smart Phone (e.g., an Android Phone, an iOS Phone, a Windows Phone, etc.), a tablet computer, a palm computer, a vehicle data recorder, a server, a notebook computer, a Mobile Internet device (MID, mobile Internet Devices), or a wearable device (e.g., a smart watch, a bluetooth headset), which are merely examples, but not exhaustive, and include but are not limited to the above electronic Devices.
The following describes embodiments of the present application in detail.
Referring to fig. 1A, fig. 1A is a schematic flowchart of a text and image retrieval method according to an embodiment of the present disclosure, and as shown in the drawing, the method is applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: the text and image retrieval method comprises the following steps of:
101. a target vehicle image and a textual description of the target vehicle image are obtained.
The target vehicle image can be acquired through the camera, the target vehicle image can be identified to obtain corresponding text description, and the target vehicle image can also be input into a preset neural network model to obtain the text description of the target vehicle image. The preset neural network model may include at least one of: convolutional neural network models, fully-connected neural network models, recurrent neural network models, and the like, without limitation. The text description is used for describing the target vehicle image by adopting character strings, characters, arrays and the like.
In the embodiment of the present application, as shown in fig. 1B, the machine learning model may include: the device comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein any one of the image encoder, the text encoder, the at least one preset cyclic affine transformation module and the post-processing module can be preset or default to a system. And each preset cyclic affine transformation module can specify the cycle number or the cycle condition, and the next stage is performed if the cycle number or the cycle condition is met.
In the embodiment of the application, the target vehicle can be shot by the camera when the target vehicle is detected, so that the image of the target vehicle is obtained.
102. And inputting the target vehicle image into the image encoder to obtain a first image characteristic.
In this embodiment of the present application, the target vehicle image may be input to the image encoder to obtain the first image feature, or the target vehicle image may be preprocessed and then the preprocessed target vehicle image is input to the image encoder to obtain the first image feature, where the preprocessing may include at least one of: image noise reduction, image enhancement, object extraction, etc., and not limited herein, the first image feature may include at least one of: feature points, feature lines, feature vectors, feature values, and the like, which are not limited herein.
103. And inputting the text description into the text encoder to obtain a first text characteristic.
In this embodiment of the application, the text description of the target vehicle image may be input to a text encoder to obtain the first text feature, or the target vehicle image may be preprocessed, the text description of the preprocessed target vehicle image is extracted, and the text description is input to the text encoder to obtain the first text feature, where the preprocessing may include at least one of: image noise reduction, image enhancement, object extraction, character recognition, etc., and not limited herein, the first text feature may include at least one of: character strings, sentences, feature vectors, feature values, and the like, without limitation.
104. And inputting the first image characteristic and the first text characteristic into the at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic.
In this embodiment of the application, the first image feature and the first text feature may be input to at least one predetermined cyclic affine transformation module to obtain a second image feature and a second text feature,
105. inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
In the embodiment of the application, the second image feature can be input into the post-processing module for post-processing to obtain a processing result, namely the processed image feature, and the retrieval consistency is realized according to the image feature and the second text feature.
Optionally, the post-processing module includes: the second convolution module, the activation function module, and the gaussian interpolation module, in step 105, inputting the second image feature into the post-processing module for post-processing to obtain a processing result, and determining a target search result according to the processing result and the second text feature, may include the following steps:
51. inputting the second image characteristics to the second convolution module, the activation function module and the Gaussian interpolation module in sequence for processing to obtain third image characteristics;
52. and determining the target retrieval result according to the second text characteristic and the third image characteristic.
In this embodiment, the post-processing module may include: the second image feature can be sequentially input into the second convolution module, the activation function module and the Gaussian interpolation module for processing to obtain a third image feature, the third image feature is a processing result, and the target retrieval result is determined according to the second text feature and the third image feature.
In this embodiment of the application, the preset cyclic affine transformation module includes: the system comprises a first affine transformation module, a first convolution module, a second affine transformation module, a down-sampling module and a dynamic semantic combination module;
the first affine transformation module is connected with the first convolution module, the first convolution module is connected with the second affine transformation module, the second affine transformation module is connected with the down-sampling module, and the first affine transformation module is used for performing affine transformation on the first image feature according to the first text feature so as to add the spatial attention of the text feature into the image feature; the down-sampling module is used for outputting image characteristics;
the second affine transformation module is used for carrying out affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features;
the first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text characteristic based on the image characteristic of the first convolution module, and the dynamic semantic combination module is further used for outputting the text characteristic.
In the embodiment of the application, in order to realize the fusion of image perception and text semantic information, a cross-modal affine transformation method is provided. The method is characterized in that fusion blocks in all stages are connected with an image text encoder through a cascaded cyclic affine transformation module, a space attention module is added into the image encoder, semantic consistency between a text and an original image is improved, and the image encoder is supervised to extract more image contents which accord with text description.
In this embodiment, as shown in fig. 1C, the preset cyclic radiation transform module may include: the device comprises a first affine transformation module, a first convolution module, a second affine transformation module, a downsampling module and a dynamic semantic combination module. Wherein, TF represents a text feature, which may be specifically expressed as: w1, W2, \ 8230;, wn. DSE represents a dynamic semantic composition module. The DSE is characterized by comprising the following steps of firstly dividing word features into subspaces with multiple granularities to construct a complete semantic space, and then configuring a dynamic subspace router to generate a stage perception path, so that more accurate and diversified semantic recombination results are brought. Generally, given a list of subspace numbers, their corresponding participle features, a mechanism of attention is used to compute the recombination of the semantics represented by the different molecular spaces.
Wherein, W1, W2, 8230, wn represents the word vector subspace of the previous stage, I1, I2, I3, \8230, in represents the weight value obtained by the image encoder, and the weight value is applied to the word vector subspace to obtain a new word vector space combination which is more adaptive to the image.
In the embodiment of the application, the dynamic semantic combination module can enable the text features (namely, the texts and images in the historical stages) in each stage to be adaptively recombined according to the states of the historical stages so as to provide various and accurate semantic guidance, and the dynamic semantic combination module can dynamically select the words needing to be recombined in each stage so as to provide various and accurate semantic guidance. Since the image encoder is a local to global process, the text should evolve synchronously during the process, providing semantic guidance from coarse granularity to fine granularity (e.g. from "vehicle" to "car" to "SUV") to better guide the extraction of image features at various stages. And through dynamically combining text features at different stages, the semantic information used before can be inhibited in the extraction process, new and consistent semantic information can be activated, the same semantic can be prevented from being repeatedly generated, and the problem of repeated rendering is solved.
In the embodiment of the present application, as shown in fig. 1C, the first affine transformation module is connected to the first convolution module, the first convolution module is connected to the second affine transformation module, the second affine transformation module is connected to the down-sampling module, and the first affine transformation module is configured to perform affine transformation on the first image feature according to the first text feature so as to add spatial attention of the text feature to the image feature; the down-sampling module is used for outputting image characteristics. And the second affine transformation module is used for carrying out affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features. The first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text characteristic based on the image characteristic of the first convolution module, and the dynamic semantic combination module is also used for outputting the text characteristic and allowing the text information to be globally distributed in the image coding process, so that the fusion of image perception and text semantic information is realized.
In the embodiment of the application, the training data may be a data set composed of text-image pairs, and each pair includes a vehicle image captured by a designated monitoring camera and is input into a text encoder and an image encoder to be trained to obtain a text image model. The network structure is shown In fig. 1C, where the text encoder may be extracted by BERT, ATT represents a spatial attention module, and (I1, I2, I3, \8230;, in) In the DSE module represents weight values of words corresponding to image regions given by ATT, the image region emphasis points extracted by different feature layers are different, and each dynamic semantic combining (DSE) module may recombine word features according to historical conditions.
In a concrete implementation, it is possible to start from a text description of the desired image and an initial image (random embedding, scene description in splines or pixels, any distinguishably created image), then run a circular affine transformation, and add a plurality of cascaded DSE modules to obtain more accurate word information to improve stability, the obtained result image area obtained by each DSE module is embedded, and the final image is formed by proposing weighted hierarchical image integration, so that each circular affine transformation module only focuses on the basis of the feature information of the previous stage, adding corresponding details, not the complete image.
Optionally, the first affine transformation module is configured to perform affine transformation on the first image feature according to the first text feature, so as to add a spatial attention of a text feature to the image feature, and the affine transformation module includes:
s1, determining an aggregation weight according to the first text characteristic;
s2, embedding the first image features into space attention to obtain image feature embedding subjected to space attention;
s3, determining text information added with image weight according to the image feature embedding subjected to the spatial attention and the aggregation weight;
s4, performing cross-modal processing on the text information added with the image weight by adopting a specified activation function to obtain cross-modal characteristics;
and S5, carrying out affine transformation on the cross-modal characteristics.
In the embodiment of the present application, the specified activation function may be preset or default to the system, where the specified activation function may include a tanh activation function.
In the embodiment of the application, an aggregation weight may be determined according to a first text feature, then spatial attention embedding is performed on the first image feature, image feature embedding which is subjected to spatial attention is obtained, text information which is added into the image weight is determined according to the image feature embedding which is subjected to spatial attention and the aggregation weight, cross-modal processing is performed on the text information which is added into the image weight by using a specified activation function, cross-modal features are obtained, and affine transformation is performed on the cross-modal features.
In specific implementation, the learnable full-time projection image features can be used to perform spatial attention on the image to obtain a small number of N image vectors, so as to obtain an aggregation weight, wherein the aggregation process is as follows:
Figure BDA0003981820660000061
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003981820660000062
image feature embedding, W, representing through-space attention i-1 Text feature information, W ', representing the current stage' i-1 Representing the text information to which the image weight is added.
Then after recombination W' i-1 Obtaining W' through cross-modal processing i-1 The cross-mode processing procedure applied to the image feature of the next stage is as follows:
W″ i-1 =tanh(W′ i-1 )
in the embodiment of the application, the tanh activation function is used instead of the softmax function because softmax maximizes the probability, suppresses other probabilities to approach 0, and the extremely small probability hinders the backward propagation of the gradient, thereby aggravating the instability of the training of the image encoder. In contrast, the tanh function prevents the attention probability from approaching 0, increasing the efficiency of back propagation, forcing the generator to synthesize more relevant information.
Optionally, before step 101, the following steps may be further included:
training the machine learning model by adopting a preset loss function to obtain the machine learning model meeting preset requirements;
the preset loss function consists of a local loss function, a global loss function and a cyclic affine transformation loss function;
the local loss function is obtained based on a local similarity principle, and specifically comprises the following steps: based on the average matching score of one sentence in the text and the most relevant object in the image;
the global loss function is obtained based on global similarity, and specifically comprises the following steps: based on the matching degree between the text vector and the image vector;
the cyclic affine transformation loss function is obtained based on calculating the matching degree of the image and the text description.
In this embodiment of the present application, the preset loss function may be preset or default to a system, and the preset loss function may be a loss function of at least one of an image encoder, a text encoder, at least one preset cyclic affine transformation module, and a post-processing module.
In specific implementation, a preset LOSS function (LOSS) may be adopted to train the machine learning model, so as to obtain the machine learning model meeting preset requirements, where the preset requirements may be preset or default to a system, and the preset requirements may include at least one of the following: the preset training times are reached, the preset convergence condition is met, and the like, which are not limited herein, wherein the preset training times and the preset convergence condition may be preset or default by the system.
The preset loss function can be composed of a local loss function, a global loss function and a cyclic affine transformation loss function, and the local loss function is obtained based on a local similarity principle, and specifically comprises the following steps: the global loss function may be obtained based on global similarity, specifically, based on an average of matching scores of a sentence in the text and a most relevant object in the image, where: the loop affine transformation loss function may be obtained based on calculating a degree of matching between the image and the text description.
In the embodiment of the application, in the process of an encoder, image features can be extracted in a plurality of cascaded cyclic affine transformations according to text information, and the image features are forced to be extracted by the image encoder through calculating a similarity loss function between text description and a vehicle image and performing gradient updating and back propagation to the image encoder.
The preset loss function in the embodiment of the present application may be composed of a local loss function, a global loss function, and a cyclic affine transformation loss function, and may introduce external unstructured parameters to generate an image-text, and generate alignment between the image and the text by contrast learning, specifically using contrast loss from the image to the text, from an image region to a word, and from the text to the image. Firstly, solving the local similarity and the global similarity, wherein the local similarity is the average value of matching scores of the most relevant objects in a sentence and an image as follows:
Figure BDA0003981820660000071
wherein N is w Number of words representing text, w T Representing text and I representing an image.
Further, global similarity measures the text vector w using the conventional cosine distance v And an image vector I v The degree of matching between them is expressed as follows:
Figure BDA0003981820660000072
then, a circular affine transformation loss function can be used to calculate the degree of matching of the image and the text description, which can be the following function: namely, the conditional enhancement loss between the standard gaussian distribution and the gaussian branch of the training text is defined as Kullback-Leibler divergence, and the calculation process is as follows:
L CA =D KL (N(μ(s),∑(s)||N(0,I)))
where, s represents the matching sentence,
Figure BDA0003981820660000073
represents a non-matching sentence, x represents a real image corresponding to s, and ` er `>
Figure BDA0003981820660000074
Representing a generated pseudo-image for s.
It can be based on the LOSS function (LOSS) and is composed of the three terms:
Figure BDA0003981820660000075
in the embodiment of the application, in order to realize better word-image region alignment, external unstructured knowledge reconstruction is respectively introduced into a text, enhancement of similar meaning words is carried out on part of words, part of object examples in the image are subjected to explicit modeling to smoothly connect the text and the image, mutual information between corresponding words is maximized through contrast learning, alignment between the generated image and the text is enhanced through contrast loss from the image to sentences and from the image region to the words in the learning process, and then robust alignment between the image and the text can be enhanced.
It can be seen that the text and image retrieval method described in the embodiments of the present application is applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: the system comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein the image encoder is used for acquiring a target vehicle image and text description of the target vehicle image, and the target vehicle image is input into the image encoder to obtain a first image characteristic; and inputting the text description into a text encoder to obtain a first text feature, inputting the first image feature and the first text feature into at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature, inputting the second image feature into a post-processing module for post-processing to obtain a processing result, determining a target retrieval result according to the processing result and the second text feature, on one hand, respectively inputting the vehicle image into the text encoder and the image encoder to obtain the corresponding text feature and the image feature, on the other hand, performing multi-angle depth feature fusion on the text feature and the image feature in at least one preset cyclic affine transformation module to obtain the fused text feature and image feature, performing post-processing on the image feature, and then corresponding the image feature and the text feature, thereby, establishing the relevance between the image feature and the text feature in depth, and accurately realizing the retrieval consistency of the text and the image in vehicle retrieval.
Referring to fig. 2, in accordance with the embodiment shown in fig. 1A, fig. 2 is a schematic flowchart of another text and image retrieval method provided in the embodiment of the present application, and is applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: as shown in the figure, the text and image retrieval method comprises the following steps:
201. training the machine learning model by adopting a preset loss function to obtain the machine learning model meeting preset requirements; the preset loss function is composed of a local loss function, a global loss function and a cyclic affine transformation loss function.
202. A target vehicle image and a textual description of the target vehicle image are obtained.
203. And inputting the target vehicle image into the image encoder to obtain a first image characteristic.
204. And inputting the text description into the text encoder to obtain a first text characteristic.
205. And inputting the first image characteristic and the first text characteristic into the at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic.
206. Inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
The specific description of the steps 201 to 206 may refer to the corresponding steps of the text and image retrieval method described in fig. 1A, and will not be described herein again.
It can be seen that, in the text and image retrieval method described in the embodiment of the present application, firstly, a vehicle image may be respectively input to a text encoder and an image encoder to obtain corresponding text features and image features, secondly, the text features and the image features may be subjected to multi-angle depth feature fusion in at least one preset cyclic affine transformation module to obtain the fused text features and image features, and then the image features are post-processed and then are corresponding to the text features, and thirdly, a loss function of the method may introduce external unstructured parameters to generate an image-text, and alignment between the image and the text is generated by contrast learning specifically using contrast loss from the image to the text, from an image region to a word, and from the text to the image, so that association between the image features and the text features is established in depth, and the retrieval consistency between the text and the image can be accurately realized in vehicle retrieval.
Referring to fig. 3, fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure, where as shown in the figure, the electronic device includes a processor, a memory, a communication interface, and one or more programs, and a machine learning model is configured in the electronic device, where the machine learning model includes: an image encoder, a text encoder, at least one predetermined cyclic affine transformation module, and a post-processing module, said one or more programs being stored in said memory and configured to be executed by said processor, in an embodiment of the present application, said programs comprising instructions for:
acquiring a target vehicle image and a text description of the target vehicle image;
inputting the target vehicle image into the image encoder to obtain a first image characteristic;
inputting the text description into the text encoder to obtain a first text characteristic;
inputting the first image characteristic and the first text characteristic into the at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic;
inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
Optionally, the preset cyclic affine transformation module includes: the system comprises a first affine transformation module, a first convolution module, a second affine transformation module, a down-sampling module and a dynamic semantic combination module;
the first affine transformation module is connected with the first convolution module, the first convolution module is connected with the second affine transformation module, the second affine transformation module is connected with the down-sampling module, and the first affine transformation module is used for performing affine transformation on the first image feature according to the first text feature so as to add the spatial attention of the text feature into the image feature; the down-sampling module is used for outputting image characteristics;
the second affine transformation module is used for performing affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features;
the first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text characteristic based on the image characteristic of the first convolution module, and the dynamic semantic combination module is further used for outputting the text characteristic.
Optionally, the first affine transformation module is configured to perform affine transformation on the first image feature according to the first text feature, so as to add a spatial attention of a text feature to the image feature, and the affine transformation module includes:
determining an aggregation weight value according to the first text characteristic;
embedding the first image features with spatial attention to obtain image feature embedding subjected to spatial attention;
determining text information added with image weight according to the image feature embedding subjected to the spatial attention and the aggregation weight;
performing cross-modal processing on the text information added with the image weight by adopting a specified activation function to obtain cross-modal characteristics;
and carrying out affine transformation on the cross-modal characteristics.
Optionally, the post-processing module includes: a second convolution module, an activation function module, and a gaussian interpolation module, wherein in the aspect of inputting the second image feature into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text feature, the program includes instructions for executing the following steps:
inputting the second image characteristics to the second convolution module, the activation function module and the Gaussian interpolation module in sequence for processing to obtain third image characteristics;
and determining the target retrieval result according to the second text characteristic and the third image characteristic.
Optionally, the program further includes instructions for performing the following steps:
training the machine learning model by adopting a preset loss function to obtain the machine learning model meeting preset requirements;
the preset loss function consists of a local loss function, a global loss function and a cyclic affine transformation loss function;
the local loss function is obtained based on a local similarity principle, and specifically comprises the following steps: based on the average matching score of one sentence in the text and the most relevant object in the image;
the global loss function is obtained based on global similarity, and specifically comprises the following steps: based on the matching degree between the text vector and the image vector;
the cyclic affine transformation loss function is obtained based on calculating the matching degree of the image and the text description.
It can be seen that the electronic device described in the embodiment of the present application is configured with a machine learning model, where the machine learning model includes: the system comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein the image encoder is used for acquiring a target vehicle image and text description of the target vehicle image, and inputting the target vehicle image into the image encoder to obtain a first image characteristic; and inputting the text description into a text encoder to obtain a first text feature, inputting the first image feature and the first text feature into at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature, inputting the second image feature into a post-processing module for post-processing to obtain a processing result, determining a target retrieval result according to the processing result and the second text feature, on one hand, respectively inputting the vehicle image into the text encoder and the image encoder to obtain the corresponding text feature and the image feature, on the other hand, performing multi-angle depth feature fusion on the text feature and the image feature in at least one preset cyclic affine transformation module to obtain the fused text feature and image feature, performing post-processing on the image feature, and then corresponding the image feature and the text feature, thereby, establishing the relevance between the image feature and the text feature in depth, and accurately realizing the retrieval consistency of the text and the image in vehicle retrieval.
Fig. 4 is a block diagram showing functional units of a text and image retrieval apparatus 400 according to an embodiment of the present application. The text and image retrieval apparatus 400 is applied to an electronic device, in which a machine learning model is configured, and the machine learning model includes: image encoder, text encoder, at least one preset cyclic affine transformation module and a post-processing module, the apparatus comprising: an acquisition unit 401, an extraction unit 402, a transformation unit 403, and a processing unit 404, wherein,
the acquiring unit 401 is configured to acquire a target vehicle image and a text description of the target vehicle image;
the extracting unit 402 is configured to input the target vehicle image into the image encoder, so as to obtain a first image feature; inputting the text description into the text encoder to obtain a first text characteristic;
the transformation unit 403 is configured to input the first image feature and the first text feature into the at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature;
the processing unit 404 is configured to input the second image feature into the post-processing module for post-processing to obtain a processing result, and determine a target retrieval result according to the processing result and the second text feature.
Optionally, the preset cyclic affine transformation module includes: the system comprises a first affine transformation module, a first convolution module, a second affine transformation module, a down-sampling module and a dynamic semantic combination module;
the first affine transformation module is connected with the first convolution module, the first convolution module is connected with the second affine transformation module, the second affine transformation module is connected with the down-sampling module, and the first affine transformation module is used for performing affine transformation on the first image feature according to the first text feature so as to add the spatial attention of the text feature into the image feature; the down-sampling module is used for outputting image characteristics;
the second affine transformation module is used for carrying out affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features;
the first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text characteristic based on the image characteristic of the first convolution module, and the dynamic semantic combination module is further used for outputting the text characteristic.
Optionally, the first affine transformation module is configured to perform affine transformation on the first image feature according to the first text feature, so as to add a spatial attention of a text feature to the image feature, and the affine transformation module includes:
determining an aggregation weight according to the first text characteristic;
embedding the first image characteristics with space attention to obtain image characteristic embedding with space attention;
determining text information added with image weight according to the image feature embedding subjected to the spatial attention and the aggregation weight;
performing cross-modal processing on the text information added with the image weight by adopting a specified activation function to obtain cross-modal characteristics;
and carrying out affine transformation on the cross-modal characteristics.
Optionally, the post-processing module includes: a second convolution module, an activation function module, and a gaussian interpolation module, where in the aspect of inputting the second image feature into the post-processing module for post-processing to obtain a processing result, and determining a target search result according to the processing result and the second text feature, the processing unit 404 is specifically configured to:
inputting the second image characteristics to the second convolution module, the activation function module and the Gaussian interpolation module in sequence for processing to obtain third image characteristics;
and determining the target retrieval result according to the second text characteristic and the third image characteristic.
Optionally, the apparatus 400 is further specifically configured to:
training the machine learning model by adopting a preset loss function to obtain the machine learning model meeting preset requirements;
the preset loss function consists of a local loss function, a global loss function and a cyclic affine transformation loss function;
the local loss function is obtained based on a local similarity principle, and specifically comprises the following steps: based on the average value of the matching scores of one sentence in the text and the most relevant object in the image;
the global loss function is obtained based on global similarity, and specifically includes: based on the matching degree between the text vector and the image vector;
the cyclic affine transformation loss function is obtained based on calculating the matching degree of the image and the text description.
It can be seen that the text and image retrieval apparatus described in the embodiments of the present application is applied to an electronic device, where a machine learning model is configured in the electronic device, and the machine learning model includes: the system comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein the image encoder is used for acquiring a target vehicle image and text description of the target vehicle image, and the target vehicle image is input into the image encoder to obtain a first image characteristic; and inputting the text description into a text encoder to obtain a first text feature, inputting the first image feature and the first text feature into at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature, inputting the second image feature into a post-processing module for post-processing to obtain a processing result, determining a target retrieval result according to the processing result and the second text feature, on one hand, respectively inputting the vehicle image into the text encoder and the image encoder to obtain the corresponding text feature and the image feature, on the other hand, performing multi-angle depth feature fusion on the text feature and the image feature in at least one preset cyclic affine transformation module to obtain the fused text feature and image feature, performing post-processing on the image feature, and then corresponding the image feature and the text feature, thereby, establishing the relevance between the image feature and the text feature in depth, and accurately realizing the retrieval consistency of the text and the image in vehicle retrieval.
It can be understood that the functions of each program module of the text and image retrieval apparatus in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.
Embodiments of the present application also provide a computer storage medium, where the computer storage medium stores a computer program for electronic data exchange, the computer program enabling a computer to execute part or all of the steps of any one of the methods described in the above method embodiments, and the computer includes an electronic device.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any of the methods as described in the above method embodiments. The computer program product may be a software installation package, the computer comprising an electronic device.
It should be noted that for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A text and image retrieval method is applied to an electronic device, wherein a machine learning model is configured in the electronic device, and the machine learning model comprises: the method comprises an image encoder, a text encoder, at least one preset cyclic affine transformation module and a post-processing module, wherein the method comprises the following steps:
acquiring a target vehicle image and a text description of the target vehicle image;
inputting the target vehicle image into the image encoder to obtain a first image characteristic;
inputting the text description into the text encoder to obtain a first text feature;
inputting the first image feature and the first text feature into the at least one preset cyclic affine transformation module to obtain a second image feature and a second text feature;
inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
2. The method according to claim 1, wherein the predetermined cyclic affine transformation module comprises: the system comprises a first affine transformation module, a first convolution module, a second affine transformation module, a down-sampling module and a dynamic semantic combination module;
the first affine transformation module is connected with the first convolution module, the first convolution module is connected with the second affine transformation module, the second affine transformation module is connected with the down-sampling module, and the first affine transformation module is used for performing affine transformation on the first image feature according to the first text feature so as to add the spatial attention of the text feature into the image feature; the down-sampling module is used for outputting image characteristics;
the second affine transformation module is used for performing affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features;
the first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text feature based on the image feature of the first convolution module, and the dynamic semantic combination module is further used for outputting the text feature.
3. The method of claim 2, wherein the first affine transformation module is configured to affine transform the first image feature according to the first text feature to add a spatial attention of a text feature to an image feature, and comprises:
determining an aggregation weight value according to the first text characteristic;
embedding the first image characteristics with space attention to obtain image characteristic embedding with space attention;
determining text information added with image weight according to the image feature embedding subjected to the spatial attention and the aggregation weight;
performing cross-modal processing on the text information added with the image weight by adopting a specified activation function to obtain cross-modal characteristics;
and carrying out affine transformation on the cross-modal characteristics.
4. The method of any of claims 1-3, wherein the post-processing module comprises: the second convolution module, the activation function module and the gaussian interpolation module, the inputting the second image feature into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text feature, including:
inputting the second image characteristics to the second convolution module, the activation function module and the Gaussian interpolation module in sequence for processing to obtain third image characteristics;
and determining the target retrieval result according to the second text characteristic and the third image characteristic.
5. The method according to any one of claims 1-3, further comprising:
training the machine learning model by adopting a preset loss function to obtain the machine learning model meeting preset requirements;
the preset loss function consists of a local loss function, a global loss function and a cyclic affine transformation loss function;
the local loss function is obtained based on a local similarity principle, and specifically comprises the following steps: based on the average matching score of one sentence in the text and the most relevant object in the image;
the global loss function is obtained based on global similarity, and specifically includes: based on the matching degree between the text vector and the image vector;
the cyclic affine transformation loss function is obtained based on calculating the matching degree of the image and the text description.
6. A text and image retrieval device is applied to an electronic device, wherein a machine learning model is configured in the electronic device, and the machine learning model comprises: image encoder, text encoder, at least one preset cyclic affine transformation module and a post-processing module, the apparatus comprising: an acquisition unit, an extraction unit, a transformation unit and a processing unit, wherein,
the acquisition unit is used for acquiring a target vehicle image and a text description of the target vehicle image;
the extraction unit is used for inputting the target vehicle image into the image encoder to obtain a first image characteristic; inputting the text description into the text encoder to obtain a first text characteristic;
the transformation unit is used for inputting the first image characteristic and the first text characteristic into the at least one preset cyclic affine transformation module to obtain a second image characteristic and a second text characteristic;
and the processing unit is used for inputting the second image characteristics into the post-processing module for post-processing to obtain a processing result, and determining a target retrieval result according to the processing result and the second text characteristics.
7. The apparatus of claim 6, wherein the predetermined cyclic affine transformation module comprises: the system comprises a first affine transformation module, a first convolution module, a second affine transformation module, a downsampling module and a dynamic semantic combination module;
the first affine transformation module is connected with the first convolution module, the first convolution module is connected with a second affine transformation module, the second affine transformation module is connected with the downsampling module, and the first affine transformation module is used for carrying out affine transformation on the first image feature according to the first text feature so as to add the spatial attention of the text feature into the image feature; the down-sampling module is used for outputting image characteristics;
the second affine transformation module is used for performing affine transformation on the first image features according to the text features output by the dynamic semantic combination module so as to add the spatial attention of the text features into the image features;
the first convolution module is connected with the dynamic semantic combination module, the dynamic semantic combination module edits the first text characteristic based on the image characteristic of the first convolution module, and the dynamic semantic combination module is further used for outputting the text characteristic.
8. The apparatus of claim 7, wherein the first affine transformation module is configured to affine transform the first image feature according to the first text feature to add a spatial attention of a text feature to an image feature, and comprises:
determining an aggregation weight according to the first text characteristic;
embedding the first image characteristics with space attention to obtain image characteristic embedding with space attention;
determining text information added with image weight according to the image feature embedding subjected to the spatial attention and the aggregation weight;
performing cross-modal processing on the text information added with the image weight by adopting a specified activation function to obtain cross-modal characteristics;
and carrying out affine transformation on the cross-modal characteristics.
9. An electronic device comprising a processor, a memory for storing one or more programs and configured for execution by the processor, the programs comprising instructions for performing the steps in the method of any of claims 1-5.
10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-5.
CN202211550479.7A 2022-12-05 2022-12-05 Text and image retrieval method, device and computer readable storage medium Pending CN115934992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211550479.7A CN115934992A (en) 2022-12-05 2022-12-05 Text and image retrieval method, device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211550479.7A CN115934992A (en) 2022-12-05 2022-12-05 Text and image retrieval method, device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115934992A true CN115934992A (en) 2023-04-07

Family

ID=86652093

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211550479.7A Pending CN115934992A (en) 2022-12-05 2022-12-05 Text and image retrieval method, device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115934992A (en)

Similar Documents

Publication Publication Date Title
US11657230B2 (en) Referring image segmentation
CN108694225B (en) Image searching method, feature vector generating method and device and electronic equipment
CN112182166B (en) Text matching method and device, electronic equipment and storage medium
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN109740158B (en) Text semantic parsing method and device
CN113569892A (en) Image description information generation method and device, computer equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN109344242B (en) Dialogue question-answering method, device, equipment and storage medium
CN111930894A (en) Long text matching method and device, storage medium and electronic equipment
CN113961736A (en) Method and device for generating image by text, computer equipment and storage medium
CN114282013A (en) Data processing method, device and storage medium
CN111291695B (en) Training method and recognition method for recognition model of personnel illegal behaviors and computer equipment
CN112650768A (en) Dialog information generation method and device and electronic equipment
CN115223020A (en) Image processing method, image processing device, electronic equipment and readable storage medium
CN117033609B (en) Text visual question-answering method, device, computer equipment and storage medium
CN113761282A (en) Video duplicate checking method and device, electronic equipment and storage medium
CN116975347A (en) Image generation model training method and related device
CN116977457A (en) Data processing method, device and computer readable storage medium
CN115204366A (en) Model generation method and device, computer equipment and storage medium
CN116186312A (en) Multi-mode data enhancement method for data sensitive information discovery model
CN115934992A (en) Text and image retrieval method, device and computer readable storage medium
CN116991412A (en) Code processing method, device, electronic equipment and storage medium
KR102476334B1 (en) Diary generator using deep learning
CN114595357A (en) Video searching method and device, electronic equipment and storage medium
CN113569094A (en) Video recommendation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination