CN118013069A

CN118013069A - Image retrieval method and device, storage medium and electronic equipment

Info

Publication number: CN118013069A
Application number: CN202410419405.2A
Authority: CN
Inventors: 叶梦宇; 高天泽; 陈畅怀; 车军
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2024-04-09
Filing date: 2024-04-09
Publication date: 2024-05-10
Anticipated expiration: 2044-04-09
Also published as: CN118013069B

Abstract

The application discloses an image retrieval method, an image retrieval device, a storage medium and electronic equipment, comprising the following steps: acquiring text description input by a user; generating a predicted image corresponding to the text description by using a pre-trained text-generated graph model; and carrying out image retrieval in an image library based on the predicted image to obtain an image retrieval result. By the method and the device, image retrieval can be performed based on text description, and a classical image retrieval method can be compatible.

Description

Image retrieval method and device, storage medium and electronic equipment

Technical Field

The present application relates to image processing technologies, and in particular, to an image retrieval method, an image retrieval device, a storage medium, and an electronic apparatus.

Background

With the rapid development of computer technology, image information retrieval has been applied to various scenes in a large scale.

In the current image retrieval technology, classical image retrieval requires that the most similar image is retrieved from a database after the image to be queried by the system is input. If no picture to be queried exists, only the text description of the appearance of the human body is provided for the user, and the classical graphic retrieval mode cannot be used.

Based on this, a series of search systems and schemes for searching images by words have been recently proposed, and such schemes require a module for extracting features of words and images and coupling the two to each other, but such schemes cannot be compatible with existing systems for searching images by words.

Disclosure of Invention

The application provides an image retrieval method, an image retrieval device, a storage medium and electronic equipment, which can perform image retrieval based on text description and can be compatible with classical image retrieval methods.

In order to achieve the above purpose, the application adopts the following technical scheme:

an image retrieval method comprising:

a. Acquiring text description input by a user, and generating a predicted image corresponding to the text description by utilizing a pre-trained text-generated graph model;

b. performing quality evaluation on the predicted image to obtain a quality evaluation result;

C, when the quality evaluation result meets the set requirement, continuing to execute the step c;

when the quality evaluation result does not meet the set requirement, prompting a user to modify the text description based on the quality evaluation result, and returning to execute the steps a and b until the quality evaluation result meets the set requirement, and continuing to execute the step c;

c. based on the predicted image, performing image retrieval in an image library to obtain an image retrieval result;

Wherein the performing quality evaluation on the predicted image includes:

acquiring global image features of the predicted image and local image features on each local attribute, and acquiring global text features of the text description and local text features on each local attribute;

And comparing the similarity between the global image feature and the global text feature, comparing the similarity between the local image feature and the local text feature on the same local attribute, and determining the overall quality score of the predicted image based on the result of the comparison between the two similarities.

Preferably, a pre-trained image feature extractor is utilized to extract global image features of the predicted image, so as to obtain the global image features;

Extracting global text features from the text description by using a pre-trained text feature extractor to obtain the global text features;

processing the global image features by using a pre-trained first global feature mapper, and predicting to obtain the local image features;

and processing the global text features by using a pre-trained second global feature mapper, and predicting to obtain the local text features.

Preferably, the quality evaluation of the predicted image further includes:

and determining the quality classification of the predicted image based on the feature extraction result determined by feature extraction of the predicted image.

Preferably, the determining the quality classification of the predicted image based on the feature extraction result determined by feature extraction of the predicted image includes:

classifying the local image characteristic prediction results on all the set local attributes by using a trained multi-attribute image quality gear evaluator to obtain quality classification results of the prediction image on each set local attribute; wherein the set local attribute includes a part or all of the local attribute.

Preferably, the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper, and the multi-attribute image quality gear evaluator are co-trained.

Preferably, when the image feature extractor, the text feature extractor, the global feature mapper and the multi-attribute image quality gear evaluator perform joint training, the parameter updating method includes:

Comparing the similarity between the processing result of the training image based on the image feature extractor and the processing result of the training text based on the text feature extractor, and determining the value of a first loss function;

determining a value of a second loss function based on a processing result of the global feature mapper;

Determining a value of a third loss function based on a processing result of the multi-attribute image quality gear evaluator;

Fusing the value of the first loss function, the value of the second loss function and the value of the third loss function to obtain the value of the comprehensive loss function;

and updating parameters of the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper and the multi-attribute image quality gear evaluator based on the value of the comprehensive loss function.

Preferably, the determining the value of the first loss function includes:

inputting a training image and a local image corresponding to each local attribute in the training image into the image feature extractor for processing to obtain global image features of the training image and local image features of the training image on each local attribute;

Inputting a training text and a local text corresponding to local attributes in the training text into the text feature extractor for processing to obtain global text features of the training text and local text features of the training text on each local attribute;

performing similarity comparison on the global image features and the global text features to obtain a global comparison result;

Performing similarity comparison on the local image features and the local text features corresponding to the same local attribute to obtain a comparison result of the corresponding local attribute;

and determining the value of the first loss function based on the global comparison result and the comparison result of each local attribute.

Preferably, the determining the value of the second loss function includes:

Inputting global image features of the training image into the global feature mapper for processing, and predicting to obtain a local image feature prediction result of the training image on each local attribute;

Respectively carrying out similarity comparison on the image feature prediction result of the training image on each local attribute and the local image feature of the training image on the corresponding local attribute output by the image feature extractor to obtain a training image feature comparison result corresponding to each local attribute;

Inputting global text features of the training text into the global feature mapper for processing, and predicting to obtain a local text feature prediction result of the training text on each local attribute;

Respectively carrying out similarity comparison on the image feature prediction result of the training text on each local attribute and the local text feature of the training text on the corresponding local attribute output by the text feature extractor to obtain a training text feature comparison result corresponding to each local attribute;

And determining the value of the second loss function based on the training image characteristic comparison result and the training text characteristic comparison result of all the local attributes.

Preferably, the determining the value of the third loss function includes:

Inputting the local image feature prediction results of the training image on all the set local attributes output by the first global feature mapper into the multi-attribute image quality gear evaluator for classification processing to obtain quality classification prediction results of the training image on each set local attribute;

determining the classification loss corresponding to each set local attribute based on the classification prediction result of the training image on each set local attribute and the quality classification actual result of the training image on the corresponding set local attribute;

And determining the value of the third loss function based on the classification losses corresponding to all the set local attributes.

Preferably, the determining the overall quality score of the predicted image includes:

the result of similarity comparison between the global image features and the global text features is used as a global quality score;

The result of similarity comparison between the local image features and the local text features of the same local attribute is used as a quality score of the corresponding local attribute;

the overall quality score is determined based on the global quality score and the quality scores of the respective local attributes.

Preferably, the quality evaluation result does not meet a set requirement, including: the global quality score is smaller than a set global threshold, and/or the quality score of each local attribute is smaller than a local attribute threshold set for the corresponding local attribute, and/or the overall quality score is smaller than a set overall threshold;

The prompting a user to modify the text description based on the quality assessment result includes: and prompting a user to modify the text description corresponding to any local attribute when the quality score of any local attribute is smaller than the local attribute threshold set for the any local attribute.

Preferably, the quality evaluation result does not meet a set requirement, including: the quality classification prediction result of the predicted image on the first set local attribute is a preset first quality classification result;

the prompting a user to modify the text description based on the quality assessment result includes:

And prompting a user to modify the text description corresponding to the first set local attribute when the quality classification prediction result of the predicted image on the first set local attribute is a preset first quality classification result.

Preferably, the predicted image includes a front predicted image, a side predicted image, and a rear predicted image;

The image retrieval in the image library comprises the following steps: performing the image retrieval based on the front predicted image, the side predicted image and the back predicted image respectively to obtain a front image retrieval result, a side image retrieval result and a back image retrieval result; or alternatively

And respectively extracting features of the front predicted image, the side predicted image and the back predicted image, fusing the extracted features to obtain fused features, and carrying out image retrieval based on the fused features to obtain retrieval results.

An image retrieval apparatus based on text description, comprising: the device comprises a text description acquisition unit, a text generation picture unit and an image retrieval unit;

The text description acquisition unit is used for acquiring text description input by a user;

the text-generated graph unit is used for generating a predicted image corresponding to the text description by utilizing a pre-trained text-generated graph model;

The image retrieval unit is used for performing image retrieval in an image library based on the predicted image to obtain an image retrieval result.

Preferably, a quality evaluation unit is further included between the text-to-image unit and the image retrieval unit, and is used for performing quality evaluation on the predicted image to obtain a quality evaluation result; when the quality evaluation result meets the set requirement, notifying the image retrieval unit to perform image retrieval; and prompting a user to modify the text description based on the quality evaluation result when the quality evaluation result does not meet the set requirement, and notifying the text description acquisition unit to acquire the text description input by the user again.

Preferably, in the quality evaluation unit, the performing quality evaluation on the predicted image includes:

and respectively carrying out feature extraction on the predicted image and the text description, carrying out similarity comparison on the feature extraction result, and determining the quality score of the predicted image based on the comparison result.

Preferably, the quality evaluation unit comprises an image feature extractor, a text feature extractor, a first global feature mapper, a second global feature mapper and a quality score unit;

the image feature extractor is used for extracting global image features of the predicted image to obtain global image features of the predicted image;

The text feature extractor is used for extracting global text features of the text description to obtain global text features of the text description;

the first global feature mapper is used for processing the global image features and predicting to obtain local image features of the predicted image on each local attribute;

the second global feature mapper is configured to process the global text feature, and predict to obtain a local text feature of the text description on each local attribute;

And the quality scoring device is used for comparing the similarity between the global image feature and the global text feature, comparing the similarity between the local image feature and the local text feature on the same local attribute, and determining the overall quality score of the predicted image based on the comparison result of the two similarities.

Preferably, the quality evaluation unit is further configured to determine a quality classification of the predicted image based on a feature extraction result determined by feature extraction of the predicted image.

Preferably, the quality evaluation unit further comprises a multi-attribute image quality gear evaluator, configured to classify the local image feature prediction results on all the set local attributes to obtain quality classification results of the predicted image on each set local attribute; wherein the set local attribute includes a part or all of the local attribute.

Preferably, in the quality evaluation unit, when the image feature extractor, the text feature extractor, the global feature mapper and the multi-attribute image quality gear evaluator perform joint training, the parameter updating manner includes:

Preferably, the determining the value of the first loss function includes:

Preferably, the determining the value of the second loss function includes:

Preferably, the determining the value of the third loss function includes:

Preferably, in the quality scoring device, the determining the overall quality score of the predicted image includes:

In the image retrieval unit, the performing image retrieval in an image library includes: performing the image retrieval based on the front predicted image, the side predicted image and the back predicted image respectively to obtain a front image retrieval result, a side image retrieval result and a back image retrieval result; or alternatively

A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the image retrieval method of any of the above.

An electronic device comprising at least a computer readable storage medium, further comprising a processor;

The processor is configured to read executable instructions from the computer-readable storage medium and execute the instructions to implement the image retrieval method of any one of the above.

According to the technical scheme, firstly, text description input by a user is acquired, the text description gives out key requirements of image retrieval, and a pre-trained text graph model is utilized to generate a predicted image corresponding to the text description; performing quality evaluation on the predicted image, prompting a user to modify text description when the quality evaluation result does not meet the set requirement, and re-generating the predicted image until the quality evaluation result of the predicted image meets the set requirement; when the image quality is evaluated, according to the result of similarity comparison between the global image features of the predicted image and the global image features of the text descriptions and the result of similarity comparison between the local image features of the predicted image on the local attributes and the local text features of the text descriptions on the same local attributes, the inconsistency between the predicted image and the text descriptions input by a user can be timely found, and the user is prompted to modify more proper text descriptions so as to generate the predicted image consistent with the text descriptions and used for searching; finally, based on the predicted image, image retrieval is carried out in an image library, and an image retrieval result is obtained. Through the method, the text description of the search is converted into the predicted image by using the text-generated graph model, and consistency of the generated predicted image and the text description is ensured through quality evaluation, so that the search requirement is correctly represented by the predicted image, namely, the image to be queried is provided for image search, and the existing image-to-image search processing mode can be used; in this way, on one hand, query images are not required to be directly provided, and only text descriptions are provided for image retrieval; on the other hand, the existing image retrieval system can be compatible and used.

Drawings

FIG. 1 is a basic flow diagram of an image retrieval method provided in the present application;

FIG. 2 is a schematic diagram of a quality assessment module according to a first embodiment of the present application;

FIG. 3 is a flowchart illustrating a training quality evaluation module according to an embodiment;

FIGS. 4-8 are schematic views of a partial image in a first embodiment of the present application;

FIG. 9 is a flowchart of an image retrieval method in the second embodiment;

FIG. 10 is a schematic diagram of a quality assessment process flow using a quality assessment module in a second embodiment;

FIG. 11 is a schematic diagram of the basic structure of an image retrieval device according to the present application;

fig. 12 is a schematic diagram of a basic structure of an electronic device according to the present application.

Detailed Description

The present application will be described in further detail with reference to the accompanying drawings, in order to make the objects, technical means and advantages of the present application more apparent.

Fig. 1 is a basic flow diagram of an image retrieval method provided in the present application. As shown in fig. 1, the method includes:

Step 101, obtaining a text description input by a user.

And acquiring text description input by a user and used for searching, wherein the text description is used for representing searching requirements and guiding subsequent image searching.

And 102, generating a predicted image corresponding to the text description by using a pre-trained text-generated graph model.

The text-to-image model is a neural network model for generating a corresponding image based on input text, and can be generated by training using an existing neural network structure. The text description obtained in step 101 is input into the draft graph model for processing, and a predicted image corresponding to the input text description can be generated.

In order to obtain a more comprehensive image retrieval result, when the predicted image is generated, the predicted image with different orientations such as the front face, the side face, the back face and the like can be generated according to the requirements of a user.

And 103, performing image retrieval in an image library based on the predicted image to obtain an image retrieval result.

And (3) searching the images meeting the conditions in the image library based on the predicted images generated in the step (102) to obtain image searching results. In the specific implementation, the existing image retrieval image processing can be adopted, so that the image retrieval based on the predicted image can be realized. For example, the image feature extraction may be performed on the predicted image by using a pre-trained image retrieval model, the feature extraction may be performed on the image in the image library, the similarity between the feature of the predicted image and the image feature of each image in the image library may be calculated, and the retrieval result in the image library may be determined based on the similarity. The image retrieval model may be generated by training using an existing neural network structure, and may be, for example, a deep learning model or the like.

If step 102 generates a plurality of predicted images with different orientations, step 103 also correspondingly, respectively performing image retrieval corresponding to the predicted images with different orientations to obtain image retrieval results with corresponding orientations; or the image retrieval model for image retrieval can respectively extract the characteristics of a plurality of predictive images with different directions, fuse the image characteristics extracted by the predictive images with different directions, and retrieve from an image library according to the fused characteristics to obtain a retrieval result. The image retrieval model can use a deep-learning CNN network structure or a transformation network structure as a backbone for extracting the characteristics of the image.

Further, in order to further improve the accuracy of image retrieval and avoid the problem that the image retrieval effect is affected due to unsuitable text description, on the basis of the flow, before image retrieval is performed based on the predicted image, quality evaluation can be further performed on the predicted image, if the quality evaluation result does not meet the requirement, a user can be prompted to modify the text description, the predicted image is regenerated based on the modified text description until the quality evaluation result of the predicted image meets the requirement, and image retrieval is performed based on the predicted image. Specifically, the steps 102 to 103 shown in fig. 1 may further include the following processes:

102a, performing quality evaluation on the predicted image to obtain a quality evaluation result;

specifically, the predicted image may be processed using a pre-trained image quality assessment model to achieve image quality assessment.

In more detail, the process of performing quality evaluation on the predicted image may include:

Acquiring global image features of a predicted image and local image features on each local attribute, and acquiring global text features of text description and local text features on each local attribute;

And comparing the similarity between the global image features and the global text features, comparing the similarity between the local image features and the local text features on the same local attribute, and determining the overall quality score of the predicted image based on the comparison result of the two similarities, and taking the overall quality score as a quality evaluation result.

Step 102b, judging whether the quality evaluation result meets the set requirement, if so, continuing to execute step 103, otherwise, executing step 102c;

step 102c, when the quality evaluation result does not meet the set requirement, prompting the user to modify the text description based on the quality evaluation result, and returning to step 101.

Through the processing, the quality evaluation can be carried out on the predicted image, when the predicted image generated by the current text description possibly has quality problems, a user is fed back, the user is allowed to provide more detailed description to supplement and perfect information until the generated image meets the quality requirements, and then the image is searched based on the generated predicted image, so that the image searching performance can be effectively improved, and an image searching result which better meets the user expectations and has higher quality is provided.

As described in the above flow, the user inputs the text description for searching, and can realize image searching without inputting an image, meanwhile, because the text-generated image model is used for generating the predicted image, and the text information is converted into the predicted image with consistent expression for image searching through quality evaluation and adjustment of the text description, the method can be compatible with the existing image searching image processing mode, and the image searching is performed based on the predicted image, so as to obtain the image searching result.

The foregoing quality evaluation of the predicted image is performed from the standpoint of consistency of the generated predicted image with the text description, and further, may be performed from the standpoint of image quality of the predicted image itself. The specific quality evaluation process can also be designed according to actual requirements.

The quality evaluation of the predicted image can be performed by a quality evaluation module, wherein the quality evaluation module is used for performing the quality evaluation of the predicted image after the related parameters in the quality evaluation module are generated through training data in advance.

For clarity of description, before describing the image retrieval method of the present application in further detail, training of the quality assessment module is first described by way of one specific embodiment. In this embodiment, a human body image search scene will be described as an example. Fig. 2 is a schematic diagram of a quality assessment module according to a first embodiment of the present application, wherein quality assessment is performed based on image information with multiple dimensions in order to achieve better quality assessment effect. As shown in FIG. 2, the quality evaluation module mainly comprises an image feature extractor, a text feature extractor, a first global feature mapper, a second global feature mapper and a multi-attribute image quality gear evaluator, wherein the modules all need to be trained to perfect parameter values, and the training of the quality evaluation module is that of a joint training process of the modules. FIG. 3 is a flowchart illustrating a training quality evaluation module according to a first embodiment. As shown in fig. 2 and 3, the specific training method of the quality assessment module includes:

In step 301, training data is generated.

The training data of the quality assessment module comprises two categories of images and text. In this embodiment, in order to achieve a better quality evaluation effect, quality evaluation is performed based on image information of multiple dimensions. Based on this, the training data also includes information of multiple dimensions, specifically including global information and local information, and the local information may also include information of multiple dimensions, where in order to distinguish dimensions corresponding to different local information, the dimensions corresponding to the local information are called local attributes, such as background information, accessory information, clothing style information, and the like in the image. Thus, the images as training data include a complete training image (which may also be referred to as a global training image) and a local image corresponding to each local attribute in the training image, and the text as training data includes a complete training text input by the user and a local text corresponding to each local attribute in the training text (which may also be referred to as a global training text). For each dimension of information, a different quality classification label is provided for the dimension for marking the quality classification of the image from an image quality perspective, e.g. for data with local properties being appurtenant, the quality classification label may comprise completely visible, partially visible, completely invisible, etc.

The information of each dimension is processed according to different information dimensions, and the processing is similar to executing different subtasks, so that the processing corresponding to the data of different dimensions is also called different subtasks, the processing of global information can be called a fixed subtask, and the processing of local information can be called a custom subtask. An example of the composition of the training data is given in table 1 below.

TABLE 1

In the column of quality class labels in table 1, the class labels marked by bold represent quality class labels for comparison with the output at a time of training.

The training data composed of a plurality of dimensional data generated in this embodiment is as above.

In step 302, the image data for training is input to an image feature extractor for processing, and global image features and local image features corresponding to the local attributes are generated.

The image feature extractor is used for extracting image features, and can be realized through a neural network, and particularly can adopt various existing feature extraction network structures. The image data input into the image feature extractor comprises a global training image and local images corresponding to all local attributes, the global training image is subjected to image feature extraction and output to obtain global image features, the local images corresponding to all local attributes are subjected to image feature extraction and output to obtain local image features corresponding to all local attributes.

In step 303, the text data for training is input into a text feature extractor for processing, and global text features and local text features corresponding to the local attributes are generated.

The text feature extractor is used for extracting text features, and can be realized through a neural network, and particularly can adopt various existing feature extraction network structures. The text data input into the text feature extractor comprises global training texts and local texts corresponding to all local attributes, the global training texts are subjected to text feature extraction and output to obtain global text features, the local texts corresponding to all the local attributes are subjected to text feature extraction and output to obtain local text features corresponding to all the local attributes.

Step 304, performing similarity comparison on the global image features and the global text features to obtain a global comparison result; and comparing the similarity of the local image features and the local text features corresponding to the same local attribute to obtain a comparison result of the corresponding local attribute.

The step compares the similarity between the image features and the text features. Specifically, for the global feature, similarity comparison can be performed on the global image feature and the global text feature, and a global comparison result is determined; for the local features, similarity comparison can be performed on the local image features and the local text features corresponding to the same local attribute to obtain a comparison result of the corresponding local attribute, for example, similarity comparison is performed on the local image features and the local text features corresponding to the background to obtain a comparison result corresponding to the background, and similarity comparison is performed on the local image features and the local text features corresponding to the accessory to obtain a comparison result corresponding to the accessory. The comparison result may be represented in a number of ways, such as a Cosine loss or an MSE loss.

Step 305, determining the value of the first loss function based on the global comparison result and the comparison result of each local attribute.

And fusing the global comparison result and the comparison results of all the local attributes, and determining the value of the first loss function. The specific fusion mode may be set according to actual needs, for example, a weighted summation mode is adopted in this embodiment.

As described above, the data of different dimensions may correspond to different subtasks, the global data corresponds to a fixed subtask, the data of the local attribute corresponds to a custom subtask, and for simplicity of expression, the process of fusing the global comparison result and the comparison result of the local attribute is expressed by the corresponding subtask, in this embodiment, a Cosine loss is taken as an example, and the value of the first loss function may be expressed asWherein f _I and f _T respectively represent the image features and text features corresponding to each subtask, norm (f) represents the L2 norm of f,/>Representing the similarity comparison result of each subtask,/>And the weight of each subtask when the comparison results are fused is represented, and n is the total number of the subtasks.

The first loss function is used for being matched with other loss functions later and used as an updating basis of model parameters after the training of the round is finished. As can be seen from the calculation of the first loss function described above, for both the image feature extractor and the text feature extractor, the purpose of training is to enable both feature extractors to extract image features and text features that have a high degree of consistency.

Step 306, inputting the global image features into the first global feature mapper to obtain local image feature prediction results on each local attribute, and comparing the prediction results with the local image features on the corresponding local attribute output by the image feature extractor to obtain a training image feature comparison result corresponding to each local attribute.

Specifically, the global image features output by the image feature extractor are input into a first global feature mapper for processing, and local image feature prediction results of the global training images on each local attribute are output. As previously described, local image feature extraction is also performed at the image feature extractor for each local attribute. Here, the local image feature output by the image feature extractor may be used as an actual feature, and similarity comparison may be performed with the local image feature prediction result output by the first global feature mapper, to obtain a training image feature comparison result. When the similarity comparison is performed in this step, the similarity comparison is performed for each local attribute, that is, the local image feature prediction result and the local image feature corresponding to the same local attribute are compared to obtain the training image feature comparison result corresponding to the corresponding local attribute, and the similarity comparison may be represented in various existing manners, for example, in this embodiment, the MSE loss is used for representing the similarity comparison.

Step 307, the global text feature is input into a second global feature mapper to obtain a local text feature prediction result on each local attribute, and the prediction result is compared with the local text feature on the corresponding local attribute output by the text feature extractor to obtain a training text feature comparison result corresponding to each local attribute.

Specifically, the global text feature output by the text feature extractor is input into a second global feature mapper for processing, and a local text feature prediction result of the global training text on each local attribute is output. Similar to the processing of the first global feature mapper, the local text feature output by the text feature extractor may be used as an actual feature, and compared with the local text feature prediction result output by the second global feature mapper to obtain a training text feature comparison result. When the similarity comparison is performed in this step, the similarity comparison is performed for each local attribute, that is, the local text feature prediction result and the local text feature corresponding to the same local attribute are compared to obtain the training text feature comparison result corresponding to the corresponding local attribute, and the similarity comparison may be represented in various existing manners, for example, in this embodiment, the MSE loss is used for representing the similarity comparison.

As can be seen from the above processing of steps 306 and 307, the first global feature mapper and the second global feature mapper are each configured to map global features to local features on respective local attributes.

The first global feature mapper and the second global feature mapper may be implemented by a neural network, for example, may be a decoder of a CNN or a transducer structure, and are trained by the processing of the present flow. The first global feature mapper may be the same mapper as the second global feature mapper, or may be a different mapper. In this embodiment, the same mapper is taken as an example, and the mapper is called a global feature mapper, and the step 308 is continued.

Step 308, determining the value of the second loss function based on the training image feature comparison result and the training text feature comparison result of all the local attributes.

Through the processing in the foregoing steps 306 and 307, a training image feature comparison result and a training text feature comparison result are obtained respectively, and since the first global feature mapper and the second global feature mapper are the same mapper in this embodiment, when determining a loss function, the training image feature comparison result and the training text feature comparison result of all local attributes are combined to determine the loss function corresponding to the global feature mapper, that is, the second loss function. Taking MSE loss as an example of the similarity comparison result, in this embodiment, the second loss function is: Where N represents the number of local attributes, N1 and N2 represent the dimensions of the image feature and the text feature, respectively,/> And/>Respectively representing local text features and predicted results thereof,/>And/>Respectively representing local image features and prediction results thereof.

The second loss function is used for being matched with other loss functions later and used as an updating basis of model parameters after the training of the round is finished. As can be seen from the calculation of the second loss function described above, for the global feature mapper the purpose of the training is to enable it to map the global features correctly to local features on the respective local attributes.

In the above processing, the first global feature mapper and the second global feature mapper are described as an example. When the first global feature mapper and the second global feature mapper are different mappers, one loss function can be determined based on training image feature comparison results of all local attributes, and the other loss function can be determined based on training text feature comparison results of all local attributes, and the two loss functions are matched with other loss functions in the follow-up process to serve as reference bases for model parameter updating.

Step 309, inputting the local image feature prediction result on the set local attribute output by the first global feature mapper into the multi-attribute image quality gear evaluator for classification processing to obtain a quality classification prediction result on each set local attribute of the training image, and determining the value of the third loss function based on the prediction result and the actual quality classification result on the corresponding set local attribute.

As previously mentioned, the quality assessment of the present application includes two angles, one from the perspective of the consistency of the predicted image with the text description, the processing of the first loss function described above characterizes this metric, and the other from the perspective of the image quality itself of the predicted image, the third loss function obtained by the processing of this step characterizes this metric.

Specifically, the local image features on the respective set local attributes are classified by using a multi-attribute image quality gear evaluator to classify the image quality of the training image from the respective set local attributes. The set local attribute may be all or part of the local attribute, and may include, for example, a local attribute such as a background, an accessory, or a distinct pattern.

For each set local attribute, determining a classification loss on the set local attribute based on the image quality classification result on the set local attribute and the actual quality classification result (i.e., the quality classification label) on the same set local attribute; and fusing the classification losses on all the set local attributes, and determining the value of the third loss function. The specific fusion mode of each classification loss can be set according to actual needs, and in this embodiment, summation is taken as an example to describe.

In more detail, the third loss function in the present embodiment may beWherein，/>The classification loss of each set local attribute is represented, n represents the number of set local attributes, y represents the real quality classification label,/>Representing the predictive confidence of the current set local attribute i.

The third loss function is used for being matched with other loss functions later and used as an updating basis of model parameters after the training of the round is finished. As can be seen from the calculation of the third loss function, for a multi-attribute image quality gear estimator, the purpose of training is to have the ability to predict image quality from multiple set local attributes.

And 310, performing fusion processing on the values of the first loss function, the second loss function and the third loss function, determining the value of the comprehensive loss function, and updating parameters of the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper and the multi-attribute image quality gear evaluator based on the value of the comprehensive loss function.

The first loss function, the second loss function and the third loss function can be used for designing a fusion mode according to actual requirements, for example, weighted summation, weighted average and the like can be adopted, and the application is not limited to a specific fusion mode.

And carrying out fusion processing on the values of the first loss function, the second loss function and the third loss function to obtain the value of the comprehensive loss function. And updating parameters of each neural network model included in the quality evaluation module based on the value of the comprehensive loss function, wherein the parameters comprise an image feature extractor, a text feature extractor, a first global feature mapper, a second global feature mapper and a multi-attribute image quality gear evaluator. After the parameters are updated, the next training is carried out until the whole quality evaluation module converges or a preset training round is reached.

To this end, the training method of the quality assessment module shown in fig. 3 ends. Through the training process, the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper and the multi-attribute image quality gear evaluator can be trained and generated, and the quality evaluation of the input image is realized from the angles of consistency of the image and the text and the angle of the image quality per se.

The image quality evaluation module generated by training in the first embodiment may perform image quality evaluation based on image information of multiple dimensions, and may implement more accurate image quality evaluation because more image information of multiple dimensions is used.

Of course, more simply, the image quality evaluation method may also perform quality evaluation according to only image information of a single dimension, for example, the global training image and the global training text in the first embodiment may be used as input, the first loss function is obtained only by comparing the global image feature and the global text feature, the image quality evaluation module no longer includes the first global feature mapper, the second global feature mapper, and the second loss function is no longer calculated, and the multi-attribute image quality gear evaluator performs quality classification prediction only for the global image feature and calculates the third loss function based on the prediction result.

Next, a specific implementation of the image retrieval method in the present application will be described by another specific embodiment. The quality evaluation module generated based on the training realizes quality evaluation processing. Fig. 9 is a specific flowchart of an image retrieval method in the second embodiment. Fig. 10 is a schematic diagram of a quality evaluation process flow using a quality evaluation module in the second embodiment. As shown in fig. 9, the image retrieval method in the second embodiment specifically includes:

Step 901, obtaining text description input by a user, and generating a predicted image corresponding to the text description by utilizing a pre-trained text-generated graph model.

The processing in this step is the same as that in steps 101 to 102, and will not be described here again.

In step 902, the predicted image is input into an image feature extractor for processing, so as to obtain global image features of the predicted image.

When the quality evaluation module performs actual reasoning, the input image data only has a predicted image, and no local image corresponding to the local attribute is included, so that the quality evaluation input during the reasoning can be effectively simplified. Based on the predicted image, global image features are obtained by processing of an image feature extractor.

And 903, inputting the global image features obtained in the step 902 into a first global feature mapper for processing, and predicting to obtain local image features of the predicted image on each local attribute.

Through the processing of the first embodiment, the training and generating the first global feature mapper may implement mapping from the global image feature to the local image feature, based on which, the global image feature is processed by the first global feature mapper, so as to predict and obtain the local image feature of the predicted image on each local attribute.

And step 904, inputting the text description into a text feature extractor for processing to obtain global text features of the text description.

When the quality evaluation module performs actual reasoning, the input text data only has text description input by a user and does not comprise local text corresponding to local attributes, so that quality evaluation input during reasoning can be effectively simplified. Based on the text description, text image features are obtained through processing by a text feature extractor.

In step 905, the global text feature obtained in step 904 is input into a second global feature mapper for processing, and a local text feature of the text description on each local attribute is predicted.

Through the processing of the first embodiment, the training and generating the second global feature mapper may implement mapping from the global text feature to the local text feature, based on which, the global text feature is processed by the second global feature mapper, so as to predict and obtain the local text feature of the text description on each local attribute, and due to the processing of the second global feature mapper, the input of the quality evaluation may not include the local text of the corresponding local attribute, but still may obtain the local text feature for performing consistency comparison of the local image feature and the local text feature.

Step 906, comparing the similarity between the global image feature and the global text feature, comparing the similarity between the local image feature and the local text feature on the same local attribute, and determining the overall quality score of the predicted image based on the result of the comparison between the two similarities.

Wherein, the specific process of determining the overall quality score based on the two similarity comparison results may include:

comparing the similarity between the global image features and the global text features, and taking the result of the similarity comparison as a global quality score;

for any one local attribute, carrying out similarity comparison on the local image characteristic of the local attribute and the local text characteristic of the local attribute, and taking the result of the similarity comparison as the quality score of the any one local attribute;

an overall quality score is determined based on the global quality score and the quality scores of the respective local attributes.

The similarity comparison method may be performed by the similarity comparison method in model training according to the embodiment, for example, cosine similarity or MSE similarity. Taking Cosine similarity as an example, the overall quality score that is ultimately determined can be expressed asWhere, described in terms of subtasks, n represents the number of all subtasks (including global and all local attributes), v >Representing the weight of each subtask, f _I and f _T representing the extracted image features and text features, respectively, for each subtask, norm (f) representing the L2 norm of f,Representing the quality score (which may be a global quality score or a local quality score) of the individual subtasks.

The quality score is used to measure the consistency between the predicted image generated by the meridional graph model and the text description input by the user, and from this point of view, the quality evaluation is performed. When the overall quality score or the quality score of a specific local attribute is smaller than a corresponding threshold set by the system, the system can feed back specific local attributes of the user, which have problems, and prompt the user to modify and supplement the description of the specific problem, wherein the consistency between partial details of the generated predictive image and the text description input by the user is poor. Such as shown in table 2.

TABLE 2

In other words, the quality evaluation result does not satisfy the setting requirement may specifically be: the global quality score is less than a set global threshold and/or the quality score of each local attribute is less than a local attribute threshold set for the corresponding local attribute and/or the overall quality score is less than a set overall threshold. When this occurs, the user may be prompted which level of quality score is less than the corresponding threshold, and the user may make modifications to the corresponding textual description. In general, when the overall quality score and the global quality score are smaller than the threshold, the quality score of a certain local attribute is also smaller than the threshold, because the consistency of the overall image and the text is poor and is usually caused by a relatively obvious problem in a certain aspect, so that the user can be prompted to the local attribute which does not meet the requirement in a targeted manner.

Step 907, the local image features with set local attributes output by the first global feature mapper are input into a multi-attribute image quality gear evaluator, and quality classification results of the predicted image on each set local attribute are obtained.

The training process of the first embodiment generates a multi-attribute image quality gear evaluator, and image quality classification can be performed according to different set local attributes. Wherein the set local attribute may be a part or all of the local attribute. The set local attribute in the present embodiment includes how many local attributes are. Based on the above, in this step, all the outputs of the first global feature mapper are input into the multi-attribute image quality gear estimator, and the quality classification result of the predicted image on each local attribute is obtained through the processing of the estimator. When the quality classification result on some local attributes is a specific classification (typically a poor quality classification), the user is prompted for related risks, which can be modified and supplemented for the description of the specific problem prompted by the system. For example as shown in table 3.

TABLE 3 Table 3

In other words, the quality evaluation result does not satisfy the setting requirement may specifically be: predicting the quality classification prediction result of the image on the first set local attribute as a preset first quality classification result; the user may be prompted to modify the text description corresponding to the first set of local properties. The first quality classification result is one of a plurality of quality classification results on the first set local attribute. For example, the first set local property may be an appendage, and the quality classification result of the appendage may include "appendage is fully visible", "appendage is partially visible", "appendage is fully invisible", where "appendage is fully invisible" is the first quality classification result, and when the quality classification prediction result of the first set local property determined in a certain quality assessment process is the first quality classification result, that is, the quality classification prediction result of the appendage is "appendage is fully invisible", the user may be prompted to modify the appendage description.

Step 908, it is determined whether the quality evaluation result meets the set requirement, if so, step 909 is executed, otherwise step 910 is executed.

The foregoing descriptions of steps 906 and 907 have been described, and will not be repeated here, as to whether the quality evaluation result does not meet the set requirement and sends a prompt to the user.

Step 909, when it is determined that the quality evaluation result does not meet the set requirement according to the output quality score or quality classification result of step 906 or 907, a prompt is sent to the user, and step 901 is returned.

In this step, a prompt may be sent to the user according to the specific case that the quality evaluation result does not meet the set requirement, and the specific processing is already described in the foregoing descriptions of steps 906 and 907, which will not be repeated here.

After sending the prompt to the user, the method may return to step 901, reacquire the text description modified by the user, and continue to execute the subsequent steps until the quality evaluation result meets the set requirement, and then execute step 910.

Step 910, based on the predicted image, performing image retrieval in the image library to obtain an image retrieval result.

The processing of this step is the same as that of step 103 in fig. 1, and will not be described again here.

Up to this point, the flow of the image retrieval method in the second embodiment shown in fig. 9 ends.

The image retrieval method provided by the application has the advantages that the text information is converted into the image information by using the text-to-image module and then used for image retrieval, the quality evaluation processing is utilized to allow the user to interact with the system, the user is fed back when the quality problem exists in the image generated by the current text description, the user is allowed to provide more detailed description to supplement and perfect information, and the image retrieval is carried out according to the predicted image until the system and the user requirements are met, so that the image retrieval effect can be effectively improved, the classical image query image retrieval system is compatible, the image retrieval system can be independently and quickly iterated, and different new strategies are combined; meanwhile, the image retrieval method can also generate the predictive images of a plurality of orientations of the target to be queried for retrieval, and can effectively improve the retrieval effect.

The above is a specific implementation of the image retrieval method in the application. The application also provides an image retrieval device which can be used for implementing the image retrieval method. Specifically, fig. 11 is a schematic diagram of the basic structure of the image retrieval device in the present application. As shown in fig. 11, the apparatus includes: the device comprises a text description acquisition unit, a text generation graph unit and an image retrieval unit.

and the image retrieval unit is used for performing image retrieval in the image library based on the predicted image to obtain an image retrieval result.

In order to achieve better retrieval effect, optionally, a quality evaluation unit is further included between the text graph unit and the image retrieval unit, and is used for performing quality evaluation on the predicted image to obtain a quality evaluation result; when the quality evaluation result meets the set requirement, notifying an image retrieval unit to perform image retrieval; and prompting the user to modify the text description based on the quality evaluation result when the quality evaluation result does not meet the set requirement, and notifying the text description acquisition unit to acquire the text description input by the user again.

Optionally, in the quality evaluation unit, the processing for performing quality evaluation on the predicted image may specifically include:

Optionally, the quality assessment unit may comprise an image feature extractor, a text feature extractor, a first global feature mapper, a second global feature mapper, and a quality scorer;

The second global feature mapper is used for processing the global text features and predicting to obtain local text features of the text description on each local attribute;

And the quality scoring device is used for comparing the similarity between the global image characteristic and the global text characteristic, comparing the similarity between the local image characteristic and the local text characteristic on the same local attribute, and determining the overall quality score of the predicted image based on the result of the comparison between the two similarities.

Alternatively, the quality evaluation unit may be further configured to determine the quality classification of the predicted image based on a feature extraction result determined by feature extraction of the predicted image.

Optionally, the quality evaluation unit may further include a multi-attribute image quality gear evaluator, configured to classify the local image feature prediction results on all the set local attributes to obtain a quality classification result of the predicted image on each set local attribute; wherein setting the local attribute may include some or all of the local attribute.

Optionally, the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper, and the multi-attribute image quality gear evaluator may be co-trained.

Optionally, in the quality evaluation unit, when the image feature extractor, the text feature extractor, the global feature mapper and the multi-attribute image quality gear evaluator perform joint training, the parameter updating manner may specifically include:

Comparing the similarity between the processing result of the training image based on the image feature extractor and the processing result of the training text based on the text feature extractor, and determining the value of the first loss function;

based on the value of the comprehensive loss function, updating parameters of the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper and the multi-attribute image quality gear evaluator.

Optionally, the process of determining the value of the first loss function may specifically include:

Inputting the training image and the local image corresponding to each local attribute in the training image into an image feature extractor for processing to obtain global image features of the training image and local image features of the training image on each local attribute;

inputting the training text and the local text corresponding to the local attribute in the training text into a text feature extractor for processing to obtain global text features of the training text and local text features of the training text on each local attribute;

comparing the similarity of the global image features and the global text features to obtain a global comparison result;

Optionally, the process of determining the value of the second loss function may specifically include:

inputting global image features of the training image into a global feature mapper for processing, and predicting to obtain a local image feature prediction result of the training image on each local attribute;

inputting global text features of the training text into a global feature mapper for processing, and predicting to obtain a local text feature prediction result of the training text on each local attribute;

Optionally, the process of determining the value of the third loss function may specifically include:

Inputting the local image feature prediction results of the training image output by the first global feature mapper on all the set local attributes into a multi-attribute image quality gear evaluator for classification processing to obtain quality classification prediction results of the training image on each set local attribute;

Optionally, in the quality scoring device, the process of determining the overall quality score of the predicted image may specifically include:

The result of similarity comparison between the local image features and the local text features of the same local attribute is used as the quality score of the corresponding local attribute;

Optionally, the quality evaluation result does not meet the set requirement, which specifically may include: the global quality score is less than a set global threshold, and/or the quality score of each local attribute is less than a local attribute threshold set for the corresponding local attribute, and/or the overall quality score is less than a set overall threshold;

Prompting the user to modify the text description based on the quality assessment results may specifically include: and prompting a user to modify the text description corresponding to any local attribute when the quality score of any local attribute is smaller than the local attribute threshold set for any local attribute.

Optionally, the quality evaluation result does not meet the set requirement, which specifically may include: predicting the quality classification prediction result of the image on the first set local attribute as a preset first quality classification result;

prompting the user to modify the text description based on the quality assessment results, including:

And when the quality classification prediction result of the predicted image on the first set local attribute is a preset first quality classification result, prompting a user to modify the text description corresponding to the first set local attribute.

Alternatively, the predicted image may include a front predicted image, a side predicted image, and a rear predicted image;

In the image retrieval unit, the processing of performing image retrieval in the image library may specifically include: performing image retrieval based on the front predicted image, the side predicted image and the back predicted image respectively to obtain a front image retrieval result, a side image retrieval result and a back image retrieval result; or alternatively

The present application also provides a computer readable storage medium storing instructions which, when executed by a processor, perform steps in implementing an image retrieval method as described above. In practice, the computer readable medium may be comprised by or separate from the apparatus/device/system of the above embodiments, and may not be incorporated into the apparatus/device/system. Wherein instructions are stored in a computer readable storage medium which stored instructions, when executed by a processor, can perform the steps in the image retrieval method as described above.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: portable computer diskette, hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing, but are not intended to limit the scope of the application. In the disclosed embodiments, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Fig. 12 is a schematic diagram of an electronic device according to the present application. As shown in fig. 12, a schematic structural diagram of an electronic device according to an embodiment of the present application is shown, specifically:

The electronic device can include a processor 1201 of one or more processing cores, a memory 1202 of one or more computer-readable storage media, and a computer program stored on the memory and executable on the processor. The image retrieval method can be implemented when the program of the memory 1202 is executed.

Specifically, in practical applications, the electronic device may further include a power source 1203, an input/output unit 1204, and other components. It will be appreciated by those skilled in the art that the structure of the electronic device shown in fig. 12 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

The processor 1201 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of a server and processes data by running or executing software programs and/or modules stored in the memory 1202 and calling data stored in the memory 1202, thereby performing overall monitoring of the electronic device.

Memory 1202 may be used to store software programs and modules, i.e., the computer-readable storage media described above. The processor 1201 performs various functional applications and data processing by executing software programs and modules stored in the memory 1202. The memory 1202 may mainly include a storage program area that may store an operating system, application programs required for at least one function, and the like, and a storage data area; the storage data area may store data created according to the use of the server, etc. In addition, memory 1202 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 1202 may also include a memory controller to provide the processor 1201 with access to the memory 1202.

The electronic device further comprises a power supply 1203 for supplying power to the respective components, which may be logically connected to the processor 1201 by a power management system, so that functions of managing charging, discharging, power consumption management, etc. are implemented by the power management system. The power supply 1203 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

The electronic device may also include an input output unit 1204, which input unit output 1204 may be used to receive input numeric or character information, as well as to generate keyboard, mouse, joystick, optical signal inputs related to user settings and function control. The input unit output 1204 may also be used to display information entered by a user or provided to a user as well as various graphical user interfaces that may be composed of graphics, text, icons, video, and any combination thereof.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. An image retrieval method, comprising:

Wherein the performing quality evaluation on the predicted image includes:

2. The method according to claim 1, wherein global image feature extraction is performed on the predicted image by a pre-trained image feature extractor to obtain the global image feature;

3. The method of claim 2, wherein said evaluating the quality of the predicted image further comprises:

4. A method according to claim 3, wherein said determining a quality class of the predicted image based on a feature extraction result determined by feature extraction of the predicted image comprises:

5. The method of claim 4, wherein the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper, and the multi-attribute image quality gear evaluator are co-trained.

6. The method of claim 5, wherein the means for updating parameters when the image feature extractor, the text feature extractor, the global feature mapper, and the multi-attribute image quality gear evaluator are jointly trained comprises:

7. The method of claim 6, wherein determining the value of the first loss function comprises:

8. The method of claim 6, wherein determining the value of the second loss function comprises:

9. The method of claim 6, wherein determining the value of the third loss function comprises:

10. The method of claim 1, wherein said determining the overall quality score of the predicted image comprises:

11. The method of claim 10, wherein the quality assessment result does not meet a set requirement, comprising: the global quality score is smaller than a set global threshold, and/or the quality score of each local attribute is smaller than a local attribute threshold set for the corresponding local attribute, and/or the overall quality score is smaller than a set overall threshold;

12. The method of claim 4, wherein the quality assessment result does not meet a set requirement, comprising: the quality classification prediction result of the predicted image on the first set local attribute is a preset first quality classification result;

13. The method of claim 1, wherein the predicted image comprises a front predicted image, a side predicted image, and a back predicted image;

14. An image retrieval apparatus based on a text description, comprising: the device comprises a text description acquisition unit, a text generation picture unit and an image retrieval unit;

15. The apparatus according to claim 14, further comprising a quality evaluation unit between the document map unit and the image retrieval unit for performing quality evaluation on the predicted image to obtain a quality evaluation result; when the quality evaluation result meets the set requirement, notifying the image retrieval unit to perform image retrieval; and prompting a user to modify the text description based on the quality evaluation result when the quality evaluation result does not meet the set requirement, and notifying the text description acquisition unit to acquire the text description input by the user again.

16. The apparatus according to claim 15, wherein in the quality evaluation unit, the performing quality evaluation on the predicted image includes:

17. The apparatus of claim 16, wherein the quality assessment unit comprises an image feature extractor, a text feature extractor, a first global feature mapper, a second global feature mapper, and a quality score extractor;

18. The apparatus according to claim 17, wherein the quality evaluation unit is further configured to determine a quality classification of the predicted image based on a feature extraction result determined by feature extraction of the predicted image.

19. The apparatus according to claim 18, wherein said quality evaluation unit further comprises a multi-attribute image quality gear estimator for classifying the prediction results of the local image features on all the set local attributes to obtain a quality classification result of said predicted image on each of said set local attributes; wherein the set local attribute includes a part or all of the local attribute.

20. The apparatus of claim 19, wherein the image feature extractor, the text feature extractor, the first global feature mapper, the second global feature mapper, and the multi-attribute image quality gear evaluator are co-trained.

21. The apparatus of claim 20, wherein in the quality assessment unit, when the image feature extractor, the text feature extractor, the global feature mapper, and the multi-attribute image quality gear assessor perform joint training, the parameter updating means comprises:

22. The apparatus of claim 21, wherein the determining the value of the first loss function comprises:

23. The apparatus of claim 21, wherein the determining the value of the second loss function comprises:

24. The apparatus of claim 21, wherein the determining the value of the third loss function comprises:

25. The apparatus of claim 17, wherein in the quality scorer, the determining the overall quality score for the predicted image comprises:

26. The apparatus of claim 25, wherein the quality assessment result does not meet a set requirement, comprising: the global quality score is smaller than a set global threshold, and/or the quality score of each local attribute is smaller than a local attribute threshold set for the corresponding local attribute, and/or the overall quality score is smaller than a set overall threshold;

27. The apparatus of claim 19, wherein the quality assessment result does not meet a set requirement, comprising: the quality classification prediction result of the predicted image on the first set local attribute is a preset first quality classification result;

28. The apparatus according to claim 14 or 15, wherein the predicted image includes a front predicted image, a side predicted image, and a back predicted image;

29. A computer readable storage medium having stored thereon computer instructions, which when executed by a processor, implement the image retrieval method of any of claims 1 to 13.

30. An electronic device comprising at least a computer-readable storage medium and a processor;

the processor is configured to read executable instructions from the computer readable storage medium and execute the instructions to implement the image retrieval method of any one of claims 1-13.