CN113806582A

CN113806582A - Image retrieval method, image retrieval device, electronic equipment and storage medium

Info

Publication number: CN113806582A
Application number: CN202111359989.1A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2021-12-17
Anticipated expiration: 2041-11-17
Also published as: CN113806582B

Abstract

The application provides an image retrieval method, an image retrieval device, electronic equipment and a storage medium, relates to the technical field of computers, and can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the following steps: recalling a first image set with similar basic features with the query image based on the target basic features of the query image; selecting a plurality of candidate images with similar semantic features with the query image from the first image set based on the target semantic features of the query image; recalling a second image set with similar semantic features with corresponding candidate images based on respective semantic features of the candidate images; and determining a retrieved target image set based on the recalled second image sets and the first image set. The method and the device adopt a two-stage recall mode, so that the retrieval recall rate can be improved, and the accuracy of a retrieval result is improved.

Description

Image retrieval method, image retrieval device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to an image retrieval method and apparatus, an electronic device, and a storage medium.

Background

With the rapid development of the deep learning technology, the image retrieval technology based on the deep learning technology is widely applied. At present, for an image library to be retrieved, an image feature extraction model based on metric learning training in deep learning is adopted to respectively extract basic features embedding of each image in the image library, and corresponding images are represented through the basic features embedding.

When the similar images of the query images need to be retrieved from the image library, the image feature extraction model is adopted to extract the basic features embedding of the query images, then the basic features embedding of the query images and the basic features embedding of each image in the image library are subjected to similarity calculation, the images with similarity meeting the requirements are used as recall images, and then retrieval results are obtained according to the recall images.

However, with the image retrieval method, only images with similar basic features can be recalled, and some images with similar features to other features of the query image cannot be recalled through the basic features, which causes insufficient recalled images and makes retrieval results inaccurate.

Disclosure of Invention

The embodiment of the application provides an image retrieval method, an image retrieval device, electronic equipment and a storage medium, which are used for recalling enough images in the image retrieval process, so that the accuracy of a retrieval result is improved.

In one aspect, an embodiment of the present application provides an image retrieval method, including:

acquiring a target basic feature and a target semantic feature of a query image; wherein the target base features are used for characterizing semantic-free features of the query image;

recalling the first image set; the basic similarity between the basic features of the images in the first image set and the target basic features meets a first preset condition;

selecting a plurality of candidate images from the first set of images; semantic similarity between each semantic feature of the candidate images and the target semantic feature meets a second preset condition;

recalling the corresponding second image set based on respective semantic features of the plurality of candidate images; semantic similarity between each image contained in each second image set and semantic features of corresponding candidate images meets a third preset condition;

determining a retrieved target image set based on the recalled plurality of second image sets and the first image set.

In one aspect, an embodiment of the present application provides an image retrieval apparatus, including:

the characteristic acquisition module is used for acquiring a target basic characteristic and a target semantic characteristic of the query image; wherein the target base features are used for characterizing semantic-free features of the query image;

a first recall module to recall a first set of images; the basic similarity between the basic features of the images in the first image set and the target basic features meets a first preset condition;

a first selection module to select a plurality of candidate images from the first set of images; semantic similarity between each semantic feature of the candidate images and the target semantic feature meets a second preset condition;

the second recalling module is used for recalling the corresponding second image set based on the respective semantic features of the candidate images; semantic similarity between each image contained in each second image set and semantic features of corresponding candidate images meets a third preset condition;

and the determining module is used for determining the retrieved target image set based on the recalled second image sets and the first image set.

In a possible embodiment, the first recall module is specifically configured to:

selecting a plurality of base reference indexes from a plurality of base feature indexes in an image library; each basic feature index is a clustering center of a plurality of associated basic features, and the basic similarity of each basic reference index and the target basic feature is not less than a first similarity threshold value;

obtaining a plurality of basic characteristics respectively associated with the plurality of basic reference indexes;

and generating the first image set according to the images corresponding to the acquired basic features and the images corresponding to the basic reference indexes.

In a possible embodiment, the second recall module is specifically configured to:

for the plurality of candidate images, respectively performing the following operations:

selecting a plurality of semantic reference indexes from a plurality of semantic feature indexes in an image library; each semantic feature index is a clustering center of a plurality of associated semantic features, and the semantic similarity between each selected semantic reference index and the semantic feature of one candidate image is not less than a second similarity threshold;

obtaining a plurality of semantic features associated with the semantic reference indexes respectively;

and generating a corresponding second image set according to the acquired images corresponding to the semantic features and the images corresponding to the semantic reference indexes.

In a possible embodiment, the feature obtaining module is specifically configured to:

inputting the query image into a trained image feature extraction model to obtain the target basic feature; the image feature extraction model is obtained based on triple sample data set training, and each triple sample in the triple sample data set comprises a reference image, an image similar to the reference image and an image dissimilar to the reference image;

inputting the query image into a trained image semantic extraction model to obtain the target semantic features; the image semantic extraction model is obtained by training based on a triple sample data set marked with object categories.

In a possible embodiment, the determining module is specifically configured to:

for each image included in the plurality of second image sets and the first image set, sorting the images according to the basic similarity between the respective basic feature of each image and the target basic feature and the semantic similarity between the respective semantic feature of each image and the target semantic feature;

and selecting a plurality of target images from the sorted images to obtain a searched target image set.

In a possible embodiment, the apparatus further comprises:

the category acquisition module is used for acquiring target category information of the query image, wherein the target category information is used for representing the object category contained in the query image;

a third recall module to recall a third set of images; the respective category information of the images included in the third image set is the same as the target category information;

a fourth recall module, configured to recall a corresponding fourth image set based on respective semantic features of multiple reference images included in the third image set; semantic similarity between each image contained in each fourth image set and semantic features of corresponding reference images meets a fifth preset condition;

the determining module is specifically further configured to:

determining a retrieved target image set based on the first image set, the plurality of second image sets, the third image set, and the plurality of fourth image sets.

In a possible embodiment, the third recall module is specifically configured to:

selecting at least one object class index matched with the target class information from a plurality of object class indexes in an image library; each object class index is associated with a plurality of images containing corresponding object classes;

and acquiring a plurality of images respectively associated with the selected at least one object class index, and generating the third image set according to the acquired images.

In a possible embodiment, the fourth recall module is specifically configured to:

selecting a plurality of reference images from the third image set, and taking the query image as one reference image; the semantic similarity between the semantic features of the multiple reference images and the target semantic features is not less than a third similarity threshold;

recalling the corresponding fourth image set based on the respective semantic features of the obtained multiple reference images; and the semantic similarity between each image contained in each fourth image set and the semantic features of the corresponding reference image is not less than a fourth similarity threshold.

In a possible embodiment, when recalling the corresponding fourth image set based on the respective semantic features of the obtained multiple reference images, the fourth recall module is further specifically configured to:

for the respective semantic features of the plurality of reference images, the following operations are respectively performed:

selecting a plurality of semantic reference indexes from a plurality of semantic feature indexes in an image library; each semantic feature index is a clustering center of a plurality of associated semantic features, and the semantic similarity between each selected semantic reference index and the semantic feature of one reference image is not less than the fourth similarity threshold;

and generating a corresponding fourth image set according to the images corresponding to the acquired semantic features and the images corresponding to the semantic reference indexes.

In a possible embodiment, the feature obtaining module is further specifically configured to:

inputting the query image into a trained image semantic extraction model to obtain the target semantic features;

the image semantic extraction model is obtained by training based on a triple sample data set marked with an object class, wherein each triple sample in the triple sample data set comprises a reference image, an image similar to the reference image and an image dissimilar to the reference image;

the category acquisition module is specifically configured to:

and inputting the query image into the trained image semantic extraction model to obtain the predicted target category information.

In a possible embodiment, the determining module is specifically further configured to:

for each image included in the first image set, the second image sets, the third image set and the fourth image sets, sorting the images according to a basic similarity between a basic feature of each image and the target basic feature, a semantic similarity between a semantic feature of each image and the target semantic feature, and an evaluation value of category information of each image;

In one possible embodiment, the evaluation value of the respective category information of the respective images is obtained by:

determining at least one object type contained in the target type information, and acquiring a first prediction probability of each object type;

for each image, respectively executing the following operations:

and determining an evaluation value of the category information of the image according to the second prediction probability of each of the at least one object category and the first prediction probability of each of the at least one object category in the category information of the image.

In one aspect, an embodiment of the present application provides an electronic device, which includes a processor and a memory, where the memory stores program code, and when the program code is executed by the processor, the processor is caused to execute the steps of any one of the image retrieval methods described above.

In one aspect, embodiments of the present application provide a computer storage medium storing computer instructions, which when executed on a computer, cause the computer to perform the steps of any one of the image retrieval methods described above.

In one aspect, an embodiment of the present application provides a computer program product, which includes computer instructions stored in a computer-readable storage medium; when the processor of the electronic device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, so that the electronic device executes the steps of any one of the image retrieval methods described above.

Due to the adoption of the technical scheme, the embodiment of the application has at least the following technical effects:

in the scheme of the embodiment of the application, when a similar image of a query image needs to be retrieved, firstly, based on basic features (representing semantic-free features) of the query image, a first image set with the basic features similar to the basic features of the query image is recalled; then, selecting a plurality of candidate images with similar semantic features to those of the query image from the first image set; then, based on the semantic features of each candidate image, recalling a second image set with similar semantic features to the semantic features of the candidate images; determining a retrieved target image set based on the recalled plurality of second image sets and the first image set.

In the scheme, the two-stage recall mode is adopted, so that the image with similar basic characteristics to the query image can be recalled, and the image with similar semantic characteristics to the query image can be further recalled, so that enough images can be recalled, the recall rate of retrieval is improved, and the accuracy of the retrieval result is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of an image retrieval method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an image semantic extraction model provided in an embodiment of the present application;

fig. 3 is a schematic diagram of an image feature extraction model provided in an embodiment of the present application;

fig. 4 is a schematic diagram illustrating establishment of an index system of an image library according to an embodiment of the present disclosure;

fig. 5 is a flowchart of an image retrieval method according to an embodiment of the present application;

FIG. 6 is a flowchart of another image retrieval method provided in the embodiments of the present application;

FIG. 7 is a logic diagram of a retrieval based on the underlying feature indexing system according to an embodiment of the present application;

FIG. 8 is a flowchart of another image retrieval method provided in the embodiments of the present application;

FIG. 9 is a logic diagram of a semantic feature indexing system-based search according to an embodiment of the present disclosure;

FIG. 10 is a flowchart of another image retrieval method provided in the embodiments of the present application;

FIG. 11 is a flowchart of another image retrieval method provided in the embodiments of the present application;

FIG. 12 is a logic diagram illustrating a search based on a category indexing system according to an embodiment of the present application;

FIG. 13 is a flowchart of another image retrieval method provided in the embodiments of the present application;

fig. 14 is a logic diagram of an image retrieval method according to an embodiment of the present application;

fig. 15 is a block diagram of an image retrieval apparatus according to an embodiment of the present application;

fig. 16 is a block diagram illustrating an image retrieval apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 18 is a schematic structural diagram of another electronic device in this embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to facilitate those skilled in the art to better understand the technical solutions of the present application, some concepts related to the present application will be described below.

Imbedding: are features extracted from the raw data, i.e. low-dimensional vectors after mapping through the neural network. Image embedding is a process of mapping graph data (typically a high-dimensional dense matrix) into low-micro dense vectors. Graph embedding requires capturing the topology of the graph, vertex-to-vertex relationships, and other information (e.g., subgraphs, edges, etc.).

And (3) image retrieval: in the conventional image retrieval, image features embedding are extracted from each image in an image library, and some images with the closest distances are acquired as recall images according to the distances between the embedding of a query image and the embedding of each image in the image library.

And (3) image retrieval recall supplementing: the conventional image retrieval recall causes insufficient recall due to problems of image confrontation change, insufficient embedding representation and the like, and the recalling is a supplementary recall technology aiming at the insufficient embedding recall. Different application scenes require different recall strategies, for example, in clothing retrieval, recall is performed by considering clothing labels, interest labels of users and the like. In the embodiment of the application, the recall is specially carried out based on the semantics of the image.

Imagenet: large generic objects identify the source data set.

Imagenet pre-training model: and training a deep learning network model based on Imagenet to obtain the parameter weight of the model, namely obtaining an Imagenet pre-training model. The image semantic extraction model in the following embodiments of the present application may be understood as an Imagenet pre-training model.

Depth metric learning (deep metric learning): the method aims to learn a mapping from original features to a low-dimensional dense vector space (called an embedding space), so that the distances of similar objects calculated by using a common distance function (Euclidean distance, cosine distance and the like) on the embedding space are relatively close, and the distances of different objects are relatively far. The depth measurement learning is widely applied to the field of computer vision, such as face recognition, face verification, image retrieval, signature verification, pedestrian re-recognition and the like. The embodiment of the application is particularly applied to image retrieval.

Semantic features of the image: semantic embedding extracted by the Imagenet pre-training model (image semantic extraction model) is adopted.

Basic features of the image: the image feature extraction model based on the depth measurement learning training is adopted, and the extracted image embedding can be understood as the semantic-free feature of the image.

Category information of image: the computer identifies which object classes the image has. An image may have a plurality of object classes, and the multi-class prediction task is to predict which object classes the image has, thereby obtaining class information of the image.

The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of embodiments of the application, unless stated otherwise, "plurality" means two or more.

The following briefly introduces the design concept of the embodiments of the present application:

in the related technology, an image feature extraction model based on metric learning training in deep learning is adopted for an image library to be retrieved, basic features embedding of each image in the image library are respectively extracted, and corresponding images are represented through the embedding.

The image feature extraction model is generally obtained as follows: and training the initial image feature extraction model based on a depth measurement learning method to obtain a trained image feature extraction model by adopting an image sample set labeled by a triple, wherein the triple comprises a target image, an image similar to the target image and an image dissimilar to the target image.

Because the extracted embedding does not distinguish image semantic information based on the image feature extraction model of the depth measurement learning training, in the image retrieval process, images with similar semantics can be omitted from retrieval results obtained based on the embedding similarity; in addition, if an image similar to the query image is subjected to image transformation (that is, if a pixel is changed), the transformed image cannot be retrieved by the foregoing embedding of the query image.

Therefore, by adopting the image retrieval method, only images with similar basic features can be recalled, and some images with similar features to other features of the query image cannot be recalled through the basic features, so that the recalled images are insufficient, and the retrieval result is not accurate enough.

In view of this, embodiments of the present application provide an image retrieval method, an apparatus, an electronic device, and a storage medium, which employ a two-stage recall method, so that not only an image having basic features similar to a query image can be recalled, but also an image having semantic features similar to the query image can be recalled further, thereby recalling enough images, improving the recall rate of retrieval, and further improving the accuracy of retrieval results.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it should be understood that the preferred embodiments described herein are merely for illustrating and explaining the present application, and are not intended to limit the present application, and that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Fig. 1 is a schematic view of an application scenario in the embodiment of the present application. The application scenario diagram includes a plurality of terminal devices 110 and a server 120, and the terminal devices 110 and the server 120 can communicate with each other through a communication network.

In an alternative embodiment, the communication network may be a wired network or a wireless network.

In the embodiment of the present application, the terminal device 110 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an e-book reader, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, and other devices; the terminal device may be installed with a client related to image recommendation, where the client may be software (e.g., a browser, video software, etc.), or a web page, an applet, etc., and the server 120 is a background server corresponding to the software, or the web page, the applet, etc., or a server specially used for image retrieval, which is not limited in this application. The server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform.

It should be noted that the image retrieval method in the embodiment of the present application may be executed by an electronic device, and the electronic device may be the server 120 or the terminal device 110, that is, the method may be executed by the server 120 or the terminal device 110 alone, or may be executed by both the server 120 and the terminal device 110. When the terminal device 110 is executed alone, for example, the terminal device 110 may acquire a query image input by a user, then acquire a target basic feature and a target semantic feature of the query image, and perform subsequent processing based on the target basic feature and the target semantic feature. When the server 120 is executed separately, for example, the terminal device 110 may acquire a query image input by a user and then transmit the query image to the server 120, and the server 120 acquires a target basic feature and a target semantic feature of the query image and performs subsequent processing based on the target basic feature and the target semantic feature. When the server 120 and the terminal device 110 are executed together, for example, the terminal device 110 may obtain a query image input by a user, then obtain a target basic feature and a target semantic feature of the query image, send the target basic feature and the target semantic feature of the query image to the server, and the server performs subsequent processing based on the target basic feature and the target semantic feature. Hereinafter, the server is mainly used as an example for illustration, and is not limited specifically herein.

In a specific implementation, a user may input a query image in the terminal device 110, the terminal device sends the acquired query image to the server 120, and the server 120 may retrieve a similar image of the query image by using the image retrieval method according to the embodiment of the present application. Specifically, a target basic feature and a target semantic feature of the query image are obtained, and subsequent processing is performed based on the target basic feature and the target semantic feature so as to recall enough similar images from the image library, and further, a plurality of retrieved target images are determined based on the recalled similar image set.

It should be noted that fig. 1 is only an example, and the number of the terminal devices and the servers is not limited in practice, and is not specifically limited in the embodiment of the present application.

In the embodiment of the application, when the number of the servers is multiple, the multiple servers can be combined into a block chain, and the servers are nodes on the block chain; the image retrieval method as disclosed in the embodiment of the present application, wherein the image data involved may be saved on the block chain, for example, the image data includes each image in the image library, and the basic feature and semantic feature of each image, and the like.

The image retrieval method provided by the exemplary embodiment of the present application is described below with reference to the drawings in conjunction with the application scenarios described above, it should be noted that the application scenarios described above are only shown for the convenience of understanding the spirit and principles of the present application, and the embodiments of the present application are not limited in this respect. Moreover, the embodiment of the application can be applied to various scenes, including not only image processing scenes, but also scenes such as cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like.

Before the image retrieval method according to the embodiment of the present application is described, an image semantic extraction model and an image feature extraction model related to the method are described first.

Image semantic extraction model

In the embodiment of the present application, the image semantic extraction model may include the following modules: the device comprises a convolutional neural network module, a depth measurement learning module and a category prediction module.

Optionally, the convolutional neural network module may adopt a Deep residual network (ResNet), and the ResNet may specifically be ResNet101, and parameters of the ResNet are shown in table 1 below; the depth metric learning module may include a max pooling layer and a full connectivity layer, the parameters of which are shown in table 2 below; the class prediction module is a classification module and may include a fully connected layer, the parameters of which are shown in table 3 below.

It should be noted that the specific structures of the above modules are only exemplary, and other model structures may be selected according to needs, for example, the convolutional neural network module may also use resnet50, inceptionv4, resnet18, and the like, which is not limited herein.

The following description will be made of a preparation process and a training process of the image semantic extraction model, taking the model structures in tables 1 to 3 as examples.

(I) image semantic extraction model preparation Process

(1) Initializing parameters: conv1-Conv5 in Table 1, were initialized with parameters of ResNet101 pre-trained on ImageNet dataset, and the embedding and Fc layers in Table 2 with a Gaussian distribution with variance of 0.01 and mean of 0. The ResNet101 may also be obtained by using other open source data sets for pre-training, which is not limited herein.

(2) Setting learning parameters: the learning parameters in table 1, table 2, and table 3 are set.

(3) Learning rate: in order to improve the learning effect, asynchronous learning rates are adopted, the Conv1-Conv5 layer and the embedding layer adopt learning rates of lr1=0.0005, and the Fc layer adopts lr = lr1 × 10= 0.005.

Because the Fc layer is easier to overfit to the target (because the class prediction is a classification task, the purpose is to enable two samples with the same class to output the same prediction class, namely to fit the same learning target, under the overfit, the class prediction has the same prediction result to a plurality of dissimilar image images), and the overfit of the semantic can enable the embedding to be overfit more easily, namely the embedding of the two images with the same class is the same (namely the embedding is overfit to the classification target), so that the embedding is not differentiated between the images, and the asynchronous learning rate can enable the parameter update of the embedding layer to be slower than that of the Fc layer (the update efficiency is 0.1 times of that of the Fc layer), the phenomenon that the embedding layer overfit is subjected to the classification target in the model parameter update is effectively prevented.

TABLE 1

TABLE 2

Wherein 128 is the embedding dimension.

TABLE 3

Where Nclass is the number of object classes, each object class may be represented by a label, and 1000 labels when training with ImageNet open source image dataset.

(II) image semantic extraction model training process

1. Acquiring a training sample data set

Acquiring a large number of sample images, and carrying out triple labeling on the sample images to acquire a large number of triple samples, wherein one triple sample comprises a reference image anchor, a similar or identical image positive of the anchor, and a dissimilar image negative of the anchor.

Generally, the labeling process for each triplet sample is: 3 images (anchors, positive and negative) meeting the rules are randomly picked out from all sample images to form a triple sample. Since there are many easy-to-learn samples (referred to as easy samples) in the generated massive triplet samples, the easy samples initially help the learning of the model, but when the model has a better distinguishing capability for the easy samples, the loss of the massive easy samples is far greater than that of the hard samples (referred to as hard samples), so that the hard samples are buried in the easy samples, and therefore, a large number of hard samples are needed at the later stage of the model training.

Based on this, in the embodiment of the present application, the labeling process of a large number of triple samples is adjusted to the following process:

the positive sample pairs (i.e., similar pairs of sample images) are first labeled, and all the positive sample pairs are divided into batches, one batch being one batch, containing bs positive sample pairs. In each of the bs sample pairs of the batch, the triple samples are mined as follows: for a certain sample image x, randomly selecting one sample image from the remaining (bs-1) sample pairs respectively, calculating the similarity between each selected sample image and x, sorting the sample images from small to large according to the similarity, taking the first n sample images as negative samples, and forming triples with the positive sample pair where x is located respectively, so that each sample can generate n triples, and the whole batch can obtain n × bs triples. Wherein, the value of n can be set according to requirements, such as 10; the value of bs may be set as desired, for example, a relatively larger value such as 256 may be set.

Further, each sample image is labeled with multiple labels, and when a sample image contains a certain object type, the label corresponding to the object type is 1, and the label corresponding to an object type that is not contained is 0, assuming that 1000 labels need to be learned in the embodiment of the present application. There is a case where the image does not contain any label, that is, all labels are 0.

2. Training image semantic extraction model based on triple sample set

And performing an epoch iteration on the triple sample set, wherein one epoch is a process of training all the triple samples once, and each iteration processes all the triple samples once.

The specific operations in each iteration are as follows: dividing the triple sample set into a plurality of batches, wherein one batch is one batch, the number of samples contained in one batch is batch-size, and the following processing is carried out for each batch:

a. model forward: setting the training parameters of the model to be in a state needing learning, and during training, carrying out forward calculation on an input sample image by the neural network ResNet101 to obtain a prediction result: the embedding layer outputs semantic embedding, and the Fc layer outputs category information (including a plurality of tags), as shown in fig. 2.

b. And (3) loss calculation: calculating triple loss aiming at semantic embedding output by an embedding layer; and calculating multi-class loss aiming at the output Fc of the Fc layer, specifically adopting binary cross entropy (bce) loss, and summing the two losses to obtain total loss, wherein the total loss is specifically shown as formula (1).

Wherein the content of the first and second substances,

and

and for the weight coefficients, can be set as desired,

a loss of a triplet is represented as,

indicating a category loss. Lower pair

And

the calculation process of (a) is described.

First, after finding a triple sample (anchor, negative, positive) in the sample image of each batch, em feature calculation is performed on the triple samples

，

Is calculated as shown in the following equation (2), where alpha is margin, for example set to 4,

a distance L2 representing two semantics embedding,

the purpose of (2) is to make the distance between anchor and negative greater than 4.

Secondly, for the prediction probability of the class information (multiple labels) output by the Fc layer, the loss of the class information (multiple labels) and the labeled multiple labels is calculated, namely the loss of multiple classes

In particular, for a certain sample image i of all sample images, its true value labels the vector t [ i [ ]]A 0, 1 vector of 1 x 1000 (assuming a total of 1000 labels), whose predictor o i]For the prediction probabilities obtained for 1 × 1000 labels, respectively, the multi-class loss is calculated according to the following formula (3)

。

When the true value of a certain label bit is 1, the left side of the plus sign of the formula (3) is effective, and when the true value is 0, the right side of the plus sign of the formula (3) is effective, so that the supervision information of a certain sample image under all labels can be learned.

c. Updating model parameters: and (3) performing gradient backward calculation on the loss in the last step by adopting a random gradient descent method to obtain the updated values of all model parameters, and updating the model parameters.

Based on the image semantic extraction model trained in the training process, the category information (including a plurality of labels) of the image can be predicted at the same time, and the semantic embedding, namely the semantic features, of the image can be extracted.

Second, image feature extraction model

The image feature extraction model may include a convolutional neural network module and a depth metric learning module, and the structures of the two modules may be the same as the image semantic extraction model, for example, the structures of table 1 and table 2 may be adopted without the Fc classification layer of table 3, as shown in fig. 3 in particular

The model learns using the triple sample set described above. When loss calculation is performed, only triple loss needs to be calculated, and the learning method of the model is consistent with the image semantic extraction model, and is not repeated here.

Based on the image feature extraction model after the triple sample set training, the similarity embedding of the image can be extracted, and the similarity embedding can be understood as the basic feature of the image, wherein the basic feature is a semantic-free feature.

After the model is introduced, an indexing system for an image library constructed based on the model is described below.

Referring to fig. 4, the construction process of the index system of the image library includes:

1) for all images in the image library, the following processing is performed:

the trained image feature extraction model is adopted to respectively extract the similarity embeddings of all the images, and after the similarity embeddings of all the images are obtained, the clustering centers of the similarity embeddings are trained, for example, a kmean algorithm is adopted to cluster all the similarity embeddings, for example, 1000 ten thousand similarity embeddings are adopted to train 1 ten thousand clustering centers.

Similarly, the trained image semantic extraction model is used to extract the semantics embedding of all the images, and after obtaining the semantics embedding of all the images, the clustering centers of the semantics embedding are trained, for example, a kmean algorithm is used to cluster all the similarity embedding, for example, 1000 ten thousand semantics embedding are used to train 1 ten thousand clustering centers.

2) Basic index system: and (3) associating each obtained similarity embedding clustering center as a basic feature index for retrieval, wherein each image in the image library is respectively associated with a basic feature index (clustering center) center closest to the image library, and establishing an association relation between the image and the basic feature index.

3) A semantic indexing system: and (3) taking each obtained semantic embedding clustering center as a retrieved semantic feature index, associating each image in the image library with a nearest semantic feature index (clustering center) center respectively, and establishing an association relation between each image and the semantic feature index.

4) A category indexing system: for each image in the image library, a plurality of object classes contained in each image are predicted through an image semantic extraction model, and each object class can be represented through one label. For example, there are 1000 object classes in total, and each object class is used as an object class index, wherein each object class index is associated with all images containing the object class.

It should be noted that, since 1000 labels cannot describe all images, some images may not have the 1000 labels, and some images may not predict labels under the model due to the characterization capability problem, which can be solved in the secondary recall scheme described below.

The following describes a specific implementation of the image retrieval method provided in the embodiment of the present application.

Referring to fig. 5, an implementation flow chart of an image retrieval method provided in the embodiment of the present application is shown, which is described by taking a server as an execution subject, and a specific implementation flow of the method is as follows:

s501, acquiring target basic features and target semantic features of a query image; the target basic features are used for representing semantic-free features of the query image.

The server can receive a query image sent by the terminal equipment of the user, and then extract the target basic features and the target semantic features of the query image. Optionally, the target basic feature and the target semantic feature of the query image may be extracted based on the image feature extraction model and the image semantic extraction model, respectively, and the specific implementation steps are as follows:

and A1, inputting the query image into the trained image feature extraction model to obtain the target basic features.

According to the embodiment of the present application, the image feature extraction model is obtained by training based on a triple sample data set, each triple sample in the triple sample data set includes a reference image, an image similar to the reference image, and an image dissimilar to the reference image, and the specific structure and training process of the model are referred to the embodiment and are not described herein again.

The target basic features extracted based on the image feature extraction model can be used as a semantic-free representation of the query image, and used for calculating the similarity between the query image and other images, which can be understood as the similarity embedding in the above embodiments.

A2, inputting the query image into the trained image semantic extraction model to obtain the target semantic features.

According to the embodiments of the present application, the image semantic extraction model is obtained by training based on a triple sample data set labeled with multiple labels, where the multiple labels can be understood as multiple object categories, and the specific structure and training process of the model are referred to the above embodiments and are not described herein again.

Optionally, the image semantic extraction model includes a convolutional neural network module, a depth metric learning module, and the convolutional neural network module inputs the query image into the image semantic extraction model, and then inputs the output result into the depth metric learning module to output the target semantic features, which may be understood as semantic embedding in the above embodiment.

S502, recalling the first image set; the basic similarity between the basic features of the images in the first image set and the target basic features meets a first preset condition.

In this step, the basic features of each image in the image library and the basic similarity of the target basic features of the query image may be calculated, and then a plurality of images whose basic similarities satisfy a first preset condition are recalled to obtain a first image set. Specifically, the basic features may be expressed as a feature vector, and the basic similarity between the basic features may be determined by calculating the distance between the basic features (i.e., the distance between vectors).

The basic features of each image in the image library may also be obtained by using the image feature extraction model in advance. The first preset condition may be: the basic similarity is not less than a preset similarity threshold, and the similarity threshold may be set as needed, for example, 0.9, 0.95, and the like, and is not limited herein.

S503, selecting a plurality of candidate images from the first image set; and the semantic similarity between the semantic features of the candidate images and the target semantic features meets a second preset condition.

In this step, semantic features of each image in the first image set may be calculated, semantic similarity with a target semantic feature of the query image may be calculated, and then a plurality of candidate images whose semantic similarity satisfies a second preset condition may be selected. The semantic features of each image in the first image set may also be obtained by using the image semantic extraction model in advance.

For example, the second preset condition may be: the semantic similarity is not less than a preset similarity threshold, and the similarity threshold may be the same as or different from the similarity threshold, and is specifically set according to needs, and is not limited herein.

S504, recalling the corresponding second image set based on respective semantic features of the candidate images; semantic similarity between each image contained in each second image set and the semantic features of the corresponding candidate images meets a third preset condition.

In this step, a second image set may be recalled based on the semantic features of each candidate image, and specifically, the semantic features of each image in the image library and the semantic similarity of the semantic features of the candidate images may be calculated, and then a plurality of images whose semantic similarities satisfy a third preset condition are recalled to obtain the second image set. The basic features of each image in the image library may also be obtained by using the image feature extraction model in advance.

For example, the third preset condition may be: the semantic similarity is not less than a preset similarity threshold, and the similarity threshold may be the same as or different from the similarity threshold, and is specifically set according to needs, and is not limited herein.

And S505, determining a retrieved target image set based on the recalled second image sets and the first image set.

Specifically, for each recalled image, the total score of each image may be determined according to the basic similarity and semantic similarity between each image and the query image, then the images are sorted from large to small (or from small to large) according to the total score, and m images ranked in the front m (or the back m) are selected to generate the target image set. Wherein, the total score of each image may be a result of weighted summation of the basic similarity and the semantic similarity.

In some possible embodiments, S505 may specifically include the following implementation procedures:

b1, aiming at the plurality of second image sets and each image contained in the first image set, sorting the images according to the basic similarity between the basic feature and the target basic feature of each image and the semantic similarity between the semantic feature and the target semantic feature of each image;

specifically, the basic similarity and the semantic similarity of each image may be weighted and summed to obtain a total score of each image, and then the total scores are arranged from large to small or from small to large. For example: the basic similarity of the image 1 is D1, and the semantic similarity is D2, then the total score of the image 1 is D = w1 × D1+ w2 × D2, where w1 and w2 are weights respectively, and may be set as required, and so on, and the total score of each image may be calculated.

B2, selecting a plurality of target images from the sorted images, and obtaining the searched target image set.

When the images are arranged from large to small according to the total score, m images arranged at the top m positions can be selected from the sequenced images, and the m images are used as a target image set; similarly, when the images are arranged from small to large according to the total score, m images arranged at the next m positions can be selected from the ordered images, and the m images are used as a target image set; the value of m may be set as required, and is not limited herein.

In the above embodiment, the images are sorted according to the basic similarity and the semantic similarity, so that the basic similarity and the semantic similarity jointly act on the sorting result, and the sorting result is more accurate.

According to the technical scheme of the embodiment of the application, the secondary recall is carried out based on the target basic characteristics and the target semantic characteristics of the query image, so that not only can the image with the similar basic characteristics to the query image be recalled, but also the image with the similar semantic characteristics to the query image can be further recalled, and therefore enough images can be recalled, the retrieval recall rate is improved, and the accuracy of the retrieval result is improved.

In consideration of the huge number of images in the image library, the image retrieval can be performed by adopting an index technology, and the retrieval efficiency can be improved. Specifically, the first image set may be recalled from the image library according to the target basic feature of the query image based on the basic feature indexing system established in the foregoing embodiment of the present application.

In some embodiments, as shown in fig. 6, the recalling the first image set according to the target basic feature of the query image in the above S502 may specifically include the following steps:

s5021, selecting a plurality of basic reference indexes from a plurality of basic characteristic indexes in an image library; each basic feature index is a clustering center of the associated multiple basic features, and the basic similarity of each of the multiple basic reference indexes and the target basic feature is not less than a first similarity threshold value.

In the above embodiments of the present application, the established basic feature indexing system of the image library includes a plurality of basic feature indexes, and each basic feature index is a cluster center of the associated plurality of basic features.

In this step, the basic similarity between the plurality of basic feature indexes and the target basic feature of the query image can be calculated, then the basic feature index with the basic similarity not less than a first similarity threshold value is selected from the plurality of basic feature indexes, and each selected basic feature index is used as a basic reference index; the first similarity threshold may be set as needed, and is not limited herein, for example, 0.9, 0.95, and the like.

S5022, a plurality of basic features associated with a plurality of basic reference indexes are obtained.

Since each base feature index is associated with a plurality of base features, a plurality of base features associated with each base reference index can be acquired.

And S5023, generating a first image set according to the images corresponding to the acquired basic features and the images corresponding to the multiple basic reference indexes.

For example, as shown in fig. 7, a base feature index 1, a base feature index 2, and a base feature index 3 that are most similar to a target base feature are selected from a plurality of base feature indexes as 3 base reference indexes, the 3 base reference indexes are respectively associated with a plurality of base features, and images corresponding to the 3 base reference indexes and images corresponding to the respective associated plurality of base features are used to generate a first image set together.

In the embodiment of the application, when the first image set is recalled based on the target basic feature, the most similar basic feature indexes are recalled from the basic feature indexes, and then the basic features respectively associated with the recalled basic feature indexes are acquired, so that the corresponding first image set is recalled, and the recall efficiency of retrieval can be improved.

Further, after the first image set is recalled, when the recall is performed based on the first image set, that is, when the second image set is recalled according to the semantic features of the first image set and the target semantic features of the query image, the recall may be performed based on the semantic feature index system established in the above embodiment of the present application.

In some alternative embodiments, as shown in fig. 8, recalling the corresponding second image set based on the respective semantic features of the multiple candidate images in S504 may include the following steps:

for a plurality of candidate images, the following operations are performed, respectively:

s5041, selecting a plurality of semantic reference indexes from a plurality of semantic feature indexes in the image library; each semantic feature index is a clustering center of the associated multiple semantic features, and the semantic similarity between each selected semantic reference index and the semantic feature of one candidate image is not less than a second similarity threshold.

In the above embodiment of the present application, the semantic feature indexing system of the established image library includes a plurality of semantic feature indexes, and each semantic feature index is a clustering center of the associated plurality of semantic features.

In the step, semantic similarity between the semantic feature indexes and semantic features of a candidate image can be calculated, then the semantic feature index with the semantic similarity not less than a second similarity threshold value is selected from the semantic feature indexes, and each selected semantic feature index is used as a semantic reference index; the second similarity threshold may be set as needed, for example, 0.9, 0.95, and the like, and is not limited herein.

S5042, a plurality of semantic features associated with each of the plurality of semantic reference indices are obtained.

S5043, generating a corresponding second image set according to the acquired images corresponding to the semantic features and the images corresponding to the semantic reference indexes.

Exemplarily, as shown in fig. 9, a semantic feature index 1, a semantic feature index 2, and a semantic feature index 3 that are most similar to a semantic feature of a candidate image are selected from a plurality of semantic feature indexes as 3 semantic reference indexes, the 3 semantic reference indexes are respectively associated with a plurality of semantic features, and an image corresponding to the 3 semantic reference indexes and an image corresponding to the respective associated plurality of semantic features are used to generate a second image set together.

Through the embodiment, when a second image set is recalled based on the semantic features of a candidate image, the most similar semantic feature indexes are recalled from the semantic feature indexes, the semantic features respectively associated with the semantic feature indexes are acquired, and then the corresponding second image set is recalled, so that the recall efficiency of retrieval can be improved.

In the embodiment of the application, the first image set and the second image sets are recalled in a two-stage recall mode based on the target basic features and the target semantic features of the query image, and on the basis, more images can be recalled in a two-stage recall mode based on the target category information and the target semantic features of the query image.

In some embodiments, as shown in fig. 10, on the basis of S501-S504 of the above embodiments, the image retrieval method may further include the following implementation flows:

s506, acquiring target category information of the query image, wherein the target category information is used for representing the object category contained in the query image;

the target category information of the query image may specifically include a plurality of tags, each tag representing an object category.

As can be seen from the above embodiments of the present application, the trained image semantic extraction model can not only extract semantic features of an image, but also predict object types included in the image, and specifically can output a plurality of labels, that is, type information.

Therefore, in some alternative embodiments, the query image may be input into the trained image semantic extraction model to obtain the target category information.

S507, recalling the third image set; the category information of each image included in the third image set is the same as the target category information.

In this step, a plurality of object categories included in the target category information of the query image may be determined, for example, including object category 1, object category 2, and object category 3, and then, each image having the same category information as the target category information is searched from all images in the image library to be recalled, that is, the category information of each recalled image also includes the object category 1, object category 2, and object category 3. The category information of each image in the image library may also be obtained in advance by using the image semantic extraction model.

S508, recalling the corresponding fourth image set based on the respective semantic features of the multiple reference images contained in the third image set; and the semantic similarity between each image contained in each fourth image set and the semantic features of the corresponding reference image meets a fifth preset condition.

The plurality of reference images in the third image set may be images having similar semantic features to the query image, and specifically, semantic similarities between all the images in the third image set and the query image may be calculated, and then, based on the semantic similarities of all the images, a plurality of reference images most similar to the semantic of the query image may be selected from all the images.

Further, based on the semantic features of each reference image, a fourth image set may be recalled, specifically, semantic features of each image in the image library and semantic similarity with the semantic features of the reference images may be calculated, and then a plurality of images whose semantic similarity satisfies a fifth preset condition are recalled to obtain the fourth image set. For example, the fifth preset condition may be: the semantic similarity is not less than a preset similarity threshold, which may be set specifically according to needs and is not limited herein.

In the embodiment of the application, on the basis of carrying out secondary recall based on the target basic feature and the target semantic feature of the query image, another secondary recall process is realized based on the target category information and the semantic feature of the query image, so that not only can the image with the same category information as the query image be recalled, but also the image with similar semantic feature as the query image can be further recalled, thereby further recalling enough images, improving the retrieval recall rate and further improving the accuracy of the retrieval result.

After executing the above S506-S508, the above S505 determines the retrieved target image set based on the recalled plurality of second image sets and the first image set, and specifically may include the following steps:

s509, a retrieved target image set is determined based on the first image set, the plurality of second image sets, the third image set and the plurality of fourth image sets.

Specifically, for each recalled image, a total score of each image may be determined according to the basic similarity, semantic similarity and evaluation value of category information between each image and the query image, and then the images are sorted from large to small (or from small to large) according to the total score, and m images ranked in the front m (or the back m) are selected to generate the target image set. Wherein the total score of each image may be a result of weighted summation of the evaluation values of the basic similarity, the semantic similarity, and the category information.

In some optional embodiments, S509 may specifically include the following implementation flows:

c1, for each image contained in the first image set, the multiple second image sets, the third image set and the multiple fourth image sets, sorting the images according to the basic similarity between the basic feature and the target basic feature of each image, the semantic similarity between the semantic feature and the target semantic feature of each image and the evaluation value of the category information of each image;

specifically, the basic similarity, the semantic similarity, and the evaluation value of the category information of each image may be weighted and summed to obtain a total score of each image, and then the total scores may be arranged from large to small or from small to large. For example: the basic similarity of the image 1 is D1, the semantic similarity is D2, and the evaluation value of the category information is D3, then the total score of the image 1 is D = w1 × D1+ w2 × D2+ w3 × D3, where w1, w2, and w3 are weights respectively, and may be set as required, and so on, and the total score of each image may be calculated.

In some alternative embodiments, the evaluation value of the respective category information of each image may be obtained as follows:

at least one object class contained in the target class information is determined, and a first prediction probability of each of the at least one object class is obtained.

Further, for each image, the following operations are performed, respectively:

It is assumed that the target category information of the query image includes an object category 1, an object category 2, an object category 3, and an object category 4, and for example, these four object categories correspond to each other: humans, dogs, cats, birds; the prediction probabilities of the image semantic extraction model for the four object categories are greater than a probability threshold, and the probability threshold may be set as required, for example, 0.5.

Since the category information of each recalled image is the same as the target category information, the category information of each image also contains the above four object categories. When predicting the category information of each image, the image semantic extraction model may further output a prediction probability of each object category included in the category information, that is, may obtain a second prediction probability of each recalled image corresponding to the four object categories.

For example, in the target category information of the query image, the first prediction probabilities of the four object categories are P1, P2, P3 and P4, respectively, and the first prediction probabilities are all greater than the probability threshold; in the category information of the recalled image, the second prediction probabilities of the four object categories are P1 ', P2', P3 'and P4', respectively, and then the evaluation value P of the category information of the recalled image may be calculated according to equation (4):

P= P1* P1’+ P2* P2’+ P3* P4’+ P4* P4’ （4）

c2, selecting a plurality of target images from the sorted images, and obtaining the searched target image set.

In the above embodiment of the present application, after all the recalled images are obtained, the images are sorted according to the basic similarity, the semantic similarity, and the comprehensive score of the evaluation value of the category information, and a plurality of target images arranged in the front are selected. Therefore, the basic similarity, the semantic similarity and the category information jointly act on the sequencing result, and the sequencing result is further more accurate.

The following describes in detail the specific implementation process of recalling the third image set in S507.

In this embodiment of the present application, the third image set may be recalled from the image library according to the target category information of the query image based on a category index system in a pre-established image library, where a process of establishing the category index system in the image library refers to the above embodiment of the present application and is not described herein again.

In some embodiments, as shown in fig. 11, the recalling the third image set in S507 may specifically include the following steps:

s5071, selecting at least one object category index matching the target category information from the plurality of object category indexes in the image library; each object class index is associated with a plurality of images containing a respective object class.

In the above embodiment of the present application, the category indexing system of the established image library includes a plurality of object category indexes, each object category index is associated with a plurality of images including corresponding object categories, after target category information of a query image is obtained, a plurality of object categories included in the query image can be determined, and then, the determined plurality of object categories are selected from the category indexing system, and the corresponding object category indexes are respectively selected.

S5072, a plurality of images associated with the selected at least one object category index are acquired, and a third image set is generated from the acquired images.

For example, as shown in fig. 12, it is assumed that the query image includes an object class 1, an object class 2, and an object class 3, and the object class index 1, the object class index 2, and the object class index 3 corresponding to the object class 1, the object class 2, and the object class 3, respectively, are selected from the plurality of object class indexes, and then a plurality of images associated with the 3 object class indexes are acquired, and a third image set is generated from the acquired images.

Further, after the third image set is recalled, when a recall is performed based on the third image set and the query image, that is, when the fourth image set is recalled according to the semantic features of the third image set and the target semantic features of the query image, the recall may be performed based on the semantic feature index system established in the above embodiment of the present application.

In some embodiments, as shown in fig. 13, recalling the corresponding fourth image set based on the respective semantic features of the multiple reference images included in the third image set in S508 may include the following implementation procedures:

s5081, selecting a plurality of reference images from the third image set, and using the query image as a reference image; and the semantic similarity between the semantic features of the multiple reference images and the target semantic features is not less than a third similarity threshold.

In this step, semantic features of each image in the third image set may be calculated, semantic similarity with a target semantic feature of the query image may be calculated, and then a plurality of reference images having semantic similarity not less than a third similarity threshold may be selected. The third similarity threshold may be specifically set according to needs, and is not limited herein.

S5082, recalling a corresponding fourth image set based on respective semantic features of the obtained multiple reference images; and the semantic similarity between each image contained in each fourth image set and the semantic features of the corresponding reference image is not less than a fourth similarity threshold.

In this step, a fourth image set may be recalled based on the semantic features of each reference image, and specifically, the semantic features of each image in the image library and the semantic similarity of the semantic features of the reference image may be calculated, and then a plurality of images whose semantic similarities are not less than a fourth similarity threshold are recalled to obtain the fourth image set. The third similarity threshold may be specifically set according to needs, and is not limited herein.

In some optional embodiments, in S5082, recalling the corresponding fourth image set based on the obtained semantic features of each of the multiple reference images, respectively, may include the following implementation procedures:

aiming at the semantic features of each of the plurality of reference images, the following operations are respectively executed:

d1, selecting a plurality of semantic reference indexes from a plurality of semantic feature indexes in the image library; each semantic feature index is a clustering center of a plurality of associated semantic features, and the semantic similarity between each selected semantic reference index and the semantic feature of one reference image is not less than a fourth similarity threshold;

in this step, semantic similarity between each of the plurality of semantic feature indexes and a semantic feature of a reference image may be calculated, and then a semantic feature index having a semantic similarity not less than the fourth similarity threshold is selected from the plurality of semantic feature indexes, and each selected semantic feature index is used as a semantic reference index.

D2, acquiring a plurality of semantic features respectively associated with the semantic reference indexes;

since each semantic feature index is associated with a plurality of semantic features, a plurality of semantic features associated with each semantic reference index can be acquired.

D3, generating a corresponding fourth image set according to the images corresponding to the acquired semantic features and the images corresponding to the semantic reference indexes.

Illustratively, from the semantic feature indexes, a semantic feature index 3, a semantic feature index 4, and a semantic feature index 5 that are most similar to the semantic feature of a reference image are selected as 3 semantic reference indexes, the 3 semantic reference indexes are respectively associated with a plurality of semantic features, and the images corresponding to the 3 semantic reference indexes and the images corresponding to the respective associated semantic features are used to jointly generate a fourth image set.

It should be noted that, the above embodiments of the present application relate to a plurality of similarity thresholds, for example: the first similarity threshold, the second similarity threshold, the third similarity threshold, and the fourth similarity threshold may be the same or different, and are not limited herein.

The following describes an image retrieval method according to an embodiment of the present application with reference to a specific retrieval example of fig. 14.

The embodiment of the application can combine a two-stage recall method based on basic features and semantic features with a two-stage recall method based on category information and semantic features, and the two-stage recall processes are introduced below respectively.

Referring to fig. 14, the image retrieval method according to the embodiment of the present application may include the following specific implementation procedures:

first, basic feature primary recall: in the search, according to the similarity embedding (i.e., the target basic feature) of the query image (query graph), topK (for example, K is 5) basic feature indexes closest to the similarity embedding of the query graph are recalled in the basic feature indexing system, and the images under the basic feature indexes are recalled and are denoted as Rsi1 (call similar image in level 1), for example, in fig. 14, two airplane images and one automobile image are recalled at one level of the basic feature.

Second, basic feature secondary recall: the method comprises the steps of acquiring semantic embedding (namely the semantic features) of images aiming at Rsi1 images recalled at the first level of basic features, calculating the distance between the semantic embedding and the semantic embedding of the query according to the semantic embedding, taking the images with the distance smaller than a distance threshold (for example, 0.05) in Rsi1 as secondary query images, recalling topK (for example, K is 5) semantic feature indexes closest to the semantic embedding of the secondary query images in a semantic feature index system based on the semantic embedding of the secondary query images, and recalling the images related under the semantic feature indexes to be Rsi2, for example, in FIG. 14, three airplane images and one automobile image recalled at the second level of basic features.

Wherein, the smaller the distance between the semantics embedding of the two images is, the higher the similarity is. As in the basic feature recall in fig. 14, 3 images are recalled one level, but since the recalled car image is recalled one level by the basic feature, but since the semantic embedding thereof is not similar to that of the query image, it is not used as a basic feature secondary recall, and only two airplanes are left for secondary recall.

Thirdly, class information primary recall: among the multiple labels (i.e., the category information in the above embodiment) predicted from the query graph, the predicted label (i.e., the object category) is found, and all images under the label are recalled as Rli1 (record label image in level 1), for example, two airplane images in which category information is recalled at one level in fig. 14.

Fourthly, category information secondary recall: for the image Rli1 recalled at the first level of the category information, calculating semantic similarity according to the semantic embedding of the Rli1 and the semantic embedding of the query graph, selecting an image with the semantic similarity larger than a similarity threshold (for example, 0.90), taking the selected image and the original query graph as a secondary query graph, recalling topK (for example, K is 5) semantic feature indexes closest to the semantic embedding of the secondary query graph in a semantic feature index system based on the semantic embedding of the secondary query graph, and recalling the images associated with the semantic feature indexes as Rli2, for example, in fig. 14, two airplane images recalled at the second level of the category information.

Fifthly, the Rsi1, Rsi2, Rli1 and Rli2 of the recalls are combined to obtain a total recalls.

And sixthly, sorting results of all recalled images.

For all the recalled images, the prediction probabilities of multiple labels of the initial query graph, which are activated in the category information prediction of all the images, are obtained, for example, if the prediction probabilities of the 10 th label and the 12 th label, which are the prediction probabilities of the 10 th label and the 12 th label in the query graph, are both greater than a probability threshold (for example, 0.5), the prediction probabilities of the 10 th label and the 12 th label in the category prediction of the recalled images are obtained.

Based on the three information of the activated multi-label prediction probability, the semantic embedding and the similarity embedding of all the images, the weighted total distance is calculated according to the following formula (5) to formula (8), and the images are sorted from small to large according to the total distance to obtain a sorting result.

Wherein, w₁、w₂、w₃Respectively, weights, which can be set as desired, e.g. w₁=w₂=w₃=0.33；

E_query1Representing the basic features of the query graph (i.e. the above-mentioned similarity), E_db1Representing the fundamental features of the recalled image, D_sim1Representing distances of basic features of the query image and the recall image; e_query2Semantic embedding, E, representing a query graph_db2Semantic embedding, D, representing recalled images_sim2Representing the distance of semantic embedding of the query image and the recall image; d_labelIndicates the negative similarity of category information (i.e., multi-label) of the query graph and the recall image (indicated as db), i indicates the label bit in which the query graph is activated, Nclass indicates the number of categories of labels (i.e., the maximum value of label bits), p (query)_iRepresents the prediction probability of the ith label of the query graph, p (db)_iA predicted probability of an ith label representing a recalled image; the activated label bit of the query graph refers to a label with a prediction probability greater than a probability threshold (for example, 0.5) in the multi-label prediction of the query graph.

The above embodiments of the present application have at least the following beneficial effects:

(1) on the retrieval framework: on the basis of the basic feature index system, a semantic feature index system and a category index system are added, so that two-stage recall of semantic recall and basic feature (no semantic) recall is realized, in the retrieval process, the semantic-free retrieval effect can be supplemented, images with similar semantics can be recalled, and the recall effect of large-scale retrieval is improved.

(2) On the recall effect: by adopting the two-stage recall scheme, enough similar images can be recalled, and the problem that the recall effect is not expected due to the fact that the model does not learn well on certain sample images is solved through a chain recall mode in the two-stage recall.

(3) Post-recall processing: based on the semantic embedding and the similarity embedding, the two features containing different information jointly act on the sequencing result, the images with similar semantic features and similar basic features are arranged at the forefront, the images with dissimilar semantic features and dissimilar basic features are arranged at the rearmost, the retrieval effect is better recognized by human, and the retrieval accuracy is improved.

(4) And (3) feature learning: updating and learning the semantic embedding through multitask different learning rates of a shared basic feature layer (conv 1-conv 5) to realize the semantic embedding and category information (multi-label) joint learning; the image semantic extraction model can provide semantic embedding while providing category information, so that one model outputs two meaningful information.

The method embodiment of the present application is based on the same inventive concept, and an image retrieval apparatus is also provided in the embodiment of the present application, and the principle of the apparatus for solving the problem is similar to the method of the embodiment, so the implementation of the apparatus can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 15, an embodiment of the present application provides an image retrieving apparatus, which includes a feature obtaining module 151, a first recall module 152, a first selecting module 153, a second recall module 154, and a determining module 155.

A feature obtaining module 151, configured to obtain a target basic feature and a target semantic feature of the query image; the target basic features are used for representing semantic-free features of the query image;

a first recall module 152 for recalling the first set of images; the basic similarity between the basic features of the images in the first image set and the target basic features meets a first preset condition;

a first selection module 153 for selecting a plurality of candidate images from the first set of images; semantic similarity between each semantic feature of the candidate images and the target semantic feature meets a second preset condition;

a second recall module 154, configured to recall the corresponding second image set based on respective semantic features of the multiple candidate images; semantic similarity between each image contained in each second image set and semantic features of corresponding candidate images meets a third preset condition;

a determining module 155, configured to determine the retrieved target image set based on the recalled plurality of second image sets and the first image set.

In a possible embodiment, the first recall module 152 is specifically configured to:

selecting a plurality of base reference indexes from a plurality of base feature indexes in an image library; each basic feature index is a clustering center of a plurality of associated basic features, and the basic similarity between each basic reference index and the target basic feature is not less than a first similarity threshold value;

obtaining a plurality of basic characteristics respectively associated with a plurality of basic reference indexes;

and generating a first image set according to the images corresponding to the acquired basic features and the images corresponding to the multiple basic reference indexes.

In one possible embodiment, the second recall module 154 is specifically configured to:

acquiring a plurality of semantic features associated with a plurality of semantic reference indexes;

and generating a corresponding second image set according to the images corresponding to the acquired semantic features and the images corresponding to the semantic reference indexes.

In a possible embodiment, the feature obtaining module 151 is specifically configured to:

inputting the query image into a trained image feature extraction model to obtain target basic features; the image feature extraction model is obtained based on triple sample data set training, and each triple sample in the triple sample data set comprises a reference image, an image similar to the reference image and an image dissimilar to the reference image;

inputting the query image into the trained image semantic extraction model to obtain target semantic features; the image semantic extraction model is obtained by training based on a triple sample data set marked with object categories.

In a possible embodiment, the determining module 155 is specifically configured to:

for each image contained in the plurality of second image sets and the first image set, sorting the images according to the basic similarity between the basic feature and the target basic feature of each image and the semantic similarity between the semantic feature and the target semantic feature of each image;

In a possible embodiment, as shown in fig. 16, the apparatus further comprises:

a category obtaining module 156, configured to obtain target category information of the query image, where the target category information is used to represent an object category included in the query image;

a third recall module 157 to recall a third set of images; the respective category information of the images contained in the third image set is the same as the target category information;

a fourth recall module 158, configured to recall a corresponding fourth image set based on respective semantic features of a plurality of reference images included in the third image set; semantic similarity between each image contained in each fourth image set and semantic features of corresponding reference images meets a fifth preset condition;

the determining module 155 is specifically further configured to:

determining a retrieved target set of images for the first set of images, the plurality of second sets of images, the third set of images, and the plurality of fourth sets of images.

In a possible embodiment, the third recall module 157 is specifically configured to:

selecting at least one object category index matched with the target category information from a plurality of object category indexes in the image library; each object class index is associated with a plurality of images containing corresponding object classes;

and acquiring a plurality of images respectively associated with the selected at least one object class index, and generating a third image set according to the acquired images.

In a possible embodiment, the fourth recall module 158 is specifically configured to:

selecting a plurality of reference images from the third image set, and taking the query image as one reference image; the semantic similarity between each semantic feature of the multiple reference images and the target semantic feature is not less than a third similarity threshold;

In a possible embodiment, when recalling the corresponding fourth image set based on the respective semantic features of the obtained multiple reference images, the fourth recall module 158 is further configured to:

selecting a plurality of semantic reference indexes from a plurality of semantic feature indexes in an image library; each semantic feature index is a clustering center of a plurality of associated semantic features, and the semantic similarity between each selected semantic reference index and the semantic feature of one reference image is not less than a fourth similarity threshold;

In a possible embodiment, the feature obtaining module 151 is further specifically configured to:

inputting the query image into the trained image semantic extraction model to obtain target semantic features;

the image semantic extraction model is obtained by training based on a triple sample data set marked with object types, and each triple sample in the triple sample data set comprises a reference image, an image similar to the reference image and an image dissimilar to the reference image;

the category obtaining module 156 is specifically configured to:

and inputting the query image into the trained image semantic extraction model to obtain predicted target category information.

In a possible embodiment, the determining module 155 is further specifically configured to:

for each image contained in the first image set, the second image sets, the third image set and the fourth image sets, sorting the images according to the basic similarity between the basic feature of each image and the target basic feature, the semantic similarity between the semantic feature of each image and the target semantic feature, and the evaluation value of the category information of each image;

determining at least one object type contained in the target type information, and acquiring a first prediction probability of each of the at least one object type;

for each image, the following operations are performed:

In the embodiment of the application, the secondary recall is performed based on the target basic feature and the target semantic feature of the query image, so that not only the image with the similar basic feature as the query image can be recalled, but also the image with the similar semantic feature as the query image can be recalled, and thus enough images can be recalled. In addition, another two-stage recall process can be realized based on the target category information and semantic features of the query image, and the recall process can recall the image with the same category information as the query image and can recall the image with similar semantic features to the query image, so that enough images can be recalled, the retrieval recall rate is improved, and the accuracy of the retrieval result is improved.

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same one or more pieces of software or hardware when implementing the present application.

Having described the image retrieval method and apparatus according to an exemplary embodiment of the present application, next, an image retrieval apparatus according to another exemplary embodiment of the present application is described.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, an image retrieval apparatus according to the present application may include at least a processor and a memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the image retrieval method according to various exemplary embodiments of the present application described in the specification. For example, the processor may perform the steps as shown in fig. 2.

Having described the image retrieval method and apparatus according to an exemplary embodiment of the present application, an electronic device according to another exemplary embodiment of the present application is described next.

Based on the same inventive concept as the method embodiment of the present application, an embodiment of the present application further provides an electronic device, and a principle of the electronic device to solve the problem is similar to the method of the embodiment, so that the implementation of the electronic device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 17, the electronic device 170 may include at least a processor 171 and a memory 172. The memory 172 stores therein program codes, which, when executed by the processor 171, cause the processor 171 to perform the steps of any of the image retrieval methods described above.

In some possible implementations, an electronic device according to the present application may include at least one processor, and at least one memory. Wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the image retrieval method according to various exemplary embodiments of the present application described above in the present specification. For example, a processor may perform the steps as shown in fig. 5.

In an exemplary embodiment, the present application also provides a storage medium, such as the memory 172, including program code that is executable by the processor 171 of the electronic device 170 to perform the image retrieval method described above. Alternatively, the storage medium may be a non-transitory computer readable storage medium, for example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

An electronic device 180 according to one embodiment of the present application is described below with reference to fig. 18. The electronic device 180 of fig. 18 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 18, the electronic device 180 is represented in the form of a general electronic device. The components of the electronic device 180 may include, but are not limited to: the at least one processing unit 181, the at least one memory unit 182, and a bus 183 that couples various system components including the memory unit 182 and the processing unit 181.

Bus 183 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 182 may include readable media in the form of volatile memory, such as Random Access Memory (RAM) 1821 and/or cache memory unit 1822, and may further include Read Only Memory (ROM) 1823.

The storage unit 182 may also include a program/utility 1825 having a set (at least one) of program modules 1824, such program modules 1824 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The electronic device 180 may also communicate with one or more external devices 184 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with the electronic device 180, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 180 to communicate with one or more other electronic devices. Such communication may occur via input/output (I/O) interfaces 185. Also, the electronic device 180 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 186. As shown, the network adapter 186 communicates with other modules for the electronic device 180 via the bus 183. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 180, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In some possible embodiments, the aspects of the image retrieval method provided by the present application may also be implemented in the form of a program product, which includes program code for causing an electronic device to perform the steps in the image retrieval method according to various exemplary embodiments of the present application described above in this specification when the program product is run on the electronic device, for example, the electronic device may perform the steps as shown in fig. 5.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. An image retrieval method, comprising:

2. The method of claim 1, wherein the recalling the first set of images comprises:

3. The method according to claim 1, wherein recalling the corresponding second set of images based on respective semantic features of the plurality of candidate images comprises:

4. The method according to any one of claims 1 to 3, wherein the obtaining of the target base feature and the target semantic feature of the query image comprises:

5. The method of claim 4, wherein determining the retrieved set of target images based on the recalled plurality of second sets of images and the first set of images comprises:

6. The method according to any one of claims 1 to 3, further comprising:

acquiring target category information of the query image, wherein the target category information is used for representing the object category contained in the query image;

recalling the third image set; the respective category information of the images included in the third image set is the same as the target category information;

recalling a corresponding fourth image set based on respective semantic features of a plurality of reference images contained in the third image set; semantic similarity between each image contained in each fourth image set and semantic features of corresponding reference images meets a fifth preset condition;

determining, by the first image collection and based on the recalled second image collections, a retrieved target image collection, including:

7. The method of claim 6, wherein the recalling a third set of images comprises:

8. The method according to claim 6, wherein the recalling the corresponding fourth image set based on the semantic features of each of the plurality of reference images included in the third image set comprises:

9. The method according to claim 8, wherein recalling the corresponding fourth image set based on the semantic features of each of the obtained plurality of reference images comprises:

10. The method of claim 6, wherein obtaining the target semantic features of the query image comprises:

the acquiring of the target category information of the query image includes:

11. The method of claim 10, wherein determining the retrieved set of target images based on the first set of images, the plurality of second sets of images, the third set of images, and the plurality of fourth sets of images comprises:

12. The method according to claim 11, wherein the evaluation value of the respective category information of the respective images is obtained by:

for each image, respectively executing the following operations:

13. An image retrieval apparatus, comprising:

14. An electronic device, comprising a processor and a memory, wherein the memory stores program code which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1 to 12.

15. A computer-readable storage medium, characterized in that it comprises program code for causing an electronic device to perform the steps of the method of any of claims 1-12, when said program code is run on the electronic device.

16. A computer program product comprising computer instructions, the computer instructions being stored in a computer readable storage medium; when a processor of an electronic device reads the computer instructions from the computer-readable storage medium, the processor executes the computer instructions, causing the electronic device to perform the steps of the method of any of claims 1-12.