CN113032612A

CN113032612A - Construction method of multi-target image retrieval model, retrieval method and device

Info

Publication number: CN113032612A
Application number: CN202110270411.2A
Authority: CN
Inventors: 范建平; 舒永康; 赵万青; 彭先霖; 胡琦瑶; 杨文静; 王琳
Original assignee: Northwestern University
Current assignee: Northwestern University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-06-25
Anticipated expiration: 2041-03-12
Also published as: CN113032612B

Abstract

The invention discloses a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device, wherein a user inputs one or more images to be retrieved, obtains a multi-target query hash code set of the images through the model, compares the query hash code set with a target hash code set in a retrieval database, calculates a combined Hamming distance, sorts the distance from small to large, and returns a retrieval result. The invention acquires the hash code corresponding to the image target area for retrieval based on the weak supervised learning mode, and each target area of each image generates an independent hash code, thereby avoiding the influence of a complex background and the mutual influence among all targets, more completely representing the content of the complex image, improving the retrieval effect and expanding the retrieval mode of the image.

Description

Construction method of multi-target image retrieval model, retrieval method and device

Technical Field

The invention belongs to the technical field of image retrieval, and particularly relates to a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device.

Background

In recent years, as deep learning technologies are continuously developed and applied in the image field, the image retrieval methods combined with deep learning can accurately capture semantic information hidden in images, and greatly improve retrieval effects. However, the existing methods ignore some problems, the images often have rich semantic information, and one image usually contains a plurality of objects or content regions, in the current mainstream image retrieval method, one image is abstracted into a feature vector representation with a fixed dimension, the feature vector representation is usually a real number vector with 256 to 2048 dimensions, so that a great deal of effective information is often lost in the image feature vector representation process, and only the most core semantic information in the image is retained. And with the increase of image data volume year by year, the characteristic vector is also subjected to Hash representation in order to quickly search in a massive image database. In the feature representation processed by the hashing method, a 16-128 dimensional binary hash vector representation is generally used, so that more information is lost, each target in the image cannot be accurately depicted, and therefore, the actual performance of retrieval is greatly limited.

The core step of image retrieval is to generate accurate feature description of an image, and for a multi-label and multi-target complex scene image, how to accurately and completely describe the content of the image is a core problem. On one hand, especially in a multi-label image, one label usually only corresponds to a certain target or a local area in the image, but not the whole image, and in the process of obtaining the characteristic representation of the whole image, complex background information irrelevant to the actual retrieval target may be doped, so that the extracted hash code loses the strong characterization capability of the retrieval target; moreover, a plurality of targets with different sizes mixed together can also affect each other, so that the subject contents of the retrieval targets cannot be distinguished, certain interference can be brought to retrieval, and the retrieval effect is poor. On the other hand, in the conventional image retrieval method, one image is input to retrieve the image related to the image, and the retrieval method has certain limitations, and if a plurality of images are expected to be input and the images with the targets in the plurality of images are retrieved at the same time, the conventional single feature representation method cannot be realized.

In order to solve the problem, an effective method is to obtain feature representations of target areas corresponding to each label, and some methods also exist at present to obtain each target in an image in a target detection mode, and then obtain feature vector representations of each target, so as to solve the retrieval problem of multi-target images. In the methods, the process of target detection and the learning process of target region features are independent, multiple steps and multiple models are used for cooperatively completing the whole task, an end-to-end system is not provided, the whole process is complex, the optimization targets in different modules are inconsistent, the target functions of some modules may deviate from the whole target of the system, and thus the independently trained modules are difficult to enable the whole system to achieve the optimal performance, and the error accumulation condition also can exist, namely the deviation generated in the previous step may influence the subsequent whole step, so that the whole system is difficult to achieve an optimal state.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention provides a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device, which only use image-level labeling information, learn the corresponding relation between an image label and the target position thereof, and simultaneously learn the Hash feature representation of each target and region, thereby solving the problems that the prior art can not avoid the interference of noise information irrelevant to the retrieval target and the retrieval problems of refinement and diversification.

In order to achieve the purpose, the invention adopts the following technical scheme:

a multi-target image retrieval model construction method is provided, the multi-target image retrieval model can be used for obtaining the Hash-like representation of multi-target images to realize multi-target image retrieval; the method comprises the following steps:

step 1, acquiring a plurality of images and corresponding labels thereof as a training set;

step 2, constructing a pre-training neural network model, wherein the neural network model is a multi-task learning model and comprises the following steps:

a first module: the RPN module is used for generating a target candidate frame for an input image;

a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image;

a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame;

a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image;

a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;

and 3, inputting the image into the pre-training neural network model for training, wherein the training comprises the following steps:

step 3.1, inputting the images into a first RPN module, generating initial target candidate frames for each image, and outputting a matrix of P × 4 if P target candidate frames are finally obtained, wherein each row represents coordinate information of one target candidate frame, initial coordinates and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;

meanwhile, the image is input into a second module deep convolution neural network, the size of the batch size is set as B, a characteristic map is output,

step 3.2, inputting the feature map and the P target candidate frames in the step 3.1 into an interest area pooling module of a third module to obtain feature vector representation of a target area corresponding to each target candidate frame, and outputting a matrix of BxPxd, wherein d is the dimension represented by the feature vector;

step 3.3, the output of the step 3.2 is respectively input into a fourth module target area detection branch and a fifth module hash code learning branch, wherein the output of the target area detection branch is the image class probability, and the output of the hash code learning branch is the class hash representation of each target candidate box;

step 3.4, optimizing the model: comparing the image category probability with the label vector of the image to calculate binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.

The invention also comprises the following technical characteristics:

specifically, the step 3.3 includes:

step 3.3.1, in the fourth module target area detection branch, the feature vector of each target candidate frame is divided into a detection data stream branch and a classification data stream branch after passing through two full connection layers; respectively passing through two full connection layers to obtain a detection output matrix and a classification output matrix; then, multiplying the detection output matrix and the classification output matrix by element-wise to obtain the class probability of each target candidate frame, and outputting a merged data matrix; finally, summing the class probabilities of all the target candidate frames of each image to obtain the class probability of the image, and using the class probability as the output of the target region detection branch;

and 3.3.2, in the fifth module hash code learning branch, enabling the feature vector of each target candidate box to pass through two fully-connected layers, and then inputting the feature vector into a hash layer containing L nodes to obtain a BxPxL class hash output, wherein the row vector of each image represents the class hash representation of each target candidate box.

Specifically, the step 3.3.1 includes:

step (a1), the detection data flow branch passes through two full connection layers to obtain a detection output matrix X^dThe calculation is performed according to equation 1:

in equation 1, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]_detct(x^d)]_ijRefers to detecting the output of the ith row and the jth column after the branch calculation of the data stream,

refers to the detection of the output matrix X^dP represents the total number of target candidate frames, and e is the base of the power operation; setting a total of C categories, wherein the detection output matrix dimension is BxPxC, and obtaining the score of each category under each target candidate frame, wherein the formula 1 is equivalent to performing primary target detection;

step (a2), the classified data flow branches pass through two full connection layers to obtain a classified output matrix X^cThe calculation is performed according to equation 2:

in equation 2, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]_class(x^c)]_ijRefers to the output of the ith row and the jth column after the branch calculation of the classified data stream,

refers to the classification output matrix X^cJ column of the ith row, C represents the total number of categories, and e is exponentiatedThe base number, formula 2, is used for calculating the probability of each target candidate frame in each category, which is equivalent to performing primary classification on each target candidate frame, and the dimension of a classification output matrix is B multiplied by P multiplied by C;

step (a3), the element-wise multiplication is carried out on the classified output moment and the detection output matrix to obtain a result of the combination of classification and detection: b multiplied by P multiplied by C data matrix, wherein each row of each image in the data matrix represents the score of a target candidate frame on each category, namely the category probability of the target candidate frame;

and (a4) finally, summing the class probabilities of the target candidate frames of each image to obtain the class probability of the image, wherein the output dimension after summation is B multiplied by 1 multiplied by C and is used as the output of the target region detection branch.

Specifically, the step 3.4 includes:

step (b1), comparing the class probability of the image obtained in step (a4) with the image label vector to calculate the binary cross entropy classification loss, wherein the calculation method is formula 3:

in formula 3, L^c(y, p (y)) is the binary cross entropy classification loss of one image, N is the total number of data set labels, y_iIs 0 or 1, indicates whether the image has the ith label, and if so, is 1, p (y)_i) Representing the probability value that the model predicted image has the ith label;

step (b2), according to the output of the target area detection branch, selecting the target candidate box with the target candidate box class probability higher than the set class probability threshold, screening the index number of the target candidate box, obtaining the corresponding class hash representation in the output of the hash code learning branch, calculating the hash loss according to the class hash representation and the corresponding label, the calculating method is shown in formula 4,

in formula 4, L^h(h₁,h₂Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h₁class-Hash representation, h, representing the image target area 1₂The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, D_h(h₁,h₂) Euclidean distance values representing the class hash vector representations of the calculation target areas 1 and 2;

and (b3) carrying out weighted summation on the binary cross entropy classification loss and the hash loss according to the proportion of 100:1 to obtain the final joint loss, and carrying out a reverse iterative optimization model by a random gradient descent method, wherein the learning rate of the model is set to be 0.001.

A multi-target image retrieval method includes:

acquiring a target hash code set of an image set to construct a retrieval database;

inputting a single or a plurality of images to be retrieved into the multi-target image retrieval model to obtain the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;

and step three, converting the class hash representation into hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved, calculating the joint Hamming distances between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the finally returned retrieval results.

Specifically, in the third step, the class hash representation is converted into a hash representation, and the hash representation is realized by a formula 5:

in the formula 5, h_iA value representing the ith position of the class hash representation vector;

thereby obtaining query hash codes of a plurality of target candidate boxes of q images to be retrieved

Query hash code set for composing image to be retrieved

Specifically, in the third step, the calculation method for calculating the joint hamming distance between the query hash code set and the target hash code set is shown in formula 6:

in equation 6, UHD (H)^q,Hⁱ) For the joint Hamming distance, H, of the image to be retrieved and the ith image hash code set in the image set^qRepresentation of a set of hash codes representing an image to be retrieved, HⁱRepresenting the representation of the ith image in the image set, R representing the number of target hash codes of the image to be retrieved, K representing the number of target hash codes of the ith image in the image set,

representing the Hamming distance between the r-th Hash code in the image to be retrieved and the j-th Hash code of the ith image in the image set;

and in formula 6, each query hash code is matched with the most similar hash code in the target hash code set to calculate the minimum distance, and the multiple distances are summed to obtain the final combined hamming distance.

A multi-target image retrieval model construction device includes:

the acquisition module is used for acquiring the images and the labels corresponding to the images as a training set;

the pre-training neural network model building module is used for building a pre-training neural network model, and the neural network model is a multi-task learning model and comprises the following steps: a first module: the RPN module is used for generating a target candidate frame for an input image; a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image; a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame; a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image; a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;

and the training module is used for inputting the image into the pre-training neural network model for training.

Specifically, the training module includes:

a target candidate frame generation module, configured to input an image into the first RPN module, generate an initial target candidate frame for each image, and set to finally obtain P target candidate frames, and output the initial target candidate frames as a P × 4 matrix, where each row represents coordinate information of one target candidate frame, an initial coordinate, and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;

the characteristic graph output module is used for inputting the image into the second module depth convolution neural network, and outputting a characteristic graph if the size of the batch size is B;

the feature vector representation module of the target candidate frame is used for inputting the feature map and the P target candidate frames into the interest region pooling module of the third module to obtain the feature vector representation of the target region corresponding to each target candidate frame, and outputting a matrix of B multiplied by P multiplied by d, wherein d is the dimension represented by the feature vector;

the image category probability and Hash-like representation output module is used for respectively inputting the feature vector representation of the target area corresponding to each target candidate box into a fourth module target area detection branch and a fifth module Hash code learning branch, wherein the output of the target area detection branch is the image category probability, and the output of the Hash code learning branch is Hash-like representation of each target candidate box;

the model optimization module is used for optimizing a model and comprises the steps of comparing the image category probability with the label vector of the image to calculate the binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.

A multi-target image retrieval apparatus comprising:

the retrieval database construction module is used for acquiring a target hash code set of the image set to construct a retrieval database;

the multi-target image input module is used for inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model;

the Hash-like representation acquisition module is used for acquiring the output of a target area detection branch and a Hash code learning branch by the method in the multi-target image retrieval model construction device; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;

the conversion module and the query hash code set acquisition module are used for converting the class hash representation into the hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved;

and the retrieval result output module is used for calculating the joint Hamming distance between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the final returned retrieval results.

Compared with the prior art, the invention has the beneficial technical effects that:

the invention provides an end-to-end multi-target image Hash model based on weak supervision detection, which uses image-level labels to learn the corresponding relation between the image labels and the target positions thereof and also learn the Hash characteristic representation of each target and each region, thereby generating respective Hash codes for each target region of the image, avoiding the interference of complex background noise, comprehensively and completely representing the semantic information of the image, thereby effectively improving the retrieval effect.

Drawings

FIG. 1 is a diagram of a model of the process of the present invention;

FIG. 2 is a flow chart of image retrieval;

FIG. 3 is a diagram of precision @500 variations of methods at different hash code lengths.

Fig. 4 is a diagram of a precison-call change when the hash code length is 64.

Fig. 5 is a diagram showing the change in ACG for different numbers of returned pictures when the hash code length is 64.

Fig. 6 is a diagram illustrating the NDCG variation of different numbers of returned images when the hash code length is 64.

FIG. 7 shows the multi-target searching effect of a single image to be searched.

Fig. 8 shows the multi-target retrieval effect of a plurality of images to be retrieved.

Detailed Description

The following describes in detail specific embodiments of the present invention. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

The application provides an end-to-end weak supervision detection-based multi-target image Hash network model method, only images and corresponding image level labels of the images are used, dependence on a target detection labeling box is reduced, meanwhile, the corresponding relation between the labels and the targets in the images and Hash codes of the targets are learned, and complete and independent description is carried out on each target in the images. And the whole model is subjected to multi-task joint training and is jointly optimized, so that the inherent defects of the previous multi-module method are overcome, the complexity of the whole image feature representation process is reduced, and the image content is accurately described in a finer-grained manner. Compared and verified by a large number of experiments, the method can greatly improve the image retrieval effect, the multi-target image feature representation method can perfectly adapt to the retrieval requirements of multiple images, and retrieval can be performed only by respectively acquiring the target hash sets of all query images and combining the target hash sets into a large target hash set.

The following definitions or conceptual connotations relating to the present invention are provided for illustration:

end-to-end: end-to-end means that a plurality of steps for solving a problem are combined together, a final required result is directly obtained according to an original input without other intermediate steps for processing again, in the invention, an image is input into a model, and a detection result and a target hash representation of a target are directly obtained through processing.

And (3) weak supervision detection: the learning of object detection is performed using weak supervision information, which only includes label labeling at the image level, but not the object frame position corresponding to its label.

An RPN module: a candidate box generation Network (RegionProposal Network) generates a number of target candidate boxes for an input image given that image.

Priori knowledge: it is to obtain a certain experience according to the previous training situation and to use it as a new guiding message.

The hash-like representation means that the output vector value is similar to a hash code vector, and the value of each bit is a real number approaching 0 or 1.

Example 1:

the embodiment discloses a method for constructing a multi-target image retrieval model, which can be used for acquiring the Hash-like representation of a multi-target image to realize multi-target image retrieval; the method comprises the following steps:

step 1, acquiring an image and a label corresponding to the image as a training set;

specifically, in the present embodiment, the captured image and the label are mapped<I₁,L₁>,<I₂,L₂>......<I_n,L_n>Adjusting all images to be of a fixed size S multiplied by S to accelerate the calculation speed of the model, wherein I is an image and L is a label vector, the label vector is represented as a one-hot encoding form of an image label, the encoding length is the number of data set categories, if the encoding median value is 1, the label is included, if the value is 0, the label is not included, and constructing an image and label vector pair<I₁,L₁>,<I₂,L₂>......<I_n,L_n>}。

Step 2, constructing a pre-trained neural network model, which is a multi-task learning model and includes, as shown in fig. 1:

a third module: the interest region pooling (RoIAlign pooling) module is used for quickly acquiring the feature vector representation of the target region corresponding to each target candidate box;

and 3.1, inputting the images into a first module RPN (pre-trained) module, generating an initial target candidate frame for each image, filtering the initial target candidate frames generated by the RPN according to prior knowledge, filtering target candidate frames with undersize, undersize ratio and uncoordinated ratio of length to width, and outputting a matrix of P x 4 after setting to finally obtain P target candidate frames, wherein each row represents coordinate information of one target candidate frame, and initial coordinate and width to height information (x, y, w, h) to obtain coordinate information of the P target candidate frames.

In other embodiments, other region suggestion generation methods, such as sliding window method, selective search, etc., may also be used to obtain candidate target frames of the image in advance and simultaneously serve as an input; here, to implement an end-to-end system, an RPN network module is used, which is an integral part of the system.

More specifically, according to experiments, the situation that the classification error of the target frame is increased when the size is too small is found, so that the minimum size area for limiting the target candidate frame needs to be larger than 50 × 50, and the ratio of the short side to the long side of the target candidate frame is limited within the range of 1: 4; in order to make subsequent input dimensions consistent, the number of candidate areas is limited to P, the number of filtered candidate frames is larger than P, the first P candidate frames are taken, for the candidate frames which are less than P, 0 is used for complementing occupation, the output of the module is a matrix of P multiplied by 4, each row represents the coordinate information of one target frame, the initial coordinate and the width and height information (x, y, w, h), and therefore the coordinate information of the P candidate target frames is obtained.

Meanwhile, inputting the image into a second module deep convolutional neural network, and outputting a feature map with the size of batch size B, wherein the specific output dimension is B multiplied by 512 multiplied by 18; in this embodiment, the network structure uses a VGG16 model pre-trained on ImageNet, the final classification layer of the model and the pool5 layer after the last convolutional layer are removed, and a feature map representation of the image is obtained by convolutional pooling calculation, and the output dimension is B × 512 × 18 × 18.

Step 3.2, inputting the feature map and the P target candidate boxes in the step 3.1 into a third module interest area pooling module, receiving target areas with any size and generating feature output with fixed size to obtain feature vector representation of the target area corresponding to each target candidate box, wherein the output is a B multiplied by P multiplied by d matrix, and d is a dimension represented by the feature vector; each target candidate box is represented by a d-dimensional feature vector;

specifically, step 3.3 includes:

step 3.3.1, in the fourth module target area detection branch, the feature vector of each target candidate frame is divided into a detection data stream branch and a classification data stream branch after passing through two full connection layers; respectively passing through two full connection layers to obtain a detection output matrix and a classification output matrix; then, multiplying the detection output matrix and the classification output matrix by element-wise to obtain the class probability of each target candidate frame, and outputting a merged data matrix; finally, summing the class probabilities of all target candidate frames of each image to obtain the class probability of the image (namely the score of the image on each class) as the output of the target region detection branch;

step 3.3.2, in the fifth module hash code learning branch, the feature vector of each target candidate box passes through two full connection layers (2 full connection layers of 512 dimensions in this embodiment), and then is input into the hash layer containing L nodes, so as to obtain a B × P × L class hash output, and the row vector of each image represents the class hash representation (class hash output) of each target candidate box.

More specifically, step 3.3.1 comprises:

step (a1), the detection data flow branch passes through two full connection layers to obtain a detection output matrix, and the calculation is carried out according to the formula 1:

step (a2), the classified data stream branches pass through two full connection layers to obtain a classified output matrix, and calculation is performed according to formula 2:

refers to the classification output matrix X^cThe value of j in the ith row, C represents the total number of categories, e is the base number of power operation, the probability of each target candidate frame in each category is calculated by formula 2, namely, each target candidate frame is classified once, and the dimension of a classification output matrix is B multiplied by P multiplied by C;

and (a4) finally, summing the class probabilities of the target candidate frames of each image to obtain the class probability of the image (namely the score of the image on each class), wherein the output dimension after summation is B multiplied by 1 multiplied by C and is used as the output of the target region detection branch.

Step 3.4, optimizing the model: comparing the output of the target region detection branch, namely the image category probability with the label vector of the image to calculate binary cross entropy classification loss; comparing the output of the hash code learning branch, namely the class hash representation of each target candidate box with the label of the image to calculate the hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iterative optimization model through a random gradient descent (SGD) method to finally obtain a multi-target image retrieval model.

Step 3.4 comprises:

step (b2), according to the output of the target area detection branch, selecting the target candidate box whose target candidate box class probability is higher than the set class probability threshold, screening the index number of the target candidate box, obtaining the corresponding class hash representation in the output of the hash code learning branch, calculating the hash loss according to the class hash representation and the corresponding label, specifically, because in the hash code learning branch, the last output actually calculates the class hash representation of each target candidate box, in order to evaluate the hash effect, the detection result of the target area detection branch is needed; in the output data of the target region detection branch L1, each row vector represents the score of the target frame region in each category, a target frame category score threshold θ is set to 0.2, the target candidate frame regions are filtered and selected according to the threshold, and for each target candidate frame, if the maximum category score is greater than the threshold θ, the target candidate frame is retained; then filtering according to each category, only keeping the target candidate frame with the highest score under each category, and taking the target candidate frame as the target candidate frame of the category; and according to the index sequence number of the final screening target candidate box, acquiring a class hash representation corresponding to the index sequence number in the hash branch, and calculating hash loss according to the class hash representation and a corresponding label.

The calculation method is shown in formula 4,

in formula 4, L^h(h₁,h₂Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h₁Representing image target areas 1Class Hash representation, h₂The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, D_h(h₁,h₂) Euclidean distance values representing the class hash vector representations of the calculation target areas 1 and 2;

and (b3) carrying out weighted summation on the binary cross entropy classification loss and the hash loss according to the proportion of 100:1 to obtain the final joint loss, and carrying out a reverse iterative optimization model by a random gradient descent (SGD) method, wherein the learning rate of the model is set to be 0.001.

Example 2:

the embodiment discloses a multi-target image retrieval method, as shown in fig. 2, including:

sequentially inputting images in a data set into the trained model in the embodiment 1, screening a target candidate box in the image and a corresponding hash code thereof according to a set threshold value, combining the hash codes together to form a target hash code set, and establishing association between the image and the corresponding hash code set;

step two, inputting a single image to be retrieved (as shown in fig. 7) or a plurality of images to be retrieved (as shown in fig. 8) into the multi-target image retrieval model constructed in embodiment 1, in this embodiment, inputting q images { I }₁,I₂...I_qObtaining the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;

converting the class hash representation into hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved, calculating joint Hamming distances between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as final returned retrieval results; as shown in fig. 7, it is the search result of a single image to be searched, and fig. 8 is the search result of a plurality of images to be searched.

More specifically, in step three, the class hash representation is converted into a hash representation, which is implemented by equation 5:

Query hash code set for composing image to be retrieved

In the third step, the calculation method for calculating the joint hamming distance between the query hash code set and the target hash code set is shown in formula 6:

And (3) experimental verification:

in order to verify the effectiveness and superiority of the image retrieval method disclosed by the invention, a plurality of different evaluation indexes are used on the VOC2012 multi-label image data set, and a comparison experiment is carried out with other mainstream image retrieval methods, wherein the comparison method comprises an unsupervised retrieval method: SH, LSH, SpH, ITQ, and supervised-based mainstream image retrieval methods: KSH, DSH, DHN. The evaluation indexes comprise precision @ k, a precision-call curve, ACG and NDCG. Fig. 3 shows the change result of precision @500 between the mainstream image retrieving party and the method of the present invention under different hash code lengths, where two images are considered to be correct only if they have the same label, and the similarity between the images is not considered; FIG. 4 shows the change of the precison-call curve of the mainstream image retrieval method and the method of the present invention, showing the feature distinguishing capability of the model; fig. 5 shows the change result of ACG of the mainstream image retrieving party and the method of the present invention under different hash code lengths, which takes into account the similarity between the images, and the value thereof represents the average label number of the retrieved image and the query image, but does not take into account the ranking condition; fig. 6 shows the change result of NDCG under different hash code lengths of the mainstream image searching party and the method of the present invention, which also considers the ranking condition of image searching, wherein the higher the ranking of the searching result, the greater the occupied weight.

As can be seen from a plurality of experimental comparison results, the image retrieval model and the image retrieval method are obviously superior to the current mainstream image retrieval method under each evaluation index, because the method fully considers the multilevel semantic information of the image, avoids the interference of noise information irrelevant to a retrieval target, more fully and completely describes the semantic information of the image from a fine granularity level, and cannot have certain superiority.

Example 3:

the embodiment discloses a multi-target image retrieval model construction device, which comprises:

Specifically, the training module includes:

the model optimization module is used for optimizing a model and comprises the steps of comparing the output of the target region detection branch, namely the image category probability with the label of the image to calculate the binary cross entropy classification loss; comparing the output of the hash code learning branch, namely the class hash representation of each target candidate box with the label of the image to calculate the hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iterative optimization model through a random gradient descent (SGD) method to finally obtain a multi-target image retrieval model.

Example 4:

the embodiment discloses a multi-target image retrieval apparatus, including:

the class hash representation acquisition module is used for acquiring the output of the target area detection branch and the hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;

Claims

1. A construction method of a multi-target image retrieval model is characterized in that the multi-target image retrieval model can be used for obtaining the Hash-like representation of a multi-target image to realize multi-target image retrieval; the method comprises the following steps:

2. The method for constructing a multi-target image retrieval model according to claim 1, wherein the step 3.3 includes:

3. The method for constructing a multi-target image retrieval model according to claim 2, wherein the step 3.3.1 includes:

in the formula 1, the first and second groups of the compound,i denotes the ith row of the matrix, j denotes the jth column of the matrix, [ delta ]_detct(x^d)]_ijRefers to detecting the output of the ith row and the jth column after the branch calculation of the data stream,

4. The method for constructing a multi-target image retrieval model according to claim 3, wherein the step 3.4 includes:

in formula 4, L^h(h₁,h₂Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h₁class-Hash representation, h, representing the image target area 1₂The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, D_h(h₁,h₂) Euclidean distance values representing class hash vector representations of calculation target areas 1 and 2；

5. A multi-target image retrieval method is characterized by comprising the following steps:

inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model according to any one of claims 1 to 4 to obtain the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;

6. The multi-target image retrieval method according to claim 5, wherein in the third step, the class hash representation is converted into a hash representation, and the conversion is realized by formula 5:

Query hashing to form an image to be retrievedCode set

7. The multi-target image retrieval method of claim 5, wherein in the third step, the calculation method for calculating the joint Hamming distance between the query hash code set and the target hash code set is shown in formula 6:

8. A multi-target image retrieval model construction device is characterized by comprising:

9. The multi-target image retrieval model building apparatus of claim 8, wherein the training module includes:

10. A multi-target image retrieval apparatus, characterized by comprising:

a hash-like representation obtaining module, configured to obtain outputs of the target region detection branch and the hash code learning branch by using the method in the multi-target image retrieval model building apparatus according to claim 8 or 9; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;