CN113032612A - Construction method of multi-target image retrieval model, retrieval method and device - Google Patents

Construction method of multi-target image retrieval model, retrieval method and device Download PDF

Info

Publication number
CN113032612A
CN113032612A CN202110270411.2A CN202110270411A CN113032612A CN 113032612 A CN113032612 A CN 113032612A CN 202110270411 A CN202110270411 A CN 202110270411A CN 113032612 A CN113032612 A CN 113032612A
Authority
CN
China
Prior art keywords
image
target
hash
module
target candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110270411.2A
Other languages
Chinese (zh)
Other versions
CN113032612B (en
Inventor
范建平
舒永康
赵万青
彭先霖
胡琦瑶
杨文静
王琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern University
Original Assignee
Northwestern University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern University filed Critical Northwestern University
Priority to CN202110270411.2A priority Critical patent/CN113032612B/en
Publication of CN113032612A publication Critical patent/CN113032612A/en
Application granted granted Critical
Publication of CN113032612B publication Critical patent/CN113032612B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/5866Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, manually generated location and time information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device, wherein a user inputs one or more images to be retrieved, obtains a multi-target query hash code set of the images through the model, compares the query hash code set with a target hash code set in a retrieval database, calculates a combined Hamming distance, sorts the distance from small to large, and returns a retrieval result. The invention acquires the hash code corresponding to the image target area for retrieval based on the weak supervised learning mode, and each target area of each image generates an independent hash code, thereby avoiding the influence of a complex background and the mutual influence among all targets, more completely representing the content of the complex image, improving the retrieval effect and expanding the retrieval mode of the image.

Description

Construction method of multi-target image retrieval model, retrieval method and device
Technical Field
The invention belongs to the technical field of image retrieval, and particularly relates to a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device.
Background
In recent years, as deep learning technologies are continuously developed and applied in the image field, the image retrieval methods combined with deep learning can accurately capture semantic information hidden in images, and greatly improve retrieval effects. However, the existing methods ignore some problems, the images often have rich semantic information, and one image usually contains a plurality of objects or content regions, in the current mainstream image retrieval method, one image is abstracted into a feature vector representation with a fixed dimension, the feature vector representation is usually a real number vector with 256 to 2048 dimensions, so that a great deal of effective information is often lost in the image feature vector representation process, and only the most core semantic information in the image is retained. And with the increase of image data volume year by year, the characteristic vector is also subjected to Hash representation in order to quickly search in a massive image database. In the feature representation processed by the hashing method, a 16-128 dimensional binary hash vector representation is generally used, so that more information is lost, each target in the image cannot be accurately depicted, and therefore, the actual performance of retrieval is greatly limited.
The core step of image retrieval is to generate accurate feature description of an image, and for a multi-label and multi-target complex scene image, how to accurately and completely describe the content of the image is a core problem. On one hand, especially in a multi-label image, one label usually only corresponds to a certain target or a local area in the image, but not the whole image, and in the process of obtaining the characteristic representation of the whole image, complex background information irrelevant to the actual retrieval target may be doped, so that the extracted hash code loses the strong characterization capability of the retrieval target; moreover, a plurality of targets with different sizes mixed together can also affect each other, so that the subject contents of the retrieval targets cannot be distinguished, certain interference can be brought to retrieval, and the retrieval effect is poor. On the other hand, in the conventional image retrieval method, one image is input to retrieve the image related to the image, and the retrieval method has certain limitations, and if a plurality of images are expected to be input and the images with the targets in the plurality of images are retrieved at the same time, the conventional single feature representation method cannot be realized.
In order to solve the problem, an effective method is to obtain feature representations of target areas corresponding to each label, and some methods also exist at present to obtain each target in an image in a target detection mode, and then obtain feature vector representations of each target, so as to solve the retrieval problem of multi-target images. In the methods, the process of target detection and the learning process of target region features are independent, multiple steps and multiple models are used for cooperatively completing the whole task, an end-to-end system is not provided, the whole process is complex, the optimization targets in different modules are inconsistent, the target functions of some modules may deviate from the whole target of the system, and thus the independently trained modules are difficult to enable the whole system to achieve the optimal performance, and the error accumulation condition also can exist, namely the deviation generated in the previous step may influence the subsequent whole step, so that the whole system is difficult to achieve an optimal state.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a construction method of a multi-target image retrieval model, a retrieval method and a retrieval device, which only use image-level labeling information, learn the corresponding relation between an image label and the target position thereof, and simultaneously learn the Hash feature representation of each target and region, thereby solving the problems that the prior art can not avoid the interference of noise information irrelevant to the retrieval target and the retrieval problems of refinement and diversification.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-target image retrieval model construction method is provided, the multi-target image retrieval model can be used for obtaining the Hash-like representation of multi-target images to realize multi-target image retrieval; the method comprises the following steps:
step 1, acquiring a plurality of images and corresponding labels thereof as a training set;
step 2, constructing a pre-training neural network model, wherein the neural network model is a multi-task learning model and comprises the following steps:
a first module: the RPN module is used for generating a target candidate frame for an input image;
a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image;
a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame;
a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image;
a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and 3, inputting the image into the pre-training neural network model for training, wherein the training comprises the following steps:
step 3.1, inputting the images into a first RPN module, generating initial target candidate frames for each image, and outputting a matrix of P × 4 if P target candidate frames are finally obtained, wherein each row represents coordinate information of one target candidate frame, initial coordinates and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;
meanwhile, the image is input into a second module deep convolution neural network, the size of the batch size is set as B, a characteristic map is output,
step 3.2, inputting the feature map and the P target candidate frames in the step 3.1 into an interest area pooling module of a third module to obtain feature vector representation of a target area corresponding to each target candidate frame, and outputting a matrix of BxPxd, wherein d is the dimension represented by the feature vector;
step 3.3, the output of the step 3.2 is respectively input into a fourth module target area detection branch and a fifth module hash code learning branch, wherein the output of the target area detection branch is the image class probability, and the output of the hash code learning branch is the class hash representation of each target candidate box;
step 3.4, optimizing the model: comparing the image category probability with the label vector of the image to calculate binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.
The invention also comprises the following technical characteristics:
specifically, the step 3.3 includes:
step 3.3.1, in the fourth module target area detection branch, the feature vector of each target candidate frame is divided into a detection data stream branch and a classification data stream branch after passing through two full connection layers; respectively passing through two full connection layers to obtain a detection output matrix and a classification output matrix; then, multiplying the detection output matrix and the classification output matrix by element-wise to obtain the class probability of each target candidate frame, and outputting a merged data matrix; finally, summing the class probabilities of all the target candidate frames of each image to obtain the class probability of the image, and using the class probability as the output of the target region detection branch;
and 3.3.2, in the fifth module hash code learning branch, enabling the feature vector of each target candidate box to pass through two fully-connected layers, and then inputting the feature vector into a hash layer containing L nodes to obtain a BxPxL class hash output, wherein the row vector of each image represents the class hash representation of each target candidate box.
Specifically, the step 3.3.1 includes:
step (a1), the detection data flow branch passes through two full connection layers to obtain a detection output matrix XdThe calculation is performed according to equation 1:
Figure BDA0002974119290000041
in equation 1, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]detct(xd)]ijRefers to detecting the output of the ith row and the jth column after the branch calculation of the data stream,
Figure BDA0002974119290000042
refers to the detection of the output matrix XdP represents the total number of target candidate frames, and e is the base of the power operation; setting a total of C categories, wherein the detection output matrix dimension is BxPxC, and obtaining the score of each category under each target candidate frame, wherein the formula 1 is equivalent to performing primary target detection;
step (a2), the classified data flow branches pass through two full connection layers to obtain a classified output matrix XcThe calculation is performed according to equation 2:
Figure BDA0002974119290000043
in equation 2, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]class(xc)]ijRefers to the output of the ith row and the jth column after the branch calculation of the classified data stream,
Figure BDA0002974119290000044
refers to the classification output matrix XcJ column of the ith row, C represents the total number of categories, and e is exponentiatedThe base number, formula 2, is used for calculating the probability of each target candidate frame in each category, which is equivalent to performing primary classification on each target candidate frame, and the dimension of a classification output matrix is B multiplied by P multiplied by C;
step (a3), the element-wise multiplication is carried out on the classified output moment and the detection output matrix to obtain a result of the combination of classification and detection: b multiplied by P multiplied by C data matrix, wherein each row of each image in the data matrix represents the score of a target candidate frame on each category, namely the category probability of the target candidate frame;
and (a4) finally, summing the class probabilities of the target candidate frames of each image to obtain the class probability of the image, wherein the output dimension after summation is B multiplied by 1 multiplied by C and is used as the output of the target region detection branch.
Specifically, the step 3.4 includes:
step (b1), comparing the class probability of the image obtained in step (a4) with the image label vector to calculate the binary cross entropy classification loss, wherein the calculation method is formula 3:
Figure BDA0002974119290000045
in formula 3, Lc(y, p (y)) is the binary cross entropy classification loss of one image, N is the total number of data set labels, yiIs 0 or 1, indicates whether the image has the ith label, and if so, is 1, p (y)i) Representing the probability value that the model predicted image has the ith label;
step (b2), according to the output of the target area detection branch, selecting the target candidate box with the target candidate box class probability higher than the set class probability threshold, screening the index number of the target candidate box, obtaining the corresponding class hash representation in the output of the hash code learning branch, calculating the hash loss according to the class hash representation and the corresponding label, the calculating method is shown in formula 4,
Figure BDA0002974119290000051
in formula 4, Lh(h1,h2Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h1class-Hash representation, h, representing the image target area 12The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, Dh(h1,h2) Euclidean distance values representing the class hash vector representations of the calculation target areas 1 and 2;
and (b3) carrying out weighted summation on the binary cross entropy classification loss and the hash loss according to the proportion of 100:1 to obtain the final joint loss, and carrying out a reverse iterative optimization model by a random gradient descent method, wherein the learning rate of the model is set to be 0.001.
A multi-target image retrieval method includes:
acquiring a target hash code set of an image set to construct a retrieval database;
inputting a single or a plurality of images to be retrieved into the multi-target image retrieval model to obtain the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
and step three, converting the class hash representation into hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved, calculating the joint Hamming distances between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the finally returned retrieval results.
Specifically, in the third step, the class hash representation is converted into a hash representation, and the hash representation is realized by a formula 5:
Figure BDA0002974119290000052
in the formula 5, hiA value representing the ith position of the class hash representation vector;
thereby obtaining query hash codes of a plurality of target candidate boxes of q images to be retrieved
Figure BDA0002974119290000053
Query hash code set for composing image to be retrieved
Figure BDA0002974119290000061
Specifically, in the third step, the calculation method for calculating the joint hamming distance between the query hash code set and the target hash code set is shown in formula 6:
Figure BDA0002974119290000062
in equation 6, UHD (H)q,Hi) For the joint Hamming distance, H, of the image to be retrieved and the ith image hash code set in the image setqRepresentation of a set of hash codes representing an image to be retrieved, HiRepresenting the representation of the ith image in the image set, R representing the number of target hash codes of the image to be retrieved, K representing the number of target hash codes of the ith image in the image set,
Figure BDA0002974119290000063
representing the Hamming distance between the r-th Hash code in the image to be retrieved and the j-th Hash code of the ith image in the image set;
and in formula 6, each query hash code is matched with the most similar hash code in the target hash code set to calculate the minimum distance, and the multiple distances are summed to obtain the final combined hamming distance.
A multi-target image retrieval model construction device includes:
the acquisition module is used for acquiring the images and the labels corresponding to the images as a training set;
the pre-training neural network model building module is used for building a pre-training neural network model, and the neural network model is a multi-task learning model and comprises the following steps: a first module: the RPN module is used for generating a target candidate frame for an input image; a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image; a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame; a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image; a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and the training module is used for inputting the image into the pre-training neural network model for training.
Specifically, the training module includes:
a target candidate frame generation module, configured to input an image into the first RPN module, generate an initial target candidate frame for each image, and set to finally obtain P target candidate frames, and output the initial target candidate frames as a P × 4 matrix, where each row represents coordinate information of one target candidate frame, an initial coordinate, and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;
the characteristic graph output module is used for inputting the image into the second module depth convolution neural network, and outputting a characteristic graph if the size of the batch size is B;
the feature vector representation module of the target candidate frame is used for inputting the feature map and the P target candidate frames into the interest region pooling module of the third module to obtain the feature vector representation of the target region corresponding to each target candidate frame, and outputting a matrix of B multiplied by P multiplied by d, wherein d is the dimension represented by the feature vector;
the image category probability and Hash-like representation output module is used for respectively inputting the feature vector representation of the target area corresponding to each target candidate box into a fourth module target area detection branch and a fifth module Hash code learning branch, wherein the output of the target area detection branch is the image category probability, and the output of the Hash code learning branch is Hash-like representation of each target candidate box;
the model optimization module is used for optimizing a model and comprises the steps of comparing the image category probability with the label vector of the image to calculate the binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.
A multi-target image retrieval apparatus comprising:
the retrieval database construction module is used for acquiring a target hash code set of the image set to construct a retrieval database;
the multi-target image input module is used for inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model;
the Hash-like representation acquisition module is used for acquiring the output of a target area detection branch and a Hash code learning branch by the method in the multi-target image retrieval model construction device; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
the conversion module and the query hash code set acquisition module are used for converting the class hash representation into the hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved;
and the retrieval result output module is used for calculating the joint Hamming distance between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the final returned retrieval results.
Compared with the prior art, the invention has the beneficial technical effects that:
the invention provides an end-to-end multi-target image Hash model based on weak supervision detection, which uses image-level labels to learn the corresponding relation between the image labels and the target positions thereof and also learn the Hash characteristic representation of each target and each region, thereby generating respective Hash codes for each target region of the image, avoiding the interference of complex background noise, comprehensively and completely representing the semantic information of the image, thereby effectively improving the retrieval effect.
Drawings
FIG. 1 is a diagram of a model of the process of the present invention;
FIG. 2 is a flow chart of image retrieval;
FIG. 3 is a diagram of precision @500 variations of methods at different hash code lengths.
Fig. 4 is a diagram of a precison-call change when the hash code length is 64.
Fig. 5 is a diagram showing the change in ACG for different numbers of returned pictures when the hash code length is 64.
Fig. 6 is a diagram illustrating the NDCG variation of different numbers of returned images when the hash code length is 64.
FIG. 7 shows the multi-target searching effect of a single image to be searched.
Fig. 8 shows the multi-target retrieval effect of a plurality of images to be retrieved.
Detailed Description
The following describes in detail specific embodiments of the present invention. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
The application provides an end-to-end weak supervision detection-based multi-target image Hash network model method, only images and corresponding image level labels of the images are used, dependence on a target detection labeling box is reduced, meanwhile, the corresponding relation between the labels and the targets in the images and Hash codes of the targets are learned, and complete and independent description is carried out on each target in the images. And the whole model is subjected to multi-task joint training and is jointly optimized, so that the inherent defects of the previous multi-module method are overcome, the complexity of the whole image feature representation process is reduced, and the image content is accurately described in a finer-grained manner. Compared and verified by a large number of experiments, the method can greatly improve the image retrieval effect, the multi-target image feature representation method can perfectly adapt to the retrieval requirements of multiple images, and retrieval can be performed only by respectively acquiring the target hash sets of all query images and combining the target hash sets into a large target hash set.
The following definitions or conceptual connotations relating to the present invention are provided for illustration:
end-to-end: end-to-end means that a plurality of steps for solving a problem are combined together, a final required result is directly obtained according to an original input without other intermediate steps for processing again, in the invention, an image is input into a model, and a detection result and a target hash representation of a target are directly obtained through processing.
And (3) weak supervision detection: the learning of object detection is performed using weak supervision information, which only includes label labeling at the image level, but not the object frame position corresponding to its label.
An RPN module: a candidate box generation Network (RegionProposal Network) generates a number of target candidate boxes for an input image given that image.
Priori knowledge: it is to obtain a certain experience according to the previous training situation and to use it as a new guiding message.
The hash-like representation means that the output vector value is similar to a hash code vector, and the value of each bit is a real number approaching 0 or 1.
Example 1:
the embodiment discloses a method for constructing a multi-target image retrieval model, which can be used for acquiring the Hash-like representation of a multi-target image to realize multi-target image retrieval; the method comprises the following steps:
step 1, acquiring an image and a label corresponding to the image as a training set;
specifically, in the present embodiment, the captured image and the label are mapped<I1,L1>,<I2,L2>......<In,Ln>Adjusting all images to be of a fixed size S multiplied by S to accelerate the calculation speed of the model, wherein I is an image and L is a label vector, the label vector is represented as a one-hot encoding form of an image label, the encoding length is the number of data set categories, if the encoding median value is 1, the label is included, if the value is 0, the label is not included, and constructing an image and label vector pair<I1,L1>,<I2,L2>......<In,Ln>}。
Step 2, constructing a pre-trained neural network model, which is a multi-task learning model and includes, as shown in fig. 1:
a first module: the RPN module is used for generating a target candidate frame for an input image;
a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image;
a third module: the interest region pooling (RoIAlign pooling) module is used for quickly acquiring the feature vector representation of the target region corresponding to each target candidate box;
a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image;
a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and 3, inputting the image into the pre-training neural network model for training, wherein the training comprises the following steps:
and 3.1, inputting the images into a first module RPN (pre-trained) module, generating an initial target candidate frame for each image, filtering the initial target candidate frames generated by the RPN according to prior knowledge, filtering target candidate frames with undersize, undersize ratio and uncoordinated ratio of length to width, and outputting a matrix of P x 4 after setting to finally obtain P target candidate frames, wherein each row represents coordinate information of one target candidate frame, and initial coordinate and width to height information (x, y, w, h) to obtain coordinate information of the P target candidate frames.
In other embodiments, other region suggestion generation methods, such as sliding window method, selective search, etc., may also be used to obtain candidate target frames of the image in advance and simultaneously serve as an input; here, to implement an end-to-end system, an RPN network module is used, which is an integral part of the system.
More specifically, according to experiments, the situation that the classification error of the target frame is increased when the size is too small is found, so that the minimum size area for limiting the target candidate frame needs to be larger than 50 × 50, and the ratio of the short side to the long side of the target candidate frame is limited within the range of 1: 4; in order to make subsequent input dimensions consistent, the number of candidate areas is limited to P, the number of filtered candidate frames is larger than P, the first P candidate frames are taken, for the candidate frames which are less than P, 0 is used for complementing occupation, the output of the module is a matrix of P multiplied by 4, each row represents the coordinate information of one target frame, the initial coordinate and the width and height information (x, y, w, h), and therefore the coordinate information of the P candidate target frames is obtained.
Meanwhile, inputting the image into a second module deep convolutional neural network, and outputting a feature map with the size of batch size B, wherein the specific output dimension is B multiplied by 512 multiplied by 18; in this embodiment, the network structure uses a VGG16 model pre-trained on ImageNet, the final classification layer of the model and the pool5 layer after the last convolutional layer are removed, and a feature map representation of the image is obtained by convolutional pooling calculation, and the output dimension is B × 512 × 18 × 18.
Step 3.2, inputting the feature map and the P target candidate boxes in the step 3.1 into a third module interest area pooling module, receiving target areas with any size and generating feature output with fixed size to obtain feature vector representation of the target area corresponding to each target candidate box, wherein the output is a B multiplied by P multiplied by d matrix, and d is a dimension represented by the feature vector; each target candidate box is represented by a d-dimensional feature vector;
step 3.3, the output of the step 3.2 is respectively input into a fourth module target area detection branch and a fifth module hash code learning branch, wherein the output of the target area detection branch is the image class probability, and the output of the hash code learning branch is the class hash representation of each target candidate box;
specifically, step 3.3 includes:
step 3.3.1, in the fourth module target area detection branch, the feature vector of each target candidate frame is divided into a detection data stream branch and a classification data stream branch after passing through two full connection layers; respectively passing through two full connection layers to obtain a detection output matrix and a classification output matrix; then, multiplying the detection output matrix and the classification output matrix by element-wise to obtain the class probability of each target candidate frame, and outputting a merged data matrix; finally, summing the class probabilities of all target candidate frames of each image to obtain the class probability of the image (namely the score of the image on each class) as the output of the target region detection branch;
step 3.3.2, in the fifth module hash code learning branch, the feature vector of each target candidate box passes through two full connection layers (2 full connection layers of 512 dimensions in this embodiment), and then is input into the hash layer containing L nodes, so as to obtain a B × P × L class hash output, and the row vector of each image represents the class hash representation (class hash output) of each target candidate box.
More specifically, step 3.3.1 comprises:
step (a1), the detection data flow branch passes through two full connection layers to obtain a detection output matrix, and the calculation is carried out according to the formula 1:
Figure BDA0002974119290000111
in equation 1, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]detct(xd)]ijRefers to detecting the output of the ith row and the jth column after the branch calculation of the data stream,
Figure BDA0002974119290000112
refers to the detection of the output matrix XdP represents the total number of target candidate frames, and e is the base of the power operation; setting a total of C categories, wherein the detection output matrix dimension is BxPxC, and obtaining the score of each category under each target candidate frame, wherein the formula 1 is equivalent to performing primary target detection;
step (a2), the classified data stream branches pass through two full connection layers to obtain a classified output matrix, and calculation is performed according to formula 2:
Figure BDA0002974119290000113
in equation 2, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]class(xc)]ijRefers to the output of the ith row and the jth column after the branch calculation of the classified data stream,
Figure BDA0002974119290000114
refers to the classification output matrix XcThe value of j in the ith row, C represents the total number of categories, e is the base number of power operation, the probability of each target candidate frame in each category is calculated by formula 2, namely, each target candidate frame is classified once, and the dimension of a classification output matrix is B multiplied by P multiplied by C;
step (a3), the element-wise multiplication is carried out on the classified output moment and the detection output matrix to obtain a result of the combination of classification and detection: b multiplied by P multiplied by C data matrix, wherein each row of each image in the data matrix represents the score of a target candidate frame on each category, namely the category probability of the target candidate frame;
and (a4) finally, summing the class probabilities of the target candidate frames of each image to obtain the class probability of the image (namely the score of the image on each class), wherein the output dimension after summation is B multiplied by 1 multiplied by C and is used as the output of the target region detection branch.
Step 3.4, optimizing the model: comparing the output of the target region detection branch, namely the image category probability with the label vector of the image to calculate binary cross entropy classification loss; comparing the output of the hash code learning branch, namely the class hash representation of each target candidate box with the label of the image to calculate the hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iterative optimization model through a random gradient descent (SGD) method to finally obtain a multi-target image retrieval model.
Step 3.4 comprises:
step (b1), comparing the class probability of the image obtained in step (a4) with the image label vector to calculate the binary cross entropy classification loss, wherein the calculation method is formula 3:
Figure BDA0002974119290000121
in formula 3, Lc(y, p (y)) is the binary cross entropy classification loss of one image, N is the total number of data set labels, yiIs 0 or 1, indicates whether the image has the ith label, and if so, is 1, p (y)i) Representing the probability value that the model predicted image has the ith label;
step (b2), according to the output of the target area detection branch, selecting the target candidate box whose target candidate box class probability is higher than the set class probability threshold, screening the index number of the target candidate box, obtaining the corresponding class hash representation in the output of the hash code learning branch, calculating the hash loss according to the class hash representation and the corresponding label, specifically, because in the hash code learning branch, the last output actually calculates the class hash representation of each target candidate box, in order to evaluate the hash effect, the detection result of the target area detection branch is needed; in the output data of the target region detection branch L1, each row vector represents the score of the target frame region in each category, a target frame category score threshold θ is set to 0.2, the target candidate frame regions are filtered and selected according to the threshold, and for each target candidate frame, if the maximum category score is greater than the threshold θ, the target candidate frame is retained; then filtering according to each category, only keeping the target candidate frame with the highest score under each category, and taking the target candidate frame as the target candidate frame of the category; and according to the index sequence number of the final screening target candidate box, acquiring a class hash representation corresponding to the index sequence number in the hash branch, and calculating hash loss according to the class hash representation and a corresponding label.
The calculation method is shown in formula 4,
Figure BDA0002974119290000122
in formula 4, Lh(h1,h2Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h1Representing image target areas 1Class Hash representation, h2The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, Dh(h1,h2) Euclidean distance values representing the class hash vector representations of the calculation target areas 1 and 2;
and (b3) carrying out weighted summation on the binary cross entropy classification loss and the hash loss according to the proportion of 100:1 to obtain the final joint loss, and carrying out a reverse iterative optimization model by a random gradient descent (SGD) method, wherein the learning rate of the model is set to be 0.001.
Example 2:
the embodiment discloses a multi-target image retrieval method, as shown in fig. 2, including:
acquiring a target hash code set of an image set to construct a retrieval database;
sequentially inputting images in a data set into the trained model in the embodiment 1, screening a target candidate box in the image and a corresponding hash code thereof according to a set threshold value, combining the hash codes together to form a target hash code set, and establishing association between the image and the corresponding hash code set;
step two, inputting a single image to be retrieved (as shown in fig. 7) or a plurality of images to be retrieved (as shown in fig. 8) into the multi-target image retrieval model constructed in embodiment 1, in this embodiment, inputting q images { I }1,I2...IqObtaining the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
converting the class hash representation into hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved, calculating joint Hamming distances between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as final returned retrieval results; as shown in fig. 7, it is the search result of a single image to be searched, and fig. 8 is the search result of a plurality of images to be searched.
More specifically, in step three, the class hash representation is converted into a hash representation, which is implemented by equation 5:
Figure BDA0002974119290000131
in the formula 5, hiA value representing the ith position of the class hash representation vector;
thereby obtaining query hash codes of a plurality of target candidate boxes of q images to be retrieved
Figure BDA0002974119290000132
Query hash code set for composing image to be retrieved
Figure BDA0002974119290000133
In the third step, the calculation method for calculating the joint hamming distance between the query hash code set and the target hash code set is shown in formula 6:
Figure BDA0002974119290000134
in equation 6, UHD (H)q,Hi) For the joint Hamming distance, H, of the image to be retrieved and the ith image hash code set in the image setqRepresentation of a set of hash codes representing an image to be retrieved, HiRepresenting the representation of the ith image in the image set, R representing the number of target hash codes of the image to be retrieved, K representing the number of target hash codes of the ith image in the image set,
Figure BDA0002974119290000141
representing the Hamming distance between the r-th Hash code in the image to be retrieved and the j-th Hash code of the ith image in the image set;
and in formula 6, each query hash code is matched with the most similar hash code in the target hash code set to calculate the minimum distance, and the multiple distances are summed to obtain the final combined hamming distance.
And (3) experimental verification:
in order to verify the effectiveness and superiority of the image retrieval method disclosed by the invention, a plurality of different evaluation indexes are used on the VOC2012 multi-label image data set, and a comparison experiment is carried out with other mainstream image retrieval methods, wherein the comparison method comprises an unsupervised retrieval method: SH, LSH, SpH, ITQ, and supervised-based mainstream image retrieval methods: KSH, DSH, DHN. The evaluation indexes comprise precision @ k, a precision-call curve, ACG and NDCG. Fig. 3 shows the change result of precision @500 between the mainstream image retrieving party and the method of the present invention under different hash code lengths, where two images are considered to be correct only if they have the same label, and the similarity between the images is not considered; FIG. 4 shows the change of the precison-call curve of the mainstream image retrieval method and the method of the present invention, showing the feature distinguishing capability of the model; fig. 5 shows the change result of ACG of the mainstream image retrieving party and the method of the present invention under different hash code lengths, which takes into account the similarity between the images, and the value thereof represents the average label number of the retrieved image and the query image, but does not take into account the ranking condition; fig. 6 shows the change result of NDCG under different hash code lengths of the mainstream image searching party and the method of the present invention, which also considers the ranking condition of image searching, wherein the higher the ranking of the searching result, the greater the occupied weight.
As can be seen from a plurality of experimental comparison results, the image retrieval model and the image retrieval method are obviously superior to the current mainstream image retrieval method under each evaluation index, because the method fully considers the multilevel semantic information of the image, avoids the interference of noise information irrelevant to a retrieval target, more fully and completely describes the semantic information of the image from a fine granularity level, and cannot have certain superiority.
Example 3:
the embodiment discloses a multi-target image retrieval model construction device, which comprises:
the acquisition module is used for acquiring the images and the labels corresponding to the images as a training set;
the pre-training neural network model building module is used for building a pre-training neural network model, and the neural network model is a multi-task learning model and comprises the following steps: a first module: the RPN module is used for generating a target candidate frame for an input image; a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image; a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame; a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image; a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and the training module is used for inputting the image into the pre-training neural network model for training.
Specifically, the training module includes:
a target candidate frame generation module, configured to input an image into the first RPN module, generate an initial target candidate frame for each image, and set to finally obtain P target candidate frames, and output the initial target candidate frames as a P × 4 matrix, where each row represents coordinate information of one target candidate frame, an initial coordinate, and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;
the characteristic graph output module is used for inputting the image into the second module depth convolution neural network, and outputting a characteristic graph if the size of the batch size is B;
the feature vector representation module of the target candidate frame is used for inputting the feature map and the P target candidate frames into the interest region pooling module of the third module to obtain the feature vector representation of the target region corresponding to each target candidate frame, and outputting a matrix of B multiplied by P multiplied by d, wherein d is the dimension represented by the feature vector;
the image category probability and Hash-like representation output module is used for respectively inputting the feature vector representation of the target area corresponding to each target candidate box into a fourth module target area detection branch and a fifth module Hash code learning branch, wherein the output of the target area detection branch is the image category probability, and the output of the Hash code learning branch is Hash-like representation of each target candidate box;
the model optimization module is used for optimizing a model and comprises the steps of comparing the output of the target region detection branch, namely the image category probability with the label of the image to calculate the binary cross entropy classification loss; comparing the output of the hash code learning branch, namely the class hash representation of each target candidate box with the label of the image to calculate the hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iterative optimization model through a random gradient descent (SGD) method to finally obtain a multi-target image retrieval model.
Example 4:
the embodiment discloses a multi-target image retrieval apparatus, including:
the retrieval database construction module is used for acquiring a target hash code set of the image set to construct a retrieval database;
the multi-target image input module is used for inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model;
the class hash representation acquisition module is used for acquiring the output of the target area detection branch and the hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
the conversion module and the query hash code set acquisition module are used for converting the class hash representation into the hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved;
and the retrieval result output module is used for calculating the joint Hamming distance between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the final returned retrieval results.

Claims (10)

1. A construction method of a multi-target image retrieval model is characterized in that the multi-target image retrieval model can be used for obtaining the Hash-like representation of a multi-target image to realize multi-target image retrieval; the method comprises the following steps:
step 1, acquiring a plurality of images and corresponding labels thereof as a training set;
step 2, constructing a pre-training neural network model, wherein the neural network model is a multi-task learning model and comprises the following steps:
a first module: the RPN module is used for generating a target candidate frame for an input image;
a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image;
a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame;
a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image;
a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and 3, inputting the image into the pre-training neural network model for training, wherein the training comprises the following steps:
step 3.1, inputting the images into a first RPN module, generating initial target candidate frames for each image, and outputting a matrix of P × 4 if P target candidate frames are finally obtained, wherein each row represents coordinate information of one target candidate frame, initial coordinates and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;
meanwhile, the image is input into a second module deep convolution neural network, the size of the batch size is set as B, a characteristic map is output,
step 3.2, inputting the feature map and the P target candidate frames in the step 3.1 into an interest area pooling module of a third module to obtain feature vector representation of a target area corresponding to each target candidate frame, and outputting a matrix of BxPxd, wherein d is the dimension represented by the feature vector;
step 3.3, the output of the step 3.2 is respectively input into a fourth module target area detection branch and a fifth module hash code learning branch, wherein the output of the target area detection branch is the image class probability, and the output of the hash code learning branch is the class hash representation of each target candidate box;
step 3.4, optimizing the model: comparing the image category probability with the label vector of the image to calculate binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.
2. The method for constructing a multi-target image retrieval model according to claim 1, wherein the step 3.3 includes:
step 3.3.1, in the fourth module target area detection branch, the feature vector of each target candidate frame is divided into a detection data stream branch and a classification data stream branch after passing through two full connection layers; respectively passing through two full connection layers to obtain a detection output matrix and a classification output matrix; then, multiplying the detection output matrix and the classification output matrix by element-wise to obtain the class probability of each target candidate frame, and outputting a merged data matrix; finally, summing the class probabilities of all the target candidate frames of each image to obtain the class probability of the image, and using the class probability as the output of the target region detection branch;
and 3.3.2, in the fifth module hash code learning branch, enabling the feature vector of each target candidate box to pass through two fully-connected layers, and then inputting the feature vector into a hash layer containing L nodes to obtain a BxPxL class hash output, wherein the row vector of each image represents the class hash representation of each target candidate box.
3. The method for constructing a multi-target image retrieval model according to claim 2, wherein the step 3.3.1 includes:
step (a1), the detection data flow branch passes through two full connection layers to obtain a detection output matrix XdThe calculation is performed according to equation 1:
Figure FDA0002974119280000021
in the formula 1, the first and second groups of the compound,i denotes the ith row of the matrix, j denotes the jth column of the matrix, [ delta ]detct(xd)]ijRefers to detecting the output of the ith row and the jth column after the branch calculation of the data stream,
Figure FDA0002974119280000022
refers to the detection of the output matrix XdP represents the total number of target candidate frames, and e is the base of the power operation; setting a total of C categories, wherein the detection output matrix dimension is BxPxC, and obtaining the score of each category under each target candidate frame, wherein the formula 1 is equivalent to performing primary target detection;
step (a2), the classified data flow branches pass through two full connection layers to obtain a classified output matrix XcThe calculation is performed according to equation 2:
Figure FDA0002974119280000023
in equation 2, i represents the ith row of the matrix, j represents the jth column of the matrix, [ δ [ ]class(xc)]ijRefers to the output of the ith row and the jth column after the branch calculation of the classified data stream,
Figure FDA0002974119280000031
refers to the classification output matrix XcThe value of j in the ith row, C represents the total number of categories, e is the base number of power operation, the probability of each target candidate frame in each category is calculated by formula 2, namely, each target candidate frame is classified once, and the dimension of a classification output matrix is B multiplied by P multiplied by C;
step (a3), the element-wise multiplication is carried out on the classified output moment and the detection output matrix to obtain a result of the combination of classification and detection: b multiplied by P multiplied by C data matrix, wherein each row of each image in the data matrix represents the score of a target candidate frame on each category, namely the category probability of the target candidate frame;
and (a4) finally, summing the class probabilities of the target candidate frames of each image to obtain the class probability of the image, wherein the output dimension after summation is B multiplied by 1 multiplied by C and is used as the output of the target region detection branch.
4. The method for constructing a multi-target image retrieval model according to claim 3, wherein the step 3.4 includes:
step (b1), comparing the class probability of the image obtained in step (a4) with the image label vector to calculate the binary cross entropy classification loss, wherein the calculation method is formula 3:
Figure FDA0002974119280000032
in formula 3, Lc(y, p (y)) is the binary cross entropy classification loss of one image, N is the total number of data set labels, yiIs 0 or 1, indicates whether the image has the ith label, and if so, is 1, p (y)i) Representing the probability value that the model predicted image has the ith label;
step (b2), according to the output of the target area detection branch, selecting the target candidate box with the target candidate box class probability higher than the set class probability threshold, screening the index number of the target candidate box, obtaining the corresponding class hash representation in the output of the hash code learning branch, calculating the hash loss according to the class hash representation and the corresponding label, the calculating method is shown in formula 4,
Figure FDA0002974119280000033
in formula 4, Lh(h1,h2Y) is the hash loss value between the target area 1 and 2 corresponding to the target candidate box, h1class-Hash representation, h, representing the image target area 12The class hash of the image target area 2 is represented, y is 0 or 1, which represents whether the target areas 1 and 2 have the same prediction label, and if so, 1, Dh(h1,h2) Euclidean distance values representing class hash vector representations of calculation target areas 1 and 2;
And (b3) carrying out weighted summation on the binary cross entropy classification loss and the hash loss according to the proportion of 100:1 to obtain the final joint loss, and carrying out a reverse iterative optimization model by a random gradient descent method, wherein the learning rate of the model is set to be 0.001.
5. A multi-target image retrieval method is characterized by comprising the following steps:
acquiring a target hash code set of an image set to construct a retrieval database;
inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model according to any one of claims 1 to 4 to obtain the output of a target area detection branch and a hash code learning branch; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
and step three, converting the class hash representation into hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved, calculating the joint Hamming distances between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the finally returned retrieval results.
6. The multi-target image retrieval method according to claim 5, wherein in the third step, the class hash representation is converted into a hash representation, and the conversion is realized by formula 5:
Figure FDA0002974119280000041
in the formula 5, hiA value representing the ith position of the class hash representation vector;
thereby obtaining query hash codes of a plurality of target candidate boxes of q images to be retrieved
Figure FDA0002974119280000042
Query hashing to form an image to be retrievedCode set
Figure FDA0002974119280000043
7. The multi-target image retrieval method of claim 5, wherein in the third step, the calculation method for calculating the joint Hamming distance between the query hash code set and the target hash code set is shown in formula 6:
Figure FDA0002974119280000044
in equation 6, UHD (H)q,Hi) For the joint Hamming distance, H, of the image to be retrieved and the ith image hash code set in the image setqRepresentation of a set of hash codes representing an image to be retrieved, HiRepresenting the representation of the ith image in the image set, R representing the number of target hash codes of the image to be retrieved, K representing the number of target hash codes of the ith image in the image set,
Figure FDA0002974119280000045
representing the Hamming distance between the r-th Hash code in the image to be retrieved and the j-th Hash code of the ith image in the image set;
and in formula 6, each query hash code is matched with the most similar hash code in the target hash code set to calculate the minimum distance, and the multiple distances are summed to obtain the final combined hamming distance.
8. A multi-target image retrieval model construction device is characterized by comprising:
the acquisition module is used for acquiring the images and the labels corresponding to the images as a training set;
the pre-training neural network model building module is used for building a pre-training neural network model, and the neural network model is a multi-task learning model and comprises the following steps: a first module: the RPN module is used for generating a target candidate frame for an input image; a second module: an arbitrary deep convolutional neural network for generating a feature map of an input image; a third module: the interest area pooling module is used for quickly acquiring the feature vector representation of the target area corresponding to each target candidate frame; a fourth module: the target area detection branch is used for determining the class probability of the target candidate frame and then summing the class probabilities to obtain the class probability of the image; a fifth module: a hash code learning branch to determine a hash-like representation of the target candidate box;
and the training module is used for inputting the image into the pre-training neural network model for training.
9. The multi-target image retrieval model building apparatus of claim 8, wherein the training module includes:
a target candidate frame generation module, configured to input an image into the first RPN module, generate an initial target candidate frame for each image, and set to finally obtain P target candidate frames, and output the initial target candidate frames as a P × 4 matrix, where each row represents coordinate information of one target candidate frame, an initial coordinate, and width and height information (x, y, w, h), so as to obtain coordinate information of the P target candidate frames;
the characteristic graph output module is used for inputting the image into the second module depth convolution neural network, and outputting a characteristic graph if the size of the batch size is B;
the feature vector representation module of the target candidate frame is used for inputting the feature map and the P target candidate frames into the interest region pooling module of the third module to obtain the feature vector representation of the target region corresponding to each target candidate frame, and outputting a matrix of B multiplied by P multiplied by d, wherein d is the dimension represented by the feature vector;
the image category probability and Hash-like representation output module is used for respectively inputting the feature vector representation of the target area corresponding to each target candidate box into a fourth module target area detection branch and a fifth module Hash code learning branch, wherein the output of the target area detection branch is the image category probability, and the output of the Hash code learning branch is Hash-like representation of each target candidate box;
the model optimization module is used for optimizing a model and comprises the steps of comparing the image category probability with the label vector of the image to calculate the binary cross entropy classification loss; comparing the class hash representation of each target candidate box with the label of the image to calculate hash loss; and weighting and summing the two loss functions obtained by calculation to obtain the final combined loss, and performing a reverse iteration optimization model by a random gradient descent method to finally obtain a multi-target image retrieval model.
10. A multi-target image retrieval apparatus, characterized by comprising:
the retrieval database construction module is used for acquiring a target hash code set of the image set to construct a retrieval database;
the multi-target image input module is used for inputting a single image or a plurality of images to be retrieved into the multi-target image retrieval model;
a hash-like representation obtaining module, configured to obtain outputs of the target region detection branch and the hash code learning branch by using the method in the multi-target image retrieval model building apparatus according to claim 8 or 9; screening a target candidate frame in the image and a class hash representation corresponding to the target candidate frame according to a set target candidate frame class probability threshold;
the conversion module and the query hash code set acquisition module are used for converting the class hash representation into the hash representation to obtain a query hash code set of a plurality of target candidate boxes of the image to be retrieved;
and the retrieval result output module is used for calculating the joint Hamming distance between the query hash code set and the target hash code set, sequencing the joint Hamming distances from small to large, and taking the first n corresponding images as the final returned retrieval results.
CN202110270411.2A 2021-03-12 2021-03-12 Construction method of multi-target image retrieval model, retrieval method and device Active CN113032612B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110270411.2A CN113032612B (en) 2021-03-12 2021-03-12 Construction method of multi-target image retrieval model, retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110270411.2A CN113032612B (en) 2021-03-12 2021-03-12 Construction method of multi-target image retrieval model, retrieval method and device

Publications (2)

Publication Number Publication Date
CN113032612A true CN113032612A (en) 2021-06-25
CN113032612B CN113032612B (en) 2023-04-11

Family

ID=76470346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110270411.2A Active CN113032612B (en) 2021-03-12 2021-03-12 Construction method of multi-target image retrieval model, retrieval method and device

Country Status (1)

Country Link
CN (1) CN113032612B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627437A (en) * 2022-05-16 2022-06-14 科大天工智能装备技术(天津)有限公司 Traffic target identification method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844669A (en) * 2016-03-28 2016-08-10 华中科技大学 Video target real-time tracking method based on partial Hash features
CN106407352A (en) * 2016-09-06 2017-02-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 Traffic image retrieval method based on depth learning
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning
CN107679250A (en) * 2017-11-01 2018-02-09 浙江工业大学 A kind of multitask layered image search method based on depth own coding convolutional neural networks
WO2018121018A1 (en) * 2016-12-30 2018-07-05 腾讯科技(深圳)有限公司 Picture identification method and device, server and storage medium
WO2018137357A1 (en) * 2017-01-24 2018-08-02 北京大学 Target detection performance optimization method
CN110297931A (en) * 2019-04-23 2019-10-01 西北大学 A kind of image search method
CN111460200A (en) * 2020-03-04 2020-07-28 西北大学 Image retrieval method and model based on multitask deep learning and construction method thereof
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844669A (en) * 2016-03-28 2016-08-10 华中科技大学 Video target real-time tracking method based on partial Hash features
CN106407352A (en) * 2016-09-06 2017-02-15 广东顺德中山大学卡内基梅隆大学国际联合研究院 Traffic image retrieval method based on depth learning
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning
WO2018121018A1 (en) * 2016-12-30 2018-07-05 腾讯科技(深圳)有限公司 Picture identification method and device, server and storage medium
WO2018137357A1 (en) * 2017-01-24 2018-08-02 北京大学 Target detection performance optimization method
CN107679250A (en) * 2017-11-01 2018-02-09 浙江工业大学 A kind of multitask layered image search method based on depth own coding convolutional neural networks
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN110297931A (en) * 2019-04-23 2019-10-01 西北大学 A kind of image search method
CN111460200A (en) * 2020-03-04 2020-07-28 西北大学 Image retrieval method and model based on multitask deep learning and construction method thereof

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
CHANGZHEN XIONG 等: "《Subject Features and Hash Codes for Multi-label Image Retrieval》", 《IEEE》 *
KAIMING HE 等: "《Mask r-cnn》", 《IEEE》 *
冯兴杰等: "基于深度卷积神经网络与哈希的图像检索", 《计算机工程与设计》 *
彭天强等: "基于深度卷积神经网络和二进制哈希学习的图像检索方法", 《电子与信息学报》 *
彭晏飞等: "基于哈希算法及生成对抗网络的图像检索", 《激光与光电子学进展》 *
胡琦瑶 等: "《基于弱监督深度学习的图像检索技术研究》", 《西北大学学报(自然科学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114627437A (en) * 2022-05-16 2022-06-14 科大天工智能装备技术(天津)有限公司 Traffic target identification method and system
CN114627437B (en) * 2022-05-16 2022-08-05 科大天工智能装备技术(天津)有限公司 Traffic target identification method and system

Also Published As

Publication number Publication date
CN113032612B (en) 2023-04-11

Similar Documents

Publication Publication Date Title
CN113378632B (en) Pseudo-label optimization-based unsupervised domain adaptive pedestrian re-identification method
CN110866140B (en) Image feature extraction model training method, image searching method and computer equipment
Lin et al. Bsn: Boundary sensitive network for temporal action proposal generation
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN110084151B (en) Video abnormal behavior discrimination method based on non-local network deep learning
CN111709311B (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN113326731B (en) Cross-domain pedestrian re-identification method based on momentum network guidance
CN111198964B (en) Image retrieval method and system
CN110532379B (en) Electronic information recommendation method based on LSTM (least Square TM) user comment sentiment analysis
CN114492574A (en) Pseudo label loss unsupervised countermeasure domain adaptive picture classification method based on Gaussian uniform mixing model
CN111967343A (en) Detection method based on simple neural network and extreme gradient lifting model fusion
CN107169117B (en) Hand-drawn human motion retrieval method based on automatic encoder and DTW
CN111931505A (en) Cross-language entity alignment method based on subgraph embedding
CN112085072B (en) Cross-modal retrieval method of sketch retrieval three-dimensional model based on space-time characteristic information
CN110929848A (en) Training and tracking method based on multi-challenge perception learning model
Wang et al. Aspect-ratio-preserving multi-patch image aesthetics score prediction
CN112784768A (en) Pedestrian re-identification method for guiding multiple confrontation attention based on visual angle
CN113255366B (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN112084895B (en) Pedestrian re-identification method based on deep learning
CN115171165A (en) Pedestrian re-identification method and device with global features and step-type local features fused
CN112784929A (en) Small sample image classification method and device based on double-element group expansion
CN111008224A (en) Time sequence classification and retrieval method based on deep multitask representation learning
CN112766378A (en) Cross-domain small sample image classification model method focusing on fine-grained identification
CN112364747A (en) Target detection method under limited sample
CN113032612B (en) Construction method of multi-target image retrieval model, retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant