CN107679250B  Multitask layered image retrieval method based on deep selfcoding convolutional neural network  Google Patents
Multitask layered image retrieval method based on deep selfcoding convolutional neural network Download PDFInfo
 Publication number
 CN107679250B CN107679250B CN201711057490.9A CN201711057490A CN107679250B CN 107679250 B CN107679250 B CN 107679250B CN 201711057490 A CN201711057490 A CN 201711057490A CN 107679250 B CN107679250 B CN 107679250B
 Authority
 CN
 China
 Prior art keywords
 image
 matrix
 retrieval
 region
 target
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Active
Links
 230000001537 neural Effects 0.000 title claims abstract description 49
 239000011159 matrix materials Substances 0.000 claims abstract description 80
 238000004422 calculation algorithm Methods 0.000 claims abstract description 18
 230000004044 response Effects 0.000 claims abstract description 11
 230000011218 segmentation Effects 0.000 claims abstract description 9
 230000000007 visual effect Effects 0.000 claims abstract description 7
 239000010410 layers Substances 0.000 claims description 71
 238000000034 methods Methods 0.000 claims description 25
 230000000875 corresponding Effects 0.000 claims description 16
 230000004913 activation Effects 0.000 claims description 9
 238000004364 calculation methods Methods 0.000 claims description 8
 230000004927 fusion Effects 0.000 claims description 7
 230000036880 Cls Effects 0.000 claims description 6
 238000007477 logistic regression Methods 0.000 claims description 6
 230000000694 effects Effects 0.000 claims description 4
 238000005070 sampling Methods 0.000 claims description 4
 280000867207 Lambda companies 0.000 claims description 3
 150000001875 compounds Chemical class 0.000 claims description 3
 PXUQTDZNOHRWLIOXUVVOBNSAO malvidin 3OβDglucoside Chemical compound   COC1=C(O)C(OC)=CC(C=2C(=CC=3C(O)=CC(O)=CC=3[O+]=2)O[C@H]2[C@@H]([C@@H](O)[C@H](O)[C@@H](CO)O2)O)=C1 PXUQTDZNOHRWLIOXUVVOBNSAO 0.000 claims description 3
 230000000644 propagated Effects 0.000 claims description 3
 201000011243 gastrointestinal stromal tumor Diseases 0.000 claims description 2
 230000002401 inhibitory effects Effects 0.000 claims description 2
 238000001914 filtration Methods 0.000 claims 2
 238000005516 engineering processes Methods 0.000 description 14
 238000000605 extraction Methods 0.000 description 4
 241000282414 Homo sapiens Species 0.000 description 3
 238000010586 diagrams Methods 0.000 description 3
 210000004556 Brain Anatomy 0.000 description 2
 238000005457 optimization Methods 0.000 description 2
 280000757597 Resnet companies 0.000 description 1
 281999990075 University of California companies 0.000 description 1
 230000001149 cognitive Effects 0.000 description 1
 230000018109 developmental process Effects 0.000 description 1
 239000000284 extracts Substances 0.000 description 1
 238000003064 k means clustering Methods 0.000 description 1
 238000000691 measurement method Methods 0.000 description 1
 239000000203 mixtures Substances 0.000 description 1
 230000004048 modification Effects 0.000 description 1
 238000006011 modification reactions Methods 0.000 description 1
 238000003909 pattern recognition Methods 0.000 description 1
 238000000513 principal component analysis Methods 0.000 description 1
 230000001629 suppression Effects 0.000 description 1
Classifications

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06F—ELECTRIC DIGITAL DATA PROCESSING
 G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
 G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
 G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
 G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06K—RECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
 G06K9/00—Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
 G06K9/20—Image acquisition
 G06K9/32—Aligning or centering of the image pickup or imagefield
 G06K9/3233—Determination of region of interest

 G—PHYSICS
 G06—COMPUTING; CALCULATING; COUNTING
 G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N3/00—Computer systems based on biological models
 G06N3/02—Computer systems based on biological models using neural network models
 G06N3/04—Architectures, e.g. interconnection topology
 G06N3/0454—Architectures, e.g. interconnection topology using a combination of multiple neural nets
Abstract
The invention discloses a multitask layered image retrieval method based on a depth selfcoding convolutional neural network, which is characterized by comprising the following steps: the method mainly comprises a multitask endtoend convolutional neural network for deep learning and training identification, a rapid visual segmentation detection and positioning method for a secondary screening module of the region of interest based on an RPN network, rough retrieval of a fullimage sparse hash code, accurate comparison retrieval of a region perception semantic feature and a matrix h based on maximum response, and an algorithm for selecting and comparing the region of interest. The invention can realize endtoend training, automatically select the region of interest with higher quality, effectively improve the automation and intelligence level of searching the images by the images, and meet the image retrieval requirement of the big data era by using less storage space and higher retrieval speed.
Description
Technical Field
The invention relates to application of computer vision, pattern recognition, information retrieval, multitask learning, similarity measurement, a deep selfcoding convolutional neural network and a deep learning technology in the field of image retrieval, in particular to a multitask hierarchical image retrieval method based on the deep selfcoding convolutional neural network.
Background
The image retrieval aims at providing a user with a search technology for graphical image information retrieval by analyzing the content of an input query image to retrieve similar images, and the search technology comprises multiple disciplines such as image processing, computer vision, multitask learning, mode recognition and cognitive psychology. The related technology mainly comprises the acquisition of image representation and similarity measurement. The method has wide application in various fields such as image retrieval, video investigation, interconnection, shopping search engines and the like in the background of big data era.
For the image retrieval technology of contents, a common traditional method is generally based on color information, shape characteristics, texture characteristics and the like of images, and the technology belongs to the image retrieval technology in the early deep learning era and can be generally divided into three steps: 1) extracting the feature representation of the target image, wherein the most common SIFT description operators, color or geometric invariant moments, hash functions, Fisher vector descriptions and the like are used in the obtained image algorithm. 2) The image feature representation is reencoded for a massive image lookup table. The target image with higher resolution can be subjected to downsampling processing and then encoded to obtain image characteristic representation, so that the operation burden in the searching process can be reduced, and the image comparison speed is increased. 3) The similarity measurement method comprises the following steps: calculating the similarity between the query image and the target data set by using the image feature representation obtained in the step 2; setting an image screening threshold according to the robustness required by the final query result, and then reserving the first k images with the highest similarity values according to the similarity of the images; and finally, screening out the most similar pictures by combining a feature matching algorithm.
The image feature representation is that the pixel information of the image is connected with the perception of human beings to things, and the image feature is the condition of retrieval. In the traditional retrieval based on image content, because the features need to be extracted manually, the method is timeconsuming, laborconsuming, and has great problems in retrieval accuracy and efficiency, when the methods use the bottomlayer features such as color, texture, contour and the like of an image frequently as the basic features of the image, and the retrieval result is matched with the image in a target database based on the similarity calculation of the manually extracted features. In the current big data era, the target database is particularly huge, so that the timeliness in the retrieval process is extremely important. The timeliness is a standard for measuring the quality of an image retrieval system, and aims at the problems that the existing contentbased image retrieval technology is low in intelligence level, lacks of image feature selfencoding capability, is difficult to accurately and quickly obtain a retrieval result and cannot meet the technical requirements of image retrieval in the big data era.
QBIC (query by image cotnet) proposed by IBM, Tineye developed by Idee, Canada, Photobrook developed by MIT media laboratories, Virage developed by Virage, NETRA developed by ADL, university of California, VisualSeek and WebSeek developed by university of Columbia are image retrieval technologies belonging to the predeep learning era.
Deep learning developed in recent years is a deep selfcoding convolutional neural network that is built and simulates the human brain to analyze learning, and it simulates the human brain learning mechanism to interpret and represent picture data. The LeNet model is the most representative network model in early deep neural networks, and AlexNet, VGG and ResNet networks have deeper network structures and stronger image representation capability. Deep learning reveals the distributed nature of the target database by representing the underlying features as more abstract target image classes or feature vectors.
The combination of the hash method and deep learning is a new trend in order to improve the precision and efficiency of retrieval. The hash algorithm can also be used for measuring the similarity between images, and mainly comprises an unsupervised method and a supervised method. Unsupervised learning methods are often used for unlabeled data, and a set of hash codes is obtained by learning the geometric structure of the data, wherein locality sensitive hashing is widely used, and maximizes the probability of likely similarity data as a learning objective to obtain a set of similar hash codes. Compared with unsupervised learning, the supervised learning hash code can obtain a group of more typical hash codes, the deep features of the image can be automatically explored by the hash method based on the deep selfcoding convolutional neural network, and the method is an ideal image retrieval technology.
The chinese patent application No. 201611127943.6 discloses an ultralow complexity image retrieval method based on sequence preserving hashing, wherein a part of images is randomly selected in an image database as a training set, and corresponding features are extracted to perform dimensionality reduction on the features of the original image by a nonlinear principal component analysis method; then obtaining a series of supporting points by using a Kmeans clustering algorithm as the basis of subsequent hash function learning; then, performing hash coding on the whole image database through iterative optimization learning of a corresponding hash function; and finally, measuring the similarity between the images by using the Hamming distance. The technology still belongs to the traditional retrieval technology based on image content.
The chinese patent application No. 201610877794.9 discloses an image retrieval method based on an interest target. Which comprises the following steps: 1) analyzing an interest target of a user according to an HS (high speed) significance detection algorithm, and segmenting the interest target by combining a SaliencyCut algorithm; 2) extracting HSV color features, SIFT local features and CNN semantic features from an interest target of a user; 3) and matching the extracted features of the interest targets with the images in the database according to the similarity, and sequencing according to the similarity to obtain a retrieval result based on the interest targets. The invention utilizes the image characteristics of multiple dimensions to represent and meet the image retrieval precision requirement in the big data era, but the algorithm has the problems of higher requirement on the system memory, reduction of the speed of a retrieval system due to the comparison of the characteristics of multiple dimensions, incapability of being used for largescale image database retrieval and the like.
The Chinese patent application No. 201510475003.51 discloses a method, a device and a system for image retrieval, image information acquisition and image identification. The system firstly extracts the local features of the retrieval image, calculates the feature values of the local features by adopting a pretrained deep selfcoding convolutional neural network, then matches the feature values with the feature values of the registered images in an image retrieval database, and finally selects the registered images meeting preset conditions as the retrieval results of the images to be retrieved according to the matching results. Although the system selects local features through the feature points, the extracted local features can not be guaranteed to accurately segment the search objects; and the characteristics of the objects cannot be independently extracted from the multitarget multilabel image.
The Chinese patent application No. 201710035674.9 discloses a method and a system for matching feature points of superlargescale images, wherein firstly, image neighbor search is carried out to obtain an image matching pair; forming an undirected graph by taking the images as nodes and forming edges among adjacent images, and carrying out breadthfirst sequencing on the graph to obtain the sequenced images and image pairs; rearranging the characteristic information of the image according to the sequencing result, and storing the characteristic information into a binary file in blocks; reading the binary files with the stored feature information in sequence, sequentially performing feature matching according to the sequenced image pairs, and releasing subsequent useless feature information in time; the algorithm can meet the retrieval requirement of largescale images by iteratively reading the feature information and performing feature matching until all the image pairs are matched, but the algorithm is based on a neighbor search technology and has the problems of low retrieval precision and the like.
In summary, the image searching technique using the deep selfcoding convolutional neural network and the hash method has several problems as follows: 1) how to extract sparse coding of the whole image from a multitarget multilabel image in a multitask learning mode, and meanwhile, accurately segmenting an interested region and extracting region perception semantic features of the interested region; 2) how to use the extracted features to establish hierarchical depth search so as to obtain more accurate retrieval results; 3) how to perfectly combine the recognition precision, the detection accuracy and the retrieval efficiency of a retrieval system of the deep selfcoding convolutional neural network; 4) how to design a frame for realizing an endtoend image retrieval method by hierarchical deep search by using a CNN network in a real sense; 5) how to reduce the problems of large storage space consumption, low retrieval speed and the like of an image retrieval system under the background of a big data era.
Disclosure of Invention
The invention provides an image retrieval method through hierarchical depth search based on a depth selfcoding convolutional neural network endtoend, aiming at the problems that the existing image searching technology is low in automation and intelligence level, lack of deep learning, difficult to obtain accurate retrieval results, large in storage space consumption of the retrieval technology, low in retrieval speed, and difficult to meet the image retrieval requirements in the big data era.
To implement the above summary, several core problems must be solved: 1) aiming at the problem of difficult image feature extraction, the strong feature characterization capability of a depth selfcoding convolutional neural network is utilized to realize feature selfadaptive extraction; 2) aiming at the problem of low retrieval speed of largescale images, a multitask layering method is designed, and a query image is used for being rapidly compared with an image in a database; 3) aiming at the semantic retrieval of a multitarget image scene, a secondary screening algorithm of a region of interest is designed to detect and segment the multitarget image; 4) by utilizing the advantages of an endtoend deep network, an endtoend deep selfcoding convolutional neural network is designed, and detection, identification and feature extraction are fused into one network.
In order to realize a largescale image retrieval method of an endtoend multitask deep selfcoding convolutional neural network, the invention comprises a multitask endtoend convolutional neural network for deep learning and training identification, a rapid visual segmentation detection and positioning method for a secondary screening model of an interested region based on an RPN network, a rough retrieval of a fullimage sparse hash code, an accurate comparison retrieval of a region perception semantic feature and a matrix h based on maximum response, and an algorithm for selecting a comparison for the interested region.
The invention provides a multitask layered image retrieval method based on a deep selfcoding convolutional neural network, which adopts the technical scheme that the method comprises the following steps:
1) the method comprises the steps of constructing a multitask endtoend convolution neural network for deep learning and training identification;
the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the modules are a deep convolution neural network formed by alternately arranging convolution layers, an activation layer and a downsampling layer; performing logistic regression and layerbylayer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
a shared convolution module: the shared network consists of 5 convolution modules, wherein the deepest layers of conv2_ x to conv5_ x are {4 }^{2},8^{2},16^{2},16^{2}As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;
the system comprises a secondary regionofinterest screening module and a regionofinterest coordinate regression and identification module: the RPN takes an image with any scale as input, and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score; in order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multiscale convolution operation is carried out on the feature map, and the method is realized as follows: using 3 scales and 3 lengthwidth ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and lengthwidth ratio, so that 9 candidate regions with different scales can be obtained by mapping on the original image, and for a shared convolution feature map with the size of w × h, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/nontarget, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;
when the RPN network is trained, each candidate region is assigned a binary label, so as to mark whether the region is a target or not. The operation is as follows: 1) a candidate region that overlaps with IoU (IntersectionoverUnion ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.
With these definitions, the objective function is minimized, following the multitasking penalty in the Faster RCNN. The loss function for an image is defined as:
where i is an index of the ith candidate region,is waiting forThe selected region is the probability of the ith class. If the label of the candidate region is positive,is 1, if the candidate area label is 0,is 0; t is t_{i}Is a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box. N is a radical of_{cls}And N_{reg}The normalized coefficients are respectively a classification loss function and a position regression loss function, and lambda is a weight parameter between the two. Classification loss function L_{cls}Is the log loss of two classes (target vs nontarget):
regression loss function L for position_{reg}Defined by the following function:
where R is a robust loss function (smooth L1).
According to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; finally, in order to obtain the maximumresponse regional perception semantic features, the obtained I4096dimensional feature maps are accessed into a secondary screening network, and finally the regional perception semantic features of the most accurate suggestion frame are selected back again;
a region of interest coordinate regression and identification module: the convolutional neural network training is a back propagation process, is similar to a BP algorithm, performs back propagation through an error function, and performs optimization adjustment on convolutional parameters and bias by using a random gradient descent method until the network converges or reaches the maximum iteration times;
the neural network training is a back propagation process, the convolution parameters and the bias are optimized and adjusted by a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times are reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, E^{N}In order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
in the formula (I), the compound is shown in the specification,^{l}representing the error function of the current layer,^{l+1}error representing the previous layerFunction, W^{l+1}For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, u^{l}Output, x, representing the layer above the failed activation function^{l1}Denotes the input of the next layer, W^{l}Mapping a weight matrix for the layer;
2) based on RPN network, the fast visual segmentation detection and positioning of the secondary screening model of the region of interest are carried out as follows:
the method comprises the steps that an image obtained from a video or a camera comprises a plurality of target areas, a probability layer is utilized to carry out nonmaximum value inhibition and threshold value screening on the coordinate output and score output of each suggestion frame of an RPN to obtain a final coordinate frame and a score, finally, the area of the most accurate suggestion frame is selected through a rescreening network again, and accurate detection and identification of a target object are guaranteed through twice screening of the suggestion frames;
3) the rough retrieval of the fullimage sparse hash code is carried out, and the process is as follows:
the hash method aims to represent samples as a string of fixedlength binary codes, and includes two aspects: 1) for two images sharing the same semantic concept, the hamming distance between the binary codes should be as small as possible, otherwise if the corresponding images are from different classes, the hamming distance should be larger, which means that the binary codes have a discriminating property in hamming space; 2) the hash method can be viewed as a black box where the input is typically a low or medium level feature, such as GIST feature, and the output of the black box is binary code that maintains semantic properties in hamming space; the inherent properties of such binary codes can be viewed as a highlevel representation based on low or medium features from the point of view of feature representation and return semantics that are more relevant to the query image;
aiming at a deep convolutional network and a Hash method, the rough retrieval processing process is provided as follows:
firstly, assuming that a target data set can be divided into c category labels, a probability vector p ═ x (x) of a target category of each region of interest can be obtained by a target image through a primary screening network of the region of interest_{1},x_{2},...,x_{c})，x_{c}E (0,1) to promote sparse codingBinary output of the code module carries out binarization processing on the probability vector p by utilizing a piecewise function; if the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)_{1},p_{2},...,p_{m}) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7);
where i, j ∈ (m, c). Secondly, in order to accelerate the image retrieval speed, a vector fusion mode is adopted again, and the matrix H is compressed to a matrix H with the dimension of 1 × c to represent the global features of the target image. The whole process is shown as a formula (8) and a formula (9), firstly, the matrix H is transposed and multiplied by the matrix H to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as a global characteristic binary hash code of the final target image I, namely a matrix H;
H＝diag(h′) (9)
the sparse hash matrix H is a 1 × cdimensional vector and H_{i}∈{0,1},i∈c，Representation matrix (h)^{T}h) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the lowdimensional vector, so that the searching precision is improved.
Learning a sparse hash matrix of the image and deep perception semantic information of an image target region through a deep selfcoding convolutional neural network, and finally realizing quick search and accurate return of an image retrieval system by adopting a coarsetofine retrieval strategy based on the viewpoint; firstly, a group of images with similar attributes are searched out through a sparse Hash matrix of the images, and then the first k pictures which are most similar to a target image are selectively searched out from the group of roughly searched images through deep perception semantic information of a target area of the images
The retrieval process is as follows: for a given image I, the output of the image Out is first extracted^{j}(H) Sparse hash matrix H
Assuming that the target dataset consists of n images, it can be expressed as ═ I_{1},I_{2},…,I_{n}Get the sparse hash code of the target data set_{H}＝{H_{1},H_{2},…,H_{n}}，H_{i}E {0,1} represents; further assume that a given search image I_{q}And retrieving sparse hash codes H of images_{q}Using a cosine distance measure H_{q}And H_{i}∈_{H}Similarity between them, the cosine value is greater than threshold value T_{H}The image is placed in the candidate pool U,the rough search result is used as a candidate image for the subsequent fine search process
4) Based on the accurate comparison and retrieval of the maximum response region perception semantic features and the matrix h;
the fine search is performed to give a query image I_{q}And a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the rescreening network and the full connection layer; the number of the suggestion boxes contained in each image is variable, and one or more suggestion boxes can be contained; if inquiring image I_{q}Contains m suggestion boxes, randomly selects an image I from the candidate pool U_{n}e.U contains m ' suggestion boxes, if all suggestion boxes are compared by using a violent retrieval mode, m × m ' times need to be compared, the larger the value of m × m ', the more the running speed of the whole retrieval system is reduced, and the problem is solved: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image I_{q}The matrix h is denoted as h_{q}An mxcdimensional vector, random image I in the candidate pool_{n}Is h matrix of_{n}Then the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
where dis (·) represents a modified cosine distance formula, which is expressed as formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use f_{q},f_{n}It is shown that the mannequin formula (11) yields a comparison suggestion box matrix s, m × m' dimension. There are often two main categories of suggestion boxes for image generation: the sameclass suggestion boxes and the nonsameclass suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a reference_{q}And I_{n}Maximum value of interclass frame, average value of interclass frame, and finally in image I_{q}The intraclass mean value is obtained again, the differences are reduced to the maximum extent through operation to ensure the accuracy of the result, and an image I is obtained_{q}And I_{n}Similarity sim of (d);
first, the matrix s needs to be updated, and the image I is taken_{q}And I_{n}The maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m '), the maximum value in the ith row in matrix s' is selectedAnd (4) showing. Finally, using the formula (14), the image I is obtained_{q}And I_{n}Mean value of class interval box and in I_{q}And (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'^{j}Representing the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
Further, the method further comprises: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a rankingbased criterion; for a given search image I_{q}And a similarity measure, one ranking for each dataset image; here, a search image I is represented by evaluating the top k ranked images_{q}The search accuracy of (2) is expressed by the formula (15);
wherein Rel (I) represents a search image I_{q}The real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; in calculating the true correlation, only the part with the classification label is considered,rel (i) is proper to {0,1}, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0 is set, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
The overall image retrieval flow chart of the multitask hierarchical image retrieval method based on the deep selfcoding convolutional neural network is summarized simply as follows: 1) sending the image into a depth selfcoding convolutional neural network, performing logistic regression on the characteristic graph, and performing position and category segmentation and prediction on an interested region on the query image; 2) extracting a sparse hash matrix of the image and a perceptual semantic feature of an interested region by using a depth selfcoding convolutional neural network; 3) carrying out coarse retrieval on images in the database by using a sparse Hash matrix to obtain candidate images with similar attributes, and putting the candidate images into a candidate pool U; 4) on the basis of rough retrieval, namely the candidate pool U further selectively compares and sorts the suggestion boxes by using the modified cosine distance to obtain images of k before the ranking.
The invention has the following beneficial effects:
1) the method for searching the multitask layered image based on the depth selfcoding convolutional neural network is provided;
2) the strong characteristic representation capability of the deep convolutional neural network is utilized to realize the selfadaptive extraction of the characteristics;
3) the image retrieval method adopting the layered depth search can meet the search requirement of largescale image data;
4) the design gives consideration to universality and specificity, and meets the requirements of various users in the aspects of universality, retrieval speed, precision, practicability and the like; and in the aspect of specificity, a user makes a special data set according to the specific requirement of the user and finely adjusts network parameters to realize a system for searching the image by the image and oriented to specific application.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a schematic diagram of a shared network.
Fig. 3 is an RPN network development diagram.
Fig. 4 is a flow chart of fast visual segmentation detection and positioning of a secondary region of interest screening model based on an RPN network.
Fig. 5 is a schematic diagram of a rough search process.
Fig. 6 is a schematic overall flow chart of a multitask hierarchical image retrieval method based on a deep selfcoding convolutional neural network.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1 to 6, a multitask hierarchical image retrieval method based on a depth selfcoding convolutional neural network is disclosed, as shown in fig. 6, an input retrieval image firstly passes through a sharing module of the convolutional neural network, then is sent to an interested module to screen out the position of a rough interested region in the image, then is sent to a quick visual segmentation detection and positioning module based on the interested module to a secondary screening model of the interested region to obtain the accurate position of a target in the image, and a fullimage sparse hash code of the image can be obtained through a depth learning method to perform rough retrieval and obtain the area perception semantic features and the accurate comparison retrieval of a matrix h with the maximum response of the image.
The image retrieval is carried out by a deep learning method, a rough target candidate region is obtained by carrying out logistic regression on the images in a shared network module and an interested region primary screening module, the completion quality of the step has direct influence on the speed of the system performance, and the calculation complexity of the second stage screening is reduced by obtaining a target rough region through primary screening so as to ensure the reliability and the adaptability of the identification system; the invention comprises the following steps:
1) constructing a multitask endtoend convolution neural network for deep learning and training identification;
the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the modules are a deep convolution neural network formed by alternately arranging convolution layers, an activation layer and a downsampling layer; performing logistic regression and layerbylayer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
a shared convolution module: the shared network consists of 5 convolution modules, wherein the deepest layers of conv2_ x to conv5_ x are {4 }^{2},8^{2},16^{2},16^{2}As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer, as shown in fig. 2, and this depth structure can effectively reduce the computation time and create invariance in the spatial structure. The method comprises the steps that an input image is subjected to layerbylayer mapping in a network, different representation forms of each layer for the image are finally obtained, the depth representation of the image is realized, wherein the mapping mode of the image is directly determined by a convolution kernel and a downsampling mode, a convolution neural network is essentially a network structure of the depth mapping, the input signal is subjected to layerbylayer mapping in the network and is continuously decomposed and represented, and finally the multilayer representation related to an object target is formed, and the method has the main characteristic that the object characteristics do not need to be manually selected and constructed any more, and the deep representation related to the object target is obtained through automatic learning of a machine;
the primary region of interest screening module: the RPN network takes an image of any scale as an input and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score. In order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multiscale convolution operation is carried out on the feature map, which is specifically realized as follows: using 3 scales and 3 lengthwidth ratios at the position of each sliding window, centering on the center of the current sliding window, and corresponding to one scale and lengthwidth ratio, then obtaining 9 candidate regions with different scales by mapping on the original image, for example, for a shared convolution feature map with a size of w × h, there are w × h × 9 candidate regions in total. Finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, i.e., estimated probabilities of each region being a target/nontarget, and the regression layer outputs w × h × 9 × 4 parameters, i.e., coordinate parameters of the candidate regions, in a specific form as shown in fig. 3.
When the RPN network is trained, each candidate region is assigned a binary label, so as to mark whether the region is a target or not. The specific operation is as follows: 1) a candidate region that overlaps with IoU (IntersectionoverUnion ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.
With these definitions, the objective function is minimized, following the multitasking penalty in the Faster RCNN. The loss function for an image is defined as:
where i is an index of the ith candidate region,is the probability that the candidate region is of the ith class. If the label of the candidate region is positive,is 1, if the candidate area label is 0,is simply 0. t is t_{i}Is a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box. N is a radical of_{cls}And N_{reg}The normalized coefficients are respectively a classification loss function and a position regression loss function, and lambda is a weight parameter between the two. Classification loss function L_{cls}Is the log loss of two classes (target vs nontarget):
regression loss function L for position_{reg}Defined by the following function:
where R is a robust loss function (smooth L1).
According to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; and finally, in order to obtain the most responsive regional perception semantic features, accessing the obtained I4096dimensional feature maps into a secondary screening network, and finally, reversely selecting the regional perception semantic features of the most accurate suggestion frame again.
A region of interest coordinate regression and identification module: the convolutional neural network training is a back propagation process, similar to the BP algorithm, and is implemented by performing back propagation on an error function and optimizing and adjusting convolution parameters and bias by using a random gradient descent method until the network converges or the maximum iteration number is reached.
The neural network training is a back propagation process, the convolution parameters and the bias are optimized and adjusted by a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times are reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, E^{N}In order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
in the formula (I), the compound is shown in the specification,^{l}representing the error function of the current layer,^{l+1}representing the error function of the previous layer, W^{l+1}For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, u^{l}Output, x, representing the layer above the failed activation function^{l1}Denotes the input of the next layer, W^{l}Mapping a weight matrix for the layer;
2) and sending the rough target obtained by the primary screening module of the region of interest into a secondary screening model of the region of interest for rapid visual segmentation detection and positioning. As shown in fig. 4, a probability layer is used to output coordinates and scores of each proposed frame of the RPN, and then nonmaximum suppression and threshold screening are performed to obtain a final coordinate frame and score, and finally a region of the most accurate proposed frame is selected again through a rescreening network, and accurate detection and identification of a target object are guaranteed through twice screening of the proposed frames, so that a global sparse hash code of an image and semantic information of the proposed frames in the image can be obtained; the global information of the image is highly summarized in the sparse hash code through endtoend fine adjustment, and the lowdimensional feature is used for quickly carrying out coarse retrieval on the image in the database, so that the method is a method for effectively reducing the calculated amount; then, the output of the full connection layer and the output of the primary screening network are used for connecting the secondary screening network, a suggestion frame with the largest response in the region of interest in the image is extracted, and highlevel semantic features are used for selectively comparing in the result of the rough retrieval to further reduce the operation time of the retrieval system;
3) based on the above viewpoints, the invention adopts a coarsetofine retrieval strategy to finally realize the quick search and accurate return of an image retrieval system; firstly, a group of images with similar attributes are searched out through a sparse Hash matrix of the images, and then the first k pictures which are most similar to the target images are selectively searched out in the group of roughly searched images through deep perception semantic information of the target areas of the images.
Firstly, assuming that a target data set can be divided into c category labels, a probability vector p ═ x (x) of a target category of each region of interest can be obtained by a target image through a primary screening network of the region of interest_{1},x_{2},...,x_{c})，x_{c}E (0,1), and performing binarization processing on the probability vector p by using a piecewise function to promote binary output of the sparse coding module, wherein the overall process is shown in fig. 2. If the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)_{1},p_{2},...,p_{m}) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7);
where i, j ∈ (m, c). Secondly, in order to accelerate the image retrieval speed, a vector fusion mode is adopted again, and the matrix H is compressed to a matrix H with the dimension of 1 × c to represent the global features of the target image. The whole process is shown as formula (8) and formula (9), firstly, the matrix H is transposed and multiplied by itself to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as the global characteristic binary hash code of the final target image I, namely, the matrix H.
H＝diag(h′) (9)
The sparse hash matrix H is a 1 × cdimensional vector and H_{i}∈{0,1},i∈c，Representation matrix (h)^{T}h) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the lowdimensional vector, so that the searching precision is improved.
The rough search processing flow is shown in fig. 5, and the search process is as follows: for a given image I, the output of the image Out is first extracted^{j}(H) And (5) sparse hash matrix H.
Assuming that the target dataset consists of n images, it can be expressed as ═ I_{1},I_{2},…,I_{n}Get the sparse hash code of the target data set_{H}＝{H_{1},H_{2},…,H_{n}}，H_{i}E {0,1} represents; further assume that a given search image I_{q}And retrieving sparse hash codes H of images_{q}Using a cosine distance measure H_{q}And H_{i}∈_{H}Similarity between them, the cosine value is greater than threshold value T_{H}The image is placed in the candidate pool U,and the rough retrieval result is used as a candidate image for a subsequent fine retrieval process.
4) And (4) carrying out accurate comparison retrieval on the region perception semantic features based on the maximum response and the matrix h.
The fine search is performed to give a query image I_{q}And a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the rescreening network and the full connection layer; the number of the suggested boxes contained in each image is variable, and one or more suggested boxes may be contained. If inquiring image I_{q}Contains m suggestion boxes, randomly selects an image I from the candidate pool U_{n}e.U contains m' suggestion boxes, if the violent retrieval mode is used for comparingThere is a suggestion box, which needs to compare m × m 'times, and the larger the value of m × m', the more the running speed of the whole search system is reduced, aiming at the problem: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image I_{q}The matrix h is denoted as h_{q}An mxcdimensional vector, random image I in the candidate pool_{n}Is h matrix of_{n}Then the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
wherein dis (·) represents a modified cosine distance formula, and the concrete expression form is shown in formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use f_{q},f_{n}It is shown that the mannequin formula (11) yields a comparison suggestion box matrix s, m × m' dimension. There are often two main categories of suggestion boxes for image generation: the sameclass suggestion boxes and the nonsameclass suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a reference_{q}And I_{n}Maximum value of interclass frame, average value of interclass frame, and finally in image I_{q}Is again averaged within classThe accuracy of the result is ensured by reducing the differences to the maximum extent through operation, and an image I is obtained_{q}And I_{n}Similarity sim.
First, the matrix s needs to be updated, and the image I is taken_{q}And I_{n}The maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m '), the maximum value in the ith row in matrix s' is selectedAnd (4) showing. Finally, using the formula (14), the image I is obtained_{q}And I_{n}Mean value of class interval box and in I_{q}And (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'^{j}Representing the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
Further, the method further comprises: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a rankingbased criterion; for a given search image I_{q}And a similarity measure, one ranking for each dataset image; here, a search image is represented by evaluating the top k ranked imagesI_{q}The search accuracy of (2) is expressed by the formula (15);
wherein Rel (I) represents a search image I_{q}The real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; when the real correlation is calculated, only a part with a classification label, namely rel (i) epsilon {0,1}, is considered, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
The overall image retrieval flow chart of the multitask hierarchical image retrieval method based on the deep selfcoding convolutional neural network is summarized simply as follows: 1) sending the image into a depth selfcoding convolutional neural network, performing logistic regression on the characteristic graph, and performing position and category segmentation and prediction on an interested region on the query image; 2) extracting a sparse hash matrix of the image and a perceptual semantic feature of an interested region by using a depth selfcoding convolutional neural network; 3) carrying out coarse retrieval on images in the database by using a sparse Hash matrix to obtain candidate images with similar attributes, and putting the candidate images into a candidate pool U; 4) on the basis of rough retrieval, namely the candidate pool U further selectively compares and sorts the suggestion boxes by using the modified cosine distance to obtain images of k before the ranking.
The above description is only exemplary of the preferred embodiments of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. A multitask layered image retrieval method based on a depth selfcoding convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
1) constructing a multitask endtoend convolution neural network for deep learning and training identification; the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the shared convolution module, the interested region secondary screening module and the interested region coordinate regression and identification module are all composed of convolution layers, an activation layer and a downsampling layer in an alternating mode; performing logistic regression and layerbylayer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
2) carrying out rapid visual segmentation detection and positioning on a secondary screening model of the region of interest based on an RPN network, adding primary and secondary screening networks based on the RPN network, grading and multifiltering an initial suggestion frame generated by the RPN, and determining a final region of interest according to the grade and the filtering of a maximum response region;
3) carrying out rough retrieval on the fullimage sparse hash code, firstly carrying out binarization coding on attribute probability vectors of a suggestion box generated by RPN (recursive noise) of an initial screening network, and then flattening twodimensional vectors into onedimensional vectors in a vector fusion mode to obtain the fullimage sparse hash code; finally, performing fast image comparison on the compact binary coding vector through cosine distance;
the hash method aims to represent samples as a string of fixedlength binary codes, and includes two aspects: 1) for two images sharing the same semantic concept, the hamming distance between the binary codes should be as small as possible, otherwise if the corresponding images are from different classes, the hamming distance should be larger, which means that the binary codes have a discriminating property in hamming space; 2) the hash method is treated as a black box, where the lowlevel or midlevel features are input, which are GIST features, and the output of the black box is binary code, which maintains semantic attributes in hamming space; the inherent properties of such binary codes can be viewed as a highlevel representation based on low or medium features from the point of view of feature representation and return semantics that are more relevant to the query image;
the processing process of the deep hash algorithm is as follows; first assume that the target dataset can be divided into c class labels, the target imageObtaining a probability vector p ═ x (x) of the target class of each region of interest through a primary screening network of the regions of interest_{1},x_{2},...,x_{c})，x_{c}E (0,1), and carrying out binarization processing on the probability vector p by using a piecewise function to promote binary output of the sparse coding module; if the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)_{1},p_{2},...,p_{m}) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7):
secondly, compressing the matrix H to a 1 x cdimensional matrix H to represent the global characteristics of the target image in order to accelerate the image retrieval speed by adopting a vector fusion mode; the whole process is shown as a formula (8) and a formula (9), firstly, the matrix H is transposed and multiplied by the matrix H to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as a global characteristic binary hash code of the final target image I, namely a matrix H;
H＝diag(h′) (9)
the sparse hash matrix H is a 1 × cdimensional vector and H_{i}∈{0,1},i∈c，Representation matrix (h)^{T}h) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the lowdimensional vector, so that the searching precision is improved;
learning a sparse hash matrix of the image and deep perception semantic information of an image target region through a deep selfcoding convolutional neural network, and finally realizing quick search and accurate return of an image retrieval system by adopting a coarsetofine retrieval strategy based on the viewpoint; firstly, retrieving a group of images with similar attributes through a sparse Hash matrix of the images, and then selectively retrieving the first k pictures which are most similar to a target image from the group of roughly retrieved images through deep sensing semantic information of a target area of the images;
the retrieval process is as follows: for a given image I, the output of the image Out is first extracted^{j}(H) A sparse hash matrix H;
let it be assumed that the target dataset consists of n images, denoted as { I ═ I_{1},I_{2},…,I_{n}Get the sparse hash code of the target data set_{H}＝{H_{1},H_{2},…,H_{n}}，H_{i}E {0,1} represents; further assume that a given search image I_{q}And retrieving sparse hash codes H of images_{q}Using a cosine distance measure H_{q}And H_{i}∈_{H}Similarity between them, the cosine value is greater than threshold value T_{H}The image is placed in the candidate pool U, the rough retrieval result is used as a candidate image for a subsequent fine retrieval process;
4) and (3) based on the accurate comparison and retrieval of the maximum response regional perception semantic features and the matrix h, extracting highlevel semantic information of the suggestion box from the primary screening network result and the fullconnection layer maximum response through rescreening the network, and selectively comparing and sequencing the highlevel semantic information by using the modified cosine distance in combination with the returned images obtained in the rapid comparison method, wherein the first k images are the last returned results.
2. The method for multitasking hierarchical image retrieval based on the deep selfcoding convolutional neural network as claimed in claim 1, wherein: in the step 1), the shared convolution module: the shared network consists of 5 convolution modules, of whichThe deepest layers of conv2_ x to conv5_ x are {4 }^{2},8^{2},16^{2},16^{2}As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;
a region of interest coordinate regression and identification module: the RPN takes an image with any scale as input, and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score; in order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multiscale convolution operation is carried out on the feature map, and the method is realized as follows: using 3 scales and 3 lengthwidth ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and lengthwidth ratio, so that 9 candidate regions with different scales can be obtained by mapping on the original image, and for a shared convolution feature map with the size of w × h, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/nontarget, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;
when the RPN network is trained, each candidate region is assigned with a binary label so as to mark whether the region is a target or not, and the operation is as follows: 1) a candidate region that overlaps with IoU (IntersectionoverUnion ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap of greater than 0.7 with any GT bounding box, assigning negative labels to candidate regions with IoU ratios to all GT bounding boxes below 0.3; 3) discard between the two;
with these definitions, following the multitasking loss in the Faster RCNN, the objective function is minimized, and the loss function for an image is defined as:
where i is an index of the ith candidate region,is the probability that the candidate region is of class i; if the label of the candidate region is positive,is 1, if the candidate area label is 0,is 0; t is t_{i}Is a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box, N_{cls}And N_{reg}Respectively, the normalized coefficients of the classification loss function and the position regression loss function, lambda is the weight parameter between the two, and the classification loss function L_{cls}Is a log loss of two classes, target and nontarget:
regression loss function L for position_{reg}Defined by the following function:
wherein R is a robust loss function smooth_{L1}：
And a secondary region of interest screening module: according to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; finally, in order to obtain the maximumresponse regional perception semantic features, the obtained I4096dimensional feature maps are accessed into a secondary screening network, and finally the regional perception semantic features of the most accurate suggestion frame are selected back again;
the convolutional neural network training is a back propagation process, and the convolutional parameters and the bias are optimized and adjusted by using a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times is reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, E^{N}In order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
^{l}＝(W^{l+1})^{T} ^{l+1}×f′(u^{l}) (u^{l}＝W^{l}x^{l1}+b^{l}) (6)
in the formula (I), the compound is shown in the specification,^{l}representing the error function of the current layer,^{l+1}representing the error function of the previous layer, W^{l+1}Is a layer aboveThe mapping matrix, f', represents the inverse of the activation function, i.e. upsampling, u^{l}Output, x, representing the layer above the failed activation function^{l1}Denotes the input of the next layer, W^{l}The weight matrix is mapped for this layer.
3. The method for multitasking hierarchical image retrieval based on deep selfcoding convolutional neural network as claimed in claim 1 or 2, characterized in that: in the step 2), the image obtained from the video or the camera comprises a plurality of target areas, the probability layer is utilized to output the coordinates and scores of each suggestion frame of the RPN, the final coordinate frame and the scores are obtained through nonmaximum value inhibition and threshold value screening, finally, the area of the most accurate suggestion frame is selected through the rescreening network again, and accurate detection and identification of the target object are guaranteed through twice screening of the suggestion frames.
4. The method for multitasking hierarchical image retrieval based on the deep selfcoding convolutional neural network as claimed in claim 1, wherein: given a query image I_{q}And a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the rescreening network and the full connection layer; the number of the suggestion boxes contained in each image is variable, and one or more suggestion boxes can be contained; if inquiring image I_{q}Contains m suggestion boxes, randomly selects an image I from the candidate pool U_{n}e.U contains m ' suggestion boxes, if all suggestion boxes are compared by using a violent retrieval mode, m × m ' times need to be compared, the larger the value of m × m ', the more the running speed of the whole retrieval system is reduced, and the problem is solved: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image I_{q}The matrix h is denoted as h_{q}An mxcdimensional vector, random image I in the candidate pool_{n}Is h matrix of_{n}Then the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
where dis (·) represents a modified cosine distance formula, which is expressed as formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use f_{q},f_{n}By substituting equation (11) into the matrix s, m × m', of comparison suggestion boxes, there are often two main categories of imagegenerated suggestion boxes: the sameclass suggestion boxes and the nonsameclass suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a reference_{q}And I_{n}Maximum value of interclass frame, average value of interclass frame, and finally in image I_{q}The intraclass mean value is obtained again, the differences are reduced to the maximum extent through operation to ensure the accuracy of the result, and an image I is obtained_{q}And I_{n}Similarity sim of (d);
first, the matrix s needs to be updated, and the image I is taken_{q}And I_{n}The maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m') For selecting the maximum value of the ith row in matrix sFinally, using the formula (14), the image I is obtained_{q}And I_{n}Mean value of class interval box and in I_{q}And (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'^{j}Representing the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
5. The method of claim 4, wherein the method comprises: the method further comprises the following steps: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a rankingbased criterion; for a given search image I_{q}And a similarity measure, one ranking for each dataset image; here, a search image I is represented by evaluating the top k ranked images_{q}The search accuracy of (2) is expressed by the formula (15);
wherein Rel (i) represents SaoCable image I_{q}The real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; when the real correlation is calculated, only a part with a classification label, namely rel (i) epsilon {0,1}, is considered, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201711057490.9A CN107679250B (en)  20171101  20171101  Multitask layered image retrieval method based on deep selfcoding convolutional neural network 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201711057490.9A CN107679250B (en)  20171101  20171101  Multitask layered image retrieval method based on deep selfcoding convolutional neural network 
Publications (2)
Publication Number  Publication Date 

CN107679250A CN107679250A (en)  20180209 
CN107679250B true CN107679250B (en)  20201201 
Family
ID=61144118
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201711057490.9A Active CN107679250B (en)  20171101  20171101  Multitask layered image retrieval method based on deep selfcoding convolutional neural network 
Country Status (1)
Country  Link 

CN (1)  CN107679250B (en) 
Families Citing this family (20)
Publication number  Priority date  Publication date  Assignee  Title 

CN108205580A (en) *  20170927  20180626  深圳市商汤科技有限公司  A kind of image search method, device and computer readable storage medium 
CN108520268A (en) *  20180309  20180911  浙江工业大学  The black box antagonism attack defense method evolved based on samples selection and model 
CN108363998A (en) *  20180321  20180803  北京迈格威科技有限公司  A kind of detection method of object, device, system and electronic equipment 
CN109740585A (en) *  20180328  20190510  北京字节跳动网络技术有限公司  A kind of text positioning method and device 
CN108898047B (en) *  20180427  20210319  中国科学院自动化研究所  Pedestrian detection method and system based on blocking and shielding perception 
CN108733801B (en) *  20180517  20200609  武汉大学  Digitalhumanoriented mobile visual retrieval method 
CN110532833A (en) *  20180523  20191203  北京国双科技有限公司  A kind of video analysis method and device 
CN108829826B (en) *  20180614  20200807  清华大学深圳研究生院  Image retrieval method based on deep learning and semantic segmentation 
CN110674331A (en) *  20180615  20200110  华为技术有限公司  Information processing method, related device and computer storage medium 
CN109145798A (en) *  20180813  20190104  浙江零跑科技有限公司  A kind of Driving Scene target identification and travelable region segmentation integrated approach 
CN109409246B (en) *  20180930  20201127  中国地质大学（武汉）  Sparse codingbased accelerated robust feature bimodal gesture intention understanding method 
CN109447169B (en) *  20181102  20201027  北京旷视科技有限公司  Image processing method, training method and device of model thereof and electronic system 
CN110084777A (en) *  20181105  20190802  哈尔滨理工大学  A kind of micro parts positioning and tracing method based on deep learning 
CN109766469B (en) *  20181214  20201201  浙江工业大学  Image retrieval method based on deep hash learning optimization 
CN109871749A (en) *  20190102  20190611  上海高重信息科技有限公司  A kind of pedestrian based on depth Hash recognition methods and device, computer system again 
CN109933682A (en) *  20190111  20190625  上海交通大学  A kind of image Hash search method and system based on semanteme in conjunction with content information 
CN109977960B (en) *  20190403  20200228  杭州深数科技有限公司  Wood pile information acquisition method, system and device based on neural network 
CN110189394B (en) *  20190514  20201229  北京字节跳动网络技术有限公司  Mouth shape generation method and device and electronic equipment 
CN110210462A (en) *  20190702  20190906  北京工业大学  A kind of bionical hippocampus cognitive map construction method based on convolutional neural networks 
CN110766011B (en) *  20191226  20200428  南京智莲森信息技术有限公司  Contact net nut abnormity identification method based on deep multistage optimization 
Citations (5)
Publication number  Priority date  Publication date  Assignee  Title 

CN106227851A (en) *  20160729  20161214  姹ゅ钩  Based on the image search method searched for by depth of seam division that degree of depth convolutional neural networks is endtoend 
CN106250812A (en) *  20160715  20161221  姹ゅ钩  A kind of model recognizing method based on quick R CNN deep neural network 
CN106339591A (en) *  20160825  20170118  姹ゅ钩  Breast cancer prevention selfservice health cloud service system based on deep convolutional neural network 
CN106372571A (en) *  20160818  20170201  宁波傲视智绘光电科技有限公司  Road traffic sign detection and identification method 
CN106951911A (en) *  20170213  20170714  北京飞搜科技有限公司  A kind of quick multitag picture retrieval system and implementation method 
Family Cites Families (1)
Publication number  Priority date  Publication date  Assignee  Title 

US10809895B2 (en) *  20160311  20201020  Fuji Xerox Co., Ltd.  Capturing documents from screens for archival, search, annotation, and sharing 

2017
 20171101 CN CN201711057490.9A patent/CN107679250B/en active Active
Patent Citations (5)
Publication number  Priority date  Publication date  Assignee  Title 

CN106250812A (en) *  20160715  20161221  姹ゅ钩  A kind of model recognizing method based on quick R CNN deep neural network 
CN106227851A (en) *  20160729  20161214  姹ゅ钩  Based on the image search method searched for by depth of seam division that degree of depth convolutional neural networks is endtoend 
CN106372571A (en) *  20160818  20170201  宁波傲视智绘光电科技有限公司  Road traffic sign detection and identification method 
CN106339591A (en) *  20160825  20170118  姹ゅ钩  Breast cancer prevention selfservice health cloud service system based on deep convolutional neural network 
CN106951911A (en) *  20170213  20170714  北京飞搜科技有限公司  A kind of quick multitag picture retrieval system and implementation method 
Also Published As
Publication number  Publication date 

CN107679250A (en)  20180209 
Similar Documents
Publication  Publication Date  Title 

Krause et al.  The unreasonable effectiveness of noisy data for finegrained recognition  
Alzu’bi et al.  Semantic contentbased image retrieval: A comprehensive study  
Noh et al.  Largescale image retrieval with attentive deep local features  
Yu et al.  Spatial pyramidenhanced NetVLAD with weighted triplet loss for place recognition  
US10102443B1 (en)  Hierarchical conditional random field model for labeling and segmenting images  
CN105912611B (en)  A kind of fast image retrieval method based on CNN  
GarciaFidalgo et al.  Visionbased topological mapping and localization methods: A survey  
US9275269B1 (en)  System, method and apparatus for facial recognition  
CN106126581B (en)  Cartographical sketching image search method based on deep learning  
Gao et al.  Database saliency for fast image retrieval  
CN106682233B (en)  Hash image retrieval method based on deep learning and local feature fusion  
US9547807B2 (en)  Image processing and object classification  
Tsai  Bagofwords representation in image annotation: A review  
JP2017062781A (en)  Similaritybased detection of prominent objects using deep cnn pooling layers as features  
CN106649487B (en)  Image retrieval method based on interest target  
Chaudhuri et al.  Multilabel remote sensing image retrieval using a semisupervised graphtheoretic method  
Sivic et al.  Efficient visual search of videos cast as text retrieval  
Hu et al.  Recognition of pornographic web pages by classifying texts and images  
EP2955645A1 (en)  System for automated segmentation of images through layout classification  
Memon et al.  GEO matching regions: multiple regions of interests using content based image retrieval based on relative locations  
Feng et al.  Attentiondriven salient edge (s) and region (s) extraction with application to CBIR  
Zhang et al.  A review on automatic image annotation techniques  
Liu et al.  A survey of contentbased image retrieval with highlevel semantics  
Georgescu et al.  Mean shift based clustering in high dimensions: A texture classification example  
Russell et al.  Using multiple segmentations to discover objects and their extent in image collections 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 