CN107679250B - Multi-task layered image retrieval method based on deep self-coding convolutional neural network - Google Patents

Multi-task layered image retrieval method based on deep self-coding convolutional neural network Download PDF

Info

Publication number
CN107679250B
CN107679250B CN201711057490.9A CN201711057490A CN107679250B CN 107679250 B CN107679250 B CN 107679250B CN 201711057490 A CN201711057490 A CN 201711057490A CN 107679250 B CN107679250 B CN 107679250B
Authority
CN
China
Prior art keywords
image
matrix
retrieval
region
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711057490.9A
Other languages
Chinese (zh)
Other versions
CN107679250A (en
Inventor
何霞
汤一平
王丽冉
陈朋
袁公萍
金宇杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201711057490.9A priority Critical patent/CN107679250B/en
Publication of CN107679250A publication Critical patent/CN107679250A/en
Application granted granted Critical
Publication of CN107679250B publication Critical patent/CN107679250B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06KRECOGNITION OF DATA; PRESENTATION OF DATA; RECORD CARRIERS; HANDLING RECORD CARRIERS
    • G06K9/00Methods or arrangements for reading or recognising printed or written characters or for recognising patterns, e.g. fingerprints
    • G06K9/20Image acquisition
    • G06K9/32Aligning or centering of the image pick-up or image-field
    • G06K9/3233Determination of region of interest
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06NCOMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computer systems based on biological models
    • G06N3/02Computer systems based on biological models using neural network models
    • G06N3/04Architectures, e.g. interconnection topology
    • G06N3/0454Architectures, e.g. interconnection topology using a combination of multiple neural nets

Abstract

The invention discloses a multitask layered image retrieval method based on a depth self-coding convolutional neural network, which is characterized by comprising the following steps: the method mainly comprises a multi-task end-to-end convolutional neural network for deep learning and training identification, a rapid visual segmentation detection and positioning method for a secondary screening module of the region of interest based on an RPN network, rough retrieval of a full-image sparse hash code, accurate comparison retrieval of a region perception semantic feature and a matrix h based on maximum response, and an algorithm for selecting and comparing the region of interest. The invention can realize end-to-end training, automatically select the region of interest with higher quality, effectively improve the automation and intelligence level of searching the images by the images, and meet the image retrieval requirement of the big data era by using less storage space and higher retrieval speed.

Description

Multi-task layered image retrieval method based on deep self-coding convolutional neural network
Technical Field
The invention relates to application of computer vision, pattern recognition, information retrieval, multitask learning, similarity measurement, a deep self-coding convolutional neural network and a deep learning technology in the field of image retrieval, in particular to a multitask hierarchical image retrieval method based on the deep self-coding convolutional neural network.
Background
The image retrieval aims at providing a user with a search technology for graphical image information retrieval by analyzing the content of an input query image to retrieve similar images, and the search technology comprises multiple disciplines such as image processing, computer vision, multi-task learning, mode recognition and cognitive psychology. The related technology mainly comprises the acquisition of image representation and similarity measurement. The method has wide application in various fields such as image retrieval, video investigation, interconnection, shopping search engines and the like in the background of big data era.
For the image retrieval technology of contents, a common traditional method is generally based on color information, shape characteristics, texture characteristics and the like of images, and the technology belongs to the image retrieval technology in the early deep learning era and can be generally divided into three steps: 1) extracting the feature representation of the target image, wherein the most common SIFT description operators, color or geometric invariant moments, hash functions, Fisher vector descriptions and the like are used in the obtained image algorithm. 2) The image feature representation is re-encoded for a massive image look-up table. The target image with higher resolution can be subjected to down-sampling processing and then encoded to obtain image characteristic representation, so that the operation burden in the searching process can be reduced, and the image comparison speed is increased. 3) The similarity measurement method comprises the following steps: calculating the similarity between the query image and the target data set by using the image feature representation obtained in the step 2; setting an image screening threshold according to the robustness required by the final query result, and then reserving the first k images with the highest similarity values according to the similarity of the images; and finally, screening out the most similar pictures by combining a feature matching algorithm.
The image feature representation is that the pixel information of the image is connected with the perception of human beings to things, and the image feature is the condition of retrieval. In the traditional retrieval based on image content, because the features need to be extracted manually, the method is time-consuming, labor-consuming, and has great problems in retrieval accuracy and efficiency, when the methods use the bottom-layer features such as color, texture, contour and the like of an image frequently as the basic features of the image, and the retrieval result is matched with the image in a target database based on the similarity calculation of the manually extracted features. In the current big data era, the target database is particularly huge, so that the timeliness in the retrieval process is extremely important. The timeliness is a standard for measuring the quality of an image retrieval system, and aims at the problems that the existing content-based image retrieval technology is low in intelligence level, lacks of image feature self-encoding capability, is difficult to accurately and quickly obtain a retrieval result and cannot meet the technical requirements of image retrieval in the big data era.
QBIC (query by image cotnet) proposed by IBM, Tineye developed by Idee, Canada, Photobrook developed by MIT media laboratories, Virage developed by Virage, NETRA developed by ADL, university of California, VisualSeek and WebSeek developed by university of Columbia are image retrieval technologies belonging to the pre-deep learning era.
Deep learning developed in recent years is a deep self-coding convolutional neural network that is built and simulates the human brain to analyze learning, and it simulates the human brain learning mechanism to interpret and represent picture data. The LeNet model is the most representative network model in early deep neural networks, and AlexNet, VGG and ResNet networks have deeper network structures and stronger image representation capability. Deep learning reveals the distributed nature of the target database by representing the underlying features as more abstract target image classes or feature vectors.
The combination of the hash method and deep learning is a new trend in order to improve the precision and efficiency of retrieval. The hash algorithm can also be used for measuring the similarity between images, and mainly comprises an unsupervised method and a supervised method. Unsupervised learning methods are often used for unlabeled data, and a set of hash codes is obtained by learning the geometric structure of the data, wherein locality sensitive hashing is widely used, and maximizes the probability of likely similarity data as a learning objective to obtain a set of similar hash codes. Compared with unsupervised learning, the supervised learning hash code can obtain a group of more typical hash codes, the deep features of the image can be automatically explored by the hash method based on the deep self-coding convolutional neural network, and the method is an ideal image retrieval technology.
The chinese patent application No. 201611127943.6 discloses an ultra-low complexity image retrieval method based on sequence preserving hashing, wherein a part of images is randomly selected in an image database as a training set, and corresponding features are extracted to perform dimensionality reduction on the features of the original image by a nonlinear principal component analysis method; then obtaining a series of supporting points by using a K-means clustering algorithm as the basis of subsequent hash function learning; then, performing hash coding on the whole image database through iterative optimization learning of a corresponding hash function; and finally, measuring the similarity between the images by using the Hamming distance. The technology still belongs to the traditional retrieval technology based on image content.
The chinese patent application No. 201610877794.9 discloses an image retrieval method based on an interest target. Which comprises the following steps: 1) analyzing an interest target of a user according to an HS (high speed) significance detection algorithm, and segmenting the interest target by combining a SaliencyCut algorithm; 2) extracting HSV color features, SIFT local features and CNN semantic features from an interest target of a user; 3) and matching the extracted features of the interest targets with the images in the database according to the similarity, and sequencing according to the similarity to obtain a retrieval result based on the interest targets. The invention utilizes the image characteristics of multiple dimensions to represent and meet the image retrieval precision requirement in the big data era, but the algorithm has the problems of higher requirement on the system memory, reduction of the speed of a retrieval system due to the comparison of the characteristics of multiple dimensions, incapability of being used for large-scale image database retrieval and the like.
The Chinese patent application No. 201510475003.51 discloses a method, a device and a system for image retrieval, image information acquisition and image identification. The system firstly extracts the local features of the retrieval image, calculates the feature values of the local features by adopting a pre-trained deep self-coding convolutional neural network, then matches the feature values with the feature values of the registered images in an image retrieval database, and finally selects the registered images meeting preset conditions as the retrieval results of the images to be retrieved according to the matching results. Although the system selects local features through the feature points, the extracted local features can not be guaranteed to accurately segment the search objects; and the characteristics of the objects cannot be independently extracted from the multi-target multi-label image.
The Chinese patent application No. 201710035674.9 discloses a method and a system for matching feature points of super-large-scale images, wherein firstly, image neighbor search is carried out to obtain an image matching pair; forming an undirected graph by taking the images as nodes and forming edges among adjacent images, and carrying out breadth-first sequencing on the graph to obtain the sequenced images and image pairs; rearranging the characteristic information of the image according to the sequencing result, and storing the characteristic information into a binary file in blocks; reading the binary files with the stored feature information in sequence, sequentially performing feature matching according to the sequenced image pairs, and releasing subsequent useless feature information in time; the algorithm can meet the retrieval requirement of large-scale images by iteratively reading the feature information and performing feature matching until all the image pairs are matched, but the algorithm is based on a neighbor search technology and has the problems of low retrieval precision and the like.
In summary, the image searching technique using the deep self-coding convolutional neural network and the hash method has several problems as follows: 1) how to extract sparse coding of the whole image from a multi-target multi-label image in a multi-task learning mode, and meanwhile, accurately segmenting an interested region and extracting region perception semantic features of the interested region; 2) how to use the extracted features to establish hierarchical depth search so as to obtain more accurate retrieval results; 3) how to perfectly combine the recognition precision, the detection accuracy and the retrieval efficiency of a retrieval system of the deep self-coding convolutional neural network; 4) how to design a frame for realizing an end-to-end image retrieval method by hierarchical deep search by using a CNN network in a real sense; 5) how to reduce the problems of large storage space consumption, low retrieval speed and the like of an image retrieval system under the background of a big data era.
Disclosure of Invention
The invention provides an image retrieval method through hierarchical depth search based on a depth self-coding convolutional neural network end-to-end, aiming at the problems that the existing image searching technology is low in automation and intelligence level, lack of deep learning, difficult to obtain accurate retrieval results, large in storage space consumption of the retrieval technology, low in retrieval speed, and difficult to meet the image retrieval requirements in the big data era.
To implement the above summary, several core problems must be solved: 1) aiming at the problem of difficult image feature extraction, the strong feature characterization capability of a depth self-coding convolutional neural network is utilized to realize feature self-adaptive extraction; 2) aiming at the problem of low retrieval speed of large-scale images, a multi-task layering method is designed, and a query image is used for being rapidly compared with an image in a database; 3) aiming at the semantic retrieval of a multi-target image scene, a secondary screening algorithm of a region of interest is designed to detect and segment the multi-target image; 4) by utilizing the advantages of an end-to-end deep network, an end-to-end deep self-coding convolutional neural network is designed, and detection, identification and feature extraction are fused into one network.
In order to realize a large-scale image retrieval method of an end-to-end multitask deep self-coding convolutional neural network, the invention comprises a multitask end-to-end convolutional neural network for deep learning and training identification, a rapid visual segmentation detection and positioning method for a secondary screening model of an interested region based on an RPN network, a rough retrieval of a full-image sparse hash code, an accurate comparison retrieval of a region perception semantic feature and a matrix h based on maximum response, and an algorithm for selecting a comparison for the interested region.
The invention provides a multi-task layered image retrieval method based on a deep self-coding convolutional neural network, which adopts the technical scheme that the method comprises the following steps:
1) the method comprises the steps of constructing a multitask end-to-end convolution neural network for deep learning and training identification;
the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the modules are a deep convolution neural network formed by alternately arranging convolution layers, an activation layer and a down-sampling layer; performing logistic regression and layer-by-layer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
a shared convolution module: the shared network consists of 5 convolution modules, wherein the deepest layers of conv2_ x to conv5_ x are {4 }2,82,162,162As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;
the system comprises a secondary region-of-interest screening module and a region-of-interest coordinate regression and identification module: the RPN takes an image with any scale as input, and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score; in order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multi-scale convolution operation is carried out on the feature map, and the method is realized as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and length-width ratio, so that 9 candidate regions with different scales can be obtained by mapping on the original image, and for a shared convolution feature map with the size of w × h, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;
when the RPN network is trained, each candidate region is assigned a binary label, so as to mark whether the region is a target or not. The operation is as follows: 1) a candidate region that overlaps with IoU (Intersection-over-Union ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.
With these definitions, the objective function is minimized, following the multitasking penalty in the Faster RCNN. The loss function for an image is defined as:
where i is an index of the ith candidate region,is waiting forThe selected region is the probability of the ith class. If the label of the candidate region is positive,is 1, if the candidate area label is 0,is 0; t is tiIs a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box. N is a radical ofclsAnd NregThe normalized coefficients are respectively a classification loss function and a position regression loss function, and lambda is a weight parameter between the two. Classification loss function LclsIs the log loss of two classes (target vs non-target):
regression loss function L for positionregDefined by the following function:
where R is a robust loss function (smooth L1).
According to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096-dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; finally, in order to obtain the maximum-response regional perception semantic features, the obtained I4096-dimensional feature maps are accessed into a secondary screening network, and finally the regional perception semantic features of the most accurate suggestion frame are selected back again;
a region of interest coordinate regression and identification module: the convolutional neural network training is a back propagation process, is similar to a BP algorithm, performs back propagation through an error function, and performs optimization adjustment on convolutional parameters and bias by using a random gradient descent method until the network converges or reaches the maximum iteration times;
the neural network training is a back propagation process, the convolution parameters and the bias are optimized and adjusted by a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times are reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, ENIn order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
in the formula (I), the compound is shown in the specification,lrepresenting the error function of the current layer,l+1error representing the previous layerFunction, Wl+1For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, ulOutput, x, representing the layer above the failed activation functionl-1Denotes the input of the next layer, WlMapping a weight matrix for the layer;
2) based on RPN network, the fast visual segmentation detection and positioning of the secondary screening model of the region of interest are carried out as follows:
the method comprises the steps that an image obtained from a video or a camera comprises a plurality of target areas, a probability layer is utilized to carry out non-maximum value inhibition and threshold value screening on the coordinate output and score output of each suggestion frame of an RPN to obtain a final coordinate frame and a score, finally, the area of the most accurate suggestion frame is selected through a re-screening network again, and accurate detection and identification of a target object are guaranteed through twice screening of the suggestion frames;
3) the rough retrieval of the full-image sparse hash code is carried out, and the process is as follows:
the hash method aims to represent samples as a string of fixed-length binary codes, and includes two aspects: 1) for two images sharing the same semantic concept, the hamming distance between the binary codes should be as small as possible, otherwise if the corresponding images are from different classes, the hamming distance should be larger, which means that the binary codes have a discriminating property in hamming space; 2) the hash method can be viewed as a black box where the input is typically a low or medium level feature, such as GIST feature, and the output of the black box is binary code that maintains semantic properties in hamming space; the inherent properties of such binary codes can be viewed as a high-level representation based on low or medium features from the point of view of feature representation and return semantics that are more relevant to the query image;
aiming at a deep convolutional network and a Hash method, the rough retrieval processing process is provided as follows:
firstly, assuming that a target data set can be divided into c category labels, a probability vector p ═ x (x) of a target category of each region of interest can be obtained by a target image through a primary screening network of the region of interest1,x2,...,xc),xcE (0,1) to promote sparse codingBinary output of the code module carries out binarization processing on the probability vector p by utilizing a piecewise function; if the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)1,p2,...,pm) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7);
where i, j ∈ (m, c). Secondly, in order to accelerate the image retrieval speed, a vector fusion mode is adopted again, and the matrix H is compressed to a matrix H with the dimension of 1 × c to represent the global features of the target image. The whole process is shown as a formula (8) and a formula (9), firstly, the matrix H is transposed and multiplied by the matrix H to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as a global characteristic binary hash code of the final target image I, namely a matrix H;
H=diag(h′) (9)
the sparse hash matrix H is a 1 × c-dimensional vector and Hi∈{0,1},i∈c,Representation matrix (h)Th) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the low-dimensional vector, so that the searching precision is improved.
Learning a sparse hash matrix of the image and deep perception semantic information of an image target region through a deep self-coding convolutional neural network, and finally realizing quick search and accurate return of an image retrieval system by adopting a coarse-to-fine retrieval strategy based on the viewpoint; firstly, a group of images with similar attributes are searched out through a sparse Hash matrix of the images, and then the first k pictures which are most similar to a target image are selectively searched out from the group of roughly searched images through deep perception semantic information of a target area of the images
The retrieval process is as follows: for a given image I, the output of the image Out is first extractedj(H) Sparse hash matrix H
Assuming that the target dataset consists of n images, it can be expressed as ═ I1,I2,…,InGet the sparse hash code of the target data setH={H1,H2,…,Hn},HiE {0,1} represents; further assume that a given search image IqAnd retrieving sparse hash codes H of imagesqUsing a cosine distance measure HqAnd HiHSimilarity between them, the cosine value is greater than threshold value THThe image is placed in the candidate pool U,the rough search result is used as a candidate image for the subsequent fine search process
4) Based on the accurate comparison and retrieval of the maximum response region perception semantic features and the matrix h;
the fine search is performed to give a query image IqAnd a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the re-screening network and the full connection layer; the number of the suggestion boxes contained in each image is variable, and one or more suggestion boxes can be contained; if inquiring image IqContains m suggestion boxes, randomly selects an image I from the candidate pool Une.U contains m ' suggestion boxes, if all suggestion boxes are compared by using a violent retrieval mode, m × m ' times need to be compared, the larger the value of m × m ', the more the running speed of the whole retrieval system is reduced, and the problem is solved: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image IqThe matrix h is denoted as hqAn mxc-dimensional vector, random image I in the candidate poolnIs h matrix ofnThen the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
where dis (·) represents a modified cosine distance formula, which is expressed as formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use fq,fnIt is shown that the mannequin formula (11) yields a comparison suggestion box matrix s, m × m' dimension. There are often two main categories of suggestion boxes for image generation: the same-class suggestion boxes and the non-same-class suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a referenceqAnd InMaximum value of inter-class frame, average value of inter-class frame, and finally in image IqThe intra-class mean value is obtained again, the differences are reduced to the maximum extent through operation to ensure the accuracy of the result, and an image I is obtainedqAnd InSimilarity sim of (d);
first, the matrix s needs to be updated, and the image I is takenqAnd InThe maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m '), the maximum value in the ith row in matrix s' is selectedAnd (4) showing. Finally, using the formula (14), the image I is obtainedqAnd InMean value of class interval box and in IqAnd (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'jRepresenting the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
Further, the method further comprises: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a ranking-based criterion; for a given search image IqAnd a similarity measure, one ranking for each dataset image; here, a search image I is represented by evaluating the top k ranked imagesqThe search accuracy of (2) is expressed by the formula (15);
wherein Rel (I) represents a search image IqThe real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; in calculating the true correlation, only the part with the classification label is considered,rel (i) is proper to {0,1}, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0 is set, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
The overall image retrieval flow chart of the multi-task hierarchical image retrieval method based on the deep self-coding convolutional neural network is summarized simply as follows: 1) sending the image into a depth self-coding convolutional neural network, performing logistic regression on the characteristic graph, and performing position and category segmentation and prediction on an interested region on the query image; 2) extracting a sparse hash matrix of the image and a perceptual semantic feature of an interested region by using a depth self-coding convolutional neural network; 3) carrying out coarse retrieval on images in the database by using a sparse Hash matrix to obtain candidate images with similar attributes, and putting the candidate images into a candidate pool U; 4) on the basis of rough retrieval, namely the candidate pool U further selectively compares and sorts the suggestion boxes by using the modified cosine distance to obtain images of k before the ranking.
The invention has the following beneficial effects:
1) the method for searching the multitask layered image based on the depth self-coding convolutional neural network is provided;
2) the strong characteristic representation capability of the deep convolutional neural network is utilized to realize the self-adaptive extraction of the characteristics;
3) the image retrieval method adopting the layered depth search can meet the search requirement of large-scale image data;
4) the design gives consideration to universality and specificity, and meets the requirements of various users in the aspects of universality, retrieval speed, precision, practicability and the like; and in the aspect of specificity, a user makes a special data set according to the specific requirement of the user and finely adjusts network parameters to realize a system for searching the image by the image and oriented to specific application.
Drawings
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a schematic diagram of a shared network.
Fig. 3 is an RPN network development diagram.
Fig. 4 is a flow chart of fast visual segmentation detection and positioning of a secondary region of interest screening model based on an RPN network.
Fig. 5 is a schematic diagram of a rough search process.
Fig. 6 is a schematic overall flow chart of a multitask hierarchical image retrieval method based on a deep self-coding convolutional neural network.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
Referring to fig. 1 to 6, a multitask hierarchical image retrieval method based on a depth self-coding convolutional neural network is disclosed, as shown in fig. 6, an input retrieval image firstly passes through a sharing module of the convolutional neural network, then is sent to an interested module to screen out the position of a rough interested region in the image, then is sent to a quick visual segmentation detection and positioning module based on the interested module to a secondary screening model of the interested region to obtain the accurate position of a target in the image, and a full-image sparse hash code of the image can be obtained through a depth learning method to perform rough retrieval and obtain the area perception semantic features and the accurate comparison retrieval of a matrix h with the maximum response of the image.
The image retrieval is carried out by a deep learning method, a rough target candidate region is obtained by carrying out logistic regression on the images in a shared network module and an interested region primary screening module, the completion quality of the step has direct influence on the speed of the system performance, and the calculation complexity of the second stage screening is reduced by obtaining a target rough region through primary screening so as to ensure the reliability and the adaptability of the identification system; the invention comprises the following steps:
1) constructing a multi-task end-to-end convolution neural network for deep learning and training identification;
the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the modules are a deep convolution neural network formed by alternately arranging convolution layers, an activation layer and a down-sampling layer; performing logistic regression and layer-by-layer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
a shared convolution module: the shared network consists of 5 convolution modules, wherein the deepest layers of conv2_ x to conv5_ x are {4 }2,82,162,162As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer, as shown in fig. 2, and this depth structure can effectively reduce the computation time and create invariance in the spatial structure. The method comprises the steps that an input image is subjected to layer-by-layer mapping in a network, different representation forms of each layer for the image are finally obtained, the depth representation of the image is realized, wherein the mapping mode of the image is directly determined by a convolution kernel and a downsampling mode, a convolution neural network is essentially a network structure of the depth mapping, the input signal is subjected to layer-by-layer mapping in the network and is continuously decomposed and represented, and finally the multilayer representation related to an object target is formed, and the method has the main characteristic that the object characteristics do not need to be manually selected and constructed any more, and the deep representation related to the object target is obtained through automatic learning of a machine;
the primary region of interest screening module: the RPN network takes an image of any scale as an input and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score. In order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multi-scale convolution operation is carried out on the feature map, which is specifically realized as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, centering on the center of the current sliding window, and corresponding to one scale and length-width ratio, then obtaining 9 candidate regions with different scales by mapping on the original image, for example, for a shared convolution feature map with a size of w × h, there are w × h × 9 candidate regions in total. Finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, i.e., estimated probabilities of each region being a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, i.e., coordinate parameters of the candidate regions, in a specific form as shown in fig. 3.
When the RPN network is trained, each candidate region is assigned a binary label, so as to mark whether the region is a target or not. The specific operation is as follows: 1) a candidate region that overlaps with IoU (Intersection-over-Union ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap with any GT bounding box by more than 0.7. Assigning negative labels to candidate regions for which the IoU ratio for all GT bounding boxes is below 0.3; 3) between the two.
With these definitions, the objective function is minimized, following the multitasking penalty in the Faster RCNN. The loss function for an image is defined as:
where i is an index of the ith candidate region,is the probability that the candidate region is of the ith class. If the label of the candidate region is positive,is 1, if the candidate area label is 0,is simply 0. t is tiIs a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box. N is a radical ofclsAnd NregThe normalized coefficients are respectively a classification loss function and a position regression loss function, and lambda is a weight parameter between the two. Classification loss function LclsIs the log loss of two classes (target vs non-target):
regression loss function L for positionregDefined by the following function:
where R is a robust loss function (smooth L1).
According to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096-dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; and finally, in order to obtain the most responsive regional perception semantic features, accessing the obtained I4096-dimensional feature maps into a secondary screening network, and finally, reversely selecting the regional perception semantic features of the most accurate suggestion frame again.
A region of interest coordinate regression and identification module: the convolutional neural network training is a back propagation process, similar to the BP algorithm, and is implemented by performing back propagation on an error function and optimizing and adjusting convolution parameters and bias by using a random gradient descent method until the network converges or the maximum iteration number is reached.
The neural network training is a back propagation process, the convolution parameters and the bias are optimized and adjusted by a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times are reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, ENIn order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
in the formula (I), the compound is shown in the specification,lrepresenting the error function of the current layer,l+1representing the error function of the previous layer, Wl+1For the previous layer of the mapping matrix, f' represents the inverse of the activation function, i.e. upsampling, ulOutput, x, representing the layer above the failed activation functionl-1Denotes the input of the next layer, WlMapping a weight matrix for the layer;
2) and sending the rough target obtained by the primary screening module of the region of interest into a secondary screening model of the region of interest for rapid visual segmentation detection and positioning. As shown in fig. 4, a probability layer is used to output coordinates and scores of each proposed frame of the RPN, and then non-maximum suppression and threshold screening are performed to obtain a final coordinate frame and score, and finally a region of the most accurate proposed frame is selected again through a re-screening network, and accurate detection and identification of a target object are guaranteed through twice screening of the proposed frames, so that a global sparse hash code of an image and semantic information of the proposed frames in the image can be obtained; the global information of the image is highly summarized in the sparse hash code through end-to-end fine adjustment, and the low-dimensional feature is used for quickly carrying out coarse retrieval on the image in the database, so that the method is a method for effectively reducing the calculated amount; then, the output of the full connection layer and the output of the primary screening network are used for connecting the secondary screening network, a suggestion frame with the largest response in the region of interest in the image is extracted, and high-level semantic features are used for selectively comparing in the result of the rough retrieval to further reduce the operation time of the retrieval system;
3) based on the above viewpoints, the invention adopts a coarse-to-fine retrieval strategy to finally realize the quick search and accurate return of an image retrieval system; firstly, a group of images with similar attributes are searched out through a sparse Hash matrix of the images, and then the first k pictures which are most similar to the target images are selectively searched out in the group of roughly searched images through deep perception semantic information of the target areas of the images.
Firstly, assuming that a target data set can be divided into c category labels, a probability vector p ═ x (x) of a target category of each region of interest can be obtained by a target image through a primary screening network of the region of interest1,x2,...,xc),xcE (0,1), and performing binarization processing on the probability vector p by using a piecewise function to promote binary output of the sparse coding module, wherein the overall process is shown in fig. 2. If the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)1,p2,...,pm) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7);
where i, j ∈ (m, c). Secondly, in order to accelerate the image retrieval speed, a vector fusion mode is adopted again, and the matrix H is compressed to a matrix H with the dimension of 1 × c to represent the global features of the target image. The whole process is shown as formula (8) and formula (9), firstly, the matrix H is transposed and multiplied by itself to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as the global characteristic binary hash code of the final target image I, namely, the matrix H.
H=diag(h′) (9)
The sparse hash matrix H is a 1 × c-dimensional vector and Hi∈{0,1},i∈c,Representation matrix (h)Th) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the low-dimensional vector, so that the searching precision is improved.
The rough search processing flow is shown in fig. 5, and the search process is as follows: for a given image I, the output of the image Out is first extractedj(H) And (5) sparse hash matrix H.
Assuming that the target dataset consists of n images, it can be expressed as ═ I1,I2,…,InGet the sparse hash code of the target data setH={H1,H2,…,Hn},HiE {0,1} represents; further assume that a given search image IqAnd retrieving sparse hash codes H of imagesqUsing a cosine distance measure HqAnd HiHSimilarity between them, the cosine value is greater than threshold value THThe image is placed in the candidate pool U,and the rough retrieval result is used as a candidate image for a subsequent fine retrieval process.
4) And (4) carrying out accurate comparison retrieval on the region perception semantic features based on the maximum response and the matrix h.
The fine search is performed to give a query image IqAnd a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the re-screening network and the full connection layer; the number of the suggested boxes contained in each image is variable, and one or more suggested boxes may be contained. If inquiring image IqContains m suggestion boxes, randomly selects an image I from the candidate pool Une.U contains m' suggestion boxes, if the violent retrieval mode is used for comparingThere is a suggestion box, which needs to compare m × m 'times, and the larger the value of m × m', the more the running speed of the whole search system is reduced, aiming at the problem: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image IqThe matrix h is denoted as hqAn mxc-dimensional vector, random image I in the candidate poolnIs h matrix ofnThen the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
wherein dis (·) represents a modified cosine distance formula, and the concrete expression form is shown in formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use fq,fnIt is shown that the mannequin formula (11) yields a comparison suggestion box matrix s, m × m' dimension. There are often two main categories of suggestion boxes for image generation: the same-class suggestion boxes and the non-same-class suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a referenceqAnd InMaximum value of inter-class frame, average value of inter-class frame, and finally in image IqIs again averaged within classThe accuracy of the result is ensured by reducing the differences to the maximum extent through operation, and an image I is obtainedqAnd InSimilarity sim.
First, the matrix s needs to be updated, and the image I is takenqAnd InThe maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m '), the maximum value in the ith row in matrix s' is selectedAnd (4) showing. Finally, using the formula (14), the image I is obtainedqAnd InMean value of class interval box and in IqAnd (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'jRepresenting the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
Further, the method further comprises: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a ranking-based criterion; for a given search image IqAnd a similarity measure, one ranking for each dataset image; here, a search image is represented by evaluating the top k ranked imagesIqThe search accuracy of (2) is expressed by the formula (15);
wherein Rel (I) represents a search image IqThe real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; when the real correlation is calculated, only a part with a classification label, namely rel (i) epsilon {0,1}, is considered, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
The overall image retrieval flow chart of the multi-task hierarchical image retrieval method based on the deep self-coding convolutional neural network is summarized simply as follows: 1) sending the image into a depth self-coding convolutional neural network, performing logistic regression on the characteristic graph, and performing position and category segmentation and prediction on an interested region on the query image; 2) extracting a sparse hash matrix of the image and a perceptual semantic feature of an interested region by using a depth self-coding convolutional neural network; 3) carrying out coarse retrieval on images in the database by using a sparse Hash matrix to obtain candidate images with similar attributes, and putting the candidate images into a candidate pool U; 4) on the basis of rough retrieval, namely the candidate pool U further selectively compares and sorts the suggestion boxes by using the modified cosine distance to obtain images of k before the ranking.
The above description is only exemplary of the preferred embodiments of the present invention, and is not intended to limit the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (5)

1. A multitask layered image retrieval method based on a depth self-coding convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
1) constructing a multi-task end-to-end convolution neural network for deep learning and training identification; the convolutional neural network is divided into three modules: the system comprises a shared convolution module, an interested region secondary screening module and an interested region coordinate regression and identification module, wherein the shared convolution module, the interested region secondary screening module and the interested region coordinate regression and identification module are all composed of convolution layers, an activation layer and a down-sampling layer in an alternating mode; performing logistic regression and layer-by-layer mapping on the input image in a network to obtain different expression forms of each layer relative to the image, and realizing depth representation of the region of interest;
2) carrying out rapid visual segmentation detection and positioning on a secondary screening model of the region of interest based on an RPN network, adding primary and secondary screening networks based on the RPN network, grading and multi-filtering an initial suggestion frame generated by the RPN, and determining a final region of interest according to the grade and the filtering of a maximum response region;
3) carrying out rough retrieval on the full-image sparse hash code, firstly carrying out binarization coding on attribute probability vectors of a suggestion box generated by RPN (recursive noise) of an initial screening network, and then flattening two-dimensional vectors into one-dimensional vectors in a vector fusion mode to obtain the full-image sparse hash code; finally, performing fast image comparison on the compact binary coding vector through cosine distance;
the hash method aims to represent samples as a string of fixed-length binary codes, and includes two aspects: 1) for two images sharing the same semantic concept, the hamming distance between the binary codes should be as small as possible, otherwise if the corresponding images are from different classes, the hamming distance should be larger, which means that the binary codes have a discriminating property in hamming space; 2) the hash method is treated as a black box, where the low-level or mid-level features are input, which are GIST features, and the output of the black box is binary code, which maintains semantic attributes in hamming space; the inherent properties of such binary codes can be viewed as a high-level representation based on low or medium features from the point of view of feature representation and return semantics that are more relevant to the query image;
the processing process of the deep hash algorithm is as follows; first assume that the target dataset can be divided into c class labels, the target imageObtaining a probability vector p ═ x (x) of the target class of each region of interest through a primary screening network of the regions of interest1,x2,...,xc),xcE (0,1), and carrying out binarization processing on the probability vector p by using a piecewise function to promote binary output of the sparse coding module; if the target image I contains m interested areas, m P vectors are correspondingly generated, and after fusion, P is obtained (P ═ P)1,p2,...,pm) The dimension of the global probability matrix P is mxc, P is sent to a binarization function to obtain a matrix h, and the binarization process is shown as a formula (7):
secondly, compressing the matrix H to a 1 x c-dimensional matrix H to represent the global characteristics of the target image in order to accelerate the image retrieval speed by adopting a vector fusion mode; the whole process is shown as a formula (8) and a formula (9), firstly, the matrix H is transposed and multiplied by the matrix H to obtain a c x c dimensional transition matrix H ', and then diagonal elements of the transition matrix H' are selected as a global characteristic binary hash code of the final target image I, namely a matrix H;
H=diag(h′) (9)
the sparse hash matrix H is a 1 × c-dimensional vector and Hi∈{0,1},i∈c,Representation matrix (h)Th) Row i, column j; the target data set is roughly searched by using the H, and the searching time can be effectively shortened by using the low-dimensional vector, so that the searching precision is improved;
learning a sparse hash matrix of the image and deep perception semantic information of an image target region through a deep self-coding convolutional neural network, and finally realizing quick search and accurate return of an image retrieval system by adopting a coarse-to-fine retrieval strategy based on the viewpoint; firstly, retrieving a group of images with similar attributes through a sparse Hash matrix of the images, and then selectively retrieving the first k pictures which are most similar to a target image from the group of roughly retrieved images through deep sensing semantic information of a target area of the images;
the retrieval process is as follows: for a given image I, the output of the image Out is first extractedj(H) A sparse hash matrix H;
let it be assumed that the target dataset consists of n images, denoted as { I ═ I1,I2,…,InGet the sparse hash code of the target data setH={H1,H2,…,Hn},HiE {0,1} represents; further assume that a given search image IqAnd retrieving sparse hash codes H of imagesqUsing a cosine distance measure HqAnd HiHSimilarity between them, the cosine value is greater than threshold value THThe image is placed in the candidate pool U, the rough retrieval result is used as a candidate image for a subsequent fine retrieval process;
4) and (3) based on the accurate comparison and retrieval of the maximum response regional perception semantic features and the matrix h, extracting high-level semantic information of the suggestion box from the primary screening network result and the full-connection layer maximum response through re-screening the network, and selectively comparing and sequencing the high-level semantic information by using the modified cosine distance in combination with the returned images obtained in the rapid comparison method, wherein the first k images are the last returned results.
2. The method for multi-tasking hierarchical image retrieval based on the deep self-coding convolutional neural network as claimed in claim 1, wherein: in the step 1), the shared convolution module: the shared network consists of 5 convolution modules, of whichThe deepest layers of conv2_ x to conv5_ x are {4 }2,82,162,162As the output size of the feature map, conv1 contains only a single convolutional layer as the input layer;
a region of interest coordinate regression and identification module: the RPN takes an image with any scale as input, and outputs a set of rectangular target suggestion boxes, wherein each box comprises 4 position coordinate variables and a score; in order to generate the region suggestion box, firstly, an input image is subjected to convolution sharing layer to generate a feature map, and then, multi-scale convolution operation is carried out on the feature map, and the method is realized as follows: using 3 scales and 3 length-width ratios at the position of each sliding window, taking the center of the current sliding window as the center and corresponding to one scale and length-width ratio, so that 9 candidate regions with different scales can be obtained by mapping on the original image, and for a shared convolution feature map with the size of w × h, w × h × 9 candidate regions are totally obtained; finally, the classification layer outputs scores of w × h × 9 × 2 candidate regions, namely, the estimation probability that each region is a target/non-target, and the regression layer outputs w × h × 9 × 4 parameters, namely, coordinate parameters of the candidate regions;
when the RPN network is trained, each candidate region is assigned with a binary label so as to mark whether the region is a target or not, and the operation is as follows: 1) a candidate region that overlaps with IoU (Intersection-over-Union ratio) where a real target region (GT) is the highest; 2) candidate regions with IoU overlap of greater than 0.7 with any GT bounding box, assigning negative labels to candidate regions with IoU ratios to all GT bounding boxes below 0.3; 3) discard between the two;
with these definitions, following the multitasking loss in the Faster RCNN, the objective function is minimized, and the loss function for an image is defined as:
where i is an index of the ith candidate region,is the probability that the candidate region is of class i; if the label of the candidate region is positive,is 1, if the candidate area label is 0,is 0; t is tiIs a vector, representing the 4 parameterized coordinates of the predicted bounding box,is the coordinate vector of the corresponding GT bounding box, NclsAnd NregRespectively, the normalized coefficients of the classification loss function and the position regression loss function, lambda is the weight parameter between the two, and the classification loss function LclsIs a log loss of two classes, target and non-target:
regression loss function L for positionregDefined by the following function:
wherein R is a robust loss function smoothL1
And a secondary region of interest screening module: according to the image characteristics of I interested suggestion frames output by the RPN, firstly, the I interested suggestion frames are sent to a primary screening layer to remove 2/3 background frames so as to increase the proportion of positive samples, and the generation of background areas can be effectively reduced; then, performing convolution and ReLU processing on the image features of the primarily screened interested suggestion frame to obtain I4096-dimensional feature maps, and then respectively sending the feature maps into a classification layer and a window regression layer for processing; finally, in order to obtain the maximum-response regional perception semantic features, the obtained I4096-dimensional feature maps are accessed into a secondary screening network, and finally the regional perception semantic features of the most accurate suggestion frame are selected back again;
the convolutional neural network training is a back propagation process, and the convolutional parameters and the bias are optimized and adjusted by using a random gradient descent method through back propagation of an error function until the network is converged or the maximum iteration times is reached;
the back propagation needs to compare the training samples with labels, adopt a square error cost function to identify multiple classes of the c classes and the N training samples, calculate the error by the formula (5) according to the final output error function of the network,
in the formula, ENIn order to be a function of the squared error cost,for the kth dimension of the label for the nth sample,corresponding to the k output of the network prediction for the n sample;
when the error function is reversely propagated, a calculation method similar to the traditional BP algorithm is adopted, as shown in formula (6),
l=(Wl+1)T l+1×f′(ul) (ul=Wlxl-1+bl) (6)
in the formula (I), the compound is shown in the specification,lrepresenting the error function of the current layer,l+1representing the error function of the previous layer, Wl+1Is a layer aboveThe mapping matrix, f', represents the inverse of the activation function, i.e. upsampling, ulOutput, x, representing the layer above the failed activation functionl-1Denotes the input of the next layer, WlThe weight matrix is mapped for this layer.
3. The method for multi-tasking hierarchical image retrieval based on deep self-coding convolutional neural network as claimed in claim 1 or 2, characterized in that: in the step 2), the image obtained from the video or the camera comprises a plurality of target areas, the probability layer is utilized to output the coordinates and scores of each suggestion frame of the RPN, the final coordinate frame and the scores are obtained through non-maximum value inhibition and threshold value screening, finally, the area of the most accurate suggestion frame is selected through the re-screening network again, and accurate detection and identification of the target object are guaranteed through twice screening of the suggestion frames.
4. The method for multi-tasking hierarchical image retrieval based on the deep self-coding convolutional neural network as claimed in claim 1, wherein: given a query image IqAnd a candidate pool U, wherein the top k ranks of the images in the candidate pool U are determined by using the region perception semantic features selected from the re-screening network and the full connection layer; the number of the suggestion boxes contained in each image is variable, and one or more suggestion boxes can be contained; if inquiring image IqContains m suggestion boxes, randomly selects an image I from the candidate pool Une.U contains m ' suggestion boxes, if all suggestion boxes are compared by using a violent retrieval mode, m × m ' times need to be compared, the larger the value of m × m ', the more the running speed of the whole retrieval system is reduced, and the problem is solved: in order to reduce the comparison times of the suggestion boxes and improve the running efficiency of the program, the matrix h is used as a basis for measuring the comparison times; query image IqThe matrix h is denoted as hqAn mxc-dimensional vector, random image I in the candidate poolnIs h matrix ofnThen the corresponding number of comparisons is shown in equation (10):
the result num is less than or equal to m multiplied by m', the times of comparison and the operation time of the retrieval system are greatly reduced by selectively comparing the suggestion boxes of the formula (10), wherein the effect is more obvious when the number of the suggestion boxes in the image is larger, and the suggestion boxes needing comparison are obtained as shown in the formula (11):
where dis (·) represents a modified cosine distance formula, which is expressed as formula (12):
the regional perceptual semantic features of the query image and any one of the target datasets separately use fq,fnBy substituting equation (11) into the matrix s, m × m', of comparison suggestion boxes, there are often two main categories of image-generated suggestion boxes: the same-class suggestion boxes and the non-same-class suggestion boxes, so that two classes of comparison results exist in the suggestion boxes selected for comparison, thereby causing different retrieval differences in the number of the suggestion boxes within a class and between classes, so that the differences can be eliminated by using the formula (13): taking an image I by taking a suggestion frame of the query image as a referenceqAnd InMaximum value of inter-class frame, average value of inter-class frame, and finally in image IqThe intra-class mean value is obtained again, the differences are reduced to the maximum extent through operation to ensure the accuracy of the result, and an image I is obtainedqAnd InSimilarity sim of (d);
first, the matrix s needs to be updated, and the image I is takenqAnd InThe maximum value of the similar inner frame is updated according to the following formula:
in formula (13), i, j ∈ (m, m') For selecting the maximum value of the ith row in matrix sFinally, using the formula (14), the image I is obtainedqAnd InMean value of class interval box and in IqAnd (4) averaging within classes to finally obtain the similarity of the whole picture, wherein the sim acquisition formula is as follows:
i, j is belonged to (m, m') in the formula (14)Line i, column j, s 'of the matrix representing the query picture'jRepresenting the jth column of the matrix s', the larger the similarity calculation formula sim value is, the higher the image similarity is, and for each candidate picture in the candidate pool UThe ranking of (c) is arranged from the maximum value of sim, thus determining the ranking of the top k pictures.
5. The method of claim 4, wherein the method comprises: the method further comprises the following steps: step 5) searching for an evaluation of the image accuracy, where the evaluation is performed using a ranking-based criterion; for a given search image IqAnd a similarity measure, one ranking for each dataset image; here, a search image I is represented by evaluating the top k ranked imagesqThe search accuracy of (2) is expressed by the formula (15);
wherein Rel (i) represents SaoCable image IqThe real correlation between the ith ranking image and the ith ranking image, wherein k represents the number of the ranking images and Precision @ k searching Precision; when the real correlation is calculated, only a part with a classification label, namely rel (i) epsilon {0,1}, is considered, if the search image and the ith ranking image have the same label setting rel (i) ═ 1, otherwise, rel (i) ═ 0, and the search precision can be obtained by traversing the top k ranking images in the candidate pool P.
CN201711057490.9A 2017-11-01 2017-11-01 Multi-task layered image retrieval method based on deep self-coding convolutional neural network Active CN107679250B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711057490.9A CN107679250B (en) 2017-11-01 2017-11-01 Multi-task layered image retrieval method based on deep self-coding convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711057490.9A CN107679250B (en) 2017-11-01 2017-11-01 Multi-task layered image retrieval method based on deep self-coding convolutional neural network

Publications (2)

Publication Number Publication Date
CN107679250A CN107679250A (en) 2018-02-09
CN107679250B true CN107679250B (en) 2020-12-01

Family

ID=61144118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711057490.9A Active CN107679250B (en) 2017-11-01 2017-11-01 Multi-task layered image retrieval method based on deep self-coding convolutional neural network

Country Status (1)

Country Link
CN (1) CN107679250B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108205580A (en) * 2017-09-27 2018-06-26 深圳市商汤科技有限公司 A kind of image search method, device and computer readable storage medium
CN108520268A (en) * 2018-03-09 2018-09-11 浙江工业大学 The black box antagonism attack defense method evolved based on samples selection and model
CN108363998A (en) * 2018-03-21 2018-08-03 北京迈格威科技有限公司 A kind of detection method of object, device, system and electronic equipment
CN109740585A (en) * 2018-03-28 2019-05-10 北京字节跳动网络技术有限公司 A kind of text positioning method and device
CN108898047B (en) * 2018-04-27 2021-03-19 中国科学院自动化研究所 Pedestrian detection method and system based on blocking and shielding perception
CN108733801B (en) * 2018-05-17 2020-06-09 武汉大学 Digital-human-oriented mobile visual retrieval method
CN110532833A (en) * 2018-05-23 2019-12-03 北京国双科技有限公司 A kind of video analysis method and device
CN108829826B (en) * 2018-06-14 2020-08-07 清华大学深圳研究生院 Image retrieval method based on deep learning and semantic segmentation
CN110674331A (en) * 2018-06-15 2020-01-10 华为技术有限公司 Information processing method, related device and computer storage medium
CN109145798A (en) * 2018-08-13 2019-01-04 浙江零跑科技有限公司 A kind of Driving Scene target identification and travelable region segmentation integrated approach
CN109409246B (en) * 2018-09-30 2020-11-27 中国地质大学(武汉) Sparse coding-based accelerated robust feature bimodal gesture intention understanding method
CN109447169B (en) * 2018-11-02 2020-10-27 北京旷视科技有限公司 Image processing method, training method and device of model thereof and electronic system
CN110084777A (en) * 2018-11-05 2019-08-02 哈尔滨理工大学 A kind of micro parts positioning and tracing method based on deep learning
CN109766469B (en) * 2018-12-14 2020-12-01 浙江工业大学 Image retrieval method based on deep hash learning optimization
CN109871749A (en) * 2019-01-02 2019-06-11 上海高重信息科技有限公司 A kind of pedestrian based on depth Hash recognition methods and device, computer system again
CN109933682A (en) * 2019-01-11 2019-06-25 上海交通大学 A kind of image Hash search method and system based on semanteme in conjunction with content information
CN109977960B (en) * 2019-04-03 2020-02-28 杭州深数科技有限公司 Wood pile information acquisition method, system and device based on neural network
CN110189394B (en) * 2019-05-14 2020-12-29 北京字节跳动网络技术有限公司 Mouth shape generation method and device and electronic equipment
CN110210462A (en) * 2019-07-02 2019-09-06 北京工业大学 A kind of bionical hippocampus cognitive map construction method based on convolutional neural networks
CN110766011B (en) * 2019-12-26 2020-04-28 南京智莲森信息技术有限公司 Contact net nut abnormity identification method based on deep multistage optimization

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106227851A (en) * 2016-07-29 2016-12-14 姹ゅ钩 Based on the image search method searched for by depth of seam division that degree of depth convolutional neural networks is end-to-end
CN106250812A (en) * 2016-07-15 2016-12-21 姹ゅ钩 A kind of model recognizing method based on quick R CNN deep neural network
CN106339591A (en) * 2016-08-25 2017-01-18 姹ゅ钩 Breast cancer prevention self-service health cloud service system based on deep convolutional neural network
CN106372571A (en) * 2016-08-18 2017-02-01 宁波傲视智绘光电科技有限公司 Road traffic sign detection and identification method
CN106951911A (en) * 2017-02-13 2017-07-14 北京飞搜科技有限公司 A kind of quick multi-tag picture retrieval system and implementation method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10809895B2 (en) * 2016-03-11 2020-10-20 Fuji Xerox Co., Ltd. Capturing documents from screens for archival, search, annotation, and sharing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106250812A (en) * 2016-07-15 2016-12-21 姹ゅ钩 A kind of model recognizing method based on quick R CNN deep neural network
CN106227851A (en) * 2016-07-29 2016-12-14 姹ゅ钩 Based on the image search method searched for by depth of seam division that degree of depth convolutional neural networks is end-to-end
CN106372571A (en) * 2016-08-18 2017-02-01 宁波傲视智绘光电科技有限公司 Road traffic sign detection and identification method
CN106339591A (en) * 2016-08-25 2017-01-18 姹ゅ钩 Breast cancer prevention self-service health cloud service system based on deep convolutional neural network
CN106951911A (en) * 2017-02-13 2017-07-14 北京飞搜科技有限公司 A kind of quick multi-tag picture retrieval system and implementation method

Also Published As

Publication number Publication date
CN107679250A (en) 2018-02-09

Similar Documents

Publication Publication Date Title
Krause et al. The unreasonable effectiveness of noisy data for fine-grained recognition
Alzu’bi et al. Semantic content-based image retrieval: A comprehensive study
Noh et al. Large-scale image retrieval with attentive deep local features
Yu et al. Spatial pyramid-enhanced NetVLAD with weighted triplet loss for place recognition
US10102443B1 (en) Hierarchical conditional random field model for labeling and segmenting images
CN105912611B (en) A kind of fast image retrieval method based on CNN
Garcia-Fidalgo et al. Vision-based topological mapping and localization methods: A survey
US9275269B1 (en) System, method and apparatus for facial recognition
CN106126581B (en) Cartographical sketching image search method based on deep learning
Gao et al. Database saliency for fast image retrieval
CN106682233B (en) Hash image retrieval method based on deep learning and local feature fusion
US9547807B2 (en) Image processing and object classification
Tsai Bag-of-words representation in image annotation: A review
JP2017062781A (en) Similarity-based detection of prominent objects using deep cnn pooling layers as features
CN106649487B (en) Image retrieval method based on interest target
Chaudhuri et al. Multilabel remote sensing image retrieval using a semisupervised graph-theoretic method
Sivic et al. Efficient visual search of videos cast as text retrieval
Hu et al. Recognition of pornographic web pages by classifying texts and images
EP2955645A1 (en) System for automated segmentation of images through layout classification
Memon et al. GEO matching regions: multiple regions of interests using content based image retrieval based on relative locations
Feng et al. Attention-driven salient edge (s) and region (s) extraction with application to CBIR
Zhang et al. A review on automatic image annotation techniques
Liu et al. A survey of content-based image retrieval with high-level semantics
Georgescu et al. Mean shift based clustering in high dimensions: A texture classification example
Russell et al. Using multiple segmentations to discover objects and their extent in image collections

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant