CN114780767A - Large-scale image retrieval method and system based on deep convolutional neural network - Google Patents

Large-scale image retrieval method and system based on deep convolutional neural network Download PDF

Info

Publication number
CN114780767A
CN114780767A CN202210393416.9A CN202210393416A CN114780767A CN 114780767 A CN114780767 A CN 114780767A CN 202210393416 A CN202210393416 A CN 202210393416A CN 114780767 A CN114780767 A CN 114780767A
Authority
CN
China
Prior art keywords
layer
hash
image
neural network
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210393416.9A
Other languages
Chinese (zh)
Inventor
王中元
裴盈娇
陈何玲
何政
邵振峰
邹华
肖进胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210393416.9A priority Critical patent/CN114780767A/en
Publication of CN114780767A publication Critical patent/CN114780767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a large-scale image retrieval method and a large-scale image retrieval system based on a deep convolutional neural network, wherein the large-scale image retrieval is carried out by constructing the deep convolutional neural network (DHN), and the DHN comprises four parts, namely ResNet 50-based feature extraction, Channel and Space Attention (CSA) -based feature refinement, a classification layer, a hash layer and a weight layer; the DHN realizes pixel significance attention from bottom to top through CSA, and realizes semantic constraint from top to bottom through classified label supervision; the DHN adopts a self-adaptive weighted learning algorithm to generate a weight for each bit of hash code, and then directly generates a short hash code from the long hash code according to the importance of the bit represented by the weight. The method has higher hash code generation precision and speed, thereby being suitable for large-scale image retrieval tasks.

Description

Large-scale image retrieval method and system based on deep convolutional neural network
Technical Field
The invention belongs to the technical field of artificial intelligence, relates to a large-scale image retrieval method and a large-scale image retrieval system, and particularly relates to a large-scale image retrieval method and a large-scale image retrieval system based on mixed attention and adaptive weight.
Background
The large-scale image retrieval task is to quickly find a certain number of images with similar contents from a million-level image database through image features given a query image. Conventional tree-based retrieval methods typically use similarity measures of high temporal complexity, such as euclidean distance, to compute the distance between features, and exhibit good performance when processing low-dimensional data. However, when the data volume reaches millions or hundreds of millions and the feature dimension grows in a large range, dimension disasters are easily generated and the time performance is remarkably reduced. In order to be able to significantly improve storage and retrieval efficiency, a hash retrieval of binary coding converting a high-dimensional vector of an original space into a hamming space is proposed.
The existing hash methods are mainly divided into a traditional hash method and a hash method based on a deep neural network. The most classical traditional hash method is LSH, and original data points close in distance in a high-dimensional space are mapped into the same hash bucket through K serially connected hash functions with position sensitivity, so that neighbor search only needs to be performed from the hash bucket corresponding to a query picture during query. However, when the data size is large, the LSH needs to be increased in exchange for the improvement of the retrieval performance at the cost of the code length, and a large amount of storage space is required, so that the LSH is not suitable for large-scale image retrieval.
The semi-supervised hash algorithm SSH proposed by Wang et al belongs to the classical semi-supervised hash algorithm, minimizes empirical errors for tagged data, regularizes all data to maximize computable attributes (such as variance) and independence between each hash bit. After relaxation and application of the orthogonal constraint, uncorrelated hash codes can be derived.
In order to make better use of the label information, the kernel-based hash method KSH proposed by Liu et al establishes a similarity relationship between data according to the label information, and obtains a better result. The prior hash methods directly optimize the hamming distance, but the hamming distance is non-convergent and non-smooth, and is difficult to optimize. In the KSH, an equivalent relation optimization model between the inner product of the binary code and the Hamming distance is used to obtain a very efficient and easily optimized objective function, and a compact Hash code is constructed.
The traditional hash-based image retrieval method mainly uses manually designed image descriptor features such as SIFT, LBP, HOG and SURF features. However, these features describe local features of the image, which cannot fully express the information implicit in the image, and the feature expression lacks high-level semantic information that can be understood by human beings. The convolution neural network simulates a human visual mechanism, can comprehensively express image information, and is more suitable for practical application. Thus, deep hash learning has become increasingly popular in recent years.
In 2014, Xia et al proposed a CNNH algorithm, which combines CNN with hash code by a two-stage method, wherein the first stage learns the hash code, and the second stage outputs continuous hash code by training the CNN. The CNNH algorithm automatically learns the image characteristics and the nonlinear hash function through the CNN, the binary code is fitted, and the retrieval performance is obviously improved. However, the image characterization and the hash code learning are separated, and end-to-end learning cannot be performed. In 2015, Lai et al improved the CNNH network, proposed the NINH, and used a triplet image as network input, trained simultaneously through two subnetworks, and used a triplet loss function to make similar images in the triplet have similar hash codes, and dissimilar images have hash codes with large differences, so as to implement simultaneous optimization of feature extraction and hash codes. In 2017, Cao et al propose a HashNet frame, and directly learn Hashcodes by a convergent continuous method, so as to learn accurate binary codes from continuous similar data. In addition, in order to maintain the similarity between images, the HashNet designs a weighted pairwise cross-entropy loss function based on the cross-entropy loss function. In 2018, Cao et al proposed DCH architecture, and generated compact and centralized binary hash codes by jointly optimizing cauchy cross entropy loss and cauchy quantization loss, which realizes efficient hamming space retrieval. In 2020, Wang et al propose a new global similarity measure, which encourages the hash codes of similar data pairs to converge to a common center, and the hash codes of different data pairs to converge to different centers, thereby improving learning efficiency and retrieval accuracy to a great extent.
The existing deep hash method has achieved high retrieval accuracy, but most methods only extract low-level features of pictures, and the extracted features are easily interfered by irrelevant objects in the pictures, so that hash codes generated by similar data points are not similar, and therefore retrieval accuracy and robustness are still to be improved.
Disclosure of Invention
In order to overcome the defects of the conventional deep hash algorithm, the invention provides a large-scale image retrieval method and system based on a deep convolutional neural network.
The technical scheme adopted by the method is as follows: a large-scale image retrieval method based on a deep convolutional neural network comprises the following steps:
step 1: inputting an image to be queried into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;
the deep convolutional neural network is composed of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;
the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer and a global average pooling layer which are sequentially connected and are removed from the average pooling layer and the full connection layer;
the classification layer and the hash layer are respectively two full-connection layers which are arranged behind the global average pooling layer in parallel, and the image tags and the hash codes are respectively predicted under the supervision of corresponding loss functions;
the weight layer is arranged behind the hash layer and generates corresponding weight for each hash code;
step 2: calculating the similarity of the hash codes of the query image and the image in the existing image database, and taking the image with the highest similarity as a retrieval result;
when a retrieval image in the existing image database is put in a warehouse, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a feature index to be stored in the image database for retrieval;
if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the weight from high to low to obtain the low-bit hash codes directly according to the query weight queryWeight.
The technical scheme adopted by the system of the invention is as follows: a large-scale image retrieval system based on a deep convolutional neural network comprises the following modules:
the module 1 is used for inputting the query image into a deep convolutional neural network to generate a hash code queryHash and a weight queryWeight;
the deep convolutional neural network consists of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;
the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer and a global average pooling layer which are sequentially connected and are removed from the average pooling layer and the full connection layer;
the classification layer and the hash layer are respectively two full-connection layers which are arranged in parallel behind the global average pooling layer and respectively predict the image tag and the hash code under the supervision of corresponding loss functions;
the weight layer is arranged behind the hash layer and generates corresponding weight for each hash code;
the module 2 is used for calculating the similarity between the query image and the image hash codes in the existing image database, and taking the image with the highest similarity as a retrieval result;
when the retrieval image in the existing image database is put in storage, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a characteristic index to be stored in the image database for retrieval;
if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the query weight queryWeight to obtain the low-bit hash codes from high to low.
Compared with the prior art, the invention has the following advantages and positive effects:
1) the invention provides an end-to-end deep hash framework for quickly learning high-precision hash codes required by image retrieval. The system mainly comprises a Resnet-based feature extraction module, a CSA module for feature refinement, a classification layer for semantic supervision, a hash layer for quantizing hash codes and a weight layer for generating bit weights. The end-to-end framework facilitates the advantages of overall optimization and simplified engineering implementation complexity.
2) The invention provides a mixed attention mechanism consisting of bottom-up CSA and top-down classification label supervision. The mechanism encourages the network to learn concerned main semantic information, so that consistent hash codes are generated for similar images under the condition of excluding interference of other secondary or unrelated objects, and the robustness and the precision of hash retrieval are improved.
3) The invention provides a self-adaptive weighted learning strategy, which is used for learning the weight corresponding to each hash code and generating shorter segment hash codes from available long hash codes according to the importance of weight definition, thereby avoiding the retraining of a model and obviously saving the space-time complexity of model training.
Drawings
FIG. 1: a flow chart of an embodiment of the invention;
FIG. 2 is a schematic diagram: the structure diagram of the deep convolution neural network of the embodiment of the invention.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.
Referring to fig. 1, the large-scale image retrieval method based on the deep convolutional neural network provided by the invention comprises the following steps:
step 1: inputting the query image into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;
referring to fig. 2, the deep convolutional neural network of the present embodiment is composed of a feature extraction layer, a classification layer, a hash layer, and a weight layer based on ResNet 50. The network is an end-to-end framework based on a hybrid attention mechanism and adaptive weights, while learning hash codes and weight vectors. Under the supervision of the classification labels, Channel and Spatial Attention (CSA) are combined to avoid interference of image backgrounds or extraneous objects. A mixed attention mechanism consisting of a bottom-up CSA and a top-down classification semantic supervision can emphasize the concerned salient region and the leading information at the same time, and the judgment of the hash space is improved to the maximum extent. The weight layer can generate a weight for each bit of hash code of the query image, and then directly generate short hash codes from the long hash codes of the query image and the retrieval image according to the importance of the bit represented by the weight;
the first part of the feature extraction layer based on Resnet50 is a feature extraction layer of Resnet, and is composed of an independent convolution layer and 4 convolution residual error structures, each residual error block contains a plurality of convolution layers, and after convolution operation of each layer, feature distribution is adjusted through BatchNorm regularization and Relu activation function. The feature extraction layer outputs feature vectors with dimensions of 2048, and then dimension reduction is performed on the features to one quarter of the original dimension, namely 512 dimensions, through a convolution layer with the convolution kernel size of 3 and the step length of 1. And then, the attention degree of the network to the important features is strengthened through a CSA module, and the semantic information of the feature map is strengthened. And then, the enhanced feature vectors are subjected to dimension reduction through a convolution layer and then are subjected to spatial compression through a global average pooling operation. And finally, respectively predicting the hash codes and the classification information through two full-connection layers with output dimensions of the hash code length. And a full connection layer with the same output dimension is accessed behind the hash layer and is used for generating self-adaptive weight corresponding to the hash code. In order to make the hash code output by the network closer to 1 or-1, the vector output by the hash layer needs to pass through a tan-like activation function to converge the hash code.
The feature extraction layer based on the ResNet50 of the embodiment is composed of a feature refinement layer and a global average pooling layer of ResNet50 and CSA which are sequentially connected after the average pooling layer and the full connection layer are removed.
In this embodiment, a Resnet network is used as a backbone, and a matrix with a size of 256 × 256 obtained in a preprocessing stage is input to a Resnet50 without an average pooling layer and a full connection layer, so as to obtain a global feature representation of an image
To encourage network attention to salient features, the semantic information of the feature map is enhanced, and a CSA module is embedded after ResNet50 to infer the attention map in turn along the channel and spatial dimensions. The CSA is a lightweight module with negligible parameters and computational overhead. The attention map is then multiplied by the input feature map for adaptive optimization. In the mechanism, the feature maps generated by ResNet50 are processed channel by channel and pixel by pixel according to attention weights, so that important areas in the feature maps are enhanced, and network performance is improved. And finally, concentrating the feature map output by the CSA into a space through global average pooling.
The classification layer and the hash layer of the embodiment are two full-connection layers respectively, are arranged behind the global average pooling layer in parallel, and respectively predict the image tag and the hash code under the supervision of the corresponding loss function;
the weight layer of this embodiment is disposed after the hash layer, and generates a corresponding weight for each hash code.
The weight layer of this embodiment follows the hash layer and its purpose is to generate a corresponding weight for each hash code. So its input is the hash code of the hash layer output
Figure BDA0003596437700000061
Attention-oriented graph, i.e. weight vector, with output as hash code
Figure BDA0003596437700000062
Representing a feature vector with the dimension of C and the length and the width of 1 respectively; by definition of the hash code, the hash code H can be regarded as a feature graph composed of C channels and each channel contains only one element. In the weight layer, the transpose of W and W is first matrix multiplied, and then the softmax layer is applied, resulting in the channel attention map
Figure BDA0003596437700000063
Each element xjiRepresenting the effect of the ith channel on the jth channel. Finally, the attention maps are summed in the longitudinal dimension to obtain the total impact of each channel on the other channels
Figure BDA0003596437700000064
Representing the importance of the corresponding hash bit, and transposing X' to obtain a weight vector W:
Figure BDA0003596437700000065
the feature refinement layer of the CSA of the deep convolutional neural network of the embodiment sequentially infers the attention pattern along two independent dimensions (channel and space), and then multiplies the attention pattern with the input feature pattern to perform adaptive feature optimization. The input of the feature refinement layer of the CSA is the feature map generated by the feature extraction module. The method comprises the steps of firstly obtaining a channel weighting result through a channel attention module, and then generating a final refined feature map through a space attention module. The attention-based feature adaptive learning process may be expressed as:
Figure BDA0003596437700000066
Figure BDA0003596437700000067
wherein
Figure BDA0003596437700000068
Representing element-based multiplication.
Figure BDA0003596437700000069
A feature map representing the input.
Figure BDA00035964377000000610
And
Figure BDA00035964377000000611
a channel attention map and a spatial attention map are shown, respectively. F' is the result of weighting the input feature map in the channel dimension. F' is the final weighting result. This network is further designed to link the outputs of the two attention modules, which helps to get more accurate results.
The first step of the channel attention module is to compress the input feature map from the spatial dimension to obtain a one-dimensional vector. When compressed from spatial dimensions, spatial information of the input feature map is aggregated using both average pooling and maximum pooling. Descriptors generated by the two pooling types are then generated separately
Figure BDA00035964377000000612
And
Figure BDA00035964377000000613
is sent to the shared network. The shared network consists of multi-layer perception (MLP) where the hidden layer size is reduced to 1/r of the input feature map, r being the reduction ratio. Then, element-by-element summation is carried out on the output feature vectors to obtain a channel attention diagram Mc. For a signature, the channel attention is focused on what is important in the figure. Average pooling provides feedback for every pixel on the feature map, while maximum pooling provides gradient feedback only for the corresponding maximum location on the feature map when gradient backpropagation is performed. In summary, the channel attention is calculated as follows:
Figure BDA0003596437700000071
Wherein σ is a Sigmoid function. The weight of MLP is given by
Figure BDA0003596437700000072
And
Figure BDA0003596437700000073
composition, which are shared in average pooling and maximum pooling.
The spatial attention module takes as input the output profile of the channel attention module. The spatial attention mechanism is to compress an input feature map from a channel dimension, and generate two-dimensional feature maps through average pooling and maximum pooling respectively:
Figure BDA0003596437700000074
and
Figure BDA0003596437700000075
the maximum pooling operation is to extract the maximum along the channel dimension. The average pooling operation is to extract the average along the channel dimension. The previously extracted feature maps are then merged into a two-channel feature map and reduced to one channel by a standard convolution layer, generating a spatial attention map. In summary, the spatial attention map is calculated as follows:
Figure BDA0003596437700000076
wherein σ is a Sigmoid function, and | is a join operation. f. ofk×kRepresenting a convolution operation with a convolution kernel size k, k must be 3 or 7.
The deep convolutional neural network of the embodiment is a trained deep convolutional neural network, and the training process comprises the following substeps:
step 1.1: selecting a plurality of pictures from the existing image data set as a retrieval set, dividing the retrieval set into a training set and a testing set according to the following steps of 5: a ratio of 1 assigns training sets and test sets, i.e., for each class, the number of training sets is 5 times the number of test sets. Each sample of the training set and the test set comprises an image and a corresponding label;
step 1.2: inputting a training set into the deep convolutional neural network, performing back propagation by using an SDG gradient descent algorithm under the supervision of a loss function to adjust network parameters, and repeatedly iterating to obtain an optimized deep network model;
the loss function proposed in the embodiment is a supervised hash loss function and consists of three parts, namely classification loss LCWeighted pairwise similarity loss LPAnd quantization loss LQ. Most of the past methods do not take full advantage of the tag information. The image labels can not only provide similarity of image pairs, but also provide useful information for learning the hash function through image classification supervision. First item LCSemantically similar images are mapped to similar hash codes by minimizing classification loss. Second term LPThe similarity of the paired images is preserved by minimizing the weighted likelihood function. Third term LQThe generated hash code is constrained to converge to 1 or-1 by minimizing the loss of square error of the network output and the target. Therefore, the following deep hash optimization is proposed:
Figure BDA0003596437700000081
wherein, Θ is all parameter sets of deep hash function learning, λ1、λ2And λ3The balance parameters of each item are respectively.
The classification is lost. Classification-label supervision is a component of the mixed attention mechanism and is implemented in the form of image classification loss. A single label for an image indicates that each instance can only be of the a or B category, while multiple labels indicate that each instance can be assigned to multiple categories. In order to fully utilize the label information, the invention elaborately constructs the classification loss. When the image label is a single label, a cross entropy loss function L is usedC-S(ii) a Using a multi-class cross entropy loss function L when the image label is multi-labelC-M
Figure BDA0003596437700000082
Figure BDA0003596437700000083
Wherein, y represents a prediction tag,
Figure BDA0003596437700000084
represents a genuine tag;
weighted maximum a posteriori estimation of hash codes. The feature learning and the hash learning based on the sample pair can effectively utilize the relationship between the similar images. In the hash learning process, label formation supervision information is used to constrain the hash code between two corresponding images so that the distance between the two images becomes smaller. Therefore, a similarity set is constructed by the semantic tag set L
Figure BDA0003596437700000085
Wherein x isiAnd xjIs the same as (2), sij=1;xiAnd xjWhen the labels of (a) are not identical, sij0. For a pair of binary hash codes hiAnd hjIts hamming distance distHThe relationship between (-) and inner product (-) is as follows:
Figure BDA0003596437700000086
wherein the content of the first and second substances,<hi,hj>represents the inner product;
hamming distance is non-convex and non-smooth, difficult to optimize, and because of its equivalence, the inner product is used to quantify the similarity.
In the existing work, most partiesThe method uses a Bayesian framework to combine the similarity correlation and the quantization error. Given pairwise similarity label sets
Figure BDA0003596437700000091
n sample point hash codes H ═ H1,...,hn]The weighted maximum a posteriori estimate (WMAP) of (a) is:
Figure BDA0003596437700000092
wherein
Figure BDA0003596437700000093
For weighting the likelihood functions by LPMeaning that the inner product of two similar points is made as small as possible, while the inner product of two different points is made as large as possible. w is aijRepresents each sample pair (x)i,xj,sij) Importance to total losses. Typically, the number of different image pairs is much greater than the number of similar image pairs in the training data. Using wijThe influence of different image pairs is weakened, and the influence of similar image pairs is enhanced, so that the aim of balancing data is fulfilled. p (H) is a prior distribution, using LQAnd (4) showing. Due to the fact that
Figure BDA0003596437700000099
Wherein each similar label can only be sij1 (analogous) or sijEqual to 0 (dissimilar), therefore
Figure BDA0003596437700000094
Wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003596437700000095
is a collection of similar pairs of the same or similar pairs,
Figure BDA0003596437700000096
is a collection of dissimilar pairs.
For eachP(s) toij∣hi,hj) Is given a pair of hash codes hiAnd hjTime, similar label sijCan be defined as a pair of logistic functions:
Figure BDA0003596437700000097
wherein σ (x) is 1/(1+ e)-αx) Is an adaptive Sigmoid function with a hyperparameter α, which is used to control its bandwidth.
Will P(s)ij∣hi,hj) Substituting the definition of (a) into the WMAP estimation results in the following optimization problem:
Figure BDA0003596437700000098
in order to facilitate the optimization of the loss function by using a gradient descent method, the discrete constraint of the hash code is removed in the network training, and a Tanh activation function is added behind the hash layer, so that the network output falls between-1 and 1. Meanwhile, considering that the hash code is a binary code, the binary code is discretized using a Tanh-like activation function and a quantization loss instead of a sign function.
Figure BDA0003596437700000101
Where λ ═ 1, o (x) is a standard hyperbolic tangent function, and where λ is very large, o (x) can be regarded as a standard sign function. But is differentiable compared to the sign function, facilitating back propagation by the network.
In order to ensure that the generated hash code is completely converged into a binary code, quantization loss LQ is introduced to the generated hash code hiAnd (5) thinning. Similar to DHN, the present invention uses bimodal laplacian priors for quantization, with the formula:
Figure BDA0003596437700000102
where e is an adjustment parameter. P (h)i) The definition of (2) is substituted into the WMAP estimate, resulting in the following quantization loss:
Figure BDA0003596437700000103
wherein
Figure BDA0003596437700000104
Is a full 1 vector. Due to L1The norm is non-smooth, resulting in difficulty in calculating L in back propagationQThe present invention uses Mean Squared Error (MSE) function to calculate the quantization loss of the hash code:
Figure BDA0003596437700000105
where n is the number of samples.
In this embodiment, the training result is verified on the large data set ImageNet, and the training set and the test set pictures are respectively input into the trained network model to generate corresponding hash codes DatabaseHash and TestHash, and corresponding weights DatabaseWeight and TestWeight; and calculating the Hamming distance between each image in the test set and the image hash code in the data set, sequencing the images from small to large, and sequentially outputting the query result. The result shows that the retrieval average precision mAP reaches 82.8 percent and is far higher than the prior most advanced method.
And 2, step: calculating the similarity of the image hash codes in the query image and the existing image database, and taking the image with the highest similarity as a retrieval result;
when a retrieval image in the existing image database is put in a warehouse, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a feature index to be stored in the image database for retrieval;
if the deep convolutional neural network generates 64-bit hash codes, if shorter-length hash codes (48, 32, 24 bits) are needed to meet the requirement of retrieval efficiency, corresponding hash bits can be selected from currently obtained long hash codes according to the query weight queryWeight to obtain low-bit hash codes according to the weight from high to low.
In this embodiment, if there is a strict requirement on the retrieval efficiency, the short hash code may be generated from the available long hash codes according to the generated adaptive weight, so as to save the cost of regenerating hash codes of different lengths. The scheme learns the weight vector corresponding to the importance of each bit and then converts the long hash code into the required short hash code. For the generated n-bit hash code, a weight vector W is used to describe the importance of each bit in the n-bit hash code in terms of similarity. With the weight layer of step 2.4, the corresponding weight vector is also generated at the same time as the hash code is generated. In the pairwise similarity loss function, the pairwise similarity loss function is multiplied by the hash code correspondingly, and then the original unweighted hash code is input to the pairwise similarity loss function to simultaneously learn the hash code and the weight vector. When a shorter hash code is needed, only the corresponding bit with the higher weight value needs to be taken from the long hash code.
The present invention, in turn, performs Channel and Space Attention (CSA) operations after the feature layer of the ResNet50, allowing the hash network to further learn what to focus on in the channel and space dimensions, and where to focus on. However, if there are many objects in the image, the network still cannot distinguish which is the target of attention. To solve this problem, the present invention further proposes a top-down attention mechanism supervised by classification labels. In principle, the present invention uses classification labels to constrain into which class an image should be classified, thereby encouraging the network to focus on the feature learning of that classification. The classification labels typically used for supervision are for objects of interest, which allows the network to focus more on objects of true interest when generating hash codes, while ignoring irrelevant objects. Learning CSA of salient objects in the pixel sense is actually a bottom-up attention mechanism. In this way, the present invention establishes a hybrid attention mechanism that combines bottom-up pixel saliency with top-down semantic surveillance, where CSA is driven to more prominent areas representing tag semantics, rather than visually salient objects. The combination of the CSA and the classification loss function enables the network to better identify the target in the image, thereby generating more distinguishable hash codes and having better retrieval performance.
In addition, in the most advanced deep hash method, only one length of hash code can be obtained in one training process. In other words, in order to obtain hash codes of different lengths (e.g., 12 bits, 24 bits, 32 bits, etc.), it takes time to retrain, and hash codes of different lengths must all be retained, which results in a large amount of time and memory consumption. In order to solve the problem, the invention provides a self-adaptive weight learning algorithm, which generates a weight for each bit of hash code generated by a deep network. Each weight represents the importance of the corresponding hash code bit to the image representation. For short hash codes with different lengths, the long hash codes are generated only once through training, and then corresponding bits with higher weight values are taken from the long hash codes.
The invention provides a deep hash network with a mixed attention mechanism and self-adaptive weighting. CSA was introduced after the feature extraction layer of ResNet50 to emphasize the semantically significant features of class label supervision, resulting in more discriminative hash codes. In addition, the invention provides a self-adaptive weighting method, and the method can generate the long hash code and the weight of the corresponding bit only by training the model once. Thus, the short hash code can be obtained by sub-sampling from the long hash code according to the weights.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (6)

1. A large-scale image retrieval method based on a deep convolutional neural network is characterized by comprising the following steps:
step 1: inputting an image to be queried into a deep convolutional neural network to generate a hash code queryHash and a weight queryWeight;
the deep convolutional neural network consists of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;
the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer and a global average pooling layer which are sequentially connected and are removed from the average pooling layer and the full connection layer;
the classification layer and the hash layer are respectively two full-connection layers which are arranged behind the global average pooling layer in parallel, and the image tags and the hash codes are respectively predicted under the supervision of corresponding loss functions;
the weight layer is arranged behind the hash layer and generates corresponding weight for each hash code;
step 2: calculating the similarity of the hash codes of the query image and the image in the existing image database, and taking the image with the highest similarity as a retrieval result;
when the retrieval image in the existing image database is put in storage, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a characteristic index to be stored in the image database for retrieval;
if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the query weight queryWeight to obtain the low-bit hash codes from high to low.
2. The large-scale image retrieval method based on the deep convolutional neural network as claimed in claim 1, wherein: the deep convolutional neural network is characterized in that a first part of a feature extraction layer based on Resnet50 is a feature extraction layer of Resnet and comprises an independent convolutional layer and 4 convolutional residual error structures, each residual error block comprises a plurality of convolutional layers, and after convolution operation of each layer, feature distribution is adjusted through BatchNorm regularization and Relu activation function; outputting characteristic vectors with dimensions of 2048 by a characteristic extraction layer, and reducing the dimensions of the characteristics to one fourth of the original dimensions by a convolution layer with convolution kernel size of 3 and step length of 1; secondly, the attention of the network to important features is strengthened through a feature refinement layer of the CSA, and the semantic information of the feature map is strengthened; then, after dimension reduction of the enhanced feature vector is carried out by a convolution layer, spatial compression is carried out by global average pooling operation; finally, respectively predicting the hash codes and the classification information through two fully-connected layers with output dimensions of the hash codes; a full connection layer with the same output dimension is accessed behind the hash layer and is used for generating self-adaptive weight corresponding to the hash code; the vector output by the hash layer is subjected to a tan-like activation function to converge the hash code.
3. The large-scale image retrieval method based on the deep convolutional neural network of claim 1, wherein: the weight layer of the deep convolutional neural network inputs the hash code output by the hash layer
Figure FDA0003596437690000021
Attention map, i.e. weight vector, output as hash code
Figure FDA0003596437690000022
The hash code H is a feature diagram which consists of C channels, and each channel only comprises one element;
Figure FDA0003596437690000023
a characteristic diagram showing a channel C and a length and a width of 1 respectively; in the weight layer, the transpose of W and W is first matrix multiplied, and then the softmax layer is applied, resulting in the channel attention map
Figure FDA0003596437690000024
Each element xjiRepresenting the influence of the ith channel on the jth channel; finally, the attention maps are summed in the longitudinal dimension to obtain the total impact of each channel on the other channels
Figure FDA0003596437690000025
Representing the importance of the corresponding hash bit, and transposing X' to obtain a weight vector W;
Figure FDA0003596437690000026
4. the large-scale image retrieval method based on the deep convolutional neural network as claimed in claim 1, wherein: the feature refinement layer of the CSA of the deep convolutional neural network sequentially deduces an attention diagram along two independent dimensions of a channel and a space, and then multiplies the attention diagram with an input feature diagram to perform adaptive feature optimization; the input of the CSA feature refinement layer is a feature map generated by feature extraction, a channel weighting result is obtained through a channel attention module, and then a final refined feature map is generated through a space attention module; the self-adaptive characteristic optimization process comprises the following steps:
Figure FDA0003596437690000027
Figure FDA0003596437690000028
wherein
Figure FDA0003596437690000029
It is shown that the multiplication is based on elements,
Figure FDA00035964376900000210
a feature map representing the input is generated,
Figure FDA00035964376900000211
and
Figure FDA00035964376900000212
respectively representing a channel attention diagram and a space attention diagram; f 'is the weighting result of the input feature map in the channel dimension, and F' is the final weighting result;
the first step of the channel attention module is to compress an input feature map from a space dimension to obtain a one-dimensional vector; aggregating the spatial information of the input feature map using both average pooling and maximum pooling while compressing from spatial dimensions; descriptors generated by the two pooling types respectively
Figure FDA00035964376900000213
And
Figure FDA00035964376900000214
is sent to the shared network; the shared network consists of multiple layers of perceptual MLPs, wherein the size of a hidden layer is reduced to 1/r of an input feature map, and r is a reduction ratio; then, element-by-element summation is carried out on the output feature vectors to obtain a channel attention diagram Mc(ii) a The channel attention is calculated as follows:
Figure FDA0003596437690000031
wherein, the sigma is a Sigmoid function; MLP is weighted by
Figure FDA0003596437690000032
And
Figure FDA0003596437690000033
compositions that are shared in average pooling and maximum pooling;
the spatial attention module takes the output characteristic diagram of the channel attention module as an input, and compresses the input characteristic diagram from the channel dimensionGenerating two-dimensional characteristic graphs respectively through average pooling and maximum pooling:
Figure FDA0003596437690000034
and
Figure FDA0003596437690000035
the maximum pooling operation is extracting a maximum value along the channel dimension, and the average pooling operation is extracting an average value along the channel dimension; then connecting the two characteristic graphs along the channel dimension, combining the extracted characteristic graphs into a double-channel characteristic graph, and reducing the characteristic graph into a channel through a standard convolution layer to generate a space attention graph; the spatial attention map is calculated as follows:
Figure FDA0003596437690000036
wherein, σ is a Sigmoid function, and | is a connection operation; f. ofk×kRepresenting a convolution operation with a convolution kernel size k, k taking the value 3 or 7.
5. The large-scale image retrieval method based on the deep convolutional neural network as claimed in any one of claims 1 to 4, wherein: the deep convolutional neural network is a trained deep convolutional neural network, and the training process comprises the following substeps:
step 1.1: a plurality of pictures are selected from the existing image data set to be used as a retrieval set, and then the retrieval set is divided into a training set and a testing set. Each sample of the training set and the test set comprises an image and a corresponding label;
step 1.2: inputting a training set into the deep convolutional neural network, performing back propagation by using an SDG gradient descent algorithm under the supervision of a loss function to adjust network parameters, and repeatedly iterating to obtain an optimized deep convolutional neural network;
wherein the loss function consists of three parts: loss of classification LCWeighted pairwise similarity loss LPAnd quantization loss LQ(ii) a First item LCMapping semantically similar images to similar hash codes by minimizing classification loss, a second term LPMaintaining similarity of pairs of images by minimizing a weighted likelihood function, the third term LQConstraining the generated hash code to converge to 1 or-1 by minimizing the net output and the target squared error loss; the deep hash optimization function L is:
Figure FDA0003596437690000041
wherein, theta is all parameter sets of deep hash optimization function learning, and lambda1、λ2And λ3Balance parameters of each item are respectively;
when the image label is a single label, a cross entropy loss function L is usedC-SAs a classification loss LCWhen the image label is multi-label, a multi-classification cross entropy loss function L is usedC-MAs a classification loss LC
Figure FDA0003596437690000042
Figure FDA0003596437690000043
Wherein, y represents a prediction tag,
Figure FDA0003596437690000044
represents a real tag;
Figure FDA0003596437690000045
wherein two-by-two similar label sets are given
Figure FDA0003596437690000046
Wherein xiAnd xjIs the same as the label of (a), sij=1;xiAnd xjWhen the labels of (a) are not identical, sij=0;wijRepresents each sample pair (x)i,xj,sij) The importance of the total loss; for a pair of binary hash codes hiAnd hj,<·,·>Represents the inner product, alpha represents the hyperparameter;
Figure FDA0003596437690000047
where n is the number of samples.
6. A large-scale image retrieval system based on a deep convolutional neural network is characterized by comprising the following modules:
the module 1 is used for inputting a query image into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;
the deep convolutional neural network consists of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;
the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer which are sequentially connected and are subjected to average pooling removal and full connection layer removal, and a global average pooling layer;
the classification layer and the hash layer are respectively two full-connection layers which are arranged behind the global average pooling layer in parallel, and the image tags and the hash codes are respectively predicted under the supervision of corresponding loss functions;
the weight layer is arranged behind the hash layer and generates corresponding weight for each hash code;
the module 2 is used for calculating the similarity of the image hash code in the query image and the image hash code in the existing image database, and taking the image with the highest similarity as a retrieval result;
when a retrieval image in the existing image database is put in a warehouse, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a feature index to be stored in the image database for retrieval;
if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the query weight queryWeight to obtain the low-bit hash codes from high to low.
CN202210393416.9A 2022-04-14 2022-04-14 Large-scale image retrieval method and system based on deep convolutional neural network Pending CN114780767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210393416.9A CN114780767A (en) 2022-04-14 2022-04-14 Large-scale image retrieval method and system based on deep convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210393416.9A CN114780767A (en) 2022-04-14 2022-04-14 Large-scale image retrieval method and system based on deep convolutional neural network

Publications (1)

Publication Number Publication Date
CN114780767A true CN114780767A (en) 2022-07-22

Family

ID=82429102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210393416.9A Pending CN114780767A (en) 2022-04-14 2022-04-14 Large-scale image retrieval method and system based on deep convolutional neural network

Country Status (1)

Country Link
CN (1) CN114780767A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964527A (en) * 2023-01-05 2023-04-14 北京东方通网信科技有限公司 Label representation construction method for single label image retrieval

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115964527A (en) * 2023-01-05 2023-04-14 北京东方通网信科技有限公司 Label representation construction method for single label image retrieval
CN115964527B (en) * 2023-01-05 2023-09-26 北京东方通网信科技有限公司 Label characterization construction method for single-label image retrieval

Similar Documents

Publication Publication Date Title
Mascarenhas et al. A comparison between VGG16, VGG19 and ResNet50 architecture frameworks for Image Classification
WO2021042828A1 (en) Neural network model compression method and apparatus, and storage medium and chip
Zhang et al. Context encoding for semantic segmentation
Yang et al. A survey of DNN methods for blind image quality assessment
Donahue et al. Decaf: A deep convolutional activation feature for generic visual recognition
CN109063724B (en) Enhanced generation type countermeasure network and target sample identification method
CN111783831B (en) Complex image accurate classification method based on multi-source multi-label shared subspace learning
CN109063719B (en) Image classification method combining structure similarity and class information
CN109743642B (en) Video abstract generation method based on hierarchical recurrent neural network
CN113806746B (en) Malicious code detection method based on improved CNN (CNN) network
Tang et al. Deep fishernet for object classification
CN110175248B (en) Face image retrieval method and device based on deep learning and Hash coding
CN112733866A (en) Network construction method for improving text description correctness of controllable image
CN112434628B (en) Small sample image classification method based on active learning and collaborative representation
Chen et al. An Improved Deep Fusion CNN for Image Recognition.
CN115100709B (en) Feature separation image face recognition and age estimation method
CN110188827A (en) A kind of scene recognition method based on convolutional neural networks and recurrence autocoder model
CN112507800A (en) Pedestrian multi-attribute cooperative identification method based on channel attention mechanism and light convolutional neural network
CN115222998B (en) Image classification method
CN115457332A (en) Image multi-label classification method based on graph convolution neural network and class activation mapping
Luan et al. Sunflower seed sorting based on convolutional neural network
CN114329031A (en) Fine-grained bird image retrieval method based on graph neural network and deep hash
CN112990340B (en) Self-learning migration method based on feature sharing
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
Bai et al. Softly combining an ensemble of classifiers learned from a single convolutional neural network for scene categorization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination