CN114780767A

CN114780767A - Large-scale image retrieval method and system based on deep convolutional neural network

Info

Publication number: CN114780767A
Application number: CN202210393416.9A
Authority: CN
Inventors: 王中元; 裴盈娇; 陈何玲; 何政; 邵振峰; 邹华; 肖进胜
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-04-14
Filing date: 2022-04-14
Publication date: 2022-07-22

Abstract

The invention discloses a large-scale image retrieval method and a large-scale image retrieval system based on a deep convolutional neural network, wherein the large-scale image retrieval is carried out by constructing the deep convolutional neural network (DHN), and the DHN comprises four parts, namely ResNet 50-based feature extraction, Channel and Space Attention (CSA) -based feature refinement, a classification layer, a hash layer and a weight layer; the DHN realizes pixel significance attention from bottom to top through CSA, and realizes semantic constraint from top to bottom through classified label supervision; the DHN adopts a self-adaptive weighted learning algorithm to generate a weight for each bit of hash code, and then directly generates a short hash code from the long hash code according to the importance of the bit represented by the weight. The method has higher hash code generation precision and speed, thereby being suitable for large-scale image retrieval tasks.

Description

Large-scale image retrieval method and system based on deep convolutional neural network

Technical Field

The invention belongs to the technical field of artificial intelligence, relates to a large-scale image retrieval method and a large-scale image retrieval system, and particularly relates to a large-scale image retrieval method and a large-scale image retrieval system based on mixed attention and adaptive weight.

Background

The large-scale image retrieval task is to quickly find a certain number of images with similar contents from a million-level image database through image features given a query image. Conventional tree-based retrieval methods typically use similarity measures of high temporal complexity, such as euclidean distance, to compute the distance between features, and exhibit good performance when processing low-dimensional data. However, when the data volume reaches millions or hundreds of millions and the feature dimension grows in a large range, dimension disasters are easily generated and the time performance is remarkably reduced. In order to be able to significantly improve storage and retrieval efficiency, a hash retrieval of binary coding converting a high-dimensional vector of an original space into a hamming space is proposed.

The existing hash methods are mainly divided into a traditional hash method and a hash method based on a deep neural network. The most classical traditional hash method is LSH, and original data points close in distance in a high-dimensional space are mapped into the same hash bucket through K serially connected hash functions with position sensitivity, so that neighbor search only needs to be performed from the hash bucket corresponding to a query picture during query. However, when the data size is large, the LSH needs to be increased in exchange for the improvement of the retrieval performance at the cost of the code length, and a large amount of storage space is required, so that the LSH is not suitable for large-scale image retrieval.

The semi-supervised hash algorithm SSH proposed by Wang et al belongs to the classical semi-supervised hash algorithm, minimizes empirical errors for tagged data, regularizes all data to maximize computable attributes (such as variance) and independence between each hash bit. After relaxation and application of the orthogonal constraint, uncorrelated hash codes can be derived.

In order to make better use of the label information, the kernel-based hash method KSH proposed by Liu et al establishes a similarity relationship between data according to the label information, and obtains a better result. The prior hash methods directly optimize the hamming distance, but the hamming distance is non-convergent and non-smooth, and is difficult to optimize. In the KSH, an equivalent relation optimization model between the inner product of the binary code and the Hamming distance is used to obtain a very efficient and easily optimized objective function, and a compact Hash code is constructed.

The traditional hash-based image retrieval method mainly uses manually designed image descriptor features such as SIFT, LBP, HOG and SURF features. However, these features describe local features of the image, which cannot fully express the information implicit in the image, and the feature expression lacks high-level semantic information that can be understood by human beings. The convolution neural network simulates a human visual mechanism, can comprehensively express image information, and is more suitable for practical application. Thus, deep hash learning has become increasingly popular in recent years.

In 2014, Xia et al proposed a CNNH algorithm, which combines CNN with hash code by a two-stage method, wherein the first stage learns the hash code, and the second stage outputs continuous hash code by training the CNN. The CNNH algorithm automatically learns the image characteristics and the nonlinear hash function through the CNN, the binary code is fitted, and the retrieval performance is obviously improved. However, the image characterization and the hash code learning are separated, and end-to-end learning cannot be performed. In 2015, Lai et al improved the CNNH network, proposed the NINH, and used a triplet image as network input, trained simultaneously through two subnetworks, and used a triplet loss function to make similar images in the triplet have similar hash codes, and dissimilar images have hash codes with large differences, so as to implement simultaneous optimization of feature extraction and hash codes. In 2017, Cao et al propose a HashNet frame, and directly learn Hashcodes by a convergent continuous method, so as to learn accurate binary codes from continuous similar data. In addition, in order to maintain the similarity between images, the HashNet designs a weighted pairwise cross-entropy loss function based on the cross-entropy loss function. In 2018, Cao et al proposed DCH architecture, and generated compact and centralized binary hash codes by jointly optimizing cauchy cross entropy loss and cauchy quantization loss, which realizes efficient hamming space retrieval. In 2020, Wang et al propose a new global similarity measure, which encourages the hash codes of similar data pairs to converge to a common center, and the hash codes of different data pairs to converge to different centers, thereby improving learning efficiency and retrieval accuracy to a great extent.

The existing deep hash method has achieved high retrieval accuracy, but most methods only extract low-level features of pictures, and the extracted features are easily interfered by irrelevant objects in the pictures, so that hash codes generated by similar data points are not similar, and therefore retrieval accuracy and robustness are still to be improved.

Disclosure of Invention

In order to overcome the defects of the conventional deep hash algorithm, the invention provides a large-scale image retrieval method and system based on a deep convolutional neural network.

The technical scheme adopted by the method is as follows: a large-scale image retrieval method based on a deep convolutional neural network comprises the following steps:

step 1: inputting an image to be queried into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;

the deep convolutional neural network is composed of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;

the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer and a global average pooling layer which are sequentially connected and are removed from the average pooling layer and the full connection layer;

the classification layer and the hash layer are respectively two full-connection layers which are arranged behind the global average pooling layer in parallel, and the image tags and the hash codes are respectively predicted under the supervision of corresponding loss functions;

the weight layer is arranged behind the hash layer and generates corresponding weight for each hash code;

step 2: calculating the similarity of the hash codes of the query image and the image in the existing image database, and taking the image with the highest similarity as a retrieval result;

when a retrieval image in the existing image database is put in a warehouse, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a feature index to be stored in the image database for retrieval;

if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the weight from high to low to obtain the low-bit hash codes directly according to the query weight queryWeight.

The technical scheme adopted by the system of the invention is as follows: a large-scale image retrieval system based on a deep convolutional neural network comprises the following modules:

the module 1 is used for inputting the query image into a deep convolutional neural network to generate a hash code queryHash and a weight queryWeight;

the deep convolutional neural network consists of a feature extraction layer, a classification layer, a hash layer and a weight layer based on ResNet 50;

the classification layer and the hash layer are respectively two full-connection layers which are arranged in parallel behind the global average pooling layer and respectively predict the image tag and the hash code under the supervision of corresponding loss functions;

the module 2 is used for calculating the similarity between the query image and the image hash codes in the existing image database, and taking the image with the highest similarity as a retrieval result;

when the retrieval image in the existing image database is put in storage, the retrieval image is immediately input into a deep convolutional neural network to generate a hash code databaseHash which is used as a characteristic index to be stored in the image database for retrieval;

if the deep convolutional neural network generates 64-bit hash codes but needs shorter-length hash codes, corresponding hash bits are selected from the currently obtained long hash codes according to the query weight queryWeight to obtain the low-bit hash codes from high to low.

Compared with the prior art, the invention has the following advantages and positive effects:

1) the invention provides an end-to-end deep hash framework for quickly learning high-precision hash codes required by image retrieval. The system mainly comprises a Resnet-based feature extraction module, a CSA module for feature refinement, a classification layer for semantic supervision, a hash layer for quantizing hash codes and a weight layer for generating bit weights. The end-to-end framework facilitates the advantages of overall optimization and simplified engineering implementation complexity.

2) The invention provides a mixed attention mechanism consisting of bottom-up CSA and top-down classification label supervision. The mechanism encourages the network to learn concerned main semantic information, so that consistent hash codes are generated for similar images under the condition of excluding interference of other secondary or unrelated objects, and the robustness and the precision of hash retrieval are improved.

3) The invention provides a self-adaptive weighted learning strategy, which is used for learning the weight corresponding to each hash code and generating shorter segment hash codes from available long hash codes according to the importance of weight definition, thereby avoiding the retraining of a model and obviously saving the space-time complexity of model training.

Drawings

FIG. 1: a flow chart of an embodiment of the invention;

FIG. 2 is a schematic diagram: the structure diagram of the deep convolution neural network of the embodiment of the invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.

Referring to fig. 1, the large-scale image retrieval method based on the deep convolutional neural network provided by the invention comprises the following steps:

step 1: inputting the query image into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;

referring to fig. 2, the deep convolutional neural network of the present embodiment is composed of a feature extraction layer, a classification layer, a hash layer, and a weight layer based on ResNet 50. The network is an end-to-end framework based on a hybrid attention mechanism and adaptive weights, while learning hash codes and weight vectors. Under the supervision of the classification labels, Channel and Spatial Attention (CSA) are combined to avoid interference of image backgrounds or extraneous objects. A mixed attention mechanism consisting of a bottom-up CSA and a top-down classification semantic supervision can emphasize the concerned salient region and the leading information at the same time, and the judgment of the hash space is improved to the maximum extent. The weight layer can generate a weight for each bit of hash code of the query image, and then directly generate short hash codes from the long hash codes of the query image and the retrieval image according to the importance of the bit represented by the weight;

the first part of the feature extraction layer based on Resnet50 is a feature extraction layer of Resnet, and is composed of an independent convolution layer and 4 convolution residual error structures, each residual error block contains a plurality of convolution layers, and after convolution operation of each layer, feature distribution is adjusted through BatchNorm regularization and Relu activation function. The feature extraction layer outputs feature vectors with dimensions of 2048, and then dimension reduction is performed on the features to one quarter of the original dimension, namely 512 dimensions, through a convolution layer with the convolution kernel size of 3 and the step length of 1. And then, the attention degree of the network to the important features is strengthened through a CSA module, and the semantic information of the feature map is strengthened. And then, the enhanced feature vectors are subjected to dimension reduction through a convolution layer and then are subjected to spatial compression through a global average pooling operation. And finally, respectively predicting the hash codes and the classification information through two full-connection layers with output dimensions of the hash code length. And a full connection layer with the same output dimension is accessed behind the hash layer and is used for generating self-adaptive weight corresponding to the hash code. In order to make the hash code output by the network closer to 1 or-1, the vector output by the hash layer needs to pass through a tan-like activation function to converge the hash code.

The feature extraction layer based on the ResNet50 of the embodiment is composed of a feature refinement layer and a global average pooling layer of ResNet50 and CSA which are sequentially connected after the average pooling layer and the full connection layer are removed.

In this embodiment, a Resnet network is used as a backbone, and a matrix with a size of 256 × 256 obtained in a preprocessing stage is input to a Resnet50 without an average pooling layer and a full connection layer, so as to obtain a global feature representation of an image

To encourage network attention to salient features, the semantic information of the feature map is enhanced, and a CSA module is embedded after ResNet50 to infer the attention map in turn along the channel and spatial dimensions. The CSA is a lightweight module with negligible parameters and computational overhead. The attention map is then multiplied by the input feature map for adaptive optimization. In the mechanism, the feature maps generated by ResNet50 are processed channel by channel and pixel by pixel according to attention weights, so that important areas in the feature maps are enhanced, and network performance is improved. And finally, concentrating the feature map output by the CSA into a space through global average pooling.

The classification layer and the hash layer of the embodiment are two full-connection layers respectively, are arranged behind the global average pooling layer in parallel, and respectively predict the image tag and the hash code under the supervision of the corresponding loss function;

the weight layer of this embodiment is disposed after the hash layer, and generates a corresponding weight for each hash code.

The weight layer of this embodiment follows the hash layer and its purpose is to generate a corresponding weight for each hash code. So its input is the hash code of the hash layer output

Attention-oriented graph, i.e. weight vector, with output as hash code

Representing a feature vector with the dimension of C and the length and the width of 1 respectively; by definition of the hash code, the hash code H can be regarded as a feature graph composed of C channels and each channel contains only one element. In the weight layer, the transpose of W and W is first matrix multiplied, and then the softmax layer is applied, resulting in the channel attention map

Each element x_jiRepresenting the effect of the ith channel on the jth channel. Finally, the attention maps are summed in the longitudinal dimension to obtain the total impact of each channel on the other channels

Representing the importance of the corresponding hash bit, and transposing X' to obtain a weight vector W:

。

the feature refinement layer of the CSA of the deep convolutional neural network of the embodiment sequentially infers the attention pattern along two independent dimensions (channel and space), and then multiplies the attention pattern with the input feature pattern to perform adaptive feature optimization. The input of the feature refinement layer of the CSA is the feature map generated by the feature extraction module. The method comprises the steps of firstly obtaining a channel weighting result through a channel attention module, and then generating a final refined feature map through a space attention module. The attention-based feature adaptive learning process may be expressed as:

wherein

Representing element-based multiplication.

A feature map representing the input.

And

a channel attention map and a spatial attention map are shown, respectively. F' is the result of weighting the input feature map in the channel dimension. F' is the final weighting result. This network is further designed to link the outputs of the two attention modules, which helps to get more accurate results.

The first step of the channel attention module is to compress the input feature map from the spatial dimension to obtain a one-dimensional vector. When compressed from spatial dimensions, spatial information of the input feature map is aggregated using both average pooling and maximum pooling. Descriptors generated by the two pooling types are then generated separately

And

is sent to the shared network. The shared network consists of multi-layer perception (MLP) where the hidden layer size is reduced to 1/r of the input feature map, r being the reduction ratio. Then, element-by-element summation is carried out on the output feature vectors to obtain a channel attention diagram M_c. For a signature, the channel attention is focused on what is important in the figure. Average pooling provides feedback for every pixel on the feature map, while maximum pooling provides gradient feedback only for the corresponding maximum location on the feature map when gradient backpropagation is performed. In summary, the channel attention is calculated as follows：

Wherein σ is a Sigmoid function. The weight of MLP is given by

And

composition, which are shared in average pooling and maximum pooling.

The spatial attention module takes as input the output profile of the channel attention module. The spatial attention mechanism is to compress an input feature map from a channel dimension, and generate two-dimensional feature maps through average pooling and maximum pooling respectively:

and

the maximum pooling operation is to extract the maximum along the channel dimension. The average pooling operation is to extract the average along the channel dimension. The previously extracted feature maps are then merged into a two-channel feature map and reduced to one channel by a standard convolution layer, generating a spatial attention map. In summary, the spatial attention map is calculated as follows:

wherein σ is a Sigmoid function, and | is a join operation. f. of^k×kRepresenting a convolution operation with a convolution kernel size k, k must be 3 or 7.

The deep convolutional neural network of the embodiment is a trained deep convolutional neural network, and the training process comprises the following substeps:

step 1.1: selecting a plurality of pictures from the existing image data set as a retrieval set, dividing the retrieval set into a training set and a testing set according to the following steps of 5: a ratio of 1 assigns training sets and test sets, i.e., for each class, the number of training sets is 5 times the number of test sets. Each sample of the training set and the test set comprises an image and a corresponding label;

step 1.2: inputting a training set into the deep convolutional neural network, performing back propagation by using an SDG gradient descent algorithm under the supervision of a loss function to adjust network parameters, and repeatedly iterating to obtain an optimized deep network model;

the loss function proposed in the embodiment is a supervised hash loss function and consists of three parts, namely classification loss L_CWeighted pairwise similarity loss L_PAnd quantization loss L_Q. Most of the past methods do not take full advantage of the tag information. The image labels can not only provide similarity of image pairs, but also provide useful information for learning the hash function through image classification supervision. First item L_CSemantically similar images are mapped to similar hash codes by minimizing classification loss. Second term L_PThe similarity of the paired images is preserved by minimizing the weighted likelihood function. Third term L_QThe generated hash code is constrained to converge to 1 or-1 by minimizing the loss of square error of the network output and the target. Therefore, the following deep hash optimization is proposed:

wherein, Θ is all parameter sets of deep hash function learning, λ₁、λ₂And λ₃The balance parameters of each item are respectively.

The classification is lost. Classification-label supervision is a component of the mixed attention mechanism and is implemented in the form of image classification loss. A single label for an image indicates that each instance can only be of the a or B category, while multiple labels indicate that each instance can be assigned to multiple categories. In order to fully utilize the label information, the invention elaborately constructs the classification loss. When the image label is a single label, a cross entropy loss function L is used_C-S(ii) a Using a multi-class cross entropy loss function L when the image label is multi-label_C-M。

Wherein, y represents a prediction tag,

represents a genuine tag;

weighted maximum a posteriori estimation of hash codes. The feature learning and the hash learning based on the sample pair can effectively utilize the relationship between the similar images. In the hash learning process, label formation supervision information is used to constrain the hash code between two corresponding images so that the distance between the two images becomes smaller. Therefore, a similarity set is constructed by the semantic tag set L

Wherein x is_iAnd x_jIs the same as (2), s_ij＝1；x_iAnd x_jWhen the labels of (a) are not identical, s_ij0. For a pair of binary hash codes h_iAnd h_jIts hamming distance dist_HThe relationship between (-) and inner product (-) is as follows:

wherein the content of the first and second substances,<h_i,h_j>represents the inner product;

hamming distance is non-convex and non-smooth, difficult to optimize, and because of its equivalence, the inner product is used to quantify the similarity.

In the existing work, most partiesThe method uses a Bayesian framework to combine the similarity correlation and the quantization error. Given pairwise similarity label sets

n sample point hash codes H ═ H₁,...,h_n]The weighted maximum a posteriori estimate (WMAP) of (a) is:

wherein

For weighting the likelihood functions by L_PMeaning that the inner product of two similar points is made as small as possible, while the inner product of two different points is made as large as possible. w is a_ijRepresents each sample pair (x)_i,x_j,s_ij) Importance to total losses. Typically, the number of different image pairs is much greater than the number of similar image pairs in the training data. Using w_ijThe influence of different image pairs is weakened, and the influence of similar image pairs is enhanced, so that the aim of balancing data is fulfilled. p (H) is a prior distribution, using L_QAnd (4) showing. Due to the fact that

Wherein each similar label can only be s_ij1 (analogous) or s_ijEqual to 0 (dissimilar), therefore

Wherein, the first and the second end of the pipe are connected with each other,

is a collection of similar pairs of the same or similar pairs,

is a collection of dissimilar pairs.

For eachP(s) to_ij∣h_i,h_j) Is given a pair of hash codes h_iAnd h_jTime, similar label s_ijCan be defined as a pair of logistic functions:

wherein σ (x) is 1/(1+ e)^-αx) Is an adaptive Sigmoid function with a hyperparameter α, which is used to control its bandwidth.

Will P(s)_ij∣h_i,h_j) Substituting the definition of (a) into the WMAP estimation results in the following optimization problem:

in order to facilitate the optimization of the loss function by using a gradient descent method, the discrete constraint of the hash code is removed in the network training, and a Tanh activation function is added behind the hash layer, so that the network output falls between-1 and 1. Meanwhile, considering that the hash code is a binary code, the binary code is discretized using a Tanh-like activation function and a quantization loss instead of a sign function.

Where λ ═ 1, o (x) is a standard hyperbolic tangent function, and where λ is very large, o (x) can be regarded as a standard sign function. But is differentiable compared to the sign function, facilitating back propagation by the network.

In order to ensure that the generated hash code is completely converged into a binary code, quantization loss LQ is introduced to the generated hash code h_iAnd (5) thinning. Similar to DHN, the present invention uses bimodal laplacian priors for quantization, with the formula:

where e is an adjustment parameter. P (h)_i) The definition of (2) is substituted into the WMAP estimate, resulting in the following quantization loss:

wherein

Is a full 1 vector. Due to L₁The norm is non-smooth, resulting in difficulty in calculating L in back propagation_QThe present invention uses Mean Squared Error (MSE) function to calculate the quantization loss of the hash code:

where n is the number of samples.

In this embodiment, the training result is verified on the large data set ImageNet, and the training set and the test set pictures are respectively input into the trained network model to generate corresponding hash codes DatabaseHash and TestHash, and corresponding weights DatabaseWeight and TestWeight; and calculating the Hamming distance between each image in the test set and the image hash code in the data set, sequencing the images from small to large, and sequentially outputting the query result. The result shows that the retrieval average precision mAP reaches 82.8 percent and is far higher than the prior most advanced method.

And 2, step: calculating the similarity of the image hash codes in the query image and the existing image database, and taking the image with the highest similarity as a retrieval result;

if the deep convolutional neural network generates 64-bit hash codes, if shorter-length hash codes (48, 32, 24 bits) are needed to meet the requirement of retrieval efficiency, corresponding hash bits can be selected from currently obtained long hash codes according to the query weight queryWeight to obtain low-bit hash codes according to the weight from high to low.

In this embodiment, if there is a strict requirement on the retrieval efficiency, the short hash code may be generated from the available long hash codes according to the generated adaptive weight, so as to save the cost of regenerating hash codes of different lengths. The scheme learns the weight vector corresponding to the importance of each bit and then converts the long hash code into the required short hash code. For the generated n-bit hash code, a weight vector W is used to describe the importance of each bit in the n-bit hash code in terms of similarity. With the weight layer of step 2.4, the corresponding weight vector is also generated at the same time as the hash code is generated. In the pairwise similarity loss function, the pairwise similarity loss function is multiplied by the hash code correspondingly, and then the original unweighted hash code is input to the pairwise similarity loss function to simultaneously learn the hash code and the weight vector. When a shorter hash code is needed, only the corresponding bit with the higher weight value needs to be taken from the long hash code.

The present invention, in turn, performs Channel and Space Attention (CSA) operations after the feature layer of the ResNet50, allowing the hash network to further learn what to focus on in the channel and space dimensions, and where to focus on. However, if there are many objects in the image, the network still cannot distinguish which is the target of attention. To solve this problem, the present invention further proposes a top-down attention mechanism supervised by classification labels. In principle, the present invention uses classification labels to constrain into which class an image should be classified, thereby encouraging the network to focus on the feature learning of that classification. The classification labels typically used for supervision are for objects of interest, which allows the network to focus more on objects of true interest when generating hash codes, while ignoring irrelevant objects. Learning CSA of salient objects in the pixel sense is actually a bottom-up attention mechanism. In this way, the present invention establishes a hybrid attention mechanism that combines bottom-up pixel saliency with top-down semantic surveillance, where CSA is driven to more prominent areas representing tag semantics, rather than visually salient objects. The combination of the CSA and the classification loss function enables the network to better identify the target in the image, thereby generating more distinguishable hash codes and having better retrieval performance.

In addition, in the most advanced deep hash method, only one length of hash code can be obtained in one training process. In other words, in order to obtain hash codes of different lengths (e.g., 12 bits, 24 bits, 32 bits, etc.), it takes time to retrain, and hash codes of different lengths must all be retained, which results in a large amount of time and memory consumption. In order to solve the problem, the invention provides a self-adaptive weight learning algorithm, which generates a weight for each bit of hash code generated by a deep network. Each weight represents the importance of the corresponding hash code bit to the image representation. For short hash codes with different lengths, the long hash codes are generated only once through training, and then corresponding bits with higher weight values are taken from the long hash codes.

The invention provides a deep hash network with a mixed attention mechanism and self-adaptive weighting. CSA was introduced after the feature extraction layer of ResNet50 to emphasize the semantically significant features of class label supervision, resulting in more discriminative hash codes. In addition, the invention provides a self-adaptive weighting method, and the method can generate the long hash code and the weight of the corresponding bit only by training the model once. Thus, the short hash code can be obtained by sub-sampling from the long hash code according to the weights.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A large-scale image retrieval method based on a deep convolutional neural network is characterized by comprising the following steps:

2. The large-scale image retrieval method based on the deep convolutional neural network as claimed in claim 1, wherein: the deep convolutional neural network is characterized in that a first part of a feature extraction layer based on Resnet50 is a feature extraction layer of Resnet and comprises an independent convolutional layer and 4 convolutional residual error structures, each residual error block comprises a plurality of convolutional layers, and after convolution operation of each layer, feature distribution is adjusted through BatchNorm regularization and Relu activation function; outputting characteristic vectors with dimensions of 2048 by a characteristic extraction layer, and reducing the dimensions of the characteristics to one fourth of the original dimensions by a convolution layer with convolution kernel size of 3 and step length of 1; secondly, the attention of the network to important features is strengthened through a feature refinement layer of the CSA, and the semantic information of the feature map is strengthened; then, after dimension reduction of the enhanced feature vector is carried out by a convolution layer, spatial compression is carried out by global average pooling operation; finally, respectively predicting the hash codes and the classification information through two fully-connected layers with output dimensions of the hash codes; a full connection layer with the same output dimension is accessed behind the hash layer and is used for generating self-adaptive weight corresponding to the hash code; the vector output by the hash layer is subjected to a tan-like activation function to converge the hash code.

3. The large-scale image retrieval method based on the deep convolutional neural network of claim 1, wherein: the weight layer of the deep convolutional neural network inputs the hash code output by the hash layer

Attention map, i.e. weight vector, output as hash code

The hash code H is a feature diagram which consists of C channels, and each channel only comprises one element;

a characteristic diagram showing a channel C and a length and a width of 1 respectively; in the weight layer, the transpose of W and W is first matrix multiplied, and then the softmax layer is applied, resulting in the channel attention map

Each element x_jiRepresenting the influence of the ith channel on the jth channel; finally, the attention maps are summed in the longitudinal dimension to obtain the total impact of each channel on the other channels

Representing the importance of the corresponding hash bit, and transposing X' to obtain a weight vector W;

。

4. the large-scale image retrieval method based on the deep convolutional neural network as claimed in claim 1, wherein: the feature refinement layer of the CSA of the deep convolutional neural network sequentially deduces an attention diagram along two independent dimensions of a channel and a space, and then multiplies the attention diagram with an input feature diagram to perform adaptive feature optimization; the input of the CSA feature refinement layer is a feature map generated by feature extraction, a channel weighting result is obtained through a channel attention module, and then a final refined feature map is generated through a space attention module; the self-adaptive characteristic optimization process comprises the following steps:

wherein

It is shown that the multiplication is based on elements,

a feature map representing the input is generated,

and

respectively representing a channel attention diagram and a space attention diagram; f 'is the weighting result of the input feature map in the channel dimension, and F' is the final weighting result;

the first step of the channel attention module is to compress an input feature map from a space dimension to obtain a one-dimensional vector; aggregating the spatial information of the input feature map using both average pooling and maximum pooling while compressing from spatial dimensions; descriptors generated by the two pooling types respectively

And

is sent to the shared network; the shared network consists of multiple layers of perceptual MLPs, wherein the size of a hidden layer is reduced to 1/r of an input feature map, and r is a reduction ratio; then, element-by-element summation is carried out on the output feature vectors to obtain a channel attention diagram M_c(ii) a The channel attention is calculated as follows:

wherein, the sigma is a Sigmoid function; MLP is weighted by

And

compositions that are shared in average pooling and maximum pooling;

the spatial attention module takes the output characteristic diagram of the channel attention module as an input, and compresses the input characteristic diagram from the channel dimensionGenerating two-dimensional characteristic graphs respectively through average pooling and maximum pooling:

and

the maximum pooling operation is extracting a maximum value along the channel dimension, and the average pooling operation is extracting an average value along the channel dimension; then connecting the two characteristic graphs along the channel dimension, combining the extracted characteristic graphs into a double-channel characteristic graph, and reducing the characteristic graph into a channel through a standard convolution layer to generate a space attention graph; the spatial attention map is calculated as follows:

wherein, σ is a Sigmoid function, and | is a connection operation; f. of^k×kRepresenting a convolution operation with a convolution kernel size k, k taking the value 3 or 7.

5. The large-scale image retrieval method based on the deep convolutional neural network as claimed in any one of claims 1 to 4, wherein: the deep convolutional neural network is a trained deep convolutional neural network, and the training process comprises the following substeps:

step 1.1: a plurality of pictures are selected from the existing image data set to be used as a retrieval set, and then the retrieval set is divided into a training set and a testing set. Each sample of the training set and the test set comprises an image and a corresponding label;

step 1.2: inputting a training set into the deep convolutional neural network, performing back propagation by using an SDG gradient descent algorithm under the supervision of a loss function to adjust network parameters, and repeatedly iterating to obtain an optimized deep convolutional neural network;

wherein the loss function consists of three parts: loss of classification L_CWeighted pairwise similarity loss L_PAnd quantization loss L_Q(ii) a First item L_CMapping semantically similar images to similar hash codes by minimizing classification loss, a second term L_PMaintaining similarity of pairs of images by minimizing a weighted likelihood function, the third term L_QConstraining the generated hash code to converge to 1 or-1 by minimizing the net output and the target squared error loss; the deep hash optimization function L is:

wherein, theta is all parameter sets of deep hash optimization function learning, and lambda₁、λ₂And λ₃Balance parameters of each item are respectively;

when the image label is a single label, a cross entropy loss function L is used_C-SAs a classification loss L_CWhen the image label is multi-label, a multi-classification cross entropy loss function L is used_C-MAs a classification loss L_C；

Wherein, y represents a prediction tag,

represents a real tag;

wherein two-by-two similar label sets are given

Wherein x_iAnd x_jIs the same as the label of (a), s_ij＝1；x_iAnd x_jWhen the labels of (a) are not identical, s_ij＝0；w_ijRepresents each sample pair (x)_i,x_j,s_ij) The importance of the total loss; for a pair of binary hash codes h_iAnd h_j，<·,·>Represents the inner product, alpha represents the hyperparameter;

where n is the number of samples.

6. A large-scale image retrieval system based on a deep convolutional neural network is characterized by comprising the following modules:

the module 1 is used for inputting a query image into a deep convolutional neural network to generate a Hash code queryHash and a weight queryWeight;

the ResNet 50-based feature extraction layer consists of a ResNet50 layer and a CSA feature refinement layer which are sequentially connected and are subjected to average pooling removal and full connection layer removal, and a global average pooling layer;

the module 2 is used for calculating the similarity of the image hash code in the query image and the image hash code in the existing image database, and taking the image with the highest similarity as a retrieval result;