CN112084362A

CN112084362A - Image hash retrieval method based on hierarchical feature complementation

Info

Publication number: CN112084362A
Application number: CN202010789986.0A
Authority: CN
Inventors: 刘庆杰; 马田瑶; 许杰浩; 王蕴红
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-08-07
Filing date: 2020-08-07
Publication date: 2020-12-15
Anticipated expiration: 2040-08-07
Also published as: CN112084362B

Abstract

The invention discloses an image hash retrieval method based on hierarchical feature complementation, which can be applied to large-scale content-based image retrieval and is an algorithm capable of simultaneously and effectively extracting low-level detail information and high-level semantic information of an image and fully utilizing global features and local features of the image. The method extracts the low-level feature map and the high-level feature map in the convolutional neural network at the same time, can acquire the low-level information and the high-level information of the image, introduces the attention module, can reduce the noise interference in the low-level feature map, ensures the effectiveness of the low-level feature map, adds multi-scale feature fusion in the high-level feature map of the convolutional neural network, aggregates context information of different areas, can improve the capability of the convolutional neural network for acquiring local detail information, can ensure that the model can fully extract rich and complex contents of the image by enhancing and fusing the information of different levels, and enables the Hash code to better keep the similarity between the images.

Description

Image hash retrieval method based on hierarchical feature complementation

Technical Field

The invention relates to the technical field of computer vision, in particular to an image hash retrieval method based on hierarchical feature complementation.

Background

With the rapid development of the internet, multimedia technologies, mass storage technologies, and smart devices, multimedia data on the network is being generated, distributed, and stored in an explosive form. Since a large number of images are generated at each time, content-based image retrieval is increasingly faced with a large data size. However, the conventional image retrieval method is gradually replaced by an efficient hash method because of the defects of poor image feature extraction capability, low encoding speed and the like.

In recent years, Convolutional Neural Networks (CNN) technology has been applied to relevant fields such as image processing and computer vision, and has been used with great success. Compared with the feature extraction and matching mode of the manual design extraction algorithm, the convolutional neural network can be trained through a data set, so that the semantic information of the image is well stored. With this in mind, researchers in the related art also began to seek the possibility of combining convolutional neural networks with hash algorithms in the context of large-scale image retrieval applications. The image retrieval method based on the deep hash is beneficial to completing the quick retrieval of large-scale images, and has important practical significance under the background that all industries related to the Internet seek big data as a support and growing point.

At present, most of depth hash algorithms firstly use a convolutional neural network to extract image features, and then use a full-connection hash layer to carry out quantization coding on the image features to generate binary hash codes. In the feature extraction part, most hash code methods use convolutional neural networks with a large number of layers, such as AlexNet or ResNet 50. After the image is subjected to convolution and pooling for many times, each element on the extracted feature map has a larger receptive field, and the feature map is a high-level global feature containing rich semantic information. Because the image features containing the global semantic information of the image are used for coding, the effect of the hash method is better than that of the traditional non-depth method for coding by using local features. However, in practical application scenarios, the content of the image is quite complex. In such cases, if only the high-level global features of the image are still used for encoding, the key information of the image may be covered by some secondary irrelevant information, such as background, and the like, so that the hash model cannot encode truly effective information. In contrast, if the model uses a network with a small number of convolution layers, the receptive field of each element in the feature image is small, and the feature image represents the low-level local features of the image, which may cause the reduction of the retrieval performance due to the fact that the global semantic information in the image is not identified when the quantization coding is performed. Related studies have shown that using low-level features can lead to better results in image instance retrieval, because low-level features have relatively higher resolution, containing more location and local detail information. However, since the original image has less convolution times, the lower-layer features contain less semantic information and are more noisy. Relatively, the high-level global features obtained after the image passes through the convolutional neural network with a large number of layers have richer semantic information, but the resolution is low and the perception capability of the image detail information is poor.

In summary, the current deep hash algorithm has a certain problem in the aspect of feature extraction, and the low-level local features are not fully utilized, so that the detail information of the image is ignored, and the retrieval precision is reduced.

Disclosure of Invention

In view of this, the present invention provides an image hash retrieval method based on hierarchical feature complementation, so as to effectively extract low-level detail information and high-level semantic information of an image and fully utilize global features and local features of the image.

The invention provides an image hash retrieval method based on hierarchical feature complementation, which comprises the following steps:

s1: inputting an image to be retrieved into a convolutional neural network for extracting features;

s2: intercepting a feature map generated by an intermediate layer of the convolutional neural network as a low-layer feature map L, inputting the low-layer feature map L into a space attention module, and aggregating context information into the low-layer feature map L by the space attention module to obtain the feature map L₁；

S3: inputting the low-level feature map L into a channel attention module, wherein the channel attention module models semantic dependence among channels of the low-level feature map L to obtain a feature map L₂；

S4: the obtained characteristic diagram L₁And a characteristic diagram L₂Adding to obtain a characteristic diagram L₃Fully connected hash layer pair signature graph L₃Encoding to generate length L₁The lower hash code of (1);

s5: taking the feature map generated at the last layer of the convolutional neural network as a high-level feature map K, and respectively carrying out convolution operation on the high-level feature map K by using a plurality of convolution kernels with different sizes to generate a plurality of feature maps with different scales;

s6: respectively performing point-by-point convolution on the generated feature maps with different scales by using a multi-scale feature fusion module, and reducing the number of channels of each feature map to 1/4 of the number of channels of the high-level feature map K;

s7: using a bilinear interpolation mode to perform up-sampling on each feature map subjected to point-by-point convolution, reducing each feature map to the same scale as that of a high-level feature map K, and performing splicing fusion in the channel direction on each reduced feature map and the high-level feature map K, wherein the fused feature map contains information of different scales among different subregions, so that the fusion of local information and global information is realized;

s8: using full-connection Hash layer coding to the fused feature graph to generate the length of l₂The high-level hash code of (1);

s9: splicing the low-level hash code and the high-level hash code to obtain the length l₁+l₂The hash code of (2) is used for image retrieval.

In a possible implementation manner, in the image hash retrieval method based on hierarchical feature complementation provided by the present invention, in step S2, a feature map generated by an intermediate layer of the convolutional neural network is intercepted as a low-layer feature map L, the low-layer feature map L is input into a spatial attention module, and the spatial attention module aggregates context information into the low-layer feature map L to obtain a feature map L₁The method specifically comprises the following steps:

given a low-level feature map L ∈ R^C×H×WAnd performing convolution operation on the low-layer feature map L by using two different convolution layers to generate a feature map Y and a feature map Z, wherein { Y, Z } is belonged to R^C×H×WWherein C represents the channel number of the characteristic diagram, H represents the height of the characteristic diagram, and W represents the width of the characteristic diagram; adjusting the dimensions of the characteristic diagram Y and the characteristic diagram Z to C multiplied by N, and obtaining { Y ', Z' } epsilon R after adjustment^C×NWherein N ═ H × W denotes the total number of pixels on one channel in the feature map; multiplying the transposition of the characteristic diagram Z 'with Y', using a softmax function as an activation function to obtain a spatial characteristic relation diagram S epsilon R^N×N：

Wherein S is_ijThe values of the spatial feature relation diagram S in the ith row and the jth column represent the relation between corresponding local features in the feature diagram Y and the feature diagram Z, S_ijThe larger the similarity and correlation representing two local features, i 1, 2,.., N, j 1, 2.., N;

element of line i, Y, representing the rotation of the feature map Z_j'represents the jth column element in the feature map Y'; after the spatial feature relation graph S is obtained, the relative weight of the low-level feature graph L on each spatial position is mined by using the mean pooling layer and the convolution layer; after obtaining the relative weight, the low-level feature diagram L is endowed with the weight again to complete the space dimensionThe weighting formula is as follows:

L₁＝conv(avg(S))·L (2)

wherein avg represents a mean pooling layer, conv represents a convolution layer with sigmoid as an activation function; the formula (2) weights the spatial position of the low-level feature map L, and enhances the key information of the low-level feature map L in the spatial dimension to obtain the feature map L₁。

In a possible implementation manner, in the image hash retrieval method based on hierarchical feature complementation provided by the present invention, in step S3, the low-level feature map L is input into a channel attention module, and the channel attention module models semantic dependence between channels of the low-level feature map L to obtain the feature map L₂The method specifically comprises the following steps:

adjusting the dimension of the low-level feature diagram L to C multiplied by N to obtain a feature diagram L' epsilon R^C×NMultiplying the feature diagram L 'by the transpose of the feature diagram L', and using softmax as an activation function to obtain a channel feature relationship diagram G e R^C×C：

Wherein G is_mnA value representing the channel characteristic relationship diagram G in the mth row and the nth column, where m is 1, 2., C, n is 1, 2., C; l'_mThe row m elements in the feature map L' are represented,

an nth column element in the transition representing the feature map L'; after the channel characteristic relation graph G is obtained, the relative weight of the low-level characteristic graph L on each channel is mined by using a mean pooling layer and a full-connection Hash layer; after obtaining the relative weight, giving the weight again to the low-level feature diagram L to finish the re-calibration on the space dimension, wherein the weighting formula is as follows:

L₂＝mlp(avg(G))·L (4)

mlp denotes a multilayer perceptron with sigmoid as an activation function; formula (4) is forWeighting the channels of the layer characteristic diagram L, and enhancing the key information of the low-layer characteristic diagram L on the channel dimension to obtain the characteristic diagram L₂。

The image hash retrieval method based on hierarchical feature complementation provided by the invention can be applied to large-scale content-based image retrieval, and is an algorithm which can simultaneously and effectively extract low-level detail information and high-level semantic information of an image and fully utilize global features and local features of the image. The method extracts the low-level feature map and the high-level feature map in the convolutional neural network at the same time, can acquire the low-level information and the high-level information of the image, introduces the attention module, can reduce the noise interference in the low-level feature map, ensures the effectiveness of the low-level feature map, adds multi-scale feature fusion in the high-level feature map of the convolutional neural network, aggregates context information of different areas, can improve the capability of the convolutional neural network for acquiring local detail information, can ensure that the convolutional neural network can fully extract rich and complex contents of the image by enhancing and fusing the information of different levels, and enables hash codes to better keep the similarity between the images.

Drawings

Fig. 1 is a flowchart of an image hash retrieval method based on hierarchical feature complementation according to the present invention;

fig. 2 is a schematic structural diagram of a multi-scale feature fusion module in an image hash retrieval method based on hierarchical feature complementation according to the present invention;

FIG. 3 is a schematic structural diagram of a spatial attention module in an image hash retrieval method based on hierarchical feature complementation according to the present invention;

FIG. 4 is a schematic structural diagram of a channel attention module in the hierarchical feature complementation-based image hash retrieval method according to the present invention;

fig. 5 is a comparison diagram after feature map weights of an original image are visualized by respectively using the existing ResNet50 and an image hash retrieval method based on hierarchical feature complementation provided by the present invention;

FIG. 6 is a graph of the results of t-SNE visualization experiments for the DHA method;

FIG. 7 is a graph showing the results of t-SNE visualization experiment of DHA + method.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only illustrative and are not intended to limit the present invention.

The invention provides an image hash retrieval method based on hierarchical feature complementation, which comprises the following steps as shown in figure 1:

S3: inputting the low-level feature map L into a channel attention module, and modeling semantic dependence among channels of the low-level feature map L by the channel attention module to obtain the feature map L₂；

s5: taking a feature map generated by the last layer of the convolutional neural network as a high-level feature map K, and respectively carrying out convolution operation on the high-level feature map K by using a plurality of convolution kernels with different sizes to generate a plurality of feature maps with different scales;

s9: splicing the lower-layer hash code and the higher-layer hash code to obtain the length l₁+l₂The hash code of (2) is used for image retrieval.

The following describes a specific implementation of the image hash retrieval method based on hierarchical feature complementation according to a specific embodiment.

Example 1:

the invention uses ResNet50 as the backbone of the convolutional neural network, improves on the basis, uses the characteristics of different layers to generate hash codes which can represent information of different layers, and obtains a more effective hash code which combines information of different layers by a direct splicing mode. The present invention refers to a hash code generated using lower layer information as a lower layer hash code, and refers to a hash code generated using higher layer information as a higher layer hash code.

In order to generate a high-level hash code, a feature map generated at the last layer of a convolutional neural network is used as a high-level feature map, the high-level feature map is convoluted by using convolution kernels with different sizes to generate a plurality of feature maps with different scales, the plurality of feature maps with different scales are convoluted point by point and then are sampled, each feature map is reduced to the same scale as the high-level feature map, each reduced feature map and the high-level feature map are spliced and fused in the channel direction, the feature map with the fused multi-scale features is coded by using a full-connection hash layer, and the length of the generated feature map is l₂The high level hash code of (1). In order to generate the low-level hash code, firstly, feature maps generated by the middle layer of the convolutional neural network are intercepted to be used as low-level feature maps, then an attention mechanism (combination of spatial attention and channel attention) is used for enhancing the low-level feature maps, noise interference and semantic divergence in the low-level feature maps are reduced, and then the low-level feature maps are subjected to semantic divergenceThe full-connection Hash layer is used for coding the image to obtain a length l which can represent the detail characteristics of the lower layer of the image₁The lower level hash code of (1). Splicing the high-level hash code and the low-level hash code to obtain a hash code with length l₁+l₂The hash code of (2) is used for image retrieval.

In the process of generating the high-level hash code, as shown in fig. 2, given an input feature as a high-level feature map, first, convolving the high-level feature map by using a plurality of convolution kernels of different sizes to generate a plurality of feature maps of different sizes, for example, the feature map a in fig. 2 has a size of 1 × 1, and the other feature maps are feature maps of various sizes; then, carrying out point-by-point convolution on a plurality of feature maps with different scales, and reducing the channel number of each feature map to 1/4; and finally, performing up-sampling by using a bilinear interpolation mode, restoring each feature map into an original scale, and fusing each restored feature map and the high-level feature map, wherein the fused feature map comprises information of different scales among different sub-regions, so that the fusion of local information and global information is realized.

In the process of generating the low-level hash code, the attention mechanism module uses a combination of a space attention module and a channel attention module. The spatial attention module and the channel attention module are described separately below.

The spatial attention Module, as shown in FIG. 3, gives a lower level feature map L ∈ R^C×H×WAnd performing convolution operation on the low-layer feature map L by using two different convolution layers to generate a feature map Y and a feature map Z, wherein { Y, Z } is belonged to R^C×H×WWherein C represents the channel number of the characteristic diagram, H represents the height of the characteristic diagram, and W represents the width of the characteristic diagram; adjusting the dimensions of the characteristic diagram Y and the characteristic diagram Z to C multiplied by N, and obtaining { Y ', Z' } epsilon R after adjustment^C×NWherein N ═ H × W denotes the total number of pixels on one channel in the feature map; multiplying the transposition of the characteristic diagram Z 'with Y', using a softmax function as an activation function to obtain a spatial characteristic relation diagram S epsilon R^N ^×N：

element of line i, Y, representing the rotation of the feature map Z_j'represents the jth column element in the feature map Y'; after the spatial feature relation graph S is obtained, the relative weight of the low-level feature graph L on each spatial position is mined by using the mean pooling layer and the convolution layer; after obtaining the relative weight, giving the weight again to the low-level feature diagram L to finish the re-calibration on the space dimension, wherein the weighting formula is as follows:

L₁＝conv(avg(S))·L (2)

wherein avg represents a mean pooling layer, conv represents a convolution layer with sigmoid as an activation function; weighting the spatial position of the low-level feature map L by using the formula (2), and enhancing the key information of the low-level feature map L in the spatial dimension to obtain the feature map L₁。

The channel attention module, as shown in FIG. 4, does not perform additional convolution processing unlike the spatial attention module, but directly uses the input low-level feature map L ∈ R^C×H×WCalculating the channel characteristic relation graph G epsilon R^C ^×C. Firstly, the dimension of a low-level feature diagram L is adjusted to C multiplied by N to obtain a feature diagram L' epsilon R^C×^NMultiplying the feature diagram L 'by the transpose of the feature diagram L', and using softmax as an activation function to obtain a channel feature relationship diagram G e R^C×C：

Wherein G is_mnA value representing the channel characteristic relation graph G in the mth row and the nth column represents the degree of association between the channel m and the channel n in the low-level characteristic graph, wherein m is 1, 2, and C, n is 1, 2,. and C; l'_mThe row m elements in the feature map L' are represented,

L₂＝mlp(avg(G))·L (4)

mlp denotes a multilayer perceptron with sigmoid as an activation function; weighting the channel of the low-level feature map L by using the formula (4), and enhancing the key information of the low-level feature map L on the channel dimension to obtain the feature map L₂. The channel attention module models semantic dependency between the feature maps, so that similar semantic features are mutually promoted, and the expression capability of the feature maps on image semantics can be improved.

In summary, the present invention uses a high-level feature map and a low-level feature map to obtain local information and global information simultaneously. Aiming at the problem that the high-level feature map has a relatively large receptive field, contains global information and deep semantic information of an image, but has low resolution and ignores much detailed information in the image, the invention introduces a multi-scale feature fusion module to effectively combine feature information of different scales to realize the fusion of the global information and the local information. Although the low-level feature map contains more information of image structure, such as detailed information of texture, color, shape and the like of an object, which has an important influence on the classification result, the low-level feature map has serious problems of background clutter, semantic divergence and the like, and therefore the invention adopts an attention mechanism to carry out certain processing on the low-level feature map so as to reduce the influence of noise.

The feature fusion technology can improve the detection and segmentation performance in the tasks of target detection and image segmentation. According to the sequence of fusion to detection, feature fusion can be divided into early fusion and late fusion. The method adopts an early fusion mode, firstly fuses the extracted multilayer image features, and then predicts the fused feature vector. The fusion mode comprises two modes of superposition fusion and direct addition. The superposition fusion is to splice two feature vectors, if the dimensions of the two features are x and y respectively, the fused feature dimension is x + y; additive fusion, i.e. combining two eigenvectors into a complex vector, z ═ x + iy, where i is the imaginary unit. Aiming at feature graphs of different scales, the method adopts a bilinear interpolation mode, and performs fusion after each feature graph is restored to an original feature by up-sampling.

The attention mechanism uses a combination of channel and spatial attention. Spatial attention is focused on the position of important information in the graph, so the invention uses the spatial relationship of the features to generate the attention graph, and more abundant context information is aggregated into local features, thereby enhancing the feature expression capability of the local features. For the channel attention, each channel of the feature map contains semantic information of a certain instance in an image, the semantic information among different channels is correlated, the semantic divergence of information contained in low-level local features is large, a convolutional neural network is difficult to aggregate similar semantic information of the image, and the representation of the feature map on specific semantics can be improved by mining the interdependency among the channels, so that the attention map is generated by using the cross-channel relation of the features, and the correlation among the channels is learned, which can be regarded as a supplement to the spatial attention. Therefore, after an input image is given, the two attention modules can respectively focus on the category information of the object in the feature map and the position information in the image, the quality of the low-level feature map can be improved, and the fusion feature strengthened by the information can be obtained through addition fusion. Compared with the original features, the fused features reduce the interference of background noise and improve the semantic expression capability of the features.

In summary, different from the existing deep hash method which directly uses the high-level features for classification prediction, the method adopts a mode of combining the high-level feature map and the low-level feature map, and simultaneously alleviates the defects of the high-level feature map and the low-level feature map by respectively using a feature fusion module and an attention mechanism.

The accuracy and the feature visualization of the image hash retrieval method based on the hierarchical feature complementation provided by the invention are analyzed through a specific experiment.

The evaluation index mAP @5000 was used on the multi-label datasets NUS-WIDE and MS COCO to evaluate different methods. In the experiment, the lengths of the finally generated hash codes are set to be 16 bits, 32 bits, 48 bits and 64 bits respectively, and the experiment result is shown in table 1.

TABLE 1 results of different hashing methods on multi-label datasets NUS-WIDE and MS COCO

In table 1, DHA + represents a model of original DHA after the hierarchical feature complementation-based image hash retrieval method provided by the present invention is used, HashNet + represents a model of original HashNet after the hierarchical feature complementation-based image hash retrieval method provided by the present invention is used, and the retrieval performance of the two models, DHA + and HashNet + is greatly improved compared with that of a hash model using ResNet50 as a backbone. The improvement on the MS COCO data set is more obvious, and the data of the MS COCO are analyzed to find that the image of the MS COCO has more objects with different scales, so that the image Hash retrieval method based on the hierarchical feature complementation can better extract the features with different scales to a certain extent, and the retrieval performance can be improved.

Likewise, the evaluation index mAP @54000 was used on the single label dataset CIFAR-10 to evaluate different methods. In the experiment, the lengths of the hash codes generated finally are set to be 16 bits, 32 bits, 48 bits and 64 bits respectively, and the experiment result is shown in table 2.

TABLE 2 results of different hashing methods on the single label dataset CIFAR-10

In an experiment with four hash code lengths (16bit, 32bit, 48bit and 64bit), compared with the original DHA, the DHA + obtained by the image hash retrieval method based on hierarchical feature complementation provided by the invention has certain improvement on the retrieval performance of a multi-tag data set NUS-WIDE, an MS COCO and a single-tag data set CIFAR-10. However, in the CIFAR-10 dataset, the resolution of each image is relatively low, and each image only contains one example object, so that the improvement brought by the image hash retrieval method based on the hierarchical feature complementation provided by the present invention is relatively small.

The image hash retrieval method based on hierarchical feature complementation provided by the invention can bring different degrees of improvement in experiments on different data sets, and shows the universality of the method. It is worth noting that the retrieval performance of the image hash retrieval method based on the hierarchical feature complementation provided by the invention on the image data set with complex content is greatly improved, and the image hash retrieval method based on the hierarchical feature complementation provided by the invention has better robustness.

Besides verifying the accuracy of the image hash retrieval method based on the hierarchical feature complementation, Grad-CAM is used for visualizing the weights of feature maps in a convolutional neural network, and observing and analyzing the difference of the common ResNet50 and the image hash retrieval method based on the hierarchical feature complementation.

Experiments were performed on images on a partial MS COCO dataset, using Grad-CAM to map the weights of the feature maps of the models generated by the convolutional neural network into the original images, generating a thermodynamic diagram, and selecting three representative images for display, as shown in fig. 5. As can be seen from fig. 5, the image hash retrieval method based on hierarchical feature complementation according to the present invention focuses on people, is not interfered by background and non-critical information, and has high robustness.

The experimental results of feature visualization on images with different complexities show that the image hash retrieval method based on hierarchical feature complementation provided by the invention can completely detect the key information in the images. Through a characteristic complementation mode, the image retrieval model can solve the problems of semantic divergence, noise interference and the like of the convolutional neural network in the characteristic extraction process to a certain extent.

Furthermore, t-SNE visualization experiments were also performed. t-SNE is a commonly used non-linear dimension reduction method that maps high-dimensional data into a low-dimensional space. The CIFAR-10 dataset consists of color images of 10 classes, each class containing 6000 images. Firstly, randomly selecting 1000 images in each type of a CIFAR-10 data set, and respectively generating a 64-bit hash code by using DHA and DHA +; then, the 64-dimensional vector was subjected to dimensionality reduction using t-SNE and displayed in a two-dimensional plane, and the experimental results are shown in fig. 6(DHA) and fig. 7(DHA +). As can be seen from fig. 6 and 7, both methods DH and DHA + can effectively map most of the same class of images into neighboring spaces, and DHA has more erroneous samples than DHA + and is more dispersed.

In summary, it can be seen through accuracy and feature visualization experiments that the image hash retrieval method based on hierarchical feature complementation provided by the invention has good performance, especially on an image data set with complex content.

The image hash retrieval method based on hierarchical feature complementation provided by the invention can be applied to large-scale content-based image retrieval, and is an algorithm which can simultaneously and effectively extract low-level detail information and high-level semantic information of an image and fully utilize global features and local features of the image. The method extracts the low-level feature map and the high-level feature map in the convolutional neural network at the same time, can acquire the low-level information and the high-level information of the image, introduces the attention module, can reduce the noise interference in the low-level feature map, ensures the effectiveness of the low-level feature map, adds multi-scale feature fusion in the high-level feature map of the convolutional neural network, aggregates context information of different areas, can improve the capability of the convolutional neural network for acquiring local detail information, can ensure that the model can fully extract rich and complex contents of the image by enhancing and fusing the information of different levels, and enables the Hash code to better keep the similarity between the images.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. An image hash retrieval method based on hierarchical feature complementation is characterized by comprising the following steps:

2. The image hash retrieval method based on hierarchical feature complementation according to claim 1, wherein in step S2, the feature map generated by the middle layer of the convolutional neural network is intercepted as the low-layer feature map L, the low-layer feature map L is input into the spatial attention module, and the spatial attention module aggregates the context information into the low-layer feature map L to obtain the feature map L₁The method specifically comprises the following steps:

L₁＝conv(avg(S))·L (2)

3. The image hash retrieval method based on hierarchical feature complementation according to claim 2, wherein in step S3, the low-level feature map L is input into a channel attention module, and the channel attention module models semantic dependence between channels of the low-level feature map L to obtain a feature map L₂The method specifically comprises the following steps:

L₂＝mlp(avg(G))·L (4)

mlp denotes a multilayer perceptron with sigmoid as an activation function; the formula (4) weights the channel of the low-level feature map L, and enhances the key information of the low-level feature map L on the channel dimension to obtain the feature map L₂。