CN106909924B

CN106909924B - Remote sensing image rapid retrieval method based on depth significance

Info

Publication number: CN106909924B
Application number: CN201710087670.5A
Authority: CN
Inventors: 张菁; 梁西; 陈璐; 卓力; 耿文浩; 李嘉锋
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-02-18
Filing date: 2017-02-18
Publication date: 2020-08-28
Anticipated expiration: 2037-02-18
Also published as: CN106909924A

Abstract

A remote sensing image rapid retrieval method based on depth saliency belongs to the field of computer vision, and particularly relates to technologies such as depth learning, saliency target detection and image retrieval. The invention takes the remote sensing image as a research object and researches a quick retrieval method of the remote sensing image by utilizing a deep learning technology. Firstly, a multi-task saliency target detection model is constructed by adopting a full-convolution neural network, a saliency detection task and a semantic segmentation task are simultaneously carried out on the model, and the depth saliency characteristics of the remote sensing image are learned in the network pre-training process. And then, improving a deep network structure, adding a hash layer fine tuning network, and learning to obtain a binary hash code of the remote sensing image. And finally, comprehensively utilizing the significance characteristics and the hash codes to measure the similarity. The method is practical and feasible for realizing accurate and efficient retrieval of the remote sensing image and has important application value.

Description

Remote sensing image rapid retrieval method based on depth significance

Technical Field

The invention takes a remote sensing image as a research object, and researches a rapid retrieval method of the remote sensing image by utilizing the latest research result in the field of artificial intelligence, namely a deep learning technology. Firstly, a multi-task saliency target detection model is constructed by adopting a full-convolution neural network, and the depth saliency characteristic of a remote sensing image is calculated; then improving a deep network structure, and adding a hash layer to learn to obtain a binary hash code; and finally, the remote sensing image is accurately and quickly retrieved by comprehensively utilizing the significance characteristics and the hash code. The invention belongs to the field of computer vision, and particularly relates to technologies of deep learning, salient target detection, image retrieval and the like.

Background

The remote sensing image data is used as basic data in three spatial Information technologies, namely, a Geographic Information System (GIS), a Global Positioning System (GPS) and a remote sensing mapping technology (RS), and is widely applied to various fields of environment monitoring, resource investigation, land utilization, urban planning, natural disaster analysis, military and the like. In recent years, with the development of high-resolution remote sensing satellite, imaging radar and Unmanned Aerial Vehicle (Unmanned Aerial Vehicle) technologies, remote sensing image data further show the characteristics of mass, complexity and high resolution, and the realization of efficient and accurate remote sensing image retrieval has important research significance and application value for promoting accurate extraction and data sharing of remote sensing image information.

Image Retrieval technology has evolved from early Text-Based Image Retrieval (TBIR) to Content-Based Image Retrieval (CBIR) by extracting Image features. The image retrieval method based on the saliency target can quickly select a few salient regions from a complex scene for priority processing, thereby effectively reducing the data processing complexity and improving the retrieval efficiency. Compared with the common image retrieval, the remote sensing image has complex and changeable information, small target and unobvious difference with the background, and if the traditional significance detection method is still adopted, the accurate description and analysis of the significance characteristics of the remote sensing image are difficult to realize. In recent years, with the recent research result in the field of artificial intelligence, deep learning techniques have been proposed, for example: the deep Neural Network represented by a Full Convolutional Neural Network (FCNN) shows excellent robustness in the aspect of learning of image depth saliency features by virtue of a unique convolution kernel similar to local perception of human eyes and a hierarchical cascade structure similar to biological nerves. The characteristic of weight sharing of the system also enables network parameters to be greatly reduced, meanwhile, the risk of overfitting training data is reduced, the system is easier to train than other types of deep networks, and the representing accuracy of the significant features can be improved.

In consideration of the problems that the number of remote sensing images is increasing day by day, the image semantic description capacity is limited and the like, the invention provides a remote sensing image fast retrieval method based on depth significance by taking a large-scale aerial image data set (AID), a Wuhan university remote sensing image data set (WHU-RS) and Google Earth remote sensing images as data sources. Firstly, a multitask saliency target detection model based on a Full Convolutional Neural Network (FCNN) is constructed, semantic information of different levels of a remote sensing image is learned on a pre-training data set and is used as a depth saliency feature and converted into a one-dimensional column vector. And further fine-tuning the neural network model, introducing a Hash layer, adding training samples, mapping the high-dimensional significance characteristics of the remote sensing image learned by the model to a low-dimensional space in the form of Binary Hash Codes (Binary Hash Codes), and respectively storing significance characteristic vectors and Hash Codes to construct a characteristic database. The method comprises the steps of extracting a salient feature vector and a Hash code of a remote sensing image to be inquired through a trained model, comparing a feature database, calculating Hamming Distance (Hamming Distance) and Euclidean Distance (Euclidean Distance) measurement similarity of the Hash code, and achieving quick retrieval of the remote sensing image.

Disclosure of Invention

The invention provides a remote sensing image fast retrieval method based on depth significance by utilizing a deep learning technology, which is different from the traditional remote sensing image retrieval method. Firstly, a multitask depth saliency target detection model is constructed by adopting a Full Convolutional Neural Network (FCNN), and the classification of a common Convolutional Neural Network (CNN) image level is further extended to the classification of a pixel level. The network is pre-trained on a large-scale aerial image data set (AID), the saliency detection task and the semantic segmentation task share a convolution layer, three layers of semantic information of the remote sensing image are comprehensively learned, feature redundancy is effectively removed, and depth saliency features are accurately extracted. Secondly, a hash layer is added into the model, a Wuhan university remote sensing image data set (WHU-RS) fine tuning neural network is expanded, the advantage of incremental learning is realized through a Stochastic Gradient Descent (SGD) algorithm by utilizing a deep neural network, binary hash codes are learned point by point, high-dimensional significance characteristic dimension reduction is realized, the storage space can be saved, and the retrieval efficiency can be improved. Meanwhile, compared with the traditional Hash method which needs to input training samples in pairs, the method adopted by the invention is easier to expand on a large-scale data set. The significance characteristics learned in the neural network pre-training and fine-tuning process are converted into a one-dimensional column vector, and a characteristic database is constructed together with the binary hash code. And finally, in the image retrieval stage, a coarse-to-fine retrieval strategy is adopted, and the Hamming distance and the Euclidean distance are measured by comprehensively utilizing the binary Hash codes and the significance characteristics, so that the remote sensing image can be quickly and accurately retrieved. The main process of the method is shown as attached figure 1 and can be divided into the following three steps: and constructing a target detection model based on the depth significance, pre-training a neural network, and adding hash layer fine adjustment and multi-level depth retrieval.

(1) Target detection model construction based on depth significance

In order to effectively extract the salient region of the image, the invention constructs a multitask salient object detection model based on a full convolution neural network. The model performs two tasks simultaneously: saliency detection and semantic segmentation. The saliency detection is used for learning the depth features of the remote sensing images and calculating the depth saliency, and the semantic segmentation is used for extracting the semantic information of the internal objects of the images, eliminating background confusion of the saliency map and supplementing missing parts of saliency targets.

(2) Neural network pre-training and adding hash layer fine tuning

The method selects a large-scale aerial image data set (AID) as a standard data set pre-training network. In order to enable the significance characteristics of significance target detection model learning to have better robustness for the retrieval of Chinese remote sensing images, 6050 Chinese remote sensing images with different illumination, shooting angles, resolutions and sizes are downloaded on the Google earth on the basis of a Wuhan university remote sensing image data set (WHU-RS), and the WHU-RS data set is expanded to 7000 images for fine tuning a neural network.

(3) Multi-level depth retrieval

The invention provides a coarse-to-fine retrieval scheme. The rough search measures similarity by hamming distance using a binary hash code learned by the hash layer. And the fine search maps the two-dimensional remote sensing image feature maps generated by the 13 th and 15 th layers of convolution layers into a one-dimensional array vector as a significant feature vector, and the similarity is measured through Euclidean distance. Using the ranking-based evaluation criteria, the Precision (Precision) of the search results is counted.

1. A remote sensing image fast retrieval method based on depth saliency is characterized by comprising the following steps:

step 1: target detection model construction based on depth significance

Inputting an RGB image, carrying out a series of convolution operations on the RGB image through 15 convolution layers, and then sharing the convolution layers by a saliency detection task and a superpixel target semantic segmentation task; initializing the first 13 convolutional layers by a convolutional neural network VGGNet, wherein the size of a convolutional core is 3 multiplied by 3, and a modified linear unit ReLU is adopted as an activation function after each convolutional layer; performing maximum pooling operation after the 2 nd, 4 th, 5 th and 13 th convolution layers; convolution kernel sizes of the 14 th convolution layer and the 15 th convolution layer are 7 x 7 and 1 x 1 respectively, and a Dropout layer is connected behind the 14 th convolution layer and the 15 th convolution layer;

constructing an deconvolution layer through upsampling, initializing parameters of the deconvolution layer through bilinear interpolation, and performing iterative updating in a training learning upsampling function; normalizing the output image to [0,1] through a sigmoid threshold function in a saliency target detection task, and learning saliency characteristics; in the semantic segmentation task, the feature map of the last convolutional layer is up-sampled by the anti-convolutional layer, and the up-sampling result is cut to make the size of the output image the same as that of the input image;

step 2: neural network pre-training and adding hash layer fine tuning

Step 2.1: multi-task significance target detection model pre-training

The FCNN pre-training is expanded together through a significance detection task and a segmentation task; χ represents N₁A set of training images with width and height of W and Q respectively, wherein Xi is the ith image and Y_ijkRepresenting the corresponding pixel level true segmentation map of the ith image with width height j and k, respectively, where i is 1 … N₁J is 1 … W, k is 1 … Q; z represents N₂Set of training images, Z_nFor the nth image, N is 1 … N₂It has corresponding true binary image M with significant object_n；θ_sTo share convolutional layer parameters, θ_hTo divide the task parameter, θ_fIs a significance task parameter; the formula (1) and the formula (2) are cross entropy cost functions J of the segmentation tasks respectively₁(χ；θ_s,θ_h) And the squared Euclidean distance cost function J of the significance detection task₂(Z；θ_s,θ_f) FCNN is trained by minimizing two cost functions:

in the formula (1), the first and second groups,

is an indicator function, h_cjkIs an element (j, k) of the confidence segmentation map of class C, C1 … C, h (Xi; θ)_s,θ_h) Is a semantic segmentation function, and returns confidence segmentation graphs of C target classes in total, wherein C is f (Z) in an image class formula (2) contained in a pre-training data set_n；θ_s,θ_f) Is a saliency map output function, and F represents F-norm operation;

next, minimizing the cost function by using a random gradient descent SGD method on the basis of regularizing all training samples; as the data set used for pre-training does not have segmentation and significance labeling at the same time, the segmentation task and the significance detection task are performed alternately; the training process needs to normalize the sizes of all original images; the learning rate is 0.001 +/-0.01; the reference value of the momentum parameter is [0.9,1.0], and the reference value of the weight attenuation factor is 0.0005 plus or minus 0.0002; the random gradient descent learning process is carried out for more than 80000 iterations; the detailed pre-training process is as follows:

1) sharing full convolution parameters

Initializing based on VGGNet;

2) randomly initializing segmentation task parameters by normal distribution

And salient task parameters

3) According to

And

the SGD is used for training the segmentation network, and the two parameters are updated to

And

4) according to

And

training significance network by SGD and updating relevant parameters to

And

5) according to

And

training the segmentation network by using SGD to obtain

And

6) according to

And

training significance network by SGD and updating relevant parameters to

And

7) repeating the steps 3) -6) three times to obtain a pre-training final parameter theta_s，θ_h，θ_f；

Step 2.2: adding a hash layer to fine-tune the network for the target domain

Inserting a full connection layer containing s neurons, namely a hash layer H, between a pre-trained penultimate layer of the network and a final task layer, mapping high-dimensional features to a low-dimensional space, and generating a binary hash code for storage; the H weight of the Hash layer is initialized by constructing a Hash value through random projection, the neuron activation function adopts a sigmoid function to enable the output value to be between 0 and 1, and the number of the neurons is the code length of a target binary code;

the fine tuning process adjusts the network weight through a back propagation algorithm; network fine tuning is to adjust the network weight after the tenth convolutional layer; compared with the data set of the pre-training network, the data volume of the data set for the fine-tuning network can be reduced by 10-50%, compared with the pre-training network parameters, the iteration times and the learning rate of the network parameters in the fine-tuning process are reduced by 1-10%, and the momentum parameters and the weight attenuation factors are kept unchanged;

the detailed trimming process is as follows:

1) sharing full convolution parameters

Segmenting task parameters

And salient task parameters

Obtained through a pre-training process;

2) according to

And

And

3) according to

And

training significance network by SGD and updating relevant parameters to

And

4) according to

And

training the segmentation network by using SGD to obtain

And

5) according to

And

training significance network by SGD and updating relevant parameters to

And

6) repeating the above steps 3) -6) three times to obtain the final parameter theta_s，θ_h，θ_f；

And step 3: multi-level depth retrieval

Step 3.1: coarse search

Step 3.1.1: generating binary hash codes

An image I to be inquired_qInputting the data into a trimmed neural network, extracting the output of the hash layer as an image signature, and expressing the image signature by out (H); for each binary bit r being 1 … s, binary codes are obtained according to threshold value binary activation values;

(3)

where s is the number of neurons in the hash layer, and the initial value setting range is [40,100 ]]；＝{I₁,I₂,…,I_tDenotes a data set for retrieval containing t images; the corresponding binary code of each image is expressed as_H＝{H₁,H₂,…,H_tWhere m is 1 … t, H_m∈{0,1}^sThe s-bit binary code values generated by the s neurons are respectively 0 or 1;

step 3.1.2: hamming distance metric similarity

The Hamming distance between two character strings with equal length is the number of different characters at the corresponding positions of the two character strings; for an image I to be inquired_qAnd its binary code H_qIf H is present_qAnd H_i∈_HIf the hamming distance between the candidate pictures is less than the set threshold, a candidate pool P containing m candidate pictures is defined as { I ═ I }_c1,I_c2,…,I_cmAnd (4) considering that the two images are similar when the Hamming distance is less than 5;

step 3.2: fine search

Step 3.2.1: salient feature extraction

An image I to be inquired_qRespectively mapping the two-dimensional remote sensing image characteristic graphs generated by 13 th and 15 th layers of convolution layers of the neural network into one-dimensional vectors for storage; respectively comparing retrieval results adopting different feature vectors in a subsequent retrieval process to determine which layer of convolution generated feature map is finally selected to extract the salient features of the remote sensing image;

step 3.2.2: euclidean distance metric similarity

For a query image I_qAnd a candidate pool P, using the extracted significant feature vector to select k images before ranking from the candidate pool P; v_qAnd

representing query images q and I, respectively_ciThe feature vector of (2);definition I_qAnd Euclidean distance s between the corresponding characteristic vectors of the ith image in the candidate pool P_iAs the similarity level between them, as shown in formula (4);

the smaller the Euclidean distance is, the greater the similarity between the two images is; each candidate graph I_ciSorting in an ascending order according to the similarity with the query image, wherein the image of k before the ranking is a retrieval result;

step 3.3: evaluation of search results

Evaluating the retrieval result by using the ranking-based evaluation standard; for one query image q and k retrieval result images before the obtained ranking, Precision is calculated according to the following formula:

wherein Precision @ k represents the average accuracy rate from the first correct result to the kth correct result by setting a threshold k until the kth correct result is retrieved; rel (i) represents the correlation between the query image q and the ith image, and Rel (i) is equal to {0,1}, wherein 1 represents that the query image q and the ith image have the same classification, namely are correlated, and 0 is not correlated.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

firstly, compared with the traditional method for manually extracting the remote sensing image characteristics, the method utilizes the full-convolution neural network to construct a depth significance target detection model, selects a training network of a remote sensing image database at home and abroad, comprehensively analyzes three-layer semantic information of the image, and automatically learns the significance characteristics of the remote sensing image. Meanwhile, the full convolution neural network is innovatively segmented and added to learn the depth significance of the remote sensing image, and the learned significance characteristics are effectively improved. Experiments prove that the model can be used for extracting a salient object with clear edges on a multi-target detection data set with a complex scene, such as a Microsoft COCO data set. The learning ability of the deep neural network can be further shifted to the learning of the salient features of the remote sensing images. Secondly, a hash layer is introduced into a full convolution neural network architecture, and binary hash codes are generated while the depth significance characteristics of the remote sensing image are learned, so that the storage space can be saved, and the subsequent retrieval efficiency can be improved. And finally, when image retrieval is carried out, a coarse-to-fine retrieval strategy is adopted, and binary hash codes and significance characteristics are comprehensively utilized to carry out similarity measurement. Experiments prove that a Hash layer is added into an AlexNet neural network, a multi-level retrieval strategy from coarse to fine is adopted, the accuracy of K similar images before the returned ranking, namely the topK precision, is counted in the retrieval of 250 ten thousand common images of different categories, when K is 1000, the topK precision can reach 88% on average, and the retrieval time is about 1 s. Therefore, the method is transferred to the remote sensing image retrieval, and has important application value for realizing accurate and efficient remote sensing image retrieval.

Description of the drawings:

FIG. 1 is a flow chart of a remote sensing image fast retrieval method based on depth saliency;

FIG. 2 is a diagram of a target detection model architecture based on depth saliency;

FIG. 3 is a diagram of a neural network architecture incorporating a hash layer;

fig. 4 is a diagram of a multi-level search process.

Detailed Description

In light of the above description, a specific implementation flow is as follows, but the scope of protection of this patent is not limited to this implementation flow.

Step 1: target detection model construction based on depth significance

The salient region is subjectively understood to be a region where human vision focuses attention, is closely related to a Human Visual System (HVS), and objectively, is a certain feature of an image, and a sub-region where the feature is most obvious exists. Therefore, the key to the saliency detection problem is feature learning and extraction. In view of the powerful functions of deep learning in this aspect, the invention uses the full convolution neural network for the saliency detection problem, and provides a multitask saliency target detection model based on the full convolution neural network. The model performs two tasks simultaneously: a saliency detection task and a semantic segmentation task. The saliency detection task is used for learning the depth features of the remote sensing images and calculating the depth saliency, and the semantic segmentation task is used for extracting semantic information of objects in the images, eliminating background confusion of the saliency maps and supplementing missing parts of saliency targets.

The full convolution neural network architecture provided by the invention is realized based on a mainstream open source deep learning framework Caffe, and the specific model structure is shown in an attached figure 2. Inputting an RGB image, carrying out a series of convolution operations through 15 convolution layers (Conv), and sharing the convolution layers by a saliency detection task and a superpixel target semantic segmentation task. The first 13 convolutional layers are initialized by convolutional neural network VGGNet, the size of convolutional kernel is 3 x 3, and each convolutional layer is followed by a modified linear unit (ReLU) as an activation function, so that the convergence speed is accelerated. And 2, carrying out maximum value pooling (Max Pooling) operation after the layers are coiled, reducing feature dimension, reducing calculation amount and ensuring feature invariance. Convolution kernels of the 14 th convolution layer and the 15 th convolution layer are 7 x 7 and 1 x 1 respectively, and a Dropout layer is connected after convolution of each convolution layer so as to solve the potential overfitting phenomenon of a complex neural network structure, namely the problem that the error rate is high and the generalization capability is poor in practical test due to the fact that a model learns noise and details in training data excessively. The deconvolution layer is constructed through upsampling, parameters of the deconvolution layer are initialized through bilinear interpolation, and iterative updating is carried out in training and learning an upsampling function. And (4) normalizing the output image to [0,1] through a sigmoid threshold function in a saliency target detection task, and learning a saliency characteristic. In the semantic segmentation task, the feature map of the last convolutional layer is up-sampled by the anti-convolutional layer, and the up-sampling result is clipped (Crop) to make the output image and the input image have the same size, so that a prediction is generated for each pixel, and the spatial information in the original input image is retained.

Step 2: neural network pre-training and adding hash layer fine tuning

The invention uses the disclosed large-scale aerial image data set (AID) for the pre-training of the neural network, and aims to better learn the semantic features of the remote sensing images at different levels. A Hash layer is introduced, and the network is further finely adjusted by using an expanded Wuhan university remote sensing image data set (WHU-RS), so that the high-dimensional features learned by the neural network can be mapped to the low-dimensional features, the retrieval time is shortened, and the features learned by the neural network have higher robustness.

Step 2.1: multi-task significance target detection model pre-training

Step 2.1.1: constructing a pre-training data set

The pre-training phase selects the published large-scale Aerial Image Dataset (AID) as a standard dataset for pre-training. AID contains 30 categories, 10000 images of taking photo by plane, and all images are selected from Google Earth and are marked by professional remote sensing technical field personnel. Each classified image is taken from different countries and regions, is shot by different shooting remote sensing detectors at different time, and has the image size of 600 multiplied by 600 pixels and the resolution ratio of 0.5 m/pixel to 8 m/pixel. Compared with other data sets, the data set has small intra-class difference and large inter-class difference, and is the data set with the largest scale in the current aerial image data set.

Step 2.1.2: salient object detection model pre-training

The FCNN pre-training is developed by the saliency detection task and the segmentation task together. χ represents N₁A set of training images with width and height of W and Q respectively, wherein Xi is the ith image and Y_ijkRepresenting the corresponding pixel level true segmentation map of the ith image with width height j and k, respectively, where i is 1 … N₁J is 1 … W, k is 1 … Q. Z represents N₂Set of training images, Z_nFor the nth image, N is 1 … N₂It has corresponding true binary image M with significant object_n。θ_sTo share convolutional layer parameters, θ_hTo divide the task parameter, θ_fIs a salient task parameter. The formula (1) and the formula (2) are cross entropy cost functions J of the segmentation tasks respectively₁(χ；θ_s,θ_h) And the squared Euclidean distance cost function J of the significance detection task₂(Z；θ_s,θ_f) FCNN is trained by minimizing two cost functions:

in the formula (1), the first and second groups,

is an indicator function, h_cjkIs an element (j, k) of the confidence segmentation map of class C, C1 … C, h (Xi; θ)_s,θ_h) The method comprises the steps of (1) returning confidence segmentation graphs of C target classes in total by a semantic segmentation function, wherein C is an image class contained in a pre-training data set, and 30 is selected in the method; in the formula (2), f (Z)_n；θ_s,θ_f) Is a saliency map output function, and F represents the F-norm operation.

Since the training process requires normalization of all raw image sizes, the invention resets the raw image to 500 × 500 pixels for pre-training, the learning rate is an essential parameter of the SGD learning method, determines the weight update rate, sets too large to cause cost function oscillation, results over an optimal value, too small to cause too slow convergence, and generally tends to choose a smaller learning rate, such as 0.001 + -0.01 to keep the system stable, the momentum parameters and weight decay factors can improve training adaptivity, the momentum parameters are typically [0.9,1.0]The weight attenuation factor is typically 0.0005 ± 0.0002. Through experimental observation, the learning rate is set to 10 by the invention^-10The momentum parameter is set to 0.99, and the weight attenuation factor is set to 0.0005 in the Caffe framework. The random gradient descent (SGD) learning process was accelerated by NVIDIA GTX 1080GPU device for a total of 80000 iterations.The detailed pre-training process is as follows:

1) sharing full convolution parameters

Initializing based on VGGNet;

2) randomly initializing segmentation task parameters by normal distribution

And salient task parameters

3) According to

And

And

4) according to

And

training significance network by SGD and updating relevant parameters to

And

5) according to

And

training the segmentation network by using SGD to obtain

And

6) according to

And

training significance network by SGD and updating relevant parameters to

And

7) repeating the steps 3) -6) three times to obtain a pre-training final parameter theta_s，θ_h，θ_f。

Step 2.2: adding a hash layer to fine-tune the network for the target domain

Step 2.2.1: construction of Chinese remote sensing image data set for fine tuning network

And selecting an expanded Wuhan university remote sensing image data set (WHU-RS) for neural network fine adjustment. The original WHU-RS dataset contains 19 scene classifications, and 950 remote sensing images with different resolutions, the image size is 600 x 600 pixels, and all the images are taken from Google Earth. Combining the landform of China, reconstructing and expanding to 7000 remote sensing images as a sample library on the basis of an original data set, wherein each category comprises more than 200 images. The newly added sample images are different in illumination, shooting angle, resolution and size, and therefore the significance characteristics of robustness are better benefited for neural network learning.

Step 2.2.2: joining hash layer trim networks

The feature vector generated by the deep neural network has high dimensionality and is very time-consuming in large-scale image retrieval. Because the image binary hash codes with similarity are similar, a full connection layer containing s neurons, namely a hash layer H, is inserted between the pre-trained penultimate layer of the network and the final task layer, high-dimensional features are mapped to a low-dimensional space, the binary hash codes are generated and stored, and the network structure is shown in figure 3. The H weight of the Hash layer is initialized by constructing a Hash value through random projection, the neuron activation function adopts a sigmoid function to enable the output value to be between 0 and 1, a threshold value is set to be 0.5 according to experience, and the number of neurons is the code length of a target binary code. The hash layer not only provides the feature abstraction of the previous layer, but also is a bridge connecting the semantic features of the intermediate and high-level images.

The fine tuning process adjusts the network weights through a Back Propagation (Back Propagation) algorithm. Network fine-tuning may be performed for the entire network or portions of the network. Because the learned features of the lower-level network structure are more generalized and overfitting is avoided, the invention emphasizes and adjusts the weight of the higher-level network, namely the network after the tenth convolutional layer, by utilizing the expanded WHU-RS data set. Generally, the data volume of the data set for the fine tuning network is reduced by 10% -50% compared with the data set of the pre-training, in the invention, the fine tuning network data set comprises 7000 images which are obviously smaller than the data set comprising 10000 images during the pre-training, compared with the pre-training network parameters, the network parameters in the fine tuning process are properly reduced, and the iteration times and the learning rate can be reduced by 1% -10%. In the invention, the iteration times are reduced to 8000 times in the fine tuning process, and the learning rate is reduced by 1 percent and is 10 percent^-12The momentum parameter and the weight decay factor remain unchanged, i.e., set to 0.99 and 0.0005, respectively.

The detailed trimming process is as follows:

1) sharing full convolution parameters

Segmenting task parameters

And salient task parameters

Obtained through a pre-training process;

2) according to

And

And

3) according to

And

training significance network by SGD and updating relevant parameters to

And

4) according to

And

training the segmentation network by using SGD to obtain

And

5) according to

And

training significance network by SGD and updating relevant parameters to

And

6) repeating the above steps 3) -6) three times to obtain the final parameter theta_s，θ_h，θ_f。

And step 3: multi-level depth retrieval

The shallow part of the deep convolutional neural network learns the underlying visual features, while the deep part can capture image semantic information. Therefore, the invention adopts a coarse-to-fine retrieval strategy to realize rapid and accurate image retrieval. The feature extraction and retrieval process is shown in figure 4.

Step 3.1: coarse search

Firstly, a series of candidate areas with similar high-level semantic features are retrieved, namely, similar binary activation values are possessed in a hash layer, and then a similar image ranking is further generated according to a similarity measure.

Step 3.1.1: generating binary hash codes

An image I to be inquired_qThe output of the hash layer is extracted as an image signature, which is represented by out (H). For each bit r 1 … s, the binary code is obtained by binarizing the activation value according to the threshold value.

Wherein s is the number of neurons in the hash layer, overfitting can occur when the number is too large, and the initial value is suggested to be setThe setting range is [40,100 ]]The specific numerical value is adjusted according to the practical training data, and s is set to 48 in the invention. Is ═ I₁,I₂,…,I_nDenotes a data set for retrieval containing n images. The corresponding binary code of each image is expressed as_H＝{H₁,H₂,…,H_nWhere i is 1 … n, H_i∈{0,1}^sThe s-bit binary code values representing the generation of s neurons are 0 or 1, respectively.

Step 3.1.2: hamming distance metric similarity

The hamming distance between two equal-length character strings is the number of different characters at the corresponding positions of the two character strings. For an image I to be inquired_qAnd its binary code H_qIf H is present_qAnd H_i∈_HIf the hamming distance between the candidate pictures is less than the set threshold, a candidate pool P containing m candidate pictures is defined as { I ═ I }_c1,I_c2,…,I_cmGenerally, two images can be considered similar when the hamming distance is less than 5.

Step 3.2: fine search

Step 3.2.1: salient feature extraction

Because different convolution layers of the deep convolutional network learn semantic features of different levels of different images, the features learned by the middle and high-level convolution layers are more suitable for image retrieval tasks. Therefore, the image I to be inquired_qAnd respectively mapping the two-dimensional remote sensing image characteristic graphs generated by 13 th and 15 th layers of convolution layers of the neural network into one-dimensional vectors for storage. And respectively comparing retrieval results adopting different feature vectors in the subsequent retrieval process to determine which layer of convolution generated feature map is finally selected to extract the salient features of the remote sensing image.

Step 3.2.2: euclidean distance metric similarity

For a query image I_qAnd a candidate pool P, wherein the k images before the ranking are selected from the candidate pool P by using the extracted significant feature vector. V_qAnd

are respectively provided withRepresenting query images q and I_ciThe feature vector of (2). Definition I_qAnd Euclidean distance s between the corresponding characteristic vectors of the ith image in the candidate pool P_iAs the similarity level between them, as shown in formula (4).

The smaller the euclidean distance, the greater the similarity between the two images. Each candidate graph I_ciAnd sorting according to the similarity of the images to the query image in an ascending order, wherein the images at the top k in the ranking are retrieval results.

Step 3.3: evaluation of search results

The invention evaluates the retrieval result by using the ranking-based evaluation criterion. For one query image q and the k top ranked search result images obtained, Precision (Precision) is calculated according to the following formula:

wherein Precision @ k represents the average accuracy rate from the first correct result to the kth correct result by setting a threshold value k according to actual requirements until the kth correct result is retrieved; rel (i) represents the correlation between the query image q and the ith image, and Rel (i) is equal to {0,1}, wherein 1 represents that the query image q and the ith image have the same classification, namely are correlated, and 0 is not correlated.

Claims

step 1: target detection model construction based on depth significance

Inputting an RGB image, carrying out a series of convolution operations on the RGB image through 15 convolution layers, and then sharing the convolution layers by a saliency detection task and a superpixel target semantic segmentation task; initializing the first 13 convolutional layers by a convolutional neural network VGGNet, wherein the size of a convolutional kernel is 3 multiplied by 3, and a modified linear unit ReLU is adopted as an activation function after each convolutional layer; performing maximum pooling operation after the 2 nd, 4 th, 5 th and 13 th convolution layers; convolution kernel sizes of the 14 th convolution layer and the 15 th convolution layer are 7 x 7 and 1 x 1 respectively, and a Dropout layer is connected behind the 14 th convolution layer and the 15 th convolution layer;

step 2: neural network pre-training and adding hash layer fine tuning

Step 2.1: multi-task significance target detection model pre-training

in the formula (1), the first and second groups,

1) sharing full convolution parameters

Initializing based on VGGNet;

2) randomly initializing segmentation task parameters by normal distribution

And salient task parameters

3) According to

And

And

4) according to

And

training significance network by SGD and updating relevant parameters to

And

5) according to

And

training the segmentation network by using SGD to obtain

And

6) according to

And

training significance network by SGD and updating relevant parameters to

And

Step 2.2: adding a hash layer to fine-tune the network for the target domain

the detailed trimming process is as follows:

1) sharing full convolution parameters

Segmenting task parameters

And salient task parameters

Through a pre-training processTo;

2) according to

And

And

3) according to

And

training significance network by SGD and updating relevant parameters to

And

4) according to

And

training the segmentation network by using SGD to obtain

And

5) according to

And

training significance network by SGD and updating relevant parameters to

And

And step 3: multi-level depth retrieval

Step 3.1: coarse search

Step 3.1.1: generating binary hash codes

step 3.1.2: hamming distance metric similarity

Different characters with Hamming distance between two equal-length character strings being corresponding positions of the two character stringsThe number of symbols; for an image I to be inquired_qAnd its binary code H_qIf H is present_qAnd H_i∈_HIf the hamming distance between the candidate pictures is less than the set threshold, a candidate pool P containing m candidate pictures is defined as { I ═ I }_c1,I_c2,…,I_cmAnd (4) considering that the two images are similar when the Hamming distance is less than 5;

step 3.2: fine search

Step 3.2.1: salient feature extraction

step 3.2.2: euclidean distance metric similarity

representing query images q and I, respectively_ciThe feature vector of (2); definition I_qAnd Euclidean distance s between the corresponding characteristic vectors of the ith image in the candidate pool P_iAs the similarity level between them, as shown in formula (4);

step 3.3: evaluation of search results