CN116089646A

CN116089646A - Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism

Info

Publication number: CN116089646A
Application number: CN202310007898.4A
Authority: CN
Inventors: 陈亚雄; 杨锴; 黄景灏; 黄吉瑞; 熊盛武
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-05-09

Abstract

According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash codes are learned by the saliency capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are retrieved by similarity calculation. The method provided by the invention not only can pay more attention to global information and capture remarkable characteristics, but also improves the precision performance of retrieval.

Description

Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism

Technical Field

The invention relates to an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is particularly suitable for improving retrieval progress.

Background

With the rapid development of unmanned aerial vehicle technology, image retrieval shot by unmanned aerial vehicles is widely focused in the field of image processing, and compared with satellites, unmanned aerial vehicles generally have a real-time streaming media function, so that rapid decision making can be realized. In addition, the unmanned aerial vehicle can significantly reduce dependence on weather environment, and provides higher flexibility in handling various problems. As the number of unmanned aerial vehicles increases, the number of images photographed by unmanned aerial vehicles also increases significantly. Therefore, how to mine effective unmanned aerial vehicle image information becomes increasingly important. In order to mine useful information, many researchers are paying great attention to the research of unmanned aerial vehicle image data retrieval. Because unmanned aerial vehicle data retrieval can quickly retrieve useful information, the unmanned aerial vehicle data retrieval method has been applied to various aspects of agriculture, military and the like. Unmanned aerial vehicle image retrieval is a branch of general image retrieval, and more attention is paid to image data shot by an unmanned aerial vehicle on the retrieved content.

Along with the explosive growth of unmanned aerial vehicle shooting data, an efficient ground image data analysis technology is urgently concerned about processing unmanned aerial vehicle data. The unmanned aerial vehicle image retrieval task is to utilize unmanned aerial vehicle image data to retrieve relevant unmanned aerial vehicle images. Because of the large data volume and the large information difference between different scales of data, it is difficult for users to quickly obtain favorable information. How to solve the multi-scale problem of unmanned aerial vehicle image data is an important challenge of unmanned aerial vehicle image retrieval task.

In recent years, many students solve the problem of unmanned aerial vehicle image data retrieval by using a deep learning method. It is common practice to encode all the drone image data into their respective features and then calculate the similarity of the different images in a common characterization space. Although the existing unmanned aerial vehicle image retrieval method has a certain development, the method still has several defects: 1) A large amount of memory space is required and the space-time complexity of the search is low 2) the existing hash method pays attention to global information too much, and critical information of fine granularity significance is ignored.

Disclosure of Invention

The invention aims at overcoming the defects, and by learning semantic information of unmanned aerial vehicle image data, effective hash codes are learned by using a significance capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are searched by using similarity calculation. The invention provides the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, which is used for further improving the retrieval performance by fully utilizing the fine-granularity key information of the unmanned aerial vehicle image.

In order to achieve the above object, the technical solution of the present invention is:

an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, the method comprises the following steps:

step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;

step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;

training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F _low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F _high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F _j The method comprises the steps of carrying out a first treatment on the surface of the In addition, fine-grained transformation of local joint features can reduce redundant signalingExtinguishing;

step 3, saliency capture, in generating local fine granularity characteristic F _l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;

the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture _ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module _va ；

Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module _va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;

Training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;

and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.

In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is obtainedFirst stage processing in ResNet50 network to obtain first stage projections

And network parameters of the first phase +.>

Projection of the first phase +.>

And network parameters of the first phase +.>

Performing second stage processing in ResNet50 network to obtain projection of second stage +. >

And network parameters of the second phase->

Projection of the second phase +.>

And network parameters of the second phase

Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>

And third stage network parameters

Projection of the third phase +.>

And network parameters of the third phase +.>

Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>

And network parameters of the fourth phase +.>

The feature of the ResNet50 network output by four stages in sequence is global feature projection;

inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F _low The specific formula is as follows:

wherein ,F_low As a feature of the local low-level layer,

representing a splicing operation->

Projection representing the first phase, +.>

Network parameters representing the first phase, +.>

Representing the projection of the second phase +.>

Network parameters representing the second phase;

thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map of the fourth stage of ResNet50 is up-sampledThe injection output connection is local high-level characteristic F _high The specific formula is as follows:

wherein ,F_high As a feature of a local high-level layer,

representing a splicing operation->

Projection representing the third phase, +.>

Network parameters representing the third phase, +.>

Projection representing the fourth phase, +.>

Network parameters representing the fourth phase;

then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F _j The specific formula is as follows:

wherein ,F_j For local joint features, ρ is the mean calculation,

for the sum operation, ψ is the parametric rectified linear unit function, +.>

Is a convolution of 3 x 3>

Is a convolution of 1 x 1;

to reduce redundant information of local joint features, the local joint features F _j Fine-grained transformation is carried out to obtain local fine-grained feature F _l The specific formula is as follows:

wherein ,F_l For the local fine granularity feature, by term multiplication, delta is a sigmoid function;

at this time, the information extraction is completed.

In the step (3) of the above-mentioned process,

step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q _ia Local fine grain feature F _l Projected onto Key and Value to obtain respectively

and V_ia Correlation S of global features and local fine-grained features _ia The following are provided:

wherein phi represents the softmax function,

represents the set scaling parameters, Q _ia Is Query in the attention mechanism, +.>

Is the transposed Key in the attention mechanism;

in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:

wherein L is the number of attention heads,

represents the output of the first head, +.>

Is a parameter matrix which can be learned, < >>

For Dropout operation, +.>

Representing the splicing operation S ^l For the similarity of the first head, +.>

Value projected for local fine granularity feature of the first header;

to enhance visual characterization, to further achieve efficient feature embedding, global features and T _ia In combination, the specific formula is as follows:

/>

F _ia i.e. the feature embedded vector of the information interaction attention module,

representation layer normalization operation, ++>

Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained _ia ；

Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F _ia The Query, key and Value projected to the attention mechanism respectively obtain Q _va 、

and V_va The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token _va The calculation is as follows:

wherein ,S_va For an embedding matrix of different features, phi is a softmax function,

in order to set the ratio parameters of the components,

then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:

wherein m is the head number of the visual attention enhancing module,

for output of the mth head, W _va To enhance the learnable parameters of the visual attention module, < +.>

Representing the splicing operation S ^m For the similarity of the mth head, +.>

Embedding a vector F for features of the mth head _ia The Value of the projection;

finally, generating the saliency feature F through layer normalization _va The specific formula is as follows:

wherein ,F_va Namely the characteristic of the significance,

is a layer normalization process.

The specific formula of the hash function in the step 4 is as follows:

b＝sign(h)＝sign(τ(F _va ,W _h ))

wherein ,F_va To output of saliency capture module, W _h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;

the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;

the similarity maintenance term calculation formula is as follows:

where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,

Paired tags for samples (similarity 1, dissimilarity 0);

introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:

wherein ,

is a smoothing term, gamma is a super parameter, theta is a label smoothing function, y _n1 Represents the nth ₁ Input labels, b _n To generate the hash code, y _n A tag that is true for the sample;

however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:

however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:

wherein ,

represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.

In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 ^-4 The input picture size is adjusted to 256×256; batch size is set to 64, hash code length k is set to 16, 24, 32, 48, 64, edge parameter ε is set to 2k, and initial weight usage of convolutional neural network ResNet50Initializing a weight parameter matrix W and a bias parameter matrix B which are trained in advance, repeating the steps 2 to 4 to carry out iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of an objective function L, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash codes of the samples in the whole network model calculation test data set after training.

In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention discloses an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is characterized in that a novel unmanned aerial vehicle image retrieval frame is designed, and the efficient information problem of unmanned aerial vehicle images in the hash code learning process is solved by utilizing an information extraction module and a saliency capture module. And secondly, a new objective function composed of a similarity maintenance term, a distribution smoothing term and quantization errors is designed, so that the similarity of the hash codes is maintained, the distribution of the unmanned aerial vehicle image data set is smoothed, and the quantization errors between the hash codes and the hash-like codes are reduced.

2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism mainly comprises three implementation steps of extraction, learning and selection. Giving an unmanned aerial vehicle image to be queried, and firstly extracting the representation characteristics of the unmanned aerial vehicle image; then, carrying out Hash code learning by utilizing the fixed similar relationship of the images of the similar unmanned aerial vehicles; and finally, obtaining similar K images by using similarity calculation, thereby effectively improving the retrieval precision. As can be seen from the comparison test results of the retrieval average precision indexes of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is superior to that of the existing method.

3. According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash code entry is learned by the saliency capturing mechanism, the distributed smooth items, the global information and the local fine granularity information, retrieval precision is improved, and meanwhile, the space-time complexity of retrieval is reduced by using the deep hash method, so that storage space required by the retrieval method is reduced.

Drawings

Fig. 1 is a schematic diagram of a network architecture of the present invention.

Fig. 2 is a search result diagram of the present invention.

FIG. 3 is a diagram of the visual effect of the saliency capture module of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings and detailed description.

Referring to fig. 1, an unmanned aerial vehicle image hash retrieval method based on a saliency capturing mechanism comprises the following steps:

training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F _low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F _high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F _j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;

In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to ResNePerforming first stage processing in t50 network to obtain projection of first stage

And network parameters of the first phase +.>

Projection of the first phase +.>

And network parameters of the first phase +.>

Performing second stage processing in ResNet50 network to obtain projection of second stage +.>

And network parameters of the second phase->

Projection of the second phase +.>

And network parameters of the second phase

And third stage network parameters

Projection of the third phase +.>

And network parameters of the third phase +.>

And network parameters of the fourth phase +.>

wherein ,F_low As a feature of the local low-level layer,

representing a splicing operation->

Projection representing the first phase, +.>

Network parameters representing the first phase, +.>

Representing the projection of the second phase +.>

Network parameters representing the second phase;

thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is up-sampledConnected as local high-level features F _high The specific formula is as follows:

wherein ,F_high As a feature of a local high-level layer,

representing a splicing operation->

Projection representing the third phase, +.>

Network parameters representing the third phase, +.>

Projection representing the fourth phase, +.>

Network parameters representing the fourth phase;

wherein ,F_j For local joint features, ρ is the mean calculation,

for the sum operation, ψ is the parametric rectified linear unit function, +.>

Is a convolution of 3 x 3>

Is a convolution of 1 x 1;

at this time, the information extraction is completed.

In the step (3) of the above-mentioned process,

wherein phi represents the softmax function,

Is the transposed Key in the attention mechanism;

wherein L is the number of attention heads,

represents the output of the first head, W _ia Is a parameter matrix which can be learned, < >>

For Dropout operation, +.>

Value projected for local fine granularity feature of the first header;

representation layer normalization operation, ++>

in order to set the ratio parameters of the components,

wherein m is the head number of the visual attention enhancing module,

Representing the splicing operation S ^m Is the m-th head similarDegree (f)>

wherein ,F_va Namely the characteristic of the significance,

is a layer normalization process.

The specific formula of the hash function in the step 4 is as follows:

b＝sign(h)＝sign(τ(F _va ,W _h ))

the similarity maintenance term calculation formula is as follows:

/>

paired tags for samples (similarity 1, dissimilarity 0);

wherein ,

for smooth term, γ is the superparameter, θ is the label smoothing function, ++>

Represents the nth ₁ Input labels, b _n To generate the hash code, y _n A tag that is true for the sample;

wherein ,

In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 ^-4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48, 64, the edge parameter ε is set to 2k, and the initial weight of the convolutional neural network ResNet50 uses pre-trained weightsInitializing a heavy parameter matrix W and a bias parameter matrix B, repeating the steps 2 to 4 to perform iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of the objective function L, and ending the algorithm operation when 100 iterations of training are performed or the final objective function loss is not reduced any more, so that the trained hash codes of the samples in the whole network model calculation test data set are obtained.

Example 1:

the environment adopted in this embodiment is GeForce RTX 3090GPU, interXeon (R) Silver 4210RCPU@2.40GHz ×40, 62.6G RAM, linux operating system, and developed by Python and open source library Pytorch.

Step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set; using Era and Drone-Action datasets, 80% of the dataset was selected as training dataset I _train The remaining 20% are used as test dataset I _test ；

training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F _low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F _high The method comprises the steps of carrying out a first treatment on the surface of the Finally, processing local low-level features with a 3×3 convolution and processing local high-level features with a 1×1 convolution is both featuresThe two features are connected to form a connecting feature by the same size; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F _j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;

Example 2:

example 2 is substantially the same as example 1 except that:

in the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to a first-stage processing in the ResNet50 network to obtain a first-stage projection

And network parameters of the first phase +.>

Projection of the first phase +.>

And network parameters of the first phase +.>

And network parameters of the second phase->

Projection of the second phase +.>

And network parameters of the second phase

And thirdPhase network parameters

Projection of the third phase +.>

And network parameters of the third phase +.>

And network parameters of the fourth phase +.>

/>

wherein ,F_low As a feature of the local low-level layer,

representing a splicing operation->

Projection representing the first phase, +.>

Network parameters representing the first phase, +.>

Representing the projection of the second phase +.>

Network parameters representing the second phase;

thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is connected as local high-level features F _high The specific formula is as follows:

wherein ,F_high As a feature of a local high-level layer,

representing a splicing operation->

Projection representing the third phase, +.>

Network parameters representing the third phase, +.>

Projection representing the fourth phase, +. >

Network parameters representing the fourth phase;

wherein ,F_j For local joint features, ρ is the mean calculation,

for the sum operation, ψ is the parametric rectified linear unit function, +.>

Is a convolution of 3 x 3>

Is a convolution of 1 x 1;

at this time, the information extraction is completed.

In the step (3) of the above-mentioned process,

wherein, phi tableThe softmax function is shown as a function of,

Is the transposed Key in the attention mechanism;

wherein L is the number of attention heads,

For Dropout operation, +.>

Value projected for local fine granularity feature of the first header; />

representation layer normalization operation, ++>

In order to set the ratio parameters of the components,

wherein m is an incrementThe head number of the high visual attention module,

wherein ,F_va Namely the characteristic of the significance,

is a layer normalization process.

In the step 4, the specific formula of the hash function is:

b＝sign(h)＝sign(τ(F _va ,W _h ))

the similarity maintenance term calculation formula is as follows:

paired tags for samples (similarity 1, dissimilarity 0);

wherein ,

wherein ,

In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 ^-4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash codes is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, the steps 2 to 4 are repeated to carry out iterative training on the network model, so that the weight parameter matrix W and the bias parameter matrix B are optimized to reduce the loss of an objective function L, and the algorithm operation is ended when 100 iterations of training or the final objective function loss is not reduced any more, so that the hash codes of samples in the test data set are calculated by the trained whole network model.

In order to evaluate the effectiveness of the method, the method is compared with several most advanced methods in search performance, including DHN, DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH, the experiment adopts 16, 24, 32, 48 and 64 bit hash codes, adopts a Drone-Action data set and an ERA data set, and DHN utilizes a Bayesian framework to carry out deep hash learning in a supervision mode. DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH method is performed in plain text.

TABLE 1

/>

Table 1 shows the results of a comparison experiment of unmanned aerial vehicle image retrieval tasks on ERA data sets with other methods, wherein mAP is an average precision index.

TABLE 2

Table 2 shows the results of comparison experiments of unmanned aerial vehicle image retrieval tasks on a Drone-Action data set by the method and other methods, wherein mAP is an average precision index.

As can be seen from the comparison result of the index of the retrieval average precision of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is better than that of the existing method.

Claims

1. An unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism is characterized by comprising the following steps of:

The method comprises the following steps:

2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 1, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:

And network parameters of the first phase +. >

Projection of the first phase +.>

And network parameters of the first phase +.>

And network parameters of the second phase->

Projection of the second phase +.>

And network parameters of the second phase->

And network parameters of the third phase +.>

Projection of the third phase +.>

And network parameters of the third phase +.>

And network parameters of the fourth phase +.>

wherein ,F_low As a feature of the local low-level layer,

representing a splicing operation->

Projection representing the first phase, +.>

Network parameters representing the first phase, +.>

Representing the projection of the second phase +.>

Network parameters representing the second phase;

wherein ,F_high As a feature of a local high-level layer,

representing a splicing operation->

Projection representing the third phase, +.>

Network parameters representing the third phase, +.>

Projection representing the fourth phase, +.>

Network parameters representing the fourth phase;

then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting mean and spell of local high-level features using residual structureConnected features, obtaining local joint features F _j The specific formula is as follows:

wherein ,F_j For local joint features, ρ is the mean calculation,

for the sum operation, ψ is the parametric rectified linear unit function,

is a convolution of 3 x 3>

Is a convolution of 1 x 1;

at this time, the information extraction is completed.

3. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 2, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:

In the step (3) of the above-mentioned process,

wherein phi represents the softmax function,

Is the transposed Key in the attention mechanism; />

wherein L is the number of attention heads,

For Dropout operation, +.>

Value projected for local fine granularity feature of the first header;

representation layer normalization operation, ++>

in order to set the ratio parameters of the components,

wherein m is the head number of the visual attention enhancing module,

wherein ,F_va Namely the characteristic of the significance,

is a layer normalization process.

4. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 3, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:

in the step 4, the specific formula of the hash function is:

b＝sign(h)＝sign(τ(F _va ,W _h ))

the similarity maintenance term calculation formula is as follows:

paired tags for samples (similarity 1, dissimilarity 0);

wherein ,

for smooth term, γ is the superparameter, +.>

For the label smoothing function, +.>

wherein ,

5. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 4, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of:

in the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 ^-4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, and the iterative training is carried out on the network model by repeating the steps 2 to 4, so that the weight parameter matrix W and the bias parameter are optimizedThe matrix B reduces the loss of the objective function L, and the algorithm operation is ended when the training is 100 rounds of iteration or the final objective function loss is no longer reduced, so that the hash codes of samples in the test data set are calculated by the whole network model after the training is completed.

6. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 5, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of: