CN116089646A - Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism - Google Patents

Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism Download PDF

Info

Publication number
CN116089646A
CN116089646A CN202310007898.4A CN202310007898A CN116089646A CN 116089646 A CN116089646 A CN 116089646A CN 202310007898 A CN202310007898 A CN 202310007898A CN 116089646 A CN116089646 A CN 116089646A
Authority
CN
China
Prior art keywords
features
feature
phase
local
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310007898.4A
Other languages
Chinese (zh)
Inventor
陈亚雄
杨锴
黄景灏
黄吉瑞
熊盛武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202310007898.4A priority Critical patent/CN116089646A/en
Publication of CN116089646A publication Critical patent/CN116089646A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Remote Sensing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash codes are learned by the saliency capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are retrieved by similarity calculation. The method provided by the invention not only can pay more attention to global information and capture remarkable characteristics, but also improves the precision performance of retrieval.

Description

Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism
Technical Field
The invention relates to an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is particularly suitable for improving retrieval progress.
Background
With the rapid development of unmanned aerial vehicle technology, image retrieval shot by unmanned aerial vehicles is widely focused in the field of image processing, and compared with satellites, unmanned aerial vehicles generally have a real-time streaming media function, so that rapid decision making can be realized. In addition, the unmanned aerial vehicle can significantly reduce dependence on weather environment, and provides higher flexibility in handling various problems. As the number of unmanned aerial vehicles increases, the number of images photographed by unmanned aerial vehicles also increases significantly. Therefore, how to mine effective unmanned aerial vehicle image information becomes increasingly important. In order to mine useful information, many researchers are paying great attention to the research of unmanned aerial vehicle image data retrieval. Because unmanned aerial vehicle data retrieval can quickly retrieve useful information, the unmanned aerial vehicle data retrieval method has been applied to various aspects of agriculture, military and the like. Unmanned aerial vehicle image retrieval is a branch of general image retrieval, and more attention is paid to image data shot by an unmanned aerial vehicle on the retrieved content.
Along with the explosive growth of unmanned aerial vehicle shooting data, an efficient ground image data analysis technology is urgently concerned about processing unmanned aerial vehicle data. The unmanned aerial vehicle image retrieval task is to utilize unmanned aerial vehicle image data to retrieve relevant unmanned aerial vehicle images. Because of the large data volume and the large information difference between different scales of data, it is difficult for users to quickly obtain favorable information. How to solve the multi-scale problem of unmanned aerial vehicle image data is an important challenge of unmanned aerial vehicle image retrieval task.
In recent years, many students solve the problem of unmanned aerial vehicle image data retrieval by using a deep learning method. It is common practice to encode all the drone image data into their respective features and then calculate the similarity of the different images in a common characterization space. Although the existing unmanned aerial vehicle image retrieval method has a certain development, the method still has several defects: 1) A large amount of memory space is required and the space-time complexity of the search is low 2) the existing hash method pays attention to global information too much, and critical information of fine granularity significance is ignored.
Disclosure of Invention
The invention aims at overcoming the defects, and by learning semantic information of unmanned aerial vehicle image data, effective hash codes are learned by using a significance capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are searched by using similarity calculation. The invention provides the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, which is used for further improving the retrieval performance by fully utilizing the fine-granularity key information of the unmanned aerial vehicle image.
In order to achieve the above object, the technical solution of the present invention is:
an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, the method comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, fine-grained transformation of local joint features can reduce redundant signalingExtinguishing;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
Training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is obtainedFirst stage processing in ResNet50 network to obtain first stage projections
Figure BDA0004036317480000031
And network parameters of the first phase +.>
Figure BDA0004036317480000032
Projection of the first phase +.>
Figure BDA0004036317480000033
And network parameters of the first phase +.>
Figure BDA0004036317480000034
Performing second stage processing in ResNet50 network to obtain projection of second stage +. >
Figure BDA0004036317480000035
And network parameters of the second phase->
Figure BDA0004036317480000036
Projection of the second phase +.>
Figure BDA0004036317480000037
And network parameters of the second phase
Figure BDA0004036317480000038
Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>
Figure BDA0004036317480000039
And third stage network parameters
Figure BDA00040363174800000310
Projection of the third phase +.>
Figure BDA00040363174800000311
And network parameters of the third phase +.>
Figure BDA00040363174800000312
Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>
Figure BDA00040363174800000313
And network parameters of the fourth phase +.>
Figure BDA00040363174800000314
The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
Figure BDA00040363174800000315
wherein ,Flow As a feature of the local low-level layer,
Figure BDA00040363174800000316
representing a splicing operation->
Figure BDA00040363174800000317
Projection representing the first phase, +.>
Figure BDA00040363174800000318
Network parameters representing the first phase, +.>
Figure BDA00040363174800000319
Representing the projection of the second phase +.>
Figure BDA00040363174800000320
Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map of the fourth stage of ResNet50 is up-sampledThe injection output connection is local high-level characteristic F high The specific formula is as follows:
Figure BDA00040363174800000321
wherein ,Fhigh As a feature of a local high-level layer,
Figure BDA00040363174800000322
representing a splicing operation->
Figure BDA00040363174800000323
Projection representing the third phase, +.>
Figure BDA00040363174800000324
Network parameters representing the third phase, +.>
Figure BDA00040363174800000325
Projection representing the fourth phase, +.>
Figure BDA00040363174800000326
Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
Figure BDA0004036317480000041
wherein ,Fj For local joint features, ρ is the mean calculation,
Figure BDA0004036317480000042
for the sum operation, ψ is the parametric rectified linear unit function, +.>
Figure BDA0004036317480000043
Is a convolution of 3 x 3>
Figure BDA0004036317480000044
Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
Figure BDA0004036317480000045
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively
Figure BDA0004036317480000046
and Via Correlation S of global features and local fine-grained features ia The following are provided:
Figure BDA0004036317480000047
wherein phi represents the softmax function,
Figure BDA0004036317480000048
represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>
Figure BDA0004036317480000049
Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
Figure BDA00040363174800000410
Figure BDA00040363174800000411
wherein L is the number of attention heads,
Figure BDA00040363174800000412
represents the output of the first head, +.>
Figure BDA00040363174800000413
Is a parameter matrix which can be learned, < >>
Figure BDA00040363174800000414
For Dropout operation, +.>
Figure BDA00040363174800000415
Representing the splicing operation S l For the similarity of the first head, +.>
Figure BDA00040363174800000416
Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
Figure BDA0004036317480000051
/>
F ia i.e. the feature embedded vector of the information interaction attention module,
Figure BDA0004036317480000052
representation layer normalization operation, ++>
Figure BDA0004036317480000053
Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va
Figure BDA0004036317480000054
and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
Figure BDA0004036317480000055
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,
Figure BDA0004036317480000056
in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
Figure BDA0004036317480000057
Figure BDA0004036317480000058
wherein m is the head number of the visual attention enhancing module,
Figure BDA0004036317480000059
for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>
Figure BDA00040363174800000510
Representing the splicing operation S m For the similarity of the mth head, +.>
Figure BDA00040363174800000511
Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
Figure BDA00040363174800000512
wherein ,Fva Namely the characteristic of the significance,
Figure BDA00040363174800000513
is a layer normalization process.
The specific formula of the hash function in the step 4 is as follows:
b=sign(h)=sign(τ(F va ,W h ))
Figure BDA00040363174800000514
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
Figure BDA00040363174800000515
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,
Figure BDA0004036317480000061
Paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
Figure BDA0004036317480000062
wherein ,
Figure BDA0004036317480000063
is a smoothing term, gamma is a super parameter, theta is a label smoothing function, y n1 Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
Figure BDA0004036317480000064
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
Figure BDA0004036317480000065
wherein ,
Figure BDA0004036317480000066
represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; batch size is set to 64, hash code length k is set to 16, 24, 32, 48, 64, edge parameter ε is set to 2k, and initial weight usage of convolutional neural network ResNet50Initializing a weight parameter matrix W and a bias parameter matrix B which are trained in advance, repeating the steps 2 to 4 to carry out iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of an objective function L, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash codes of the samples in the whole network model calculation test data set after training.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention discloses an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is characterized in that a novel unmanned aerial vehicle image retrieval frame is designed, and the efficient information problem of unmanned aerial vehicle images in the hash code learning process is solved by utilizing an information extraction module and a saliency capture module. And secondly, a new objective function composed of a similarity maintenance term, a distribution smoothing term and quantization errors is designed, so that the similarity of the hash codes is maintained, the distribution of the unmanned aerial vehicle image data set is smoothed, and the quantization errors between the hash codes and the hash-like codes are reduced.
2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism mainly comprises three implementation steps of extraction, learning and selection. Giving an unmanned aerial vehicle image to be queried, and firstly extracting the representation characteristics of the unmanned aerial vehicle image; then, carrying out Hash code learning by utilizing the fixed similar relationship of the images of the similar unmanned aerial vehicles; and finally, obtaining similar K images by using similarity calculation, thereby effectively improving the retrieval precision. As can be seen from the comparison test results of the retrieval average precision indexes of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is superior to that of the existing method.
3. According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash code entry is learned by the saliency capturing mechanism, the distributed smooth items, the global information and the local fine granularity information, retrieval precision is improved, and meanwhile, the space-time complexity of retrieval is reduced by using the deep hash method, so that storage space required by the retrieval method is reduced.
Drawings
Fig. 1 is a schematic diagram of a network architecture of the present invention.
Fig. 2 is a search result diagram of the present invention.
FIG. 3 is a diagram of the visual effect of the saliency capture module of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and detailed description.
Referring to fig. 1, an unmanned aerial vehicle image hash retrieval method based on a saliency capturing mechanism comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to ResNePerforming first stage processing in t50 network to obtain projection of first stage
Figure BDA0004036317480000081
And network parameters of the first phase +.>
Figure BDA0004036317480000082
Projection of the first phase +.>
Figure BDA0004036317480000083
And network parameters of the first phase +.>
Figure BDA0004036317480000084
Performing second stage processing in ResNet50 network to obtain projection of second stage +.>
Figure BDA0004036317480000085
And network parameters of the second phase->
Figure BDA0004036317480000086
Projection of the second phase +.>
Figure BDA0004036317480000087
And network parameters of the second phase
Figure BDA0004036317480000088
Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>
Figure BDA0004036317480000089
And third stage network parameters
Figure BDA00040363174800000810
Projection of the third phase +.>
Figure BDA00040363174800000811
And network parameters of the third phase +.>
Figure BDA00040363174800000812
Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>
Figure BDA0004036317480000091
And network parameters of the fourth phase +.>
Figure BDA0004036317480000092
The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
Figure BDA0004036317480000093
wherein ,Flow As a feature of the local low-level layer,
Figure BDA0004036317480000094
representing a splicing operation->
Figure BDA0004036317480000095
Projection representing the first phase, +.>
Figure BDA0004036317480000096
Network parameters representing the first phase, +.>
Figure BDA0004036317480000097
Representing the projection of the second phase +.>
Figure BDA0004036317480000098
Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is up-sampledConnected as local high-level features F high The specific formula is as follows:
Figure BDA0004036317480000099
wherein ,Fhigh As a feature of a local high-level layer,
Figure BDA00040363174800000910
representing a splicing operation->
Figure BDA00040363174800000911
Projection representing the third phase, +.>
Figure BDA00040363174800000912
Network parameters representing the third phase, +.>
Figure BDA00040363174800000913
Projection representing the fourth phase, +.>
Figure BDA00040363174800000914
Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
Figure BDA00040363174800000915
wherein ,Fj For local joint features, ρ is the mean calculation,
Figure BDA00040363174800000916
for the sum operation, ψ is the parametric rectified linear unit function, +.>
Figure BDA00040363174800000917
Is a convolution of 3 x 3>
Figure BDA00040363174800000918
Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
Figure BDA00040363174800000919
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively
Figure BDA0004036317480000101
and Via Correlation S of global features and local fine-grained features ia The following are provided:
Figure BDA0004036317480000102
wherein phi represents the softmax function,
Figure BDA0004036317480000103
represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>
Figure BDA0004036317480000104
Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
Figure BDA0004036317480000105
Figure BDA0004036317480000106
wherein L is the number of attention heads,
Figure BDA0004036317480000107
represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>
Figure BDA0004036317480000108
For Dropout operation, +.>
Figure BDA0004036317480000109
Representing the splicing operation S l For the similarity of the first head, +.>
Figure BDA00040363174800001010
Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
Figure BDA00040363174800001011
F ia i.e. the feature embedded vector of the information interaction attention module,
Figure BDA00040363174800001012
representation layer normalization operation, ++>
Figure BDA00040363174800001013
Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va
Figure BDA00040363174800001014
and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
Figure BDA00040363174800001015
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,
Figure BDA00040363174800001016
in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
Figure BDA0004036317480000111
Figure BDA0004036317480000112
wherein m is the head number of the visual attention enhancing module,
Figure BDA0004036317480000113
for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>
Figure BDA0004036317480000114
Representing the splicing operation S m Is the m-th head similarDegree (f)>
Figure BDA0004036317480000115
Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
Figure BDA0004036317480000116
wherein ,Fva Namely the characteristic of the significance,
Figure BDA0004036317480000117
is a layer normalization process.
The specific formula of the hash function in the step 4 is as follows:
b=sign(h)=sign(τ(F va ,W h ))
Figure BDA0004036317480000118
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
Figure BDA0004036317480000119
/>
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,
Figure BDA00040363174800001110
paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
Figure BDA00040363174800001111
wherein ,
Figure BDA00040363174800001112
for smooth term, γ is the superparameter, θ is the label smoothing function, ++>
Figure BDA00040363174800001113
Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
Figure BDA00040363174800001114
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
Figure BDA0004036317480000121
wherein ,
Figure BDA0004036317480000122
represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48, 64, the edge parameter ε is set to 2k, and the initial weight of the convolutional neural network ResNet50 uses pre-trained weightsInitializing a heavy parameter matrix W and a bias parameter matrix B, repeating the steps 2 to 4 to perform iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of the objective function L, and ending the algorithm operation when 100 iterations of training are performed or the final objective function loss is not reduced any more, so that the trained hash codes of the samples in the whole network model calculation test data set are obtained.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
Example 1:
an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, the method comprises the following steps:
the environment adopted in this embodiment is GeForce RTX 3090GPU, interXeon (R) Silver 4210RCPU@2.40GHz ×40, 62.6G RAM, linux operating system, and developed by Python and open source library Pytorch.
Step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set; using Era and Drone-Action datasets, 80% of the dataset was selected as training dataset I train The remaining 20% are used as test dataset I test
Step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, processing local low-level features with a 3×3 convolution and processing local high-level features with a 1×1 convolution is both featuresThe two features are connected to form a connecting feature by the same size; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
Step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
And 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
Example 2:
example 2 is substantially the same as example 1 except that:
in the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to a first-stage processing in the ResNet50 network to obtain a first-stage projection
Figure BDA0004036317480000131
And network parameters of the first phase +.>
Figure BDA0004036317480000132
Projection of the first phase +.>
Figure BDA0004036317480000133
And network parameters of the first phase +.>
Figure BDA0004036317480000134
Performing second stage processing in ResNet50 network to obtain projection of second stage +.>
Figure BDA0004036317480000135
And network parameters of the second phase->
Figure BDA0004036317480000136
Projection of the second phase +.>
Figure BDA0004036317480000137
And network parameters of the second phase
Figure BDA0004036317480000141
Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>
Figure BDA0004036317480000142
And thirdPhase network parameters
Figure BDA0004036317480000143
Projection of the third phase +.>
Figure BDA0004036317480000144
And network parameters of the third phase +.>
Figure BDA0004036317480000145
Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>
Figure BDA0004036317480000146
And network parameters of the fourth phase +.>
Figure BDA0004036317480000147
The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
Figure BDA0004036317480000148
/>
wherein ,Flow As a feature of the local low-level layer,
Figure BDA0004036317480000149
representing a splicing operation->
Figure BDA00040363174800001410
Projection representing the first phase, +.>
Figure BDA00040363174800001411
Network parameters representing the first phase, +.>
Figure BDA00040363174800001412
Representing the projection of the second phase +.>
Figure BDA00040363174800001413
Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is connected as local high-level features F high The specific formula is as follows:
Figure BDA00040363174800001414
wherein ,Fhigh As a feature of a local high-level layer,
Figure BDA00040363174800001415
representing a splicing operation->
Figure BDA00040363174800001416
Projection representing the third phase, +.>
Figure BDA00040363174800001417
Network parameters representing the third phase, +.>
Figure BDA00040363174800001418
Projection representing the fourth phase, +. >
Figure BDA00040363174800001419
Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
Figure BDA00040363174800001420
wherein ,Fj For local joint features, ρ is the mean calculation,
Figure BDA00040363174800001421
for the sum operation, ψ is the parametric rectified linear unit function, +.>
Figure BDA00040363174800001422
Is a convolution of 3 x 3>
Figure BDA00040363174800001423
Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
Figure BDA0004036317480000151
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively
Figure BDA0004036317480000152
and Via Correlation S of global features and local fine-grained features ia The following are provided:
Figure BDA0004036317480000153
wherein, phi tableThe softmax function is shown as a function of,
Figure BDA0004036317480000154
Represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>
Figure BDA0004036317480000155
Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
Figure BDA0004036317480000156
Figure BDA0004036317480000157
wherein L is the number of attention heads,
Figure BDA0004036317480000158
represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>
Figure BDA0004036317480000159
For Dropout operation, +.>
Figure BDA00040363174800001510
Representing the splicing operation S l For the similarity of the first head, +.>
Figure BDA00040363174800001511
Value projected for local fine granularity feature of the first header; />
To enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
Figure BDA00040363174800001512
F ia i.e. the feature embedded vector of the information interaction attention module,
Figure BDA00040363174800001513
representation layer normalization operation, ++>
Figure BDA00040363174800001514
Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va
Figure BDA00040363174800001515
and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
Figure BDA0004036317480000161
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,
Figure BDA0004036317480000162
In order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
Figure BDA0004036317480000163
Figure BDA0004036317480000164
wherein m is an incrementThe head number of the high visual attention module,
Figure BDA0004036317480000165
for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>
Figure BDA0004036317480000166
Representing the splicing operation S m For the similarity of the mth head, +.>
Figure BDA0004036317480000167
Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
Figure BDA0004036317480000168
wherein ,Fva Namely the characteristic of the significance,
Figure BDA0004036317480000169
is a layer normalization process.
In the step 4, the specific formula of the hash function is:
b=sign(h)=sign(τ(F va ,W h ))
Figure BDA00040363174800001610
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
Figure BDA00040363174800001611
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,
Figure BDA00040363174800001612
paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
Figure BDA00040363174800001613
wherein ,
Figure BDA00040363174800001614
For smooth term, γ is the superparameter, θ is the label smoothing function, ++>
Figure BDA00040363174800001615
Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
Figure BDA0004036317480000171
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
Figure BDA0004036317480000172
wherein ,
Figure BDA0004036317480000173
represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash codes is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, the steps 2 to 4 are repeated to carry out iterative training on the network model, so that the weight parameter matrix W and the bias parameter matrix B are optimized to reduce the loss of an objective function L, and the algorithm operation is ended when 100 iterations of training or the final objective function loss is not reduced any more, so that the hash codes of samples in the test data set are calculated by the trained whole network model.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
In order to evaluate the effectiveness of the method, the method is compared with several most advanced methods in search performance, including DHN, DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH, the experiment adopts 16, 24, 32, 48 and 64 bit hash codes, adopts a Drone-Action data set and an ERA data set, and DHN utilizes a Bayesian framework to carry out deep hash learning in a supervision mode. DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH method is performed in plain text.
TABLE 1
Figure BDA0004036317480000174
Figure BDA0004036317480000181
/>
Table 1 shows the results of a comparison experiment of unmanned aerial vehicle image retrieval tasks on ERA data sets with other methods, wherein mAP is an average precision index.
TABLE 2
Figure BDA0004036317480000182
Table 2 shows the results of comparison experiments of unmanned aerial vehicle image retrieval tasks on a Drone-Action data set by the method and other methods, wherein mAP is an average precision index.
As can be seen from the comparison result of the index of the retrieval average precision of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is better than that of the existing method.

Claims (6)

1. An unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism is characterized by comprising the following steps of:
The method comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
Training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 1, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
in the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to a first-stage processing in the ResNet50 network to obtain a first-stage projection
Figure FDA0004036317470000021
And network parameters of the first phase +. >
Figure FDA0004036317470000022
Projection of the first phase +.>
Figure FDA0004036317470000023
And network parameters of the first phase +.>
Figure FDA0004036317470000024
Performing second stage processing in ResNet50 network to obtain projection of second stage +.>
Figure FDA0004036317470000025
And network parameters of the second phase->
Figure FDA0004036317470000026
Projection of the second phase +.>
Figure FDA0004036317470000027
And network parameters of the second phase->
Figure FDA0004036317470000028
Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>
Figure FDA0004036317470000029
And network parameters of the third phase +.>
Figure FDA00040363174700000210
Projection of the third phase +.>
Figure FDA00040363174700000211
And network parameters of the third phase +.>
Figure FDA00040363174700000212
Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>
Figure FDA00040363174700000213
And network parameters of the fourth phase +.>
Figure FDA00040363174700000214
The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
Figure FDA00040363174700000215
wherein ,Flow As a feature of the local low-level layer,
Figure FDA00040363174700000216
representing a splicing operation->
Figure FDA00040363174700000217
Projection representing the first phase, +.>
Figure FDA00040363174700000218
Network parameters representing the first phase, +.>
Figure FDA00040363174700000219
Representing the projection of the second phase +.>
Figure FDA00040363174700000220
Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is connected as local high-level features F high The specific formula is as follows:
Figure FDA00040363174700000221
wherein ,Fhigh As a feature of a local high-level layer,
Figure FDA0004036317470000031
representing a splicing operation->
Figure FDA0004036317470000032
Projection representing the third phase, +.>
Figure FDA0004036317470000033
Network parameters representing the third phase, +.>
Figure FDA0004036317470000034
Projection representing the fourth phase, +.>
Figure FDA0004036317470000035
Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting mean and spell of local high-level features using residual structureConnected features, obtaining local joint features F j The specific formula is as follows:
Figure FDA0004036317470000036
wherein ,Fj For local joint features, ρ is the mean calculation,
Figure FDA0004036317470000037
for the sum operation, ψ is the parametric rectified linear unit function,
Figure FDA0004036317470000038
is a convolution of 3 x 3>
Figure FDA0004036317470000039
Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
Figure FDA00040363174700000310
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
3. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 2, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively
Figure FDA00040363174700000311
and Via Correlation S of global features and local fine-grained features ia The following are provided:
Figure FDA00040363174700000312
wherein phi represents the softmax function,
Figure FDA00040363174700000313
represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>
Figure FDA00040363174700000314
Is the transposed Key in the attention mechanism; />
In order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
Figure FDA0004036317470000041
Figure FDA0004036317470000042
wherein L is the number of attention heads,
Figure FDA0004036317470000043
represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>
Figure FDA0004036317470000044
For Dropout operation, +.>
Figure FDA0004036317470000045
Representing the splicing operation S l For the similarity of the first head, +.>
Figure FDA0004036317470000046
Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
Figure FDA0004036317470000047
F ia i.e. the feature embedded vector of the information interaction attention module,
Figure FDA0004036317470000048
representation layer normalization operation, ++>
Figure FDA0004036317470000049
Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va
Figure FDA00040363174700000410
and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
Figure FDA00040363174700000411
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,
Figure FDA00040363174700000412
in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
Figure FDA00040363174700000413
Figure FDA00040363174700000414
wherein m is the head number of the visual attention enhancing module,
Figure FDA00040363174700000415
for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>
Figure FDA00040363174700000416
Representing the splicing operation S m For the similarity of the mth head, +.>
Figure FDA00040363174700000417
Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
Figure FDA00040363174700000418
wherein ,Fva Namely the characteristic of the significance,
Figure FDA00040363174700000419
is a layer normalization process.
4. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 3, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
in the step 4, the specific formula of the hash function is:
b=sign(h)=sign(τ(F va ,W h ))
Figure FDA0004036317470000051
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
The objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
Figure FDA0004036317470000052
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,
Figure FDA0004036317470000053
paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
Figure FDA0004036317470000054
wherein ,
Figure FDA0004036317470000055
for smooth term, γ is the superparameter, +.>
Figure FDA0004036317470000056
For the label smoothing function, +.>
Figure FDA0004036317470000057
Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
Figure FDA0004036317470000058
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
Figure FDA0004036317470000059
wherein ,
Figure FDA00040363174700000510
represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
5. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 4, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of:
in the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, and the iterative training is carried out on the network model by repeating the steps 2 to 4, so that the weight parameter matrix W and the bias parameter are optimizedThe matrix B reduces the loss of the objective function L, and the algorithm operation is ended when the training is 100 rounds of iteration or the final objective function loss is no longer reduced, so that the hash codes of samples in the test data set are calculated by the whole network model after the training is completed.
6. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 5, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of:
in the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
CN202310007898.4A 2023-01-04 2023-01-04 Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism Pending CN116089646A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310007898.4A CN116089646A (en) 2023-01-04 2023-01-04 Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310007898.4A CN116089646A (en) 2023-01-04 2023-01-04 Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism

Publications (1)

Publication Number Publication Date
CN116089646A true CN116089646A (en) 2023-05-09

Family

ID=86205785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310007898.4A Pending CN116089646A (en) 2023-01-04 2023-01-04 Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism

Country Status (1)

Country Link
CN (1) CN116089646A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524282A (en) * 2023-06-26 2023-08-01 贵州大学 Discrete similarity matching classification method based on feature vectors

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116524282A (en) * 2023-06-26 2023-08-01 贵州大学 Discrete similarity matching classification method based on feature vectors
CN116524282B (en) * 2023-06-26 2023-09-05 贵州大学 Discrete similarity matching classification method based on feature vectors

Similar Documents

Publication Publication Date Title
Li et al. Few sample knowledge distillation for efficient network compression
Yang et al. Cross-image relational knowledge distillation for semantic segmentation
CN110188765B (en) Image semantic segmentation model generation method, device, equipment and storage medium
CN107480261B (en) Fine-grained face image fast retrieval method based on deep learning
US11238093B2 (en) Video retrieval based on encoding temporal relationships among video frames
CN112528780B (en) Video motion segmentation by hybrid temporal adaptation
EP3295381B1 (en) Augmenting neural networks with sparsely-accessed external memory
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN114911958B (en) Semantic preference-based rapid image retrieval method
CN114358203A (en) Training method and device for image description sentence generation module and electronic equipment
CN114863407B (en) Multi-task cold start target detection method based on visual language deep fusion
CN115222998B (en) Image classification method
CN109239670B (en) Radar HRRP (high resolution ratio) identification method based on structure embedding and deep neural network
CN113297370A (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN113626589A (en) Multi-label text classification method based on mixed attention mechanism
CN116089646A (en) Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism
CN113032601A (en) Zero sample sketch retrieval method based on discriminant improvement
CN114626454A (en) Visual emotion recognition method integrating self-supervision learning and attention mechanism
Zhu et al. Two-branch encoding and iterative attention decoding network for semantic segmentation
CN115640418B (en) Cross-domain multi-view target website retrieval method and device based on residual semantic consistency
Fadavi Amiri et al. Improving image segmentation using artificial neural networks and evolutionary algorithms
CN117011515A (en) Interactive image segmentation model based on attention mechanism and segmentation method thereof
CN112487231B (en) Automatic image labeling method based on double-image regularization constraint and dictionary learning
Zhang et al. Style classification of media painting images by integrating ResNet and attention mechanism
CN114049634B (en) Image recognition method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination