CN116089646A - Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism - Google Patents
Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism Download PDFInfo
- Publication number
- CN116089646A CN116089646A CN202310007898.4A CN202310007898A CN116089646A CN 116089646 A CN116089646 A CN 116089646A CN 202310007898 A CN202310007898 A CN 202310007898A CN 116089646 A CN116089646 A CN 116089646A
- Authority
- CN
- China
- Prior art keywords
- features
- feature
- phase
- local
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 93
- 230000007246 mechanism Effects 0.000 title claims abstract description 48
- 238000004364 calculation method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims description 83
- 230000006870 function Effects 0.000 claims description 79
- 230000003993 interaction Effects 0.000 claims description 32
- 230000008569 process Effects 0.000 claims description 29
- 230000000007 visual effect Effects 0.000 claims description 29
- 238000012360 testing method Methods 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 20
- 238000009499 grossing Methods 0.000 claims description 18
- 238000013139 quantization Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 12
- 238000012423 maintenance Methods 0.000 claims description 9
- 230000009466 transformation Effects 0.000 claims description 8
- 238000012821 model calculation Methods 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- BQCADISMDOOEFD-UHFFFAOYSA-N Silver Chemical compound [Ag] BQCADISMDOOEFD-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 229910052709 silver Inorganic materials 0.000 description 1
- 239000004332 silver Substances 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/17—Terrestrial scenes taken from planes or by drones
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Library & Information Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Remote Sensing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash codes are learned by the saliency capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are retrieved by similarity calculation. The method provided by the invention not only can pay more attention to global information and capture remarkable characteristics, but also improves the precision performance of retrieval.
Description
Technical Field
The invention relates to an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is particularly suitable for improving retrieval progress.
Background
With the rapid development of unmanned aerial vehicle technology, image retrieval shot by unmanned aerial vehicles is widely focused in the field of image processing, and compared with satellites, unmanned aerial vehicles generally have a real-time streaming media function, so that rapid decision making can be realized. In addition, the unmanned aerial vehicle can significantly reduce dependence on weather environment, and provides higher flexibility in handling various problems. As the number of unmanned aerial vehicles increases, the number of images photographed by unmanned aerial vehicles also increases significantly. Therefore, how to mine effective unmanned aerial vehicle image information becomes increasingly important. In order to mine useful information, many researchers are paying great attention to the research of unmanned aerial vehicle image data retrieval. Because unmanned aerial vehicle data retrieval can quickly retrieve useful information, the unmanned aerial vehicle data retrieval method has been applied to various aspects of agriculture, military and the like. Unmanned aerial vehicle image retrieval is a branch of general image retrieval, and more attention is paid to image data shot by an unmanned aerial vehicle on the retrieved content.
Along with the explosive growth of unmanned aerial vehicle shooting data, an efficient ground image data analysis technology is urgently concerned about processing unmanned aerial vehicle data. The unmanned aerial vehicle image retrieval task is to utilize unmanned aerial vehicle image data to retrieve relevant unmanned aerial vehicle images. Because of the large data volume and the large information difference between different scales of data, it is difficult for users to quickly obtain favorable information. How to solve the multi-scale problem of unmanned aerial vehicle image data is an important challenge of unmanned aerial vehicle image retrieval task.
In recent years, many students solve the problem of unmanned aerial vehicle image data retrieval by using a deep learning method. It is common practice to encode all the drone image data into their respective features and then calculate the similarity of the different images in a common characterization space. Although the existing unmanned aerial vehicle image retrieval method has a certain development, the method still has several defects: 1) A large amount of memory space is required and the space-time complexity of the search is low 2) the existing hash method pays attention to global information too much, and critical information of fine granularity significance is ignored.
Disclosure of Invention
The invention aims at overcoming the defects, and by learning semantic information of unmanned aerial vehicle image data, effective hash codes are learned by using a significance capturing mechanism, distributed smooth items, global information and local fine granularity information, and finally a given number of unmanned aerial vehicle image items are searched by using similarity calculation. The invention provides the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, which is used for further improving the retrieval performance by fully utilizing the fine-granularity key information of the unmanned aerial vehicle image.
In order to achieve the above object, the technical solution of the present invention is:
an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, the method comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, fine-grained transformation of local joint features can reduce redundant signalingExtinguishing;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va ;
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
Training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is obtainedFirst stage processing in ResNet50 network to obtain first stage projectionsAnd network parameters of the first phase +.>Projection of the first phase +.>And network parameters of the first phase +.>Performing second stage processing in ResNet50 network to obtain projection of second stage +. >And network parameters of the second phase->Projection of the second phase +.>And network parameters of the second phasePerforming a third phase process in the ResNet50 network to obtain a third phase projection +.>And third stage network parametersProjection of the third phase +.>And network parameters of the third phase +.>Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>And network parameters of the fourth phase +.>The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
wherein ,Flow As a feature of the local low-level layer,representing a splicing operation->Projection representing the first phase, +.>Network parameters representing the first phase, +.>Representing the projection of the second phase +.>Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map of the fourth stage of ResNet50 is up-sampledThe injection output connection is local high-level characteristic F high The specific formula is as follows:
wherein ,Fhigh As a feature of a local high-level layer,representing a splicing operation->Projection representing the third phase, +.>Network parameters representing the third phase, +.>Projection representing the fourth phase, +.>Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
wherein ,Fj For local joint features, ρ is the mean calculation,for the sum operation, ψ is the parametric rectified linear unit function, +.>Is a convolution of 3 x 3>Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively and Via Correlation S of global features and local fine-grained features ia The following are provided:
wherein phi represents the softmax function,represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
wherein L is the number of attention heads,represents the output of the first head, +.>Is a parameter matrix which can be learned, < >>For Dropout operation, +.>Representing the splicing operation S l For the similarity of the first head, +.>Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
F ia i.e. the feature embedded vector of the information interaction attention module,representation layer normalization operation, ++>Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia ;
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va 、 and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
wherein m is the head number of the visual attention enhancing module,for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>Representing the splicing operation S m For the similarity of the mth head, +.>Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
The specific formula of the hash function in the step 4 is as follows:
b=sign(h)=sign(τ(F va ,W h ))
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance, Paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
wherein ,is a smoothing term, gamma is a super parameter, theta is a label smoothing function, y n1 Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
wherein ,represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; batch size is set to 64, hash code length k is set to 16, 24, 32, 48, 64, edge parameter ε is set to 2k, and initial weight usage of convolutional neural network ResNet50Initializing a weight parameter matrix W and a bias parameter matrix B which are trained in advance, repeating the steps 2 to 4 to carry out iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of an objective function L, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash codes of the samples in the whole network model calculation test data set after training.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention discloses an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, which is characterized in that a novel unmanned aerial vehicle image retrieval frame is designed, and the efficient information problem of unmanned aerial vehicle images in the hash code learning process is solved by utilizing an information extraction module and a saliency capture module. And secondly, a new objective function composed of a similarity maintenance term, a distribution smoothing term and quantization errors is designed, so that the similarity of the hash codes is maintained, the distribution of the unmanned aerial vehicle image data set is smoothed, and the quantization errors between the hash codes and the hash-like codes are reduced.
2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism mainly comprises three implementation steps of extraction, learning and selection. Giving an unmanned aerial vehicle image to be queried, and firstly extracting the representation characteristics of the unmanned aerial vehicle image; then, carrying out Hash code learning by utilizing the fixed similar relationship of the images of the similar unmanned aerial vehicles; and finally, obtaining similar K images by using similarity calculation, thereby effectively improving the retrieval precision. As can be seen from the comparison test results of the retrieval average precision indexes of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is superior to that of the existing method.
3. According to the unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism, semantic information of unmanned aerial vehicle image data is learned, effective hash code entry is learned by the saliency capturing mechanism, the distributed smooth items, the global information and the local fine granularity information, retrieval precision is improved, and meanwhile, the space-time complexity of retrieval is reduced by using the deep hash method, so that storage space required by the retrieval method is reduced.
Drawings
Fig. 1 is a schematic diagram of a network architecture of the present invention.
Fig. 2 is a search result diagram of the present invention.
FIG. 3 is a diagram of the visual effect of the saliency capture module of the present invention.
Detailed Description
The invention is described in further detail below with reference to the accompanying drawings and detailed description.
Referring to fig. 1, an unmanned aerial vehicle image hash retrieval method based on a saliency capturing mechanism comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va ;
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
In the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to ResNePerforming first stage processing in t50 network to obtain projection of first stageAnd network parameters of the first phase +.>Projection of the first phase +.>And network parameters of the first phase +.>Performing second stage processing in ResNet50 network to obtain projection of second stage +.>And network parameters of the second phase->Projection of the second phase +.>And network parameters of the second phasePerforming a third phase process in the ResNet50 network to obtain a third phase projection +.>And third stage network parametersProjection of the third phase +.>And network parameters of the third phase +.>Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>And network parameters of the fourth phase +.>The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
wherein ,Flow As a feature of the local low-level layer,representing a splicing operation->Projection representing the first phase, +.>Network parameters representing the first phase, +.>Representing the projection of the second phase +.>Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is up-sampledConnected as local high-level features F high The specific formula is as follows:
wherein ,Fhigh As a feature of a local high-level layer,representing a splicing operation->Projection representing the third phase, +.>Network parameters representing the third phase, +.>Projection representing the fourth phase, +.>Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
wherein ,Fj For local joint features, ρ is the mean calculation,for the sum operation, ψ is the parametric rectified linear unit function, +.>Is a convolution of 3 x 3>Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively and Via Correlation S of global features and local fine-grained features ia The following are provided:
wherein phi represents the softmax function,represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
wherein L is the number of attention heads,represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>For Dropout operation, +.>Representing the splicing operation S l For the similarity of the first head, +.>Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
F ia i.e. the feature embedded vector of the information interaction attention module,representation layer normalization operation, ++>Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia ;
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va 、 and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
wherein m is the head number of the visual attention enhancing module,for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>Representing the splicing operation S m Is the m-th head similarDegree (f)>Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
The specific formula of the hash function in the step 4 is as follows:
b=sign(h)=sign(τ(F va ,W h ))
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
wherein ,for smooth term, γ is the superparameter, θ is the label smoothing function, ++>Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
wherein ,represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48, 64, the edge parameter ε is set to 2k, and the initial weight of the convolutional neural network ResNet50 uses pre-trained weightsInitializing a heavy parameter matrix W and a bias parameter matrix B, repeating the steps 2 to 4 to perform iterative training on the network model, so as to optimize the weight parameter matrix W and the bias parameter matrix B to reduce the loss of the objective function L, and ending the algorithm operation when 100 iterations of training are performed or the final objective function loss is not reduced any more, so that the trained hash codes of the samples in the whole network model calculation test data set are obtained.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
Example 1:
an unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism, the method comprises the following steps:
the environment adopted in this embodiment is GeForce RTX 3090GPU, interXeon (R) Silver 4210RCPU@2.40GHz ×40, 62.6G RAM, linux operating system, and developed by Python and open source library Pytorch.
Step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set; using Era and Drone-Action datasets, 80% of the dataset was selected as training dataset I train The remaining 20% are used as test dataset I test ;
Step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, processing local low-level features with a 3×3 convolution and processing local high-level features with a 1×1 convolution is both featuresThe two features are connected to form a connecting feature by the same size; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
Step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va ;
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
And 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
Example 2:
example 2 is substantially the same as example 1 except that:
in the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to a first-stage processing in the ResNet50 network to obtain a first-stage projectionAnd network parameters of the first phase +.>Projection of the first phase +.>And network parameters of the first phase +.>Performing second stage processing in ResNet50 network to obtain projection of second stage +.>And network parameters of the second phase->Projection of the second phase +.>And network parameters of the second phasePerforming a third phase process in the ResNet50 network to obtain a third phase projection +.>And thirdPhase network parameters Projection of the third phase +.>And network parameters of the third phase +.>Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>And network parameters of the fourth phase +.>The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
wherein ,Flow As a feature of the local low-level layer,representing a splicing operation->Projection representing the first phase, +.>Network parameters representing the first phase, +.>Representing the projection of the second phase +.>Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is connected as local high-level features F high The specific formula is as follows:
wherein ,Fhigh As a feature of a local high-level layer,representing a splicing operation->Projection representing the third phase, +.>Network parameters representing the third phase, +.>Projection representing the fourth phase, +. >Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting the average value of the local advanced features and the spliced features by using a residual structure to obtain a local joint feature F j The specific formula is as follows:
wherein ,Fj For local joint features, ρ is the mean calculation,for the sum operation, ψ is the parametric rectified linear unit function, +.>Is a convolution of 3 x 3>Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively and Via Correlation S of global features and local fine-grained features ia The following are provided:
wherein, phi tableThe softmax function is shown as a function of, Represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>Is the transposed Key in the attention mechanism;
in order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
wherein L is the number of attention heads,represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>For Dropout operation, +.>Representing the splicing operation S l For the similarity of the first head, +.>Value projected for local fine granularity feature of the first header; />
To enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
F ia i.e. the feature embedded vector of the information interaction attention module,representation layer normalization operation, ++>Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia ;
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va 、 and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
wherein ,Sva For an embedding matrix of different features, phi is a softmax function, In order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
wherein m is an incrementThe head number of the high visual attention module,for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>Representing the splicing operation S m For the similarity of the mth head, +.>Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
In the step 4, the specific formula of the hash function is:
b=sign(h)=sign(τ(F va ,W h ))
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
the objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
wherein ,For smooth term, γ is the superparameter, θ is the label smoothing function, ++>Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
wherein ,represented as an L2-canonical result of generating the hash code and the real hash code, λ is the hyper-parameter.
In the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash codes is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, the steps 2 to 4 are repeated to carry out iterative training on the network model, so that the weight parameter matrix W and the bias parameter matrix B are optimized to reduce the loss of an objective function L, and the algorithm operation is ended when 100 iterations of training or the final objective function loss is not reduced any more, so that the hash codes of samples in the test data set are calculated by the trained whole network model.
In the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
In order to evaluate the effectiveness of the method, the method is compared with several most advanced methods in search performance, including DHN, DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH, the experiment adopts 16, 24, 32, 48 and 64 bit hash codes, adopts a Drone-Action data set and an ERA data set, and DHN utilizes a Bayesian framework to carry out deep hash learning in a supervision mode. DCH, DFH, DPH, DSHSD, greedyHash, DSDH, DTSH, LCDSH, QSMIH method is performed in plain text.
TABLE 1
Table 1 shows the results of a comparison experiment of unmanned aerial vehicle image retrieval tasks on ERA data sets with other methods, wherein mAP is an average precision index.
TABLE 2
Table 2 shows the results of comparison experiments of unmanned aerial vehicle image retrieval tasks on a Drone-Action data set by the method and other methods, wherein mAP is an average precision index.
As can be seen from the comparison result of the index of the retrieval average precision of the two data sets, the retrieval effect of the unmanned aerial vehicle image retrieval method is better than that of the existing method.
Claims (6)
1. An unmanned aerial vehicle image hash retrieval method based on a saliency capture mechanism is characterized by comprising the following steps of:
The method comprises the following steps:
step 1, dividing photos of an unmanned aerial vehicle image library into a training data set and a testing data set;
step 2, information extraction, namely improving the pre-trained ResNet50 network, and performing information extraction training on the ResNet50 network by using pictures of a training data set;
training and extracting features of a picture of a training data set by utilizing a ResNet50 network, wherein the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, firstly upsamples a feature map output generated in a first stage of the ResNet50 and then is connected with a feature mapping output in a second stage of the ResNet50 to form local low-layer features F low The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then connected to the feature map output of the fourth stage of ResNet50 as a local high-level feature F high The method comprises the steps of carrying out a first treatment on the surface of the Finally, the local low-level features are processed by using the convolution of 3 multiplied by 3, and the local high-level features are processed by using the convolution of 1 multiplied by 1, so that the two features have the same size, and the two features are connected to form a connection feature; to avoid high-level feature semantic loss, a residual structure is used to connect the average value of the local high-level features and the connected features to obtain a local joint feature F j The method comprises the steps of carrying out a first treatment on the surface of the In addition, the fine granularity transformation of the local joint features can reduce redundant information;
step 3, saliency capture, in generating local fine granularity characteristic F l Later, to enhance the effectiveness of the feature, a saliency capture process is used; firstly, capturing information interaction attention, and then capturing vision enhancement attention;
the capturing mechanism of the information interaction attention is to enable global features and local fine granularity features to mutually learn and interact to obtain feature embedded vectors F of the information interaction attention capture ia The method comprises the steps of carrying out a first treatment on the surface of the The capturing mechanism of visual enhancement attention is to enhance the visual representation of the extracted effective features, and the obtained saliency features F output by the saliency module va ;
Step 4, carrying out hash learning training, and outputting the saliency characteristic F obtained in the step 3 by the saliency module va Then, inputting the hash to a hash learning module for training, namely a fully connected hash layer of k nodes, wherein the hash uses a tanh function as an activation function; generating a k-bit hash code in a training stage, and learning by an objective function consisting of a similarity maintenance term, a distributed smoothing term and a quantization error; in the test stage, quantizing the k-bit hash codes into k-bit hash codes by using a symbol function;
Training the significant capture model, namely training the network model by using a training data set to circulate the steps 2 to 4, ending the operation of the algorithm when 100 iterations of training or the final objective function loss is no longer reduced, and further obtaining the hash code of the sample in the training-completed integral network model calculation test data set;
and 6, calculating hash codes of samples in the test data set by using the trained integral network model, sequencing Hamming distances between the query samples and the hash codes of each sample in the training data set from large to small, calculating the top n accuracies of the ranking list, and obtaining an average accuracy index MAP and top n search results, wherein the search results are output at the moment, and the search is completed.
2. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 1, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
in the step 2, the picture of the training data set is trained and extracted by utilizing a ResNet50 network, the ResNet50 network performs four-stage feature mapping processing on the picture of the training data set, and the picture of the training data set is subjected to a first-stage processing in the ResNet50 network to obtain a first-stage projectionAnd network parameters of the first phase +. >Projection of the first phase +.>And network parameters of the first phase +.>Performing second stage processing in ResNet50 network to obtain projection of second stage +.>And network parameters of the second phase->Projection of the second phase +.>And network parameters of the second phase->Performing a third phase process in the ResNet50 network to obtain a third phase projection +.>And network parameters of the third phase +.>Projection of the third phase +.>And network parameters of the third phase +.>Performing a fourth phase process in the ResNet50 network to obtain a projection of the fourth phase +.>And network parameters of the fourth phase +.>The feature of the ResNet50 network output by four stages in sequence is global feature projection;
inputting an unmanned aerial vehicle image, and simultaneously taking global feature extraction and feature extraction of different convolution layers into consideration; upsampling the feature map output of the first stage of ResNet50 and then connecting the feature map output of the second stage of ResNet50 to a local low-level feature F low The specific formula is as follows:
wherein ,Flow As a feature of the local low-level layer,representing a splicing operation->Projection representing the first phase, +.>Network parameters representing the first phase, +.>Representing the projection of the second phase +.>Network parameters representing the second phase;
thereafter, the feature map output of the third stage of ResNet50 is up-sampled and then the feature map output of the fourth stage of ResNet50 is connected as local high-level features F high The specific formula is as follows:
wherein ,Fhigh As a feature of a local high-level layer,representing a splicing operation->Projection representing the third phase, +.>Network parameters representing the third phase, +.>Projection representing the fourth phase, +.>Network parameters representing the fourth phase;
then, the local low-layer features and the local high-layer features are processed and spliced by using a convolution of 3 multiplied by 3 and convolution of 1 multiplied by 1 respectively, so that the local low-layer features and the local high-layer features have the same size; connecting mean and spell of local high-level features using residual structureConnected features, obtaining local joint features F j The specific formula is as follows:
wherein ,Fj For local joint features, ρ is the mean calculation,for the sum operation, ψ is the parametric rectified linear unit function,is a convolution of 3 x 3>Is a convolution of 1 x 1;
to reduce redundant information of local joint features, the local joint features F j Fine-grained transformation is carried out to obtain local fine-grained feature F l The specific formula is as follows:
wherein ,Fl For the local fine granularity feature, by term multiplication, delta is a sigmoid function;
at this time, the information extraction is completed.
3. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 2, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
In the step (3) of the above-mentioned process,
step 3.1, capturing information interaction attention, and projecting global features onto a Query of an attention mechanism through different full connection layers to obtain Q ia Local fine grain feature F l Projected onto Key and Value to obtain respectively and Via Correlation S of global features and local fine-grained features ia The following are provided:
wherein phi represents the softmax function,represents the set scaling parameters, Q ia Is Query in the attention mechanism, +.>Is the transposed Key in the attention mechanism; />
In order to perform information interaction, calculating the similarity by utilizing the multi-head attention, and splicing and fusing the similarity of different heads, the specific process is as follows:
wherein L is the number of attention heads,represents the output of the first head, W ia Is a parameter matrix which can be learned, < >>For Dropout operation, +.>Representing the splicing operation S l For the similarity of the first head, +.>Value projected for local fine granularity feature of the first header;
to enhance visual characterization, to further achieve efficient feature embedding, global features and T ia In combination, the specific formula is as follows:
F ia i.e. the feature embedded vector of the information interaction attention module,representation layer normalization operation, ++>Is a multi-layer perceptron; at this time, a feature embedding vector F for capturing the attention of information interaction is obtained ia ;
Step 3.2, capturing of visual enhancement attention: to enhance visual performance, features of the information interaction attention module are first embedded into vector F ia The Query, key and Value projected to the attention mechanism respectively obtain Q va 、 and Vva The method comprises the steps of carrying out a first treatment on the surface of the Similarity S of different token va The calculation is as follows:
wherein ,Sva For an embedding matrix of different features, phi is a softmax function,in order to set the ratio parameters of the components,
then calculating the similarity by utilizing a multi-head attention mechanism, wherein the specific process is as follows:
wherein m is the head number of the visual attention enhancing module,for output of the mth head, W va To enhance the learnable parameters of the visual attention module, < +.>Representing the splicing operation S m For the similarity of the mth head, +.>Embedding a vector F for features of the mth head ia The Value of the projection;
finally, generating the saliency feature F through layer normalization va The specific formula is as follows:
4. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism according to claim 3, wherein the unmanned aerial vehicle image hash retrieval method is characterized in that:
in the step 4, the specific formula of the hash function is:
b=sign(h)=sign(τ(F va ,W h ))
wherein ,Fva To output of saliency capture module, W h Is the weight of the approximate function, τ is the approximate function, h is the hash-like code, and b is the generated hash code;
The objective function consists of a similarity maintaining term, a distribution smoothing term and a quantization error;
the similarity maintenance term calculation formula is as follows:
where epsilon edge parameter, max is the maximum function, H () calculates the hamming distance,paired tags for samples (similarity 1, dissimilarity 0);
introducing a distribution smoothing term can smooth a distribution center at a theoretical value, and a calculation formula is as follows:
wherein ,for smooth term, γ is the superparameter, +.>For the label smoothing function, +.>Represents the nth 1 Input labels, b n To generate the hash code, y n A tag that is true for the sample;
however, the objective function is difficult to optimize during training, so the Euclidean distance D is used instead of the Hamming distance, namely:
however, the hash code generates quantization errors and thus adds quantization error terms, and the final objective function is:
5. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 4, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of:
in the step 5, when the whole network model is trained, the Adam algorithm is used for optimization, and the learning rate is set to be 10 -4 The input picture size is adjusted to 256×256; the batch size is set to 64, the length k of the hash code is set to 16, 24, 32, 48 and 64, the edge parameter epsilon is set to 2k, the initial weight of the convolutional neural network ResNet50 is initialized by using a weight parameter matrix W and a bias parameter matrix B which are trained in advance, and the iterative training is carried out on the network model by repeating the steps 2 to 4, so that the weight parameter matrix W and the bias parameter are optimizedThe matrix B reduces the loss of the objective function L, and the algorithm operation is ended when the training is 100 rounds of iteration or the final objective function loss is no longer reduced, so that the hash codes of samples in the test data set are calculated by the whole network model after the training is completed.
6. The unmanned aerial vehicle image hash retrieval method based on the saliency capturing mechanism of claim 5, wherein the unmanned aerial vehicle image hash retrieval method is characterized by comprising the following steps of:
in the step 6, the query sample is a test data set or unmanned aerial vehicle picture input under a prediction scene.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310007898.4A CN116089646A (en) | 2023-01-04 | 2023-01-04 | Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310007898.4A CN116089646A (en) | 2023-01-04 | 2023-01-04 | Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116089646A true CN116089646A (en) | 2023-05-09 |
Family
ID=86205785
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310007898.4A Pending CN116089646A (en) | 2023-01-04 | 2023-01-04 | Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116089646A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116524282A (en) * | 2023-06-26 | 2023-08-01 | 贵州大学 | Discrete similarity matching classification method based on feature vectors |
-
2023
- 2023-01-04 CN CN202310007898.4A patent/CN116089646A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116524282A (en) * | 2023-06-26 | 2023-08-01 | 贵州大学 | Discrete similarity matching classification method based on feature vectors |
CN116524282B (en) * | 2023-06-26 | 2023-09-05 | 贵州大学 | Discrete similarity matching classification method based on feature vectors |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Few sample knowledge distillation for efficient network compression | |
Yang et al. | Cross-image relational knowledge distillation for semantic segmentation | |
CN110188765B (en) | Image semantic segmentation model generation method, device, equipment and storage medium | |
CN107480261B (en) | Fine-grained face image fast retrieval method based on deep learning | |
US11238093B2 (en) | Video retrieval based on encoding temporal relationships among video frames | |
CN112528780B (en) | Video motion segmentation by hybrid temporal adaptation | |
EP3295381B1 (en) | Augmenting neural networks with sparsely-accessed external memory | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN114911958B (en) | Semantic preference-based rapid image retrieval method | |
CN114358203A (en) | Training method and device for image description sentence generation module and electronic equipment | |
CN114863407B (en) | Multi-task cold start target detection method based on visual language deep fusion | |
CN115222998B (en) | Image classification method | |
CN109239670B (en) | Radar HRRP (high resolution ratio) identification method based on structure embedding and deep neural network | |
CN113297370A (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN113626589A (en) | Multi-label text classification method based on mixed attention mechanism | |
CN116089646A (en) | Unmanned aerial vehicle image hash retrieval method based on saliency capture mechanism | |
CN113032601A (en) | Zero sample sketch retrieval method based on discriminant improvement | |
CN114626454A (en) | Visual emotion recognition method integrating self-supervision learning and attention mechanism | |
Zhu et al. | Two-branch encoding and iterative attention decoding network for semantic segmentation | |
CN115640418B (en) | Cross-domain multi-view target website retrieval method and device based on residual semantic consistency | |
Fadavi Amiri et al. | Improving image segmentation using artificial neural networks and evolutionary algorithms | |
CN117011515A (en) | Interactive image segmentation model based on attention mechanism and segmentation method thereof | |
CN112487231B (en) | Automatic image labeling method based on double-image regularization constraint and dictionary learning | |
Zhang et al. | Style classification of media painting images by integrating ResNet and attention mechanism | |
CN114049634B (en) | Image recognition method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |