CN116204675A - Cross view geographic positioning method for global relation attention guidance - Google Patents

Cross view geographic positioning method for global relation attention guidance Download PDF

Info

Publication number
CN116204675A
CN116204675A CN202310046541.7A CN202310046541A CN116204675A CN 116204675 A CN116204675 A CN 116204675A CN 202310046541 A CN202310046541 A CN 202310046541A CN 116204675 A CN116204675 A CN 116204675A
Authority
CN
China
Prior art keywords
global
feature
image
attention
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310046541.7A
Other languages
Chinese (zh)
Inventor
孙静
闫睿
张冰
王法胜
孙福明
朱兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Minzu University
Original Assignee
Dalian Minzu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Minzu University filed Critical Dalian Minzu University
Priority to CN202310046541.7A priority Critical patent/CN116204675A/en
Publication of CN116204675A publication Critical patent/CN116204675A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/587Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Image Analysis (AREA)

Abstract

A global relation attention-guided cross view geographic positioning method belongs to the technical field of computers, a depth residual error network is used as a backbone network, and a global relation attention module is used for capturing more robust image global structure information for matching. A dual-branch network comprising global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively. In the local branches, the receptive field of the feature map is increased with dilation convolution, and the feature map is segmented at 4 scales using a square-ring segmentation strategy. For the feature map of each branch, it is converted into column vector descriptors, and then the classifier is used to obtain the prediction category of each column vector. Finally, the difference between the image prediction category and the real category is measured by using a cross entropy loss function, so that the geographic positioning accuracy is improved.

Description

Cross view geographic positioning method for global relation attention guidance
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a cross view geographic positioning method for global relationship attention guidance.
Background
Cross-view geolocation may be viewed as an image content-based retrieval task [1][2] The method is that the query image of one platform is matched with images from other platform databases to find out images of the same geographic position. Previous studies have focused mainly on matching between ground and satellite, aerial view images. In recent years, with the wide application of unmanned aerial vehicles [3] Unmanned aerial vehicle view images are added in cross view geographic positioning, and geographic positioning based on unmanned aerial vehicle views and satellite images becomes a current research hotspot.
With convolutional neural networks (Convolutional Neural Networks, CNN) being widely used in image classification [4][5] Target detection [6][7] Semantic segmentation [8][9] Motion recognition [10][11] In the field of equi-vision, some researchers applied convolutional neural networks to cross-view geolocation [12] In the task, significant progress has been made. However, most of the cross view geographic positioning methods mainly consider the advanced semantic information of the target image, and neglect the important effect that the spatial structure information can effectively improve the geographic positioning accuracy. Zhenget al [13] The geographic positioning is regarded as a classification task, and similarity measurement is carried out on the semantic features of the images. However, this method ignores the context information of the surrounding area of the target, resulting in insufficient comprehensiveness of the extracted features. King et al [14] Dividing the image high-level features according to a square ring division strategy, and carrying out each partial featureSimilarity measures, thereby improving geolocation accuracy using contextual information. However, the method directly divides the feature map into 4 scales, ignores global structure information of the image, and causes the situation that similar images are used as correct retrieval results and false detection occurs in the retrieval process. Clearly, fully mining the structural information of the geographic target image helps to improve the performance of cross-view geographic positioning.
Disclosure of Invention
Aiming at the fact that most algorithms fail to fully consider the influence of image structure information on matching precision in cross view geographic positioning, a global relationship attention-directed cross view geographic positioning method is provided.
First, a depth residual error network is adopted [15] As backbone network, a global relationship attention module is utilized [16] More robust image global structure information is captured for matching. Second, a dual-branch network including global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively. In local branches, dilation convolution is utilized [17] Increasing receptive field of feature map and using square ring partitioning strategy [14] The feature map is segmented at 4 scales. For the feature map of each branch, converting the feature map of each branch into column vector descriptors, and then obtaining the prediction category of each column vector by using a classifier. Finally, cross entropy loss function is used [18][19] To measure the difference between the predicted category and the true category of the image, thereby improving the accuracy of network training.
The advantages are that:
1. the global structure information of the image is fully mined by utilizing the global relationship attention module to learn the relationship among image feature nodes, so that a network focuses on a significant region, and more robust features are extracted for image feature matching.
2. A dual-branch structure including global branches and local branches is designed for cross-view geolocation. Deep features are extracted from the global branches by using a depth residual error network, so that a feature map containing rich semantic information is obtained; dilation convolution is used in local branches to capture richer multi-scale context information.
3. Experimental results on three data sets of University-1652, CVUSA and CVACT show that the method provided by the technical scheme has better performance than other advanced models in terms of geographic positioning, and the effectiveness of the method is proved.
Drawings
Fig. 1 is a diagram of a whole network framework of the method in the technical scheme.
Fig. 2 is an enlarged view of a portion of the middle left side of fig. 1.
Fig. 3 is an enlarged view of a portion of the right middle of fig. 1.
Fig. 4 is a global relationship attention structure diagram.
Fig. 5 is a global spatial relationship attention structure diagram.
FIG. 6 is a global channel relationship attention structure diagram.
Fig. 7 is a schematic diagram of a standard convolution and an expanded convolution.
Fig. 8 is a view of the unmanned aerial vehicle target positioning task retrieval result.
Fig. 9 is a search result of the unmanned aerial vehicle navigation task.
Fig. 10 shows the search result on the data set CVUSA.
Detailed Description
The work related to cross-view geolocation is presented herein; the method and the network structure adopted by the technical proposal are described in detail; the experimental results were analyzed and ablation experiments and summarized.
1. Related work
Early cross-view geolocation studies were based primarily on ground-view and aerial view images. Workman et al [20] The feature extraction is carried out on the images by using two published pre-training models, and the deep features are proved to have the capability of identifying images of different geographic positions. However, the method only focuses on image feature extraction on a single scale, and multi-scale information cannot be effectively utilized, so that the matching features extracted by the network are not abundant enough Is rich. On the basis of this, workman et al [21] A CVUSA (Cross-View USA) data set is constructed, multi-scale fusion is carried out on the aerial image features, and a Cross View positioning result is improved. Forest et al [22] 78000 street view and 45 ° aerial image pairs were constructed using the disclosed data, and then depth twinning networks were employed to extract features for cross-view localization. Vo et al [23] Different deep learning methods have been evaluated, utilizing distance-based logic loss layers (Distance Based Logistic Layer, DBL) and rotational invariance to train the network, improving positioning accuracy. Considering that the semantic information of the image is less affected by the viewpoint change, taban et al [24] And performing cross-view matching by extracting buildings in the images, so as to obtain a final geographic positioning result. Altwaijry et al [25] Focusing on the matching task of the aerial image pair, the aerial image pair learning method learns and discriminates the representation from the image pair by using a data driving method, and solves the matching problem of the ultra-wide baseline image. Furthermore, dian et al [26] Firstly extracting aerial image features, then mapping the aerial image features to a ground view by using self-adaptive conversion, and finally, minimizing the difference between the predicted ground view semantic features and the semantic features extracted directly from the ground image by using an end-to-end learning method to finish cross view geographic positioning. Hu et al [27] Combining a twin network with NetVLAD [28] And combining, coding the local features to obtain a global image descriptor, introducing weighted soft edge ordering loss, and accelerating network convergence, thereby improving network performance. Shi et al [29] It is believed that existing methods ignore differences in appearance and geometry between ground view and aerial view images, so they utilize polar transformations to approximately align aerial images with ground images. To further solve the cross-view direction alignment problem, shi et al [30] A dynamic similarity matching network (Dynamic Similarity Matching Network, DSM) is designed to align the directions of the cross-view images so that the image matching result is more accurate. Liu et al [31] It is believed that geometric cues can be used for localization, thus designing a twin network that encodes the orientation information of each pixel of the image into a network modelIn the method, the network can learn the appearance and the geometric information at the same time, and the recall rate and the precision of the network are improved. To solve the problem of scene changes over time, rodrigues et al [32] A semantic driven data enhancement technology is provided, which aims to simulate scene change phenomenon in cross view image matching, and then uses a multi-scale attention module to perform image matching, so that network performance is improved. Regmi et al [33] Application of generating an countermeasure network (Generative Adversarial Networks, GANs) in cross-view positioning for the first time [34] They synthesize aerial images from ground views for image matching using a generation antagonism network (Generative Adversarial Networks, GANs), but this approach is not an end-to-end approach. Toker et al [35] The satellite view is synthesized into the ground view by using polar coordinate transformation, then image retrieval is carried out, and the two steps are integrated in an end-to-end architecture, so that advanced geographic positioning performance is realized. The method is mainly aimed at the matching task between the ground view and the aerial view image, only two views are considered for geographic positioning, and the important effect of the unmanned aerial vehicle view image is not considered, so that the feature learning of the multi-view matching task is lacked.
Recent cross-view geolocation studies suggest that increasing the viewpoint may increase the accuracy of geolocation, thus making the drone a third platform for solving the geolocation problem. Zhenget al [13] Constructing a University-1652 data set comprising a satellite view image, a ground view image and an unmanned plane view image, regarding all view images at the same position as a category, completing a geographic positioning task in a classified manner, and applying instance loss [36] To optimize the model. However, the method only focuses on the semantic information of the image, and does not consider the influence of the detail information of the image on the geographic positioning of the cross view. For this problem, king et al [14] A local pattern network (Local Pattern Network, LPN) is proposed, which uses the context information of the image as an auxiliary cue and uses a square ring partitioning strategy to make the network notice the environment around the target building, effectively solving the method [13 ]]The problem of neglecting the detail information of the image is solved, and a good matching result is obtained.Butyl et al [37] The cross matching method (LCM) based on the position classification solves the problem of unbalance between the satellite image and the unmanned aerial vehicle image sample, and improves the image matching precision. Attention mechanisms are widely used in the field of computer vision [16][38][39][40] The method aims to make the network pay more attention to the distinguishing characteristics, filter out some irrelevant information and promote the training effect of the model. Zhang et al [16] The global relationship attention is integrated into the pedestrian re-recognition network, and the feature representation capability and the pedestrian re-recognition performance are improved through capturing the global structure information of the image. To avoid the impact of target offset and view scaling on image matching, zhuang et al [38] A multi-scale attention (Multiscale Block Attention, MSBA) structure is proposed to enhance the salient features of the different regions of the feature map. Forest et al [39] A unit subtraction attention module (Unit Subtraction Attention Module, USAM) is designed to focus the model on salient regions in the image by detecting key points in the feature map, improving the performance of the model with fewer parameters. Wearing et al [40] It is believed that some operations based on convolutional neural networks (Convolutional Neural Networks, CNN) result in loss of image fine-grained information, so a transducer structure is introduced in cross-view localization [41] And a feature segmentation and region alignment method (Feature Segmentation and Region Alignment, FSRA) is provided, a feature map is segmented into different regions according to heat distribution, and classification supervision is carried out on each region, so that cross view positioning is effectively realized.
The method provides a new research idea for solving the problem of inaccurate geographical positioning of the cross view. Inspired by the method, the attention mechanism is fully combined with the feature extraction network, and the structural information is mined from the global angle; meanwhile, the image global features and the local features are jointly trained by adopting a double-branch structure, and fusion expansion convolution is carried out in the local branches, so that the receptive field of the feature map is increased, more abundant multi-scale context information can be captured, and the purpose of improving the positioning accuracy of the cross view is achieved.
Cross-view geolocation is an important research direction in the field of image retrieval, mainly to match images from different platforms at the same geographic location. Most of the existing methods fail to fully consider the effect of structural information of images on cross-view geographic positioning, so that extracted features cannot fully characterize the images, which affects positioning accuracy. Based on the method, a global relationship attention-guided cross view geographic positioning method is provided, global relationship attention is fully fused with a feature extraction network, and the network can capture rich global structure information, so that the characterization capability of the features is improved. Meanwhile, a global branch and local branch parallel combined training structure is designed in consideration of the important effect of semantic information and context information on geographic positioning, multi-scale context features for image matching are fully mined, and the accuracy of cross view geographic positioning is further improved. Compared with other advanced methods, quantitative and qualitative experimental results on data sets University-1652, CVUSA and CVACT show that the algorithm of the technical scheme has remarkable advantages on Recall rate (Recall) and image retrieval precision (AP) indexes.
2. Method of
2.1 network frame
The network framework proposed by the present technical solution is shown in fig. 1, where the entire network structure is divided into a global branch and a local branch, which share the network weight. Firstly, the ResNet50 is adopted as a backbone network in the technical scheme, an average pooling layer and a classification layer in the ResNet50 are removed, and further input image features are extracted. Meanwhile, a global relationship attention module (RGA) is added after the shallow features are extracted, including a global Spatial relationship attention (RGA-Aware Global Attention) and a global Channel relationship attention (RGA-C) for capturing image global structure information. Second, the output features of the previous stage are processed separately using a dual-branch structure, thereby focusing on global and local information effectively. The global branch is used for extracting high-level semantic information of the whole image; the local branches are used for focusing on the deep features of the network, so that more image detail information is reserved. Meanwhile, in order to combine the information of the surrounding areas of the target, the feature map is divided into four different areas by using a square ring division strategy at the local branches. Finally, global averaging pooling is used to convert the image high-level features into column vector descriptors. In the training phase, a classifier module is utilized to obtain the predicted class probability of each column vector descriptor, and a cross entropy loss function is used to minimize the difference between the predicted class and the true class. And calculating the similarity between the query image and the database image in the test process by utilizing the Euclidean distance, and finally sorting the retrieval results based on the similarity.
2.2 Global relationship attention Module
In a cross-view geolocation task, the global relationship attention module can enable the network to notice the difference features in the images, helping to distinguish buildings with similar appearances. A global relational attention module is combined with a feature extraction network, including global spatial relational attention (RGA-S) and global channel relational attention (RGA-C), to construct a global relational attention-directed feature extraction network, and attention weights are calculated by learning relationships between feature nodes. Global relationship attention as shown in fig. 4, for feature vectors in the feature graph, the feature vectors are represented as feature nodes x i Where i=1, 2, … N, N is the number of feature nodes. For a certain characteristic node x i Calculating the correlation r between the current node and other nodes i,j And r j,i Where j=1, 2, … N, thereby yielding a feature node x i Is r i =[r i,1 ,r i,2 ,…,r i,N ,r 1,i ,r 2,i ,…,r N,i ]. Then, the feature nodes are combined with the relation vector r i And splicing to obtain the relation perception feature so as to infer the attention weight of the current feature node.
2.2.1 Global spatial relationship attention
Global spatial relationship attention (RGA-S) learns the correlation between all feature nodes in the spatial dimension of the feature map, enabling the network to capture the features of a salient target region. The global spatial relationship attention is shown in fig. 5:
In particular, for the slave spiritFeature map obtained from network
Figure BDA0004055726940000051
Taking the C-dimensional feature vector of each spatial position as a feature node to form a graph G with N=W×H nodes in total S Representing feature nodes as x i Where i=1, 2, … N. Feature node x i And x j Correlation r between i,j Can be obtained by a dot product operation, and can be specifically defined as formula (1):
r i,j =f s (x i ,x j )=(ReLU(BN(Conv(x i )))) T (ReLU(BN(Conv(x j ))))
(1)
wherein f s (. Cndot.) represents a dot product operation, reLU (-) represents a modified linear unit activation function, BN (-) represents a batch normalization layer, conv (-) represents a 1X 1 convolution operation, and the dimension reduction ratio is controlled by a predefined positive integer. The characteristic node x can be obtained by the same way j And x i Correlation r between j,i With (r) i,j ,r j,i ) To represent the characteristic node x i And x j A pair-wise relationship between them. Finally, using the relation matrix
Figure BDA0004055726940000052
Representing correlations between all nodes, where r i,j =R S (i,j)。
Stacking the related relations between the ith characteristic node and all nodes according to a fixed sequence to obtain a spatial relation vector
Figure BDA0004055726940000053
Wherein R is S (i) representing the correlation between the ith feature node and all nodes, R S Where i represents the relationship between all nodes and the ith node. In order to make the network fully utilize the global structure information of the characteristic nodes, the spatial relation vector r is calculated by i And characteristic node x i Itself is spliced to obtain the space relation perception characteristic E S Can be fixedThe meaning is formula (2):
E S =C(x i ,r i )=(pool c (ReLU(BN(Conv(x i )))),(ReLU(BN(Conv(r i )))))
(2) Wherein C (·) represents the splicing operation, pool c (.) means global average pooling over the channel dimension, reducing the channel dimension to 1. The spatial attention weight a of the ith feature can be calculated by the spatial relationship aware feature i Defined as formula (3):
a i =sigmoid(BN(Conv 2 (ReLU(BN(Conv 1 (E S )))))) (3)
wherein sigmoid (·) represents a sigmoid activation function, conv 2 (. Cndot.) conversion of channel number to 1, conv 1 (. Cndot.) the dimensions were reduced at a fixed rate.
2.2.2 Global channel relationship attention
Global channel relationship attention (RGA-C) learns the relationships between all feature nodes in the channel dimension of the feature map, giving each channel a different weight. The global channel relationship attention is shown in fig. 6:
in particular, for feature diagrams
Figure BDA0004055726940000061
Taking the characteristic graph on each channel as a characteristic node to form a graph G with C nodes in total C Each feature node is denoted as x i Where i=1, 2, … C.
For the input feature map S, the input feature map S is spatially compressed as
Figure BDA0004055726940000062
The characteristic node x can be obtained i And characteristic node x j Correlation r between i,j Defined as formula (4):
r i,j =f c (x i ,x j )=(ReLU(BN(Conv(x i )))) T (ReLU(BN(Conv(x j )))) (4)
wherein f c (. Cndot.) represents a dot product operation. As same as Processing to obtain a characteristic node x j And x i Correlation r between j,i Using a matrix
Figure BDA0004055726940000063
Representing the correlation between all nodes. Stacking the related relations between the ith characteristic node and all nodes to obtain a channel relation vector +.>
Figure BDA0004055726940000064
Similar to equations (2) (3), the channel relationship characteristic E can be obtained C And channel attention weight a i
2.3 local branching
In order to enable the network to capture rich multi-scale context information, finer spatial structure information is reserved, so that the most similar images can be searched in a database, the accuracy of cross view geographic positioning is improved, and in a local branch, expansion convolution with multiple expansion rates is utilized [17] The receptive field of the feature map is increased without losing image details, so that the capability of capturing multi-scale information by the network is improved. Meanwhile, a square ring division strategy is adopted to divide the feature map under 4 scales, so that rich space context information is obtained.
The dilation convolution expands the receptive field by inserting r-1 weight values of 0, where r is the dilation factor, which is 1 in standard convolution operations. The standard convolution and the dilated convolution structure are shown in fig. 7, wherein part (a) in fig. 7 represents the standard convolution and part (b) in fig. 7 represents the dilated convolution with a dilated factor of 2. The use of a 3 x 3 convolution kernel under the same conditions resulted in a standard convolution and an expanded convolution with a receptive field of 3 x 3 and 5 x 5, respectively. The dilation convolution can capture richer image multi-scale information for image matching than standard convolution.
Specifically, the module uses dilation convolution operations with dilation factors of 2 and 4, respectively, to increase the receptive field of the feature map. And simultaneously, the step sizes of a convolution layer and a downsampling layer in the last residual block of the ResNet50 are adjusted to be 1, when the resolution of an input image is 256 multiplied by 256, the resolution of a characteristic image output by a backbone network is 8 multiplied by 8, and the resolution of the characteristic image output by an expanded residual network is 32 multiplied by 32.
In order to help the network to better judge images of different geographic positions, the environment around the target building is used as auxiliary information, meanwhile, a square ring division strategy is adopted in a local branch to divide the characteristic image, and the characteristic image is divided into four parts according to the distance from the center of the image, so that the characteristic images of different areas are obtained. The image features are then converted into 2048-dimensional feature vectors by an averaging pooling operation, represented by equation (5):
Figure BDA0004055726940000071
wherein Avgpool (·) represents an average pooling operation,
Figure BDA0004055726940000072
partial branching feature map representing partitions in different view platforms,/->
Figure BDA0004055726940000073
Representing the pooled 4 local branches 2048-dimensional feature vectors.
2.4 Global branching
Considering that semantic information focused by deep networks is also an important part of cross-view geographic positioning tasks, a global branch structure is designed, which is parallel to local branches. In the global branch, the depth residual error network is used for extracting and refining the large-scale features, so that a feature map f containing rich semantic information is obtained j The network can thus be made to recognize the belonging category of the different image features. Then, the global feature map is processed by adopting an average pooling method to obtain 2048-dimensional feature vectors, and the feature vectors are expressed by a formula (6):
g j =Avgpool(f j ) (6)
wherein g j Representing the pooled global branch feature vector.
2.5 Classification learning and loss function
The technical proposal fuses the classifierAfter merging into the feature extraction stage, it is used to predict the class of each feature vector. The classifier consists of a fully connected Layer (Fully Connected Layer, FC), a batch normalization Layer (Batch Normalization Layer, BN), a Dropout Layer (Dropout Layer), and a classification Layer (Classification Layer, cls). Local feature vector of image
Figure BDA0004055726940000074
And global feature vector g j As input, predicting the category to which each feature vector belongs, and finally obtaining the local prediction probability distribution vector of the image>
Figure BDA0004055726940000075
And a global predictive probability distribution vector q j
The loss function used in the technical scheme is cross entropy loss, and the distribution difference between the prediction probability and the true probability of the image is measured by using the function, so that the image characteristics are better learned, and the network training precision is improved. The cross entropy loss can be represented by equation (7):
Figure BDA0004055726940000081
wherein the method comprises the steps of
Figure BDA0004055726940000082
Representing corresponding original image after square ring division strategy processing, x j (j∈[1,2]) Representing an input image, j=1 represents a drone platform, and j=2 represents a satellite platform. y represents the true class of the input image, +.>
Figure BDA0004055726940000083
Respectively indicate->
Figure BDA0004055726940000084
And x j Normalized probability scores belonging to the true class, defined by equation (8) and equation (9),
Figure BDA0004055726940000085
Figure BDA0004055726940000086
where C represents the number of all geotag categories in the database.
3 experiment
3.1 data set
The technical proposal uses University-1652 [13] 、CVUSA [21] And CVACT [31] The three data sets train and test the proposed method.
(1) University-1652 is a multi-view, multi-source dataset comprising images of the unmanned, satellite, and ground view of 1652 buildings at University 72, and no duplicate images in the training dataset and the test dataset. We use the unmanned aerial vehicle view and satellite view images in this dataset to study both the unmanned aerial vehicle view target location and unmanned aerial vehicle navigation tasks. In the unmanned aerial vehicle view target positioning task, 701 image categories are shared in a query image database of the unmanned aerial vehicle view, and each category corresponds to a truly matched satellite image. In the unmanned aerial vehicle navigation task, 701 image categories are shared in a query database of satellite view angles, and each category corresponds to 54 truly matched unmanned aerial vehicle images.
(2) The CVUSA dataset includes satellite images and panoramic ground images, with 35532 image pairs for training and 8884 image pairs for testing.
(3) CVACT is a larger reference dataset, also comprising 35532 training image pairs, except 8884 image pairs are used as validation dataset and an additional 92802 image pairs are used as test set.
3.2 Experimental details
In order to ensure fairness of the experiment, all algorithms are implemented on a Linux server of which the operating system is ubuntu20.04, and all performance comparisons are based on the results under the configuration condition. The server is configured to: GTX 3090GPU with 24G video memory capacity. The model provided by the technical scheme is realized based on a Pytorch framework. Before training, all the input images are resized to 256×256 and data expansion is performed using horizontal flipping and random rotation. The model was updated with an SGD optimizer with a momentum of 0.9 and a weight decay of 0.0005, with an initial learning rate set to 0.001. For better convergence of the network, the model training period is 140 for dataset University-1652 and 100 for datasets CVUSA and CVACT. In the test phase, the similarity of the images is evaluated by calculating the Euclidean distance between the query image and the database image.
3.3 comparison of Performance
3.3.1 quantitative comparison
The technical scheme adopts recall ratio (recall@K) and image retrieval precision (Average Precision, AP) as image retrieval performance measurement indexes. Recall@K refers to the ratio of the correctly matched image in the first k results retrieved to all the correct images in the database, which is used to measure recall. In the present embodiment, a case where k=1 is mainly considered. AP is the area under the Precision-Recall (PR) curve, and refers to the ratio of the retrieved true matching image to the total number of retrieval results, and is used for measuring the Precision. The larger the recall@K and AP values, the higher the accuracy of representing image retrieval.
To illustrate the effectiveness of the proposed method, the proposed method was compared with other CNN-based algorithms on three data sets, university-1652, CVUSA and CVACT. The results of the comparison on the University-1652 dataset are shown in Table 1, the method of comparison including Instance Loss [13] 、LCM [37] 、LPN [14] 、Instance Loss+USAM [39] 、LPN+USAM [39] The method.
Table 1 results of quantitative tests on data set University-1652.
Wherein, the method comprises the following steps of [22] 、Triplet Loss [42] 、Soft Margin Triplet Loss [27] In the method of Instance Loss [13] And replacing the result obtained by the loss function on the basis of (a). Optimum evaluation indexAnd sub-optimal results are indicated by brackets, respectively.
Figure BDA0004055726940000091
As can be seen from table 1, the method of the present solution achieves optimal results in both tasks of the University-1652 dataset. In the unmanned aerial vehicle view target positioning task, the algorithm realizes 81.06% and 83.74% performances on the Recall@K and AP indexes respectively, compared with the suboptimal method LPN+USAM [39] The two indexes are respectively improved by 3.99 percent and 3.65 percent. In the unmanned aerial vehicle navigation task, 89.58% and 79.63% of performances are respectively realized on the Recall@K and AP indexes. Compared to the suboptimal method LPN [14] The recall@K and AP are respectively improved by 3.13% and 4.84%, which proves that the technical scheme method has remarkable advantages in image retrieval performance.
The comparison results on the data sets CVUSA and CVACT are shown in table 2. Since the ground images in the two data sets are panoramic, a sequential partitioning strategy is employed [14] The image is divided. The comparison method comprises CVM-Net [27]
Figure BDA0004055726940000101
Instance Loss [13] 、Regmi et al. [33] 、Siam-FCANet [43] 、CVFT [12] 、LPN [14] 、Instance Loss+USAM [39] 、LPN+USAM [39] . Wherein LPN [14] 、LPN+USAM [39] Algorithms are results generated by training using publicly published codes, while other methods directly use results provided by authors.
Table 2 data sets CVUSA and CVACT.
The optimal and suboptimal results of the evaluation index are indicated by brackets, respectively.
Figure BDA0004055726940000102
Indicating that the method uses additional direction informationIs input.
Figure BDA0004055726940000103
As can be seen from table 2, on the CVUSA dataset, the method of the present technical solution reached 88.00% and 99.47% on the evaluation index R@1 and r@top1%, respectively. Compared with other 9 advanced models, the method has remarkable improvement on two indexes, particularly on R@1 indexes, the performance is improved by 2.03 percent. On the CVACT_val dataset, the method reached 80.98% and 96.53% on R@1 and R@Top1%, both achieving optimal results, demonstrating the effectiveness of the proposed method.
3.3.2 qualitative results
Fig. 8 and 9 are respectively the search results of the method on the University-1652 data set, the unmanned aerial vehicle view target positioning task and the unmanned aerial vehicle navigation task are respectively visually represented, and fig. 10 is the search results on the CVUSA data set. In the qualitative results, each row represents a search result of one position, the image on the left side of the dotted line is the query image, and the image on the right side is the top-ranked image in the matching result. Wherein the yellow box represents the correct search and the blue box represents the incorrect search.
For unmanned aerial vehicle view target positioning, only one image which is truly matched is shown in the first five images of the matching result in fig. 8, because each unmanned aerial vehicle view image only has 1 satellite image which is matched with the image, and therefore, the method can prove that the matched image can be correctly searched under the interference of similar images. For the unmanned navigation task, the first five images of the matching result in fig. 9 are all correctly matched images, since each satellite image has 54 unmanned images matched to it. Each ground image in the CVUSA dataset corresponds to a correct satellite image, and the first image in the search result of each query image in fig. 10 is a correctly matched image. By qualitative result analysis, it can be found that the method can retrieve the correct result on both data sets.
3.4 ablation experiments
To verify the effectiveness of each module, the proposed method was analyzed in this section, and ablation experiments were performed on the University-1652 dataset to verify the effect of each module.
3.4.1 validity of the Global relationship attention Module
To verify the validity of the global relationship attention module, 2 ablation experiments were performed herein. The first experiment is to remove the global relationship attention module and only use the dual-branch structure network for image feature extraction. The second experiment is to add SENet in the network [44] In SE attention module(s) to obtain the attention of the image in the channel dimension.
From the results of table 3, it can be observed that using the global relationship attention module enables the network to focus on the authentication feature of the image, improving the retrieval capability of the network and making the index obtain a better result than the result without adding the attention mechanism and using the SE attention module.
Table 3 global relationship attention module ablation experimental results comparison.
(a) Indicating that the attention mechanism is not used; (b) represents adding a SE attention module; (c) represents the use of an RGA attention module.
The evaluation index optimum is indicated by brackets.
Figure BDA0004055726940000111
3.4.2 dilates the effectiveness of convolution
In order to verify the effectiveness of the dilation convolution, 3 ablation experiments were performed, the dilation convolution in the local branches was adjusted, and image feature extraction was performed using different dilation rates. According to the results of table 4, it can be found that the detail information of the image can be effectively captured by using the expanded convolution to increase the receptive field of the feature map, so that the accuracy of the geographic positioning of the cross view is improved, and when the expansion factors are respectively 2 and 4, the performance of the model is optimal.
Table 4 comparison of the results of the dilation convolution ablation experiments.
(a) The expansion factors of the local branch residual blocks are respectively 1 and 1;
(b) The expansion factors of the local branch residual blocks are respectively 1 and 2;
(c) The expansion factors of the local branch residual blocks are respectively 2 and 2;
(d) The local branch residual block expansion factors are respectively 2 and 4.
The evaluation index optimum is indicated by brackets.
Figure BDA0004055726940000121
3.4.3 influence of input image size on results
Higher accuracy can be achieved using high resolution image training models, but more computational resources and time are required. Due to limited resources, low resolution input images are required during actual operation, but this reduces the accuracy of image matching. A set of ablation experiments was designed herein to observe the effect of different resolution input images on model performance. As a result, as shown in table 5, it was found that when the input image size was increased from 224 to 320, both the recall@1 and AP (image retrieval accuracy) values of the network were improved, and when the image size was increased to 384, the performance of the network was slightly degraded.
Table 5 effect of different resolution input images on the results.
(a) Representing an input image size of 224×224;
(b) Representing the input image size as 256×256;
(c) Representing an input image size of 320×320;
(d) Representing an input image size of 384 x 384.
The evaluation index optimum is indicated by brackets.
Figure BDA0004055726940000122
Conclusion 4
The technical scheme provides a cross view geographic positioning method for global relation attention guidance. The method adopts global structure information of the global relation attention captured image, and extracts more robust image features for geographic positioning. Meanwhile, a double-branch strategy is used for joint training, and the expanded convolution is used for increasing the receptive field of the feature map in the local branches, and the feature map is divided into 4 scales. The feature representation containing the context information and the semantic information is obtained through the double-branch structure to calculate the image category probability, so that the accuracy of geographic positioning is further improved. On three data sets of University-1652, CVUSA and CVACT, the technical scheme method has remarkably improved on recall@K and AP. In addition, the technical scheme can well eliminate the interference of similar buildings and search out correct images.
Reference to the literature
[1] Ahmed K T, ummesfi S, iqbal A. Content-based image feature information fusion retrieval [ J ]. Informationfusion, 2019,51:76-99.
[2] Sarotha R, paul V, kumar P g content-based deep learning image retrieval [ J ]. Cluster Computing (set
Group calculation), 2019,22 (2): 4187-4200.
[3] Outay F, mengash H A, adnan M. Unmanned Aerial Vehicle (UAV) in road safety, traffic and highway infrastructure management
Is applied to: recent developments and challenges [ J ]. Transportation research part A: policy and practice (first part of traffic research: policy and practice), 2020,141:116-129.
[4] Zhu H, ma M, ma W, et al spatial channel progressive fusion ResNet [ J ] information fusion for remote sensing classification 2021,70:72-87.
[5] Wang P, fan E, wang P.
Pattern Recognition Letters (Pattern recognition), 2021,141:61-67.
[6] Zhang D, ye M, liu Y, et al multisource unsupervised domain adaptive target detection [ J ]. Information Fusion (information fusion), 2022,78:138-148.
[7] Tan M, pang R, le Q v.efficientset: scalable and efficient object detection [ C ]// Proceedings of theIEEE/CVF conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2020:10781-10790.
[8] Yuan Y, chen X, wang J. Object context representation for semantic segmentation [ C ]// European conference oncomputer vision (European computer vision conference). Springer, cham,2020:173-190.
[9] Hao S, zhou Y, guo Y. Deep learning-based semantic segmentation study review [ J ]. Neuro-segmentation (neuro-computing)
School report), 2020,406:302-321.
[10] Jaouedi N, boujnah N, bouhlel M s.
Journal of King Saud University-Computer and Information Sciences (university of Saint King-journal of computer and information science), 2020,32 (4): 447-453.
[11] Yang C, xu Y, shi J, et al time pyramid network for motion recognition [ C ]// Proceedings of theIEEE/CVF conference on computer vision and pattern recognition (IEEE conference on computer vision and pattern recognition). 2020:591-600.
[12] Shi Y, yu X, liu L, et al, optimal feature transfer for geolocation across view images [ C ]// Proceedings of theAAAI Conference on Artificial Intelligence (AAAI artificial Smart conference): 2020,34 (07): 11990-11997.
[13] Zheng Z, wei Y, yang y.university-1652: multi-view multi-source reference based on geographic positioning of unmanned aerial vehicle
[C] The ratio of the components is/Proceedings of the, 28 and th ACM international conference on Multimedia (ACM multimedia International conference), 2020:1395-1403.
[14] Wang T, zheng Z, yan C, et al, each section is important: the local mode facilitates geolocation across views [ J ].
IEEE Transactions on Circuits and Systems for Video Technology (IEEE video technology circuits and systems), 2021,32 (2): 867-879.
[15] He K, zhang X, ren S, et al depth residual learning [ C ]// Proceedings of the IEEEconference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2016:770-778 in image recognition.
[16] Zhang Z, lan C, zeng W, et al, global relationship attention for pedestrian re-recognition [ C ]// Proceedings of theieee/cvf conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2020:3186-3195.
[17] Yu F, koltun V, funkhouse T. Expanded residual network [ C ]// Proceedings of the IEEE conference oncomputer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2017:472-480.
[18] Zheng Z, zheng L, yang y. A discriminating learning convolutional neural network for human re-recognition embeds [ J ]. ACMtransactions on multimedia computing, communications, and applications (TOMM) (ACM transactions for multimedia computing communications and applications), 2017,14 (1): 1-20.
[19] Li X, yu L, chang D, et al, double cross entropy loss of small sample fine-grained vehicle classification [ J ]. IEEE Transactionson Vehicular Technology (IEEE journal of vehicle technology), 2019,68 (5): 4204-4212.
[20] Workman S, jacobs N. Position dependence of convolutional neural network characteristics [ C ]// Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops (IEEE computer vision and pattern recognition conference). 2015:70-78.
[21] Workman S, souvenir, jacobs N. Wide area image geolocation Using aerial reference images
[C] v/Proceedings of the IEEE International Conference on Computer Vision (IEEE computer Vision and Pattern recognition conference). 2015:3961-3969.
[22] Lin T Y, cui Y, belongie S, et al, depth of ground to aerial geolocation, represents learning [ C ]// Proceedings ofthe IEEE conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2015:5007-5015.
[23] Vo N, hays J. Use of overhead images to locate and orient street view [ C ]// European conference on computer vision (European computer vision conference). Springer, cham,2016:494-509.
[24] Tian Y, chen C, shah M. Cross-view image matching of geolocation in urban environments [ C ]// Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition). 2017:3608-3616.
[25] Altwaijry H, trulls E, hays J, et al learn an architecture [ C ]// Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition) that matches aerial images with depth attention 2016:3539-3547.
[26] Zhai M, bessinger Z, workman S, et al predicts a ground scene layout [ C ]// Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition) 2017:867-875 from aerial images.
[27] Hu S, feng M, nguyen R M H, et al Cvm-net: cross-view image-based ground-to-air geolocation
Point matching network [ C ]// Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (IEEE computer vision and pattern recognition conference). 2018:7258-7267.
[28] Arand jelovic R, gronat P, torili A, et al NetVLAD: CNN architecture for weakly supervised location identification
[C] /(Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE computer Vision and Pattern recognition conference). 2016:5297-5307).
[29] Shi Y, liu L, yu X, et al spatial perception feature aggregation based on cross-view geolocation of images
[C] v/Proceedings of the 33rd International Conference on Neural Information ProcessingSystems (33 rd International conference on neuro information handling systems). 2019:10090-10100.
[30] Shi Y, yu X, campbell D, et al where I see? Joint position and direction estimation by cross-view matching
Meter [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE computer vision and pattern recognition conference). 2020:4064-4072.
[31] Liu L, li h. Cross-view geolocation using direction in neural networks [ C ]// Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition). 2019:5624-5633.
[32] Do Rodrigues R, tani m. These are in the same place? Geolocation of images that are not visible in the cross-view
[C] /(Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE computer View application conference). 2021:3753-3761).
[33] Regmi K, shah m. Connect ground to aerial image matched domain gap [ C ]// Proceedings of the IEEE/CVFInternational Conference on Computer Vision (IEEE computer vision conference). 2019:470-479.
[34] GoodFe low I, pouget-Abadie J, mirza M, et al, generates an antagonistic network [ J ]. Communications of theACM (American society of computers communication), 2020,63 (11): 139-144.
[35] Toker A, zhou Q, maximov M, et al, come to earth: synthesis of geolocated satellites to streetscapes
[C] /(Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE conference on computer Vision and Pattern recognition). 2021:6488-6497).
[36] Zheng Z, zheng L, garrett M, et al, have example missing dual path convolution image text embedding [ J ]. ACMtransactions on multimedia computing, communications, and applications (TOMM) (ACM transactions for multimedia computing communications and applications), 2020,16 (2): 1-23.
[37] Ding L, zhou J, meng L, et al, a practical unmanned aerial vehicle and satellite cross viewpoint based on unmanned aerial vehicle geographic positioning
Image matching method [ J ]. Remote Sensing, 2020,13 (1): 47.
[38] Zhuang J, dai M, chen X, et al, a more efficient unmanned aerial vehicle and satellite image cross-matching method [ J ]. RemoteSensing, 2021,13 (19): 3979.
[39] Lin J, zheng Z, zhong Z, et al, joint representation of cross-view geolocation learning and keypoint detection [ J ]. IEEETransactions on Image Processing (IEEE image processing journal), 2022.
[40] Dai M, hu J, zhuang J, et al, a transformer-based unmanned aerial vehicle viewpoint geographic positioning feature segmentation and region alignment
Method [ J ]. IEEE Transactions on Circuits and Systems for Video Technology (IEEE video technology circuits and systems), 2021.
[41] Vaswani A, shazer N, parmar N, et al attention is what you need [ J ]. Advances in neuralinformation processing systems (Procedent of neural information processing systems), 2017,30.
[42] Chechik G, sharma V, shalit U, et al, online learning based on rank-based large-scale image similarity [ J ]. Journal ofMachine Learning Research (journal of mechanical learning study), 2010,11 (3).
[43] Cai S, guo Y, khan S, et al, ground-to-air image geolocation with hard sample re-weighted triple loss [ C ]/-for
Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE computer View conference.) 2019:8391-8400.
[44] Hu J, shen L, sun G. Squeeze and excite networks [ C ]// Proceedings of the IEEE conference on computervision and pattern recognition (IEEE computer vision and pattern recognition conference). 2018:7132-7141.

Claims (7)

1. The global relation attention-guided cross view geographic positioning method is characterized by comprising the following steps of:
The depth residual error network is adopted as a backbone network, and a global relation attention module is utilized to capture more robust global structure information of the image for matching; a dual-branch network comprising global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively; in the local branches, the receptive field of the feature map is increased by using dilation convolution, and the feature map is segmented under 4 scales by using a square ring segmentation strategy; for the feature map of each branch, converting the feature map of each branch into column vector descriptors, and then obtaining the prediction category of each column vector by using a classifier; the cross entropy loss function is used for measuring the difference between the image prediction category and the real category, so that the network training accuracy is improved.
2. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:
the whole network structure is divided into a global branch and a local branch, which share network weights; firstly, adopting ResNet50 as a backbone network, removing an average pooling layer and a classification layer in the backbone network, and further extracting input image characteristics; meanwhile, a global relation attention module is added after shallow features are extracted, wherein the global relation attention module comprises global space relation attention and global channel relation attention and is used for capturing global structural information of an image; secondly, the output characteristics of the previous stage are respectively processed by using a double-branch structure, so that global and local information is effectively focused; the global branch is used for extracting high-level semantic information of the whole image; the local branches are used for focusing on the deep features of the network, so that more image detail information is reserved; meanwhile, in order to combine the information of the surrounding areas of the target, a square ring division strategy is used for dividing the feature map into four different areas in local branches; finally, global averaging pooling is used to convert the image high-level features into column vector descriptors; in the training stage, a classifier module is utilized to obtain the prediction category probability of each column vector descriptor, and a cross entropy loss function is used to minimize the difference between the prediction category and the real category; and calculating the similarity between the query image and the database image in the test process by utilizing the Euclidean distance, and finally sorting the retrieval results based on the similarity.
3. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:
combining a global relationship attention module with a feature extraction network, comprising global spatial relationship attention and global channel relationship attention, constructing a feature extraction network guided by the global relationship attention, and calculating attention weights by learning the relationship among feature nodes; for feature vectors in the feature graph, the feature vectors are represented as feature nodes x i Where i=1, 2, … N, N is the number of feature nodes; for a certain characteristic node x i Calculating the correlation r between the current node and other nodes i,j And r j,i Where j=1, 2, … N, thereby yielding a feature node x i Is r i =[r i,1 ,r i,2 ,…,r i,N ,r 1,i ,r 2,i ,…,r N,i ]The method comprises the steps of carrying out a first treatment on the surface of the Then, the feature nodes are combined with the relation vector r i And splicing to obtain the relation perception feature so as to infer the attention weight of the current feature node.
4. A global relational attention directed cross-view geolocation method as recited in claim 3, comprising the steps of:
the global spatial relationship attention learns the correlation relationship among all feature nodes in the spatial dimension of the feature map, so that the network captures the features of the remarkable target area;
In particular, for feature maps obtained from neural networks
Figure FDA0004055726930000021
Taking the C-dimensional feature vector of each spatial position as a feature node to form a graph G with N=W×H nodes in total S Representing feature nodes as x i Where i=1, 2, … N; feature node x i And x j Correlation r between i,j Obtained by a dot product operation, specifically defined as formula (1):
r i,j =f s (x i ,x j )=(ReLU(BN(Conv(x i )))) T (ReLU(BN(Conv(x j ))))
(1) Wherein f s (. Cndot.) represents a dot product operation, reLU (-) represents a modified linear unit activation function, BN (-) represents a batch normalization layer, conv (-) represents a 1 x 1 convolution operation, and the dimension reduction ratio is controlled by a predefined positive integer; and similarly obtaining a characteristic node x j And x i Correlation r between j,i With (r) i,j ,r j,i ) To represent the characteristic node x i And x j A pair-wise relationship between; finally, using the relation matrix
Figure FDA0004055726930000022
Representing correlations between all nodes, where r i,j =R S (i,j);
Stacking the related relations between the ith characteristic node and all nodes according to a fixed sequence to obtain a spatial relation vector
Figure FDA0004055726930000023
Wherein R is S (i) representing the correlation between the ith feature node and all nodes, R S I) represents the relationship between all nodes and the ith node; to fully utilize the global structure information of the feature nodes, the spatial relation vector r is calculated i And characteristic node x i Itself is spliced to obtain the space relation perception characteristic E S Defined as formula (2):
E S =C(x i ,r i )=(pool c (ReLU(BN(Conv(x i )))),(ReLU(BN(Conv(r i )))))
(2) Wherein C (·) represents the splicing operation, pool c (. Cndot.) means global average pooling over the channel dimension, reducing the channel dimension to 1; calculation of spatial attention weight a of ith feature by spatial relationship aware feature i Defined as formula (3):
a i =sigmoid(BN(Conv 2 (ReLU(BN(Conv 1 (E S )))))) (3)
wherein sigmoid (·) represents a sigmoid activation function, conv 2 (. Cndot.) conversion of channel number to 1, conv 1 (. Cndot.) in order toA fixed ratio dimension reduction;
the global channel relation attention learns the relation among all feature nodes in the channel dimension of the feature graph, and different weights are given to each channel;
in particular, for feature diagrams
Figure FDA0004055726930000024
Taking the characteristic graph on each channel as a characteristic node to form a graph G with C nodes in total C Each feature node is denoted as x i Wherein i=1, 2, … C;
for the input feature map S, it is spatially compressed into
Figure FDA0004055726930000031
Obtaining a characteristic node x i And characteristic node x j Correlation r between i,j Defined as formula (4):
r i,j =f c (x i ,x j )=(ReLU(BN(Conv(x i )))) T (ReLU(BN(Conv(x j )))) (4)
wherein f c (. Cndot.) represents a dot product operation; and similarly obtaining a characteristic node x j And x i Correlation r between j,i Using a matrix
Figure FDA0004055726930000036
Representing the correlation between all nodes; stacking the related relations between the ith characteristic node and all nodes to obtain a channel relation vector +. >
Figure FDA0004055726930000032
And obtain the channel relation characteristic E C And channel attention weight a i
5. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:
introducing an expansion convolution structure into a depth residual error network, wherein the expansion convolution expands a receptive field by inserting r-1 values with weight of 0, wherein r is an expansion factor, and in standard convolution operation, the expansion factor is 1; using a 3×3 convolution kernel under the same conditions, the receptive fields of the standard convolution and the expanded convolution are 3×3, 5×5, respectively; compared with standard convolution, the expansion convolution can capture richer image multi-scale information for image matching;
for residual blocks of the network in the local branches, expanding convolution operations with expansion factors of 2 and 4 are used to increase the receptive field of the feature map; simultaneously, the step length of a convolution layer and a downsampling layer in the last residual block of the ResNet50 is adjusted to be 1, when the resolution of an input image is 256 multiplied by 256, the resolution of a characteristic image output by a backbone network is 8 multiplied by 8, and the resolution of the characteristic image output by an expanded residual network is 32 multiplied by 32;
in order to better judge images of different geographic positions, taking the surrounding environment of a target building as auxiliary information, dividing a characteristic image by adopting a square ring division strategy in a local branch, and dividing the characteristic image into four parts according to the distance from the center of the image to obtain characteristic images of different areas; the image features are then converted into 2048-dimensional feature vectors by an averaging pooling operation, represented by equation (5):
Figure FDA0004055726930000033
Wherein Avgpool (·) represents an average pooling operation,
Figure FDA0004055726930000034
partial branching feature map representing partitions in different view platforms,/->
Figure FDA0004055726930000035
Representing the pooled 4 local branches 2048-dimensional feature vectors.
6. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:
in the global branch, the depth residual error network is used for extracting and refining the large-scale features, so that a feature map f containing rich semantic information is obtained j Thereby identifying the category to which the different image features belong; then, the global feature map is processed by adopting an average pooling method to obtain 2048-dimensional feature vectors, and the feature vectors are expressed by a formula (6):
g j =Avgpool(f j ) (6)
wherein g j Representing the pooled global branch feature vector.
7. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:
after fusing the classifier into the feature extraction stage, predicting the category of each feature vector; the classifier consists of a full connection layer, a batch normalization layer, a Dropout layer and a classification layer; local feature vector l of image j i And global feature vector g j As input, predicting the category to which each feature vector belongs, and finally obtaining the local prediction probability distribution vector z of the image i j And a global predictive probability distribution vector q j
Meanwhile, the cross entropy loss function is utilized to measure the distribution difference between the prediction probability and the real probability of the image; the cross entropy loss is represented by equation (7):
Figure FDA0004055726930000041
wherein the method comprises the steps of
Figure FDA0004055726930000042
Representing corresponding original image after square ring division strategy processing, x j (j∈[1,2]) Representing an input image, j=1 representing an unmanned aerial vehicle platform, j=2 representing a satellite platform; y represents the true class of the input imageLet(s) not>
Figure FDA0004055726930000043
Respectively indicate->
Figure FDA0004055726930000044
And x j Normalized probability scores belonging to the true class, defined by equation (8) and equation (9),
Figure FDA0004055726930000045
Figure FDA0004055726930000046
where C represents the number of all geotag categories in the database.
CN202310046541.7A 2023-01-31 2023-01-31 Cross view geographic positioning method for global relation attention guidance Pending CN116204675A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310046541.7A CN116204675A (en) 2023-01-31 2023-01-31 Cross view geographic positioning method for global relation attention guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310046541.7A CN116204675A (en) 2023-01-31 2023-01-31 Cross view geographic positioning method for global relation attention guidance

Publications (1)

Publication Number Publication Date
CN116204675A true CN116204675A (en) 2023-06-02

Family

ID=86512223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310046541.7A Pending CN116204675A (en) 2023-01-31 2023-01-31 Cross view geographic positioning method for global relation attention guidance

Country Status (1)

Country Link
CN (1) CN116204675A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116660982A (en) * 2023-08-02 2023-08-29 东北石油大学三亚海洋油气研究院 Full waveform inversion method based on attention convolution neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116660982A (en) * 2023-08-02 2023-08-29 东北石油大学三亚海洋油气研究院 Full waveform inversion method based on attention convolution neural network
CN116660982B (en) * 2023-08-02 2023-09-29 东北石油大学三亚海洋油气研究院 Full waveform inversion method based on attention convolution neural network

Similar Documents

Publication Publication Date Title
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Shi et al. Accurate 3-DoF camera geo-localization via ground-to-satellite image matching
CN111242064A (en) Pedestrian re-identification method and system based on camera style migration and single marking
CN111652293A (en) Vehicle weight recognition method for multi-task joint discrimination learning
CN114119993B (en) Remarkable target detection method based on self-attention mechanism
Sun et al. Unmanned surface vessel visual object detection under all-weather conditions with optimized feature fusion network in YOLOv4
CN114241053A (en) FairMOT multi-class tracking method based on improved attention mechanism
CN115019039B (en) Instance segmentation method and system combining self-supervision and global information enhancement
Shen et al. MCCG: A ConvNeXt-based multiple-classifier method for cross-view geo-localization
Lu et al. Improving 3d vulnerable road user detection with point augmentation
CN116204675A (en) Cross view geographic positioning method for global relation attention guidance
Gu et al. Real-time streaming perception system for autonomous driving
Sun et al. Squeeze-and-excitation network-based radar object detection with weighted location fusion
He et al. Multi-level progressive learning for unsupervised vehicle re-identification
Li et al. Material-Guided Multiview Fusion Network for Hyperspectral Object Tracking
Li et al. Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network
Li et al. Efficient thermal infrared tracking with cross-modal compress distillation
Mokalla et al. On designing MWIR and visible band based deepface detection models
Zhu et al. Find gold in sand: Fine-grained similarity mining for domain-adaptive crowd counting
Zhang et al. Remote sensing cross-modal retrieval by deep image-voice hashing
CN113298037B (en) Vehicle weight recognition method based on capsule network
Kim et al. LabelDistill: Label-guided Cross-modal Knowledge Distillation for Camera-based 3D Object Detection
Sun et al. A cross-view geo-localization method guided by relation-aware global attention
Yuan et al. Cross-Attention Between Satellite and Ground Views for Enhanced Fine-Grained Robot Geo-Localization
Xiong et al. Domain adaptation of object detector using scissor-like networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination