CN116204675A

CN116204675A - Cross view geographic positioning method for global relation attention guidance

Info

Publication number: CN116204675A
Application number: CN202310046541.7A
Authority: CN
Inventors: 孙静; 闫睿; 张冰; 王法胜; 孙福明; 朱兵
Original assignee: Dalian Minzu University
Current assignee: Dalian Minzu University
Priority date: 2023-01-31
Filing date: 2023-01-31
Publication date: 2023-06-02

Abstract

A global relation attention-guided cross view geographic positioning method belongs to the technical field of computers, a depth residual error network is used as a backbone network, and a global relation attention module is used for capturing more robust image global structure information for matching. A dual-branch network comprising global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively. In the local branches, the receptive field of the feature map is increased with dilation convolution, and the feature map is segmented at 4 scales using a square-ring segmentation strategy. For the feature map of each branch, it is converted into column vector descriptors, and then the classifier is used to obtain the prediction category of each column vector. Finally, the difference between the image prediction category and the real category is measured by using a cross entropy loss function, so that the geographic positioning accuracy is improved.

Description

Cross view geographic positioning method for global relation attention guidance

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a cross view geographic positioning method for global relationship attention guidance.

Background

Cross-view geolocation may be viewed as an image content-based retrieval task ^[1][2] The method is that the query image of one platform is matched with images from other platform databases to find out images of the same geographic position. Previous studies have focused mainly on matching between ground and satellite, aerial view images. In recent years, with the wide application of unmanned aerial vehicles ^[3] Unmanned aerial vehicle view images are added in cross view geographic positioning, and geographic positioning based on unmanned aerial vehicle views and satellite images becomes a current research hotspot.

With convolutional neural networks (Convolutional Neural Networks, CNN) being widely used in image classification ^[4][5] Target detection ^[6][7] Semantic segmentation ^[8][9] Motion recognition ^[10][11] In the field of equi-vision, some researchers applied convolutional neural networks to cross-view geolocation ^[12] In the task, significant progress has been made. However, most of the cross view geographic positioning methods mainly consider the advanced semantic information of the target image, and neglect the important effect that the spatial structure information can effectively improve the geographic positioning accuracy. Zhenget al ^[13] The geographic positioning is regarded as a classification task, and similarity measurement is carried out on the semantic features of the images. However, this method ignores the context information of the surrounding area of the target, resulting in insufficient comprehensiveness of the extracted features. King et al ^[14] Dividing the image high-level features according to a square ring division strategy, and carrying out each partial featureSimilarity measures, thereby improving geolocation accuracy using contextual information. However, the method directly divides the feature map into 4 scales, ignores global structure information of the image, and causes the situation that similar images are used as correct retrieval results and false detection occurs in the retrieval process. Clearly, fully mining the structural information of the geographic target image helps to improve the performance of cross-view geographic positioning.

Disclosure of Invention

Aiming at the fact that most algorithms fail to fully consider the influence of image structure information on matching precision in cross view geographic positioning, a global relationship attention-directed cross view geographic positioning method is provided.

First, a depth residual error network is adopted ^[15] As backbone network, a global relationship attention module is utilized ^[16] More robust image global structure information is captured for matching. Second, a dual-branch network including global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively. In local branches, dilation convolution is utilized ^[17] Increasing receptive field of feature map and using square ring partitioning strategy ^[14] The feature map is segmented at 4 scales. For the feature map of each branch, converting the feature map of each branch into column vector descriptors, and then obtaining the prediction category of each column vector by using a classifier. Finally, cross entropy loss function is used ^[18][19] To measure the difference between the predicted category and the true category of the image, thereby improving the accuracy of network training.

The advantages are that:

1. the global structure information of the image is fully mined by utilizing the global relationship attention module to learn the relationship among image feature nodes, so that a network focuses on a significant region, and more robust features are extracted for image feature matching.

2. A dual-branch structure including global branches and local branches is designed for cross-view geolocation. Deep features are extracted from the global branches by using a depth residual error network, so that a feature map containing rich semantic information is obtained; dilation convolution is used in local branches to capture richer multi-scale context information.

3. Experimental results on three data sets of University-1652, CVUSA and CVACT show that the method provided by the technical scheme has better performance than other advanced models in terms of geographic positioning, and the effectiveness of the method is proved.

Drawings

Fig. 1 is a diagram of a whole network framework of the method in the technical scheme.

Fig. 2 is an enlarged view of a portion of the middle left side of fig. 1.

Fig. 3 is an enlarged view of a portion of the right middle of fig. 1.

Fig. 4 is a global relationship attention structure diagram.

Fig. 5 is a global spatial relationship attention structure diagram.

FIG. 6 is a global channel relationship attention structure diagram.

Fig. 7 is a schematic diagram of a standard convolution and an expanded convolution.

Fig. 8 is a view of the unmanned aerial vehicle target positioning task retrieval result.

Fig. 9 is a search result of the unmanned aerial vehicle navigation task.

Fig. 10 shows the search result on the data set CVUSA.

Detailed Description

The work related to cross-view geolocation is presented herein; the method and the network structure adopted by the technical proposal are described in detail; the experimental results were analyzed and ablation experiments and summarized.

1. Related work

Early cross-view geolocation studies were based primarily on ground-view and aerial view images. Workman et al ^[20] The feature extraction is carried out on the images by using two published pre-training models, and the deep features are proved to have the capability of identifying images of different geographic positions. However, the method only focuses on image feature extraction on a single scale, and multi-scale information cannot be effectively utilized, so that the matching features extracted by the network are not abundant enough Is rich. On the basis of this, workman et al ^[21] A CVUSA (Cross-View USA) data set is constructed, multi-scale fusion is carried out on the aerial image features, and a Cross View positioning result is improved. Forest et al ^[22] 78000 street view and 45 ° aerial image pairs were constructed using the disclosed data, and then depth twinning networks were employed to extract features for cross-view localization. Vo et al ^[23] Different deep learning methods have been evaluated, utilizing distance-based logic loss layers (Distance Based Logistic Layer, DBL) and rotational invariance to train the network, improving positioning accuracy. Considering that the semantic information of the image is less affected by the viewpoint change, taban et al ^[24] And performing cross-view matching by extracting buildings in the images, so as to obtain a final geographic positioning result. Altwaijry et al ^[25] Focusing on the matching task of the aerial image pair, the aerial image pair learning method learns and discriminates the representation from the image pair by using a data driving method, and solves the matching problem of the ultra-wide baseline image. Furthermore, dian et al ^[26] Firstly extracting aerial image features, then mapping the aerial image features to a ground view by using self-adaptive conversion, and finally, minimizing the difference between the predicted ground view semantic features and the semantic features extracted directly from the ground image by using an end-to-end learning method to finish cross view geographic positioning. Hu et al ^[27] Combining a twin network with NetVLAD ^[28] And combining, coding the local features to obtain a global image descriptor, introducing weighted soft edge ordering loss, and accelerating network convergence, thereby improving network performance. Shi et al ^[29] It is believed that existing methods ignore differences in appearance and geometry between ground view and aerial view images, so they utilize polar transformations to approximately align aerial images with ground images. To further solve the cross-view direction alignment problem, shi et al ^[30] A dynamic similarity matching network (Dynamic Similarity Matching Network, DSM) is designed to align the directions of the cross-view images so that the image matching result is more accurate. Liu et al ^[31] It is believed that geometric cues can be used for localization, thus designing a twin network that encodes the orientation information of each pixel of the image into a network modelIn the method, the network can learn the appearance and the geometric information at the same time, and the recall rate and the precision of the network are improved. To solve the problem of scene changes over time, rodrigues et al ^[32] A semantic driven data enhancement technology is provided, which aims to simulate scene change phenomenon in cross view image matching, and then uses a multi-scale attention module to perform image matching, so that network performance is improved. Regmi et al ^[33] Application of generating an countermeasure network (Generative Adversarial Networks, GANs) in cross-view positioning for the first time ^[34] They synthesize aerial images from ground views for image matching using a generation antagonism network (Generative Adversarial Networks, GANs), but this approach is not an end-to-end approach. Toker et al ^[35] The satellite view is synthesized into the ground view by using polar coordinate transformation, then image retrieval is carried out, and the two steps are integrated in an end-to-end architecture, so that advanced geographic positioning performance is realized. The method is mainly aimed at the matching task between the ground view and the aerial view image, only two views are considered for geographic positioning, and the important effect of the unmanned aerial vehicle view image is not considered, so that the feature learning of the multi-view matching task is lacked.

Recent cross-view geolocation studies suggest that increasing the viewpoint may increase the accuracy of geolocation, thus making the drone a third platform for solving the geolocation problem. Zhenget al ^[13] Constructing a University-1652 data set comprising a satellite view image, a ground view image and an unmanned plane view image, regarding all view images at the same position as a category, completing a geographic positioning task in a classified manner, and applying instance loss ^[36] To optimize the model. However, the method only focuses on the semantic information of the image, and does not consider the influence of the detail information of the image on the geographic positioning of the cross view. For this problem, king et al ^[14] A local pattern network (Local Pattern Network, LPN) is proposed, which uses the context information of the image as an auxiliary cue and uses a square ring partitioning strategy to make the network notice the environment around the target building, effectively solving the method [13 ]]The problem of neglecting the detail information of the image is solved, and a good matching result is obtained.Butyl et al ^[37] The cross matching method (LCM) based on the position classification solves the problem of unbalance between the satellite image and the unmanned aerial vehicle image sample, and improves the image matching precision. Attention mechanisms are widely used in the field of computer vision ^{[16][38][39][40]} The method aims to make the network pay more attention to the distinguishing characteristics, filter out some irrelevant information and promote the training effect of the model. Zhang et al ^[16] The global relationship attention is integrated into the pedestrian re-recognition network, and the feature representation capability and the pedestrian re-recognition performance are improved through capturing the global structure information of the image. To avoid the impact of target offset and view scaling on image matching, zhuang et al ^[38] A multi-scale attention (Multiscale Block Attention, MSBA) structure is proposed to enhance the salient features of the different regions of the feature map. Forest et al ^[39] A unit subtraction attention module (Unit Subtraction Attention Module, USAM) is designed to focus the model on salient regions in the image by detecting key points in the feature map, improving the performance of the model with fewer parameters. Wearing et al ^[40] It is believed that some operations based on convolutional neural networks (Convolutional Neural Networks, CNN) result in loss of image fine-grained information, so a transducer structure is introduced in cross-view localization ^[41] And a feature segmentation and region alignment method (Feature Segmentation and Region Alignment, FSRA) is provided, a feature map is segmented into different regions according to heat distribution, and classification supervision is carried out on each region, so that cross view positioning is effectively realized.

The method provides a new research idea for solving the problem of inaccurate geographical positioning of the cross view. Inspired by the method, the attention mechanism is fully combined with the feature extraction network, and the structural information is mined from the global angle; meanwhile, the image global features and the local features are jointly trained by adopting a double-branch structure, and fusion expansion convolution is carried out in the local branches, so that the receptive field of the feature map is increased, more abundant multi-scale context information can be captured, and the purpose of improving the positioning accuracy of the cross view is achieved.

Cross-view geolocation is an important research direction in the field of image retrieval, mainly to match images from different platforms at the same geographic location. Most of the existing methods fail to fully consider the effect of structural information of images on cross-view geographic positioning, so that extracted features cannot fully characterize the images, which affects positioning accuracy. Based on the method, a global relationship attention-guided cross view geographic positioning method is provided, global relationship attention is fully fused with a feature extraction network, and the network can capture rich global structure information, so that the characterization capability of the features is improved. Meanwhile, a global branch and local branch parallel combined training structure is designed in consideration of the important effect of semantic information and context information on geographic positioning, multi-scale context features for image matching are fully mined, and the accuracy of cross view geographic positioning is further improved. Compared with other advanced methods, quantitative and qualitative experimental results on data sets University-1652, CVUSA and CVACT show that the algorithm of the technical scheme has remarkable advantages on Recall rate (Recall) and image retrieval precision (AP) indexes.

2. Method of

2.1 network frame

The network framework proposed by the present technical solution is shown in fig. 1, where the entire network structure is divided into a global branch and a local branch, which share the network weight. Firstly, the ResNet50 is adopted as a backbone network in the technical scheme, an average pooling layer and a classification layer in the ResNet50 are removed, and further input image features are extracted. Meanwhile, a global relationship attention module (RGA) is added after the shallow features are extracted, including a global Spatial relationship attention (RGA-Aware Global Attention) and a global Channel relationship attention (RGA-C) for capturing image global structure information. Second, the output features of the previous stage are processed separately using a dual-branch structure, thereby focusing on global and local information effectively. The global branch is used for extracting high-level semantic information of the whole image; the local branches are used for focusing on the deep features of the network, so that more image detail information is reserved. Meanwhile, in order to combine the information of the surrounding areas of the target, the feature map is divided into four different areas by using a square ring division strategy at the local branches. Finally, global averaging pooling is used to convert the image high-level features into column vector descriptors. In the training phase, a classifier module is utilized to obtain the predicted class probability of each column vector descriptor, and a cross entropy loss function is used to minimize the difference between the predicted class and the true class. And calculating the similarity between the query image and the database image in the test process by utilizing the Euclidean distance, and finally sorting the retrieval results based on the similarity.

2.2 Global relationship attention Module

In a cross-view geolocation task, the global relationship attention module can enable the network to notice the difference features in the images, helping to distinguish buildings with similar appearances. A global relational attention module is combined with a feature extraction network, including global spatial relational attention (RGA-S) and global channel relational attention (RGA-C), to construct a global relational attention-directed feature extraction network, and attention weights are calculated by learning relationships between feature nodes. Global relationship attention as shown in fig. 4, for feature vectors in the feature graph, the feature vectors are represented as feature nodes x _i Where i=1, 2, … N, N is the number of feature nodes. For a certain characteristic node x _i Calculating the correlation r between the current node and other nodes _i,j And r _j,i Where j=1, 2, … N, thereby yielding a feature node x _i Is r _i ＝[r _i,1 ,r _i,2 ,…,r _i,N ,r _1,i ,r _2,i ,…,r _N,i ]. Then, the feature nodes are combined with the relation vector r _i And splicing to obtain the relation perception feature so as to infer the attention weight of the current feature node.

2.2.1 Global spatial relationship attention

Global spatial relationship attention (RGA-S) learns the correlation between all feature nodes in the spatial dimension of the feature map, enabling the network to capture the features of a salient target region. The global spatial relationship attention is shown in fig. 5:

In particular, for the slave spiritFeature map obtained from network

Taking the C-dimensional feature vector of each spatial position as a feature node to form a graph G with N=W×H nodes in total _S Representing feature nodes as x _i Where i=1, 2, … N. Feature node x _i And x _j Correlation r between _i,j Can be obtained by a dot product operation, and can be specifically defined as formula (1):

r _i,j ＝f _s (x _i ,x _j )＝(ReLU(BN(Conv(x _i )))) ^T (ReLU(BN(Conv(x _j ))))

(1)

wherein f _s (. Cndot.) represents a dot product operation, reLU (-) represents a modified linear unit activation function, BN (-) represents a batch normalization layer, conv (-) represents a 1X 1 convolution operation, and the dimension reduction ratio is controlled by a predefined positive integer. The characteristic node x can be obtained by the same way _j And x _i Correlation r between _j,i With (r) _i,j ,r _j,i ) To represent the characteristic node x _i And x _j A pair-wise relationship between them. Finally, using the relation matrix

Representing correlations between all nodes, where r _i,j ＝R _S (i,j)。

Stacking the related relations between the ith characteristic node and all nodes according to a fixed sequence to obtain a spatial relation vector

Wherein R is _S (i) representing the correlation between the ith feature node and all nodes, R _S Where i represents the relationship between all nodes and the ith node. In order to make the network fully utilize the global structure information of the characteristic nodes, the spatial relation vector r is calculated by _i And characteristic node x _i Itself is spliced to obtain the space relation perception characteristic E _S Can be fixedThe meaning is formula (2):

E _S ＝C(x _i ,r _i )＝(pool _c (ReLU(BN(Conv(x _i )))),(ReLU(BN(Conv(r _i )))))

(2) Wherein C (·) represents the splicing operation, pool _c (.) means global average pooling over the channel dimension, reducing the channel dimension to 1. The spatial attention weight a of the ith feature can be calculated by the spatial relationship aware feature _i Defined as formula (3):

a _i ＝sigmoid(BN(Conv ₂ (ReLU(BN(Conv ₁ (E _S )))))) (3)

wherein sigmoid (·) represents a sigmoid activation function, conv ₂ (. Cndot.) conversion of channel number to 1, conv ₁ (. Cndot.) the dimensions were reduced at a fixed rate.

2.2.2 Global channel relationship attention

Global channel relationship attention (RGA-C) learns the relationships between all feature nodes in the channel dimension of the feature map, giving each channel a different weight. The global channel relationship attention is shown in fig. 6:

in particular, for feature diagrams

Taking the characteristic graph on each channel as a characteristic node to form a graph G with C nodes in total _C Each feature node is denoted as x _i Where i=1, 2, … C.

For the input feature map S, the input feature map S is spatially compressed as

The characteristic node x can be obtained _i And characteristic node x _j Correlation r between _i,j Defined as formula (4):

r _i,j ＝f _c (x _i ,x _j )＝(ReLU(BN(Conv(x _i )))) ^T (ReLU(BN(Conv(x _j )))) (4)

wherein f _c (. Cndot.) represents a dot product operation. As same as Processing to obtain a characteristic node x _j And x _i Correlation r between _j,i Using a matrix

Representing the correlation between all nodes. Stacking the related relations between the ith characteristic node and all nodes to obtain a channel relation vector +.>

Similar to equations (2) (3), the channel relationship characteristic E can be obtained _C And channel attention weight a _i 。

2.3 local branching

In order to enable the network to capture rich multi-scale context information, finer spatial structure information is reserved, so that the most similar images can be searched in a database, the accuracy of cross view geographic positioning is improved, and in a local branch, expansion convolution with multiple expansion rates is utilized ^[17] The receptive field of the feature map is increased without losing image details, so that the capability of capturing multi-scale information by the network is improved. Meanwhile, a square ring division strategy is adopted to divide the feature map under 4 scales, so that rich space context information is obtained.

The dilation convolution expands the receptive field by inserting r-1 weight values of 0, where r is the dilation factor, which is 1 in standard convolution operations. The standard convolution and the dilated convolution structure are shown in fig. 7, wherein part (a) in fig. 7 represents the standard convolution and part (b) in fig. 7 represents the dilated convolution with a dilated factor of 2. The use of a 3 x 3 convolution kernel under the same conditions resulted in a standard convolution and an expanded convolution with a receptive field of 3 x 3 and 5 x 5, respectively. The dilation convolution can capture richer image multi-scale information for image matching than standard convolution.

Specifically, the module uses dilation convolution operations with dilation factors of 2 and 4, respectively, to increase the receptive field of the feature map. And simultaneously, the step sizes of a convolution layer and a downsampling layer in the last residual block of the ResNet50 are adjusted to be 1, when the resolution of an input image is 256 multiplied by 256, the resolution of a characteristic image output by a backbone network is 8 multiplied by 8, and the resolution of the characteristic image output by an expanded residual network is 32 multiplied by 32.

In order to help the network to better judge images of different geographic positions, the environment around the target building is used as auxiliary information, meanwhile, a square ring division strategy is adopted in a local branch to divide the characteristic image, and the characteristic image is divided into four parts according to the distance from the center of the image, so that the characteristic images of different areas are obtained. The image features are then converted into 2048-dimensional feature vectors by an averaging pooling operation, represented by equation (5):

wherein Avgpool (·) represents an average pooling operation,

partial branching feature map representing partitions in different view platforms,/->

Representing the pooled 4 local branches 2048-dimensional feature vectors.

2.4 Global branching

Considering that semantic information focused by deep networks is also an important part of cross-view geographic positioning tasks, a global branch structure is designed, which is parallel to local branches. In the global branch, the depth residual error network is used for extracting and refining the large-scale features, so that a feature map f containing rich semantic information is obtained _j The network can thus be made to recognize the belonging category of the different image features. Then, the global feature map is processed by adopting an average pooling method to obtain 2048-dimensional feature vectors, and the feature vectors are expressed by a formula (6):

g _j ＝Avgpool(f _j ) (6)

wherein g _j Representing the pooled global branch feature vector.

2.5 Classification learning and loss function

The technical proposal fuses the classifierAfter merging into the feature extraction stage, it is used to predict the class of each feature vector. The classifier consists of a fully connected Layer (Fully Connected Layer, FC), a batch normalization Layer (Batch Normalization Layer, BN), a Dropout Layer (Dropout Layer), and a classification Layer (Classification Layer, cls). Local feature vector of image

And global feature vector g _j As input, predicting the category to which each feature vector belongs, and finally obtaining the local prediction probability distribution vector of the image>

And a global predictive probability distribution vector q _j 。

The loss function used in the technical scheme is cross entropy loss, and the distribution difference between the prediction probability and the true probability of the image is measured by using the function, so that the image characteristics are better learned, and the network training precision is improved. The cross entropy loss can be represented by equation (7):

wherein the method comprises the steps of

Representing corresponding original image after square ring division strategy processing, x _j (j∈[1,2]) Representing an input image, j=1 represents a drone platform, and j=2 represents a satellite platform. y represents the true class of the input image, +.>

Respectively indicate->

And x _j Normalized probability scores belonging to the true class, defined by equation (8) and equation (9),

where C represents the number of all geotag categories in the database.

3 experiment

3.1 data set

The technical proposal uses University-1652 ^[13] 、CVUSA ^[21] And CVACT ^[31] The three data sets train and test the proposed method.

(1) University-1652 is a multi-view, multi-source dataset comprising images of the unmanned, satellite, and ground view of 1652 buildings at University 72, and no duplicate images in the training dataset and the test dataset. We use the unmanned aerial vehicle view and satellite view images in this dataset to study both the unmanned aerial vehicle view target location and unmanned aerial vehicle navigation tasks. In the unmanned aerial vehicle view target positioning task, 701 image categories are shared in a query image database of the unmanned aerial vehicle view, and each category corresponds to a truly matched satellite image. In the unmanned aerial vehicle navigation task, 701 image categories are shared in a query database of satellite view angles, and each category corresponds to 54 truly matched unmanned aerial vehicle images.

(2) The CVUSA dataset includes satellite images and panoramic ground images, with 35532 image pairs for training and 8884 image pairs for testing.

(3) CVACT is a larger reference dataset, also comprising 35532 training image pairs, except 8884 image pairs are used as validation dataset and an additional 92802 image pairs are used as test set.

3.2 Experimental details

In order to ensure fairness of the experiment, all algorithms are implemented on a Linux server of which the operating system is ubuntu20.04, and all performance comparisons are based on the results under the configuration condition. The server is configured to: GTX 3090GPU with 24G video memory capacity. The model provided by the technical scheme is realized based on a Pytorch framework. Before training, all the input images are resized to 256×256 and data expansion is performed using horizontal flipping and random rotation. The model was updated with an SGD optimizer with a momentum of 0.9 and a weight decay of 0.0005, with an initial learning rate set to 0.001. For better convergence of the network, the model training period is 140 for dataset University-1652 and 100 for datasets CVUSA and CVACT. In the test phase, the similarity of the images is evaluated by calculating the Euclidean distance between the query image and the database image.

3.3 comparison of Performance

3.3.1 quantitative comparison

The technical scheme adopts recall ratio (recall@K) and image retrieval precision (Average Precision, AP) as image retrieval performance measurement indexes. Recall@K refers to the ratio of the correctly matched image in the first k results retrieved to all the correct images in the database, which is used to measure recall. In the present embodiment, a case where k=1 is mainly considered. AP is the area under the Precision-Recall (PR) curve, and refers to the ratio of the retrieved true matching image to the total number of retrieval results, and is used for measuring the Precision. The larger the recall@K and AP values, the higher the accuracy of representing image retrieval.

To illustrate the effectiveness of the proposed method, the proposed method was compared with other CNN-based algorithms on three data sets, university-1652, CVUSA and CVACT. The results of the comparison on the University-1652 dataset are shown in Table 1, the method of comparison including Instance Loss ^[13] 、LCM ^[37] 、LPN ^[14] 、Instance Loss+USAM ^[39] 、LPN+USAM ^[39] The method.

Table 1 results of quantitative tests on data set University-1652.

Wherein, the method comprises the following steps of ^[22] 、Triplet Loss ^[42] 、Soft Margin Triplet Loss ^[27] In the method of Instance Loss ^[13] And replacing the result obtained by the loss function on the basis of (a). Optimum evaluation indexAnd sub-optimal results are indicated by brackets, respectively.

As can be seen from table 1, the method of the present solution achieves optimal results in both tasks of the University-1652 dataset. In the unmanned aerial vehicle view target positioning task, the algorithm realizes 81.06% and 83.74% performances on the Recall@K and AP indexes respectively, compared with the suboptimal method LPN+USAM ^[39] The two indexes are respectively improved by 3.99 percent and 3.65 percent. In the unmanned aerial vehicle navigation task, 89.58% and 79.63% of performances are respectively realized on the Recall@K and AP indexes. Compared to the suboptimal method LPN ^[14] The recall@K and AP are respectively improved by 3.13% and 4.84%, which proves that the technical scheme method has remarkable advantages in image retrieval performance.

The comparison results on the data sets CVUSA and CVACT are shown in table 2. Since the ground images in the two data sets are panoramic, a sequential partitioning strategy is employed ^[14] The image is divided. The comparison method comprises CVM-Net ^[27] 、

Instance Loss ^[13] 、Regmi et al. ^[33] 、Siam-FCANet ^[43] 、CVFT ^[12] 、LPN ^[14] 、Instance Loss+USAM ^[39] 、LPN+USAM ^[39] . Wherein LPN ^[14] 、LPN+USAM ^[39] Algorithms are results generated by training using publicly published codes, while other methods directly use results provided by authors.

Table 2 data sets CVUSA and CVACT.

The optimal and suboptimal results of the evaluation index are indicated by brackets, respectively.

Indicating that the method uses additional direction informationIs input.

As can be seen from table 2, on the CVUSA dataset, the method of the present technical solution reached 88.00% and 99.47% on the evaluation index R@1 and r@top1%, respectively. Compared with other 9 advanced models, the method has remarkable improvement on two indexes, particularly on R@1 indexes, the performance is improved by 2.03 percent. On the CVACT_val dataset, the method reached 80.98% and 96.53% on R@1 and R@Top1%, both achieving optimal results, demonstrating the effectiveness of the proposed method.

3.3.2 qualitative results

Fig. 8 and 9 are respectively the search results of the method on the University-1652 data set, the unmanned aerial vehicle view target positioning task and the unmanned aerial vehicle navigation task are respectively visually represented, and fig. 10 is the search results on the CVUSA data set. In the qualitative results, each row represents a search result of one position, the image on the left side of the dotted line is the query image, and the image on the right side is the top-ranked image in the matching result. Wherein the yellow box represents the correct search and the blue box represents the incorrect search.

For unmanned aerial vehicle view target positioning, only one image which is truly matched is shown in the first five images of the matching result in fig. 8, because each unmanned aerial vehicle view image only has 1 satellite image which is matched with the image, and therefore, the method can prove that the matched image can be correctly searched under the interference of similar images. For the unmanned navigation task, the first five images of the matching result in fig. 9 are all correctly matched images, since each satellite image has 54 unmanned images matched to it. Each ground image in the CVUSA dataset corresponds to a correct satellite image, and the first image in the search result of each query image in fig. 10 is a correctly matched image. By qualitative result analysis, it can be found that the method can retrieve the correct result on both data sets.

3.4 ablation experiments

To verify the effectiveness of each module, the proposed method was analyzed in this section, and ablation experiments were performed on the University-1652 dataset to verify the effect of each module.

3.4.1 validity of the Global relationship attention Module

To verify the validity of the global relationship attention module, 2 ablation experiments were performed herein. The first experiment is to remove the global relationship attention module and only use the dual-branch structure network for image feature extraction. The second experiment is to add SENet in the network ^[44] In SE attention module(s) to obtain the attention of the image in the channel dimension.

From the results of table 3, it can be observed that using the global relationship attention module enables the network to focus on the authentication feature of the image, improving the retrieval capability of the network and making the index obtain a better result than the result without adding the attention mechanism and using the SE attention module.

Table 3 global relationship attention module ablation experimental results comparison.

(a) Indicating that the attention mechanism is not used; (b) represents adding a SE attention module; (c) represents the use of an RGA attention module.

The evaluation index optimum is indicated by brackets.

3.4.2 dilates the effectiveness of convolution

In order to verify the effectiveness of the dilation convolution, 3 ablation experiments were performed, the dilation convolution in the local branches was adjusted, and image feature extraction was performed using different dilation rates. According to the results of table 4, it can be found that the detail information of the image can be effectively captured by using the expanded convolution to increase the receptive field of the feature map, so that the accuracy of the geographic positioning of the cross view is improved, and when the expansion factors are respectively 2 and 4, the performance of the model is optimal.

Table 4 comparison of the results of the dilation convolution ablation experiments.

(a) The expansion factors of the local branch residual blocks are respectively 1 and 1;

(b) The expansion factors of the local branch residual blocks are respectively 1 and 2;

(c) The expansion factors of the local branch residual blocks are respectively 2 and 2;

(d) The local branch residual block expansion factors are respectively 2 and 4.

The evaluation index optimum is indicated by brackets.

3.4.3 influence of input image size on results

Higher accuracy can be achieved using high resolution image training models, but more computational resources and time are required. Due to limited resources, low resolution input images are required during actual operation, but this reduces the accuracy of image matching. A set of ablation experiments was designed herein to observe the effect of different resolution input images on model performance. As a result, as shown in table 5, it was found that when the input image size was increased from 224 to 320, both the recall@1 and AP (image retrieval accuracy) values of the network were improved, and when the image size was increased to 384, the performance of the network was slightly degraded.

Table 5 effect of different resolution input images on the results.

(a) Representing an input image size of 224×224;

(b) Representing the input image size as 256×256;

(c) Representing an input image size of 320×320;

(d) Representing an input image size of 384 x 384.

The evaluation index optimum is indicated by brackets.

Conclusion 4

The technical scheme provides a cross view geographic positioning method for global relation attention guidance. The method adopts global structure information of the global relation attention captured image, and extracts more robust image features for geographic positioning. Meanwhile, a double-branch strategy is used for joint training, and the expanded convolution is used for increasing the receptive field of the feature map in the local branches, and the feature map is divided into 4 scales. The feature representation containing the context information and the semantic information is obtained through the double-branch structure to calculate the image category probability, so that the accuracy of geographic positioning is further improved. On three data sets of University-1652, CVUSA and CVACT, the technical scheme method has remarkably improved on recall@K and AP. In addition, the technical scheme can well eliminate the interference of similar buildings and search out correct images.

Reference to the literature

[1] Ahmed K T, ummesfi S, iqbal A. Content-based image feature information fusion retrieval [ J ]. Informationfusion, 2019,51:76-99.

[2] Sarotha R, paul V, kumar P g content-based deep learning image retrieval [ J ]. Cluster Computing (set

Group calculation), 2019,22 (2): 4187-4200.

[3] Outay F, mengash H A, adnan M. Unmanned Aerial Vehicle (UAV) in road safety, traffic and highway infrastructure management

Is applied to: recent developments and challenges [ J ]. Transportation research part A: policy and practice (first part of traffic research: policy and practice), 2020,141:116-129.

[4] Zhu H, ma M, ma W, et al spatial channel progressive fusion ResNet [ J ] information fusion for remote sensing classification 2021,70:72-87.

[5] Wang P, fan E, wang P.

Pattern Recognition Letters (Pattern recognition), 2021,141:61-67.

[6] Zhang D, ye M, liu Y, et al multisource unsupervised domain adaptive target detection [ J ]. Information Fusion (information fusion), 2022,78:138-148.

[7] Tan M, pang R, le Q v.efficientset: scalable and efficient object detection [ C ]// Proceedings of theIEEE/CVF conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2020:10781-10790.

[8] Yuan Y, chen X, wang J. Object context representation for semantic segmentation [ C ]// European conference oncomputer vision (European computer vision conference). Springer, cham,2020:173-190.

[9] Hao S, zhou Y, guo Y. Deep learning-based semantic segmentation study review [ J ]. Neuro-segmentation (neuro-computing)

School report), 2020,406:302-321.

[10] Jaouedi N, boujnah N, bouhlel M s.

Journal of King Saud University-Computer and Information Sciences (university of Saint King-journal of computer and information science), 2020,32 (4): 447-453.

[11] Yang C, xu Y, shi J, et al time pyramid network for motion recognition [ C ]// Proceedings of theIEEE/CVF conference on computer vision and pattern recognition (IEEE conference on computer vision and pattern recognition). 2020:591-600.

[12] Shi Y, yu X, liu L, et al, optimal feature transfer for geolocation across view images [ C ]// Proceedings of theAAAI Conference on Artificial Intelligence (AAAI artificial Smart conference): 2020,34 (07): 11990-11997.

[13] Zheng Z, wei Y, yang y.university-1652: multi-view multi-source reference based on geographic positioning of unmanned aerial vehicle

[C] The ratio of the components is/Proceedings of the, 28 and th ACM international conference on Multimedia (ACM multimedia International conference), 2020:1395-1403.

[14] Wang T, zheng Z, yan C, et al, each section is important: the local mode facilitates geolocation across views [ J ].

IEEE Transactions on Circuits and Systems for Video Technology (IEEE video technology circuits and systems), 2021,32 (2): 867-879.

[15] He K, zhang X, ren S, et al depth residual learning [ C ]// Proceedings of the IEEEconference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2016:770-778 in image recognition.

[16] Zhang Z, lan C, zeng W, et al, global relationship attention for pedestrian re-recognition [ C ]// Proceedings of theieee/cvf conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2020:3186-3195.

[17] Yu F, koltun V, funkhouse T. Expanded residual network [ C ]// Proceedings of the IEEE conference oncomputer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2017:472-480.

[18] Zheng Z, zheng L, yang y. A discriminating learning convolutional neural network for human re-recognition embeds [ J ]. ACMtransactions on multimedia computing, communications, and applications (TOMM) (ACM transactions for multimedia computing communications and applications), 2017,14 (1): 1-20.

[19] Li X, yu L, chang D, et al, double cross entropy loss of small sample fine-grained vehicle classification [ J ]. IEEE Transactionson Vehicular Technology (IEEE journal of vehicle technology), 2019,68 (5): 4204-4212.

[20] Workman S, jacobs N. Position dependence of convolutional neural network characteristics [ C ]// Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops (IEEE computer vision and pattern recognition conference). 2015:70-78.

[21] Workman S, souvenir, jacobs N. Wide area image geolocation Using aerial reference images

[C] v/Proceedings of the IEEE International Conference on Computer Vision (IEEE computer Vision and Pattern recognition conference). 2015:3961-3969.

[22] Lin T Y, cui Y, belongie S, et al, depth of ground to aerial geolocation, represents learning [ C ]// Proceedings ofthe IEEE conference on computer vision and pattern recognition (IEEE computer vision and pattern recognition conference). 2015:5007-5015.

[23] Vo N, hays J. Use of overhead images to locate and orient street view [ C ]// European conference on computer vision (European computer vision conference). Springer, cham,2016:494-509.

[24] Tian Y, chen C, shah M. Cross-view image matching of geolocation in urban environments [ C ]// Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition). 2017:3608-3616.

[25] Altwaijry H, trulls E, hays J, et al learn an architecture [ C ]// Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition) that matches aerial images with depth attention 2016:3539-3547.

[26] Zhai M, bessinger Z, workman S, et al predicts a ground scene layout [ C ]// Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition) 2017:867-875 from aerial images.

[27] Hu S, feng M, nguyen R M H, et al Cvm-net: cross-view image-based ground-to-air geolocation

Point matching network [ C ]// Proceedings of the IEEE Conference on Computer Vision and PatternRecognition (IEEE computer vision and pattern recognition conference). 2018:7258-7267.

[28] Arand jelovic R, gronat P, torili A, et al NetVLAD: CNN architecture for weakly supervised location identification

[C] /(Proceedings of the IEEE conference on computer vision and pattern recognition (IEEE computer Vision and Pattern recognition conference). 2016:5297-5307).

[29] Shi Y, liu L, yu X, et al spatial perception feature aggregation based on cross-view geolocation of images

[C] v/Proceedings of the 33rd International Conference on Neural Information ProcessingSystems (33 rd International conference on neuro information handling systems). 2019:10090-10100.

[30] Shi Y, yu X, campbell D, et al where I see? Joint position and direction estimation by cross-view matching

Meter [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE computer vision and pattern recognition conference). 2020:4064-4072.

[31] Liu L, li h. Cross-view geolocation using direction in neural networks [ C ]// Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition (IEEE conference on computer vision and pattern recognition). 2019:5624-5633.

[32] Do Rodrigues R, tani m. These are in the same place? Geolocation of images that are not visible in the cross-view

[C] /(Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (IEEE computer View application conference). 2021:3753-3761).

[33] Regmi K, shah m. Connect ground to aerial image matched domain gap [ C ]// Proceedings of the IEEE/CVFInternational Conference on Computer Vision (IEEE computer vision conference). 2019:470-479.

[34] GoodFe low I, pouget-Abadie J, mirza M, et al, generates an antagonistic network [ J ]. Communications of theACM (American society of computers communication), 2020,63 (11): 139-144.

[35] Toker A, zhou Q, maximov M, et al, come to earth: synthesis of geolocated satellites to streetscapes

[C] /(Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (IEEE conference on computer Vision and Pattern recognition). 2021:6488-6497).

[36] Zheng Z, zheng L, garrett M, et al, have example missing dual path convolution image text embedding [ J ]. ACMtransactions on multimedia computing, communications, and applications (TOMM) (ACM transactions for multimedia computing communications and applications), 2020,16 (2): 1-23.

[37] Ding L, zhou J, meng L, et al, a practical unmanned aerial vehicle and satellite cross viewpoint based on unmanned aerial vehicle geographic positioning

Image matching method [ J ]. Remote Sensing, 2020,13 (1): 47.

[38] Zhuang J, dai M, chen X, et al, a more efficient unmanned aerial vehicle and satellite image cross-matching method [ J ]. RemoteSensing, 2021,13 (19): 3979.

[39] Lin J, zheng Z, zhong Z, et al, joint representation of cross-view geolocation learning and keypoint detection [ J ]. IEEETransactions on Image Processing (IEEE image processing journal), 2022.

[40] Dai M, hu J, zhuang J, et al, a transformer-based unmanned aerial vehicle viewpoint geographic positioning feature segmentation and region alignment

Method [ J ]. IEEE Transactions on Circuits and Systems for Video Technology (IEEE video technology circuits and systems), 2021.

[41] Vaswani A, shazer N, parmar N, et al attention is what you need [ J ]. Advances in neuralinformation processing systems (Procedent of neural information processing systems), 2017,30.

[42] Chechik G, sharma V, shalit U, et al, online learning based on rank-based large-scale image similarity [ J ]. Journal ofMachine Learning Research (journal of mechanical learning study), 2010,11 (3).

[43] Cai S, guo Y, khan S, et al, ground-to-air image geolocation with hard sample re-weighted triple loss [ C ]/-for

Proceedings of the IEEE/CVF International Conference on Computer Vision (IEEE computer View conference.) 2019:8391-8400.

[44] Hu J, shen L, sun G. Squeeze and excite networks [ C ]// Proceedings of the IEEE conference on computervision and pattern recognition (IEEE computer vision and pattern recognition conference). 2018:7132-7141.

Claims

1. The global relation attention-guided cross view geographic positioning method is characterized by comprising the following steps of:

The depth residual error network is adopted as a backbone network, and a global relation attention module is utilized to capture more robust global structure information of the image for matching; a dual-branch network comprising global branches and local branches is designed to capture deep features with rich semantic information and local features with multi-scale context information, respectively; in the local branches, the receptive field of the feature map is increased by using dilation convolution, and the feature map is segmented under 4 scales by using a square ring segmentation strategy; for the feature map of each branch, converting the feature map of each branch into column vector descriptors, and then obtaining the prediction category of each column vector by using a classifier; the cross entropy loss function is used for measuring the difference between the image prediction category and the real category, so that the network training accuracy is improved.

2. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:

the whole network structure is divided into a global branch and a local branch, which share network weights; firstly, adopting ResNet50 as a backbone network, removing an average pooling layer and a classification layer in the backbone network, and further extracting input image characteristics; meanwhile, a global relation attention module is added after shallow features are extracted, wherein the global relation attention module comprises global space relation attention and global channel relation attention and is used for capturing global structural information of an image; secondly, the output characteristics of the previous stage are respectively processed by using a double-branch structure, so that global and local information is effectively focused; the global branch is used for extracting high-level semantic information of the whole image; the local branches are used for focusing on the deep features of the network, so that more image detail information is reserved; meanwhile, in order to combine the information of the surrounding areas of the target, a square ring division strategy is used for dividing the feature map into four different areas in local branches; finally, global averaging pooling is used to convert the image high-level features into column vector descriptors; in the training stage, a classifier module is utilized to obtain the prediction category probability of each column vector descriptor, and a cross entropy loss function is used to minimize the difference between the prediction category and the real category; and calculating the similarity between the query image and the database image in the test process by utilizing the Euclidean distance, and finally sorting the retrieval results based on the similarity.

3. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:

combining a global relationship attention module with a feature extraction network, comprising global spatial relationship attention and global channel relationship attention, constructing a feature extraction network guided by the global relationship attention, and calculating attention weights by learning the relationship among feature nodes; for feature vectors in the feature graph, the feature vectors are represented as feature nodes x _i Where i=1, 2, … N, N is the number of feature nodes; for a certain characteristic node x _i Calculating the correlation r between the current node and other nodes _i,j And r _j,i Where j=1, 2, … N, thereby yielding a feature node x _i Is r _i ＝[r _i,1 ,r _i,2 ,…,r _i,N ,r _1,i ,r _2,i ,…,r _N,i ]The method comprises the steps of carrying out a first treatment on the surface of the Then, the feature nodes are combined with the relation vector r _i And splicing to obtain the relation perception feature so as to infer the attention weight of the current feature node.

4. A global relational attention directed cross-view geolocation method as recited in claim 3, comprising the steps of:

the global spatial relationship attention learns the correlation relationship among all feature nodes in the spatial dimension of the feature map, so that the network captures the features of the remarkable target area;

In particular, for feature maps obtained from neural networks

Taking the C-dimensional feature vector of each spatial position as a feature node to form a graph G with N=W×H nodes in total _S Representing feature nodes as x _i Where i=1, 2, … N; feature node x _i And x _j Correlation r between _i,j Obtained by a dot product operation, specifically defined as formula (1):

(1) Wherein f _s (. Cndot.) represents a dot product operation, reLU (-) represents a modified linear unit activation function, BN (-) represents a batch normalization layer, conv (-) represents a 1 x 1 convolution operation, and the dimension reduction ratio is controlled by a predefined positive integer; and similarly obtaining a characteristic node x _j And x _i Correlation r between _j,i With (r) _i,j ,r _j,i ) To represent the characteristic node x _i And x _j A pair-wise relationship between; finally, using the relation matrix

Representing correlations between all nodes, where r _i,j ＝R _S (i,j)；

Wherein R is _S (i) representing the correlation between the ith feature node and all nodes, R _S I) represents the relationship between all nodes and the ith node; to fully utilize the global structure information of the feature nodes, the spatial relation vector r is calculated _i And characteristic node x _i Itself is spliced to obtain the space relation perception characteristic E _S Defined as formula (2):

(2) Wherein C (·) represents the splicing operation, pool _c (. Cndot.) means global average pooling over the channel dimension, reducing the channel dimension to 1; calculation of spatial attention weight a of ith feature by spatial relationship aware feature _i Defined as formula (3):

a _i ＝sigmoid(BN(Conv ₂ (ReLU(BN(Conv ₁ (E _S )))))) (3)

wherein sigmoid (·) represents a sigmoid activation function, conv ₂ (. Cndot.) conversion of channel number to 1, conv ₁ (. Cndot.) in order toA fixed ratio dimension reduction;

the global channel relation attention learns the relation among all feature nodes in the channel dimension of the feature graph, and different weights are given to each channel;

in particular, for feature diagrams

Taking the characteristic graph on each channel as a characteristic node to form a graph G with C nodes in total _C Each feature node is denoted as x _i Wherein i=1, 2, … C;

for the input feature map S, it is spatially compressed into

Obtaining a characteristic node x _i And characteristic node x _j Correlation r between _i,j Defined as formula (4):

wherein f _c (. Cndot.) represents a dot product operation; and similarly obtaining a characteristic node x _j And x _i Correlation r between _j,i Using a matrix

Representing the correlation between all nodes; stacking the related relations between the ith characteristic node and all nodes to obtain a channel relation vector +. >

And obtain the channel relation characteristic E _C And channel attention weight a _i 。

5. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:

introducing an expansion convolution structure into a depth residual error network, wherein the expansion convolution expands a receptive field by inserting r-1 values with weight of 0, wherein r is an expansion factor, and in standard convolution operation, the expansion factor is 1; using a 3×3 convolution kernel under the same conditions, the receptive fields of the standard convolution and the expanded convolution are 3×3, 5×5, respectively; compared with standard convolution, the expansion convolution can capture richer image multi-scale information for image matching;

for residual blocks of the network in the local branches, expanding convolution operations with expansion factors of 2 and 4 are used to increase the receptive field of the feature map; simultaneously, the step length of a convolution layer and a downsampling layer in the last residual block of the ResNet50 is adjusted to be 1, when the resolution of an input image is 256 multiplied by 256, the resolution of a characteristic image output by a backbone network is 8 multiplied by 8, and the resolution of the characteristic image output by an expanded residual network is 32 multiplied by 32;

in order to better judge images of different geographic positions, taking the surrounding environment of a target building as auxiliary information, dividing a characteristic image by adopting a square ring division strategy in a local branch, and dividing the characteristic image into four parts according to the distance from the center of the image to obtain characteristic images of different areas; the image features are then converted into 2048-dimensional feature vectors by an averaging pooling operation, represented by equation (5):

Wherein Avgpool (·) represents an average pooling operation,

Representing the pooled 4 local branches 2048-dimensional feature vectors.

6. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:

in the global branch, the depth residual error network is used for extracting and refining the large-scale features, so that a feature map f containing rich semantic information is obtained _j Thereby identifying the category to which the different image features belong; then, the global feature map is processed by adopting an average pooling method to obtain 2048-dimensional feature vectors, and the feature vectors are expressed by a formula (6):

g _j ＝Avgpool(f _j ) (6)

wherein g _j Representing the pooled global branch feature vector.

7. The global relational attention directed cross-view geolocation method of claim 1, comprising the steps of:

after fusing the classifier into the feature extraction stage, predicting the category of each feature vector; the classifier consists of a full connection layer, a batch normalization layer, a Dropout layer and a classification layer; local feature vector l of image _j ⁱ And global feature vector g _j As input, predicting the category to which each feature vector belongs, and finally obtaining the local prediction probability distribution vector z of the image ⁱ _j And a global predictive probability distribution vector q _j ；

Meanwhile, the cross entropy loss function is utilized to measure the distribution difference between the prediction probability and the real probability of the image; the cross entropy loss is represented by equation (7):

wherein the method comprises the steps of

Representing corresponding original image after square ring division strategy processing, x _j (j∈[1,2]) Representing an input image, j=1 representing an unmanned aerial vehicle platform, j=2 representing a satellite platform; y represents the true class of the input imageLet(s) not>

Respectively indicate->

where C represents the number of all geotag categories in the database.