CN114757999A

CN114757999A - Ground-space based cross-view geographic positioning method

Info

Publication number: CN114757999A
Application number: CN202210406567.3A
Authority: CN
Inventors: 田晓阳; 王珂
Original assignee: Sichuan Yikong Intelligent Control Technology Co ltd
Current assignee: Sichuan Yikong Intelligent Control Technology Co ltd
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2022-07-15

Abstract

The invention discloses a cross-view geographical positioning method based on ground-space, which comprises the following steps: s1: establishing a ground-air image training set; s2: carrying out region segmentation on the ground image; s3: converting a top view of the satellite image into a front view; s4: training double-layer conditions to generate a countermeasure network to obtain a satellite image with a ground view style; s5: performing Transformer network training; s6: and acquiring the latest geo-positioned ground image and satellite image, inputting the latest geo-positioned ground image and satellite image into the trained transform network, and performing image matching. The invention provides a method for generating a countermeasure network by automatically mixing perspective-polar coordinate mapping and double-layer conditions for the first time, so as to reduce the visual domain distance of a ground-space image, and the image alignment and a Transformer method are combined for the first time, so that accurate cross-view geographic positioning is realized.

Description

Ground-space based cross-view geographic positioning method

Technical Field

The invention belongs to the technical field of geographic positioning, and particularly relates to a cross-view geographic positioning method based on ground-space.

Background

Cross-view image matching is the retrieval of the most relevant images from different platforms. Cross-view geographic localization (CVGL) is mainly based on ground images (front view) and satellite images (top view). The goal of ground-space based cross-view geolocation is to determine the location of a given terrestrial image through matching with satellite images. However, this is still a very challenging task, since the viewpoints of the ground-space images are very different.

With the rapid development of a Convolutional Neural Network (CNN) in computer vision, the existing method is mainly based on the convolutional neural network, early research is directly focused on feature representation, a metric learning method directly paying attention to features is gradually applied to CVGL, the method directly extracts view invariant features, no explicit view conversion is performed on input images, only the feature representation based on image contents is learned, and no consideration is given to the spatial correspondence of image pairs. Hu et al in the literature "Siming Hu, Mengdannfeng, Long M.H.Nguyen, and GimHee Lee," Cvm-Net: Cross-view matching network for image-based group-to-entity-location, "in 2018IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2018, pp.7258-7267" use a global VLAD descriptor, suggesting a CVM-Net based on the Simase architecture and employing weighted soft edge loss. The first of Liu and Li in the documents "Liu and Hongdong Li," orientation to neural networks for cross-view geo-localization, "in 2019IEEE Conference on Computer Vision and Pattern registration, CVPR,2019, pp.5624-5633" uses orientation information as an important clue for spatial positioning tasks. Shi et al propose a feature transport network CVFT that takes into account domain differences and spatial layout information in The literature "Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li," Optimal feature transport for cross-view image geo-localization, "in The third-Fourth AAAI Conference on Intelligent Intelligence, AAAI, 1192020, pp.11990-11997".

Some later studies considered spatial correspondence of ground-air image pairs, solving the inter-visual domain distance problem by transforming the input images, such as explicit viewpoint mapping of the input images and generation of a countermeasure network (CGAN) with conditions. Shi et al in the literature "Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li," Spatial-adaptive feature aggregation for image based cross-view-localization, "in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS,2019, pp.10090-10100" use the displayed view-point mapping, propose SAFA to eliminate the differences in the geometric domain using polar coordinate mapping and propose a Spatial perception layer. Toker et al, in the literature "AysimmoToker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix' e," Coming down to earth: Satellite-to-street view synthesis for geo-localization, "in 2021IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2021, pp.6488-6497," use conditions to generate a confrontation network, create a realistic ground image from the Satellite images, and simultaneously locate the corresponding ground view query in an end-to-end manner. A key method for reducing visual domain distances among different views is viewpoint mapping, but the existing mapping methods adopt the same mapping for the whole image, neglect the special spatial correspondence of the image on a common-view-non-common-view region, and the existing CGAN methods do not utilize additional effective information.

Transformer achieved excellent performance in Natural Language Processing (NLP), with parallel computing, global view and flexible overlay capability incomparable by CNN. Recent studies have attempted to use it for computer vision, but its application in CVGL is still very limited. Yang et al propose a simple and effective self-Cross attention mechanism Polar-EgoTR in the documents "hong jiYang, Xiufan Lu, and Yingying Zhu," Cross-view geo-localization with evolution transform, "CoRR, vol.abs/2107.00842,2021" to improve the quality of learning characterization. The prior method does not preprocess the image before the transform, and ignores the direct spatial correspondence of the image pair.

Furthermore, all existing methods do not take into account that the corresponding one-to-one scenes in the image pair are not taken simultaneously, and may even change over time, such as vegetation and buildings. Only the links remain unchanged at a large probability and play an important role in matching. Therefore, different areas are divided by utilizing a semantic division technology, and data are expanded so as to fully utilize road information.

Disclosure of Invention

The invention provides a cross-view geographic positioning method based on ground-space, aiming at solving the problems.

The technical scheme of the invention is as follows: a cross-view geographic positioning method based on ground-space comprises the following steps:

s1: acquiring a ground image and a satellite image of geographical positioning, and establishing a ground-air image training set;

s2: carrying out region segmentation on the ground image by using a semantic segmentation method;

s3: converting a top view of the satellite image into a front view by mixing perspective mapping and polar coordinate mapping;

s4: taking the ground image after the region segmentation and the converted satellite image as the input of a double-layer condition generation countermeasure network, training the double-layer condition generation countermeasure network, and obtaining a satellite image with a ground view style;

s5: inputting the ground image after the region segmentation and the satellite image with the ground view style into a transform network for matching, and finishing the training of the transform network;

s6: and acquiring the latest ground image and satellite image for geographical positioning, inputting the latest ground image and satellite image into the trained transform network, and performing image matching to complete ground-space cross-view geographical positioning.

And (3) segmenting different areas by utilizing a semantic segmentation technology so as to fully utilize road information and expand data. Since the segmented image often has valid information, the purpose of the low-level visual features is to generate more efficient image descriptors to achieve better image feature retrieval performance. Therefore, the most important thing of adopting the semantic segmentation technology is to take the segmented image as the input of the subsequent DCGAN (Dual Conditional genetic adaptive nets), so that the additional effective information is fully utilized, and the satellite image with more ground view style is generated.

Further, in step S2, the calculation formula for performing perspective mapping is:

wherein x is_aAbscissa, y, representing the original satellite image for perspective mapping_aOrdinate, x, representing the original satellite image for perspective mapping_gAbscissa, y, representing the transformed satellite image of the perspective mapping_gOrdinate, C, representing the transformed satellite image after perspective mapping^hRepresenting the height of the camera from the ground level, H_aRepresenting the height, W, of the perspective mapped satellite image_aRepresenting the width of the perspective mapped satellite image;

the calculation formula for polar coordinate mapping is as follows:

wherein, x'_aRepresents the abscissa, y ', of the original satellite image for polar mapping'_aDenotes the ordinate, x 'of the original satellite image for polar mapping'_gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'_gRepresenting the ordinate, A, of the satellite image after polar mapping conversion_aRepresenting the size of the original satellite image, H_gRepresenting the height, W, of the satellite image after polar mapping_gRepresenting the width of the satellite image after polar mapping.

The beneficial effects of the further scheme are as follows: in the invention, although the deep neural network can learn any function transformation theoretically, the learning process generates a large burden, and the invention explicitly aligns two domains according to geometric correspondence to promote convergence of the network and reduce the learning burden. Instead of forcing learning the neural network of the implicit mapping, the satellite image is explicitly transformed, and the top view of the satellite image is transformed into a front view by automatic hybrid polar-perspective mapping so that it is approximately aligned with the ground image, thus establishing a better spatial correspondence so as to approximately bridge the distance of geometric-spatial correspondence between the two domains.

The cross-view image pair has a common view area and a non-common view area, in the non-visible area, only one side of the vertical structure can be seen in the top-down view or the ground view, but in the common view area, the plane structure can be seen at the same time, so that the geometric spatial correspondence of the two areas should not adopt the same method. The proposed hybrid polar-coordinate-perspective mapping method uses perspective mapping and polar-coordinate mapping, respectively, based on the essential difference between the common-view region and the non-common-view region. The satellite image after mapping is very close to the actual ground image, the ground panoramic image is aligned with the satellite image, the ground panoramic image is more suitable for matching, and the visual domain distance of the ground-space image is greatly reduced.

Further, in step S4, the expression of the two-layer condition generating penalty function against the second generator G2 in the network is:

S'_g＝G₁(I_a),I'_g＝G₂(S'_g)

wherein, S'_gRepresenting an image, I ', generated by a first generator'_gRepresenting images generated by a second generator, I_aSatellite image, S, representing an input hybrid perspective-polar mapping_gSemantic segmentation graph representing an input ground image, I_gRepresenting the input original ground image, L_CGAN(. represents a CGAN loss function, L_L1(. represents an L1 loss function, G ₁、G₂Denotes a first, second generator, D₂Denotes a second discriminator, E [ ·]Representing an expectation function corresponding between the two, log being a logarithmic function, D₂[·]Representing a variance function between the two components, | | · | | non-calculation₁Is a norm of 1.

Further, in step S4, a two-layer condition generates an overall objective function L of the countermeasure network_Dual-CGANThe expression of (c) is:

L_Dual-CGAN＝L_CGAN(G₁,D₁)+λL_L1(G₁)+L_CGAN(G₂,D₂)+λL_L1(G₂)

wherein L is_CGAN(. smallcircle.) denotes the CGAN loss function, L_L1(. represents an L1 loss function, G₁Denotes a first generator, G₂Denotes a second generator, D₁Denotes a first discriminator, D₂Denotes a second discriminator and lambda is expressed as a loss function L_L1The weight parameter of (2).

The beneficial effects of the further scheme are as follows: in the invention, the scene content is not considered in polar coordinate-perspective mapping, the real corresponding relation between two different domains is much more complicated than simple mapping, the appearance distortion of the transformed image is still obvious, and the field distance between two views cannot be completely eliminated only by the point. To solve this problem, the present invention synthesizes a ground image with realistic appearance and content preservation from the corresponding satellite views to solve the huge viewing angle difference existing between the two domains in terms of geographical localization. The invention adopts a DCGAN method, and aims to take the converted satellite image and the ground image subjected to semantic segmentation as input, train double-layer conditions to generate a countermeasure network and synthesize the satellite image with the ground view style.

Further, step S5 includes the following sub-steps:

s51: dividing the ground image after the region division and the satellite image with the ground view style into a plurality of patches, and embedding linear patches as the input of a transform network;

s52: will be provided withPosition embedding x_posAnd learnable class embedding x_classAdding the sequence to the input of the Transformer network to obtain the input sequence embedding x₀；

S53: embedding an input sequence into x₀Inputting the image into an L-layer Transformer encoder of a Transformer network for image matching.

Further, in step S52, the input sequence is embedded with x₀The calculation formula of (c) is:

x₀＝[x_class；x]+x_pos

wherein x is_posIndicating position embedding, x_classRepresenting learnable class embedding.

The beneficial effects of the further scheme are as follows: in the invention, the parallel computing, global view and flexible stacking capability of the Transformer are incomparable with CNN. By observing the CNN-based approach in CVGL, two potential problems were discovered. On the one hand, CVGL needs to mine relevant information between contexts. Images from different domains have positional transformations, such as rotation, scaling and offset. Therefore, it is necessary to fully understand the semantic information of the global context. On the other hand, fine-grained information is very important for the retrieval task. The downsampling operations of CNN-based methods, i.e., pooling and step-wise convolution, can reduce the resolution of the image while intangibly destroying identifiable fine-grained information. For this reason, the Transformer will function in CVGL as a powerful context sensitive information extractor. However, its application in CVGL is still very limited and the existing methods neglect the direct spatial correspondence of the image and do not pre-process the image. Therefore, a Transformer-based CVGL model SMDT (english initials of steps S2-S5, respectively) was proposed, which is the first model to combine image mapping with Transformer.

The beneficial effects of the invention are: the invention provides a method for automatic hybrid perspective-polar coordinate mapping for the first time, so as to reduce the visual domain space of a ground-space image; the method is characterized in that a double-layer conditional generation countermeasure network is provided for the first time, and additional effective information is considered by taking a converted image and a semantically segmented image as input; a Transformer-based CVGL model SMDT is proposed, which is the first model to combine image mapping with Transformer; semantic segmentation techniques are utilized to facilitate the segmentation of co-view and non-co-view regions.

Drawings

FIG. 1 is a flow chart of a cross-view geolocation method;

FIG. 2 is a schematic diagram of semantic segmentation for obtaining category-specific masks corresponding to ground images (gray);

FIG. 3 is a diagram illustrating the mapping effect of satellite images under different methods;

FIG. 4 is a network architecture diagram of DCGAN;

FIG. 5 is a diagram of the overall method framework of the Transformer;

FIG. 6 is a diagram of a Transformer internal method framework.

Detailed Description

The embodiments of the present invention will be further described with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a cross-view geo-location method based on ground-space, comprising the following steps:

s4: taking the ground image after the region segmentation and the converted satellite image as double-layer conditions to generate the input of an antagonistic network, training the double-layer conditions to generate the antagonistic network, and obtaining the satellite image with the ground view style;

In the embodiment of the present invention, in step S1, segmentation of the common-view region and the non-common-view region is realized by performing semantic segmentation on the ground image, so as to enhance image sample data. The additional information of the added samples is utilized in the subsequent S3 mixed polar coordinate-perspective mapping to better distinguish the co-view region from the non co-view region, so as to select different mapping methods for different regions. And the keep road mode can also better utilize the additional information of the road, because the image is shot at different time, buildings, trees and the like can not be taken as features for image matching, only the road probability is kept unchanged, and the image can be taken as a mark feature for matching, which plays an important role in image matching. Most importantly, the semantic segmentation map is input as a condition in DCGAN in the subsequent S4, so that additional effective information is fully utilized, and a better image feature retrieval performance is realized by using a more effective image descriptor.

As shown in fig. 2, a semantic segmentation module is used to obtain segmented images corresponding to a ground view, considering sky, roads, sidewalks, trees, buildings and cars, and as shown in fig. 2B, masks are used to generate and distinguish these different classes for subsequent use in creating enhancement samples in keep and remove modes, which are two different image operation modes. The original pixels of a specific class are retained in the keep mode, which retains a portion of the original image as shown in fig. 2C, while they are masked by black pixels when operating in the remove mode, which removes a portion of the original image as shown in fig. 2D. remove would consider all object classes, but in keep mode the sky and cars would not be considered, since the areas in these classes are not available for matching in the bird's eye view. When the ground view is modified, the satellite view is not modified.

In the embodiment of the present invention, in step S2, formally let A_a×A_aRepresenting the size of the satellite image, H_g×W_gRepresenting the target size of the polar transformation. Thus, the original satellite image point (x'_a,y'_a) And target transform satellite image points (x'_g,y'_g) Polar coordinates of (a) betweenThe transformation is defined as:

wherein, x' _aRepresents the abscissa, y 'of the original satellite image for polar mapping'_aDenotes the ordinate, x 'of the original satellite image for polar mapping'_gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'_gRepresenting the ordinate, A, of the satellite image after polar mapping conversion_aRepresenting the size of the original satellite image, H_gRepresenting the height, W, of the satellite image after polar mapping_gRepresenting the width of the satellite image after polar mapping.

And aligning the ground panoramic image with the satellite image to establish a spatial correspondence relationship for the cross-view image pair. Using the ground-space image training set containing ground images (elevation) and satellite images (overhead) established in step S1, specifically: the satellite images (top view) are generated by orthogonal projection and the ground panoramic images (front view) are generated by projection sphere rectangular projection. Therefore, the cross-view image pair has a common-view region and a non-common-view region, and the common-view region is mainly a planar structure which can be seen on the ground and the satellite view simultaneously, such as a road and a sidewalk, which means that pixels belonging to the common-view region in the cross-view image pair are associated through perspective mapping. In contrast, non-co-viewing areas consist of vertical structures, such as building roofs and tree crowns, only one side of which is visible in a top-down or ground level view. In the non-co-view region, only semantic relationships exist between pairs of cross-view images. The invention provides a hybrid polar coordinate-perspective mapping method, which comprises the following steps: and respectively adopting perspective mapping and polar coordinate mapping according to the essential difference of the common-view area and the non-common-view area. As can be seen from fig. 3(c), the satellite image of the hybrid perspective-polar mapping is very close to the actual ground image, and is more suitable for matching, and the visual domain distance of the ground-space image is greatly reduced.

For the common-view region, a satellite image which is very similar to the ground image is obtained by adopting perspective transformation. For a given pixel (x) in the satellite image coordinate system_a,y_a) And corresponding pixel (x) in the ground image coordinate system_g,y_g) The mapping relationship is described by the following equation:

wherein x is_aAbscissa, y, representing the original satellite image for perspective mapping_aOrdinate, x, representing the original satellite image for perspective mapping_gAbscissa, y, representing the transformed satellite image after perspective mapping_gOrdinate, C, representing the transformed satellite image after perspective mapping^hRepresenting the height of the camera from the ground level, H_aRepresenting the height, W, of the perspective mapped satellite image_aRepresenting the width of the perspective mapped satellite image.

However, the basic assumption of the perspective mapping is that all the pixels in the common view region represent the same object plane, which is not satisfied for the vertical structure in the non-common view region. In fact, these vertical objects are greatly distorted and cannot even be converted into ground images. Therefore, polar mapping is used. By applying polar mapping, the projected geometric domain spacing between the terrestrial image and the satellite image is roughly closed.

In the embodiment of the present invention, the mapping of the satellite image to the ground image using the CGAN in step S4 may be regarded as a visual domain adaptive process. Krishna Regmi proposes X-Fork and X-Seq structures to assist the cross-view transformation of images with the help of CGAN, the model of the invention is mainly based on the improvement of X-Fork and X-Seq, as shown in FIG. 4, the invention proposes a DCGAN, wherein, a first network generates a cross-view segmentation image, and a second network acquires the segmentation image from a first generator as input to generate a ground image. The entire system is trained end-to-end, so that both CGANs are trained simultaneously. Compared with the X-Seq framework, a semantic segmentation graph of the ground image is added in G1 to participate in training. Because the semantic segmentation graph is used to generate more details from the outline information of the image, the purpose of the low-level visual features is to generate more efficient image descriptors to achieve better image feature retrieval performance, although more details tend to cause some confusion in the image, extended experiments prove that it is tradeable.

With reference to the loss function of the conventional CGAN, an equivalent expression of the cross-view CGAN network loss in the architecture is obtained, G2 takes the ground image generated by G1 as a condition input, and the expression of the loss function of the dual-layer condition generation countermeasure network is:

S'_g＝G₁(I_a),I'_g＝G₂(S'_g)

wherein, S'_gRepresenting an image, I ', generated by a first generator'_gRepresenting the image generated by the second generator (i.e. the satellite image with the ground view style), I_aSatellite image, S, representing an input hybrid perspective-polar mapping_gSemantic segmentation graph representing an input ground image, I_gRepresenting the input original ground image, L_CGAN(. represents a CGAN loss function, L_L1(. represents an L1 loss function, G₁、G₂Denotes a first, a second generator, D₂Denotes a second discriminator, E [ ·]Representing the expectation function between the two correspondences, log being logarithmicFunction, D₂[·]Representing a variance function between the two components, | · | | non-woven phosphor₁Is a 1 norm (i.e., a satellite image with a ground view style and a pixel-by-pixel difference from the ground image).

In the embodiment of the present invention, in step S4, the two-layer condition generates the overall objective function L of the countermeasure network_Dual-CGANThe expression of (a) is:

L_Dual-CGAN＝L_CGAN(G₁,D₁)+λL_L1(G₁)+L_CGAN(G₂,D₂)+λL_L1(G₂)

wherein L is_CGAN(. represents a CGAN loss function, L_L1(. represents an L1 loss function, G ₁Denotes a first generator, G₂Denotes a second generator, D₁Denotes a first discriminator, D₂Denotes a second discriminator and lambda is expressed as a loss function L_L1The weight parameter of (2).

In the embodiment of the present invention, step S5 includes the following sub-steps:

s52: embedding locations into x_posAnd learnable class embedding x_classAdding the sequence to the input of the Transformer network to obtain the input sequence embedding x₀；

In the embodiment of the present invention, in step S52, the input sequence is embedded with x₀The calculation formula of (2) is as follows:

x₀＝[x_class；x]+x_pos

The global context-aware nature of SDMT effectively reduces visual domain spacing, while position coding gives it a notion of geometry, thereby reducing ambiguity caused by geometry misadjustment. The present invention follows the method of Yang et al applying Transformer to CVGL but directly adopts the Transformer layer structure in ViT without using their self-cross-attention mechanism. The specific structure is shown in fig. 5 and 6.

Visual transducer: first, a description will be made in the context of the Vision Transformer (ViT) architecture. As shown in fig. 6, given an input image, ViT first divides the image into patches. ViT then receives as input the linear sequence of projection patch embeddings

Where N is the number of patches and D is the patch embedding size. Embedding after preparing a learnable class in advance

ViT is shown as an image and the position is embedded in x_posAdding to X to obtain X₀＝[x_class；x]+x_posAnd input it to an L-layer Transformer encoder. Each layer is composed of a multi-headed self-Attention module (SA) and layered modules (LayerNorm blocks (LN)). Note that the MSA consists of multiple self-attention heads and one linear projection block.

Domain-specific Transformer: the large regional differences between the terrestrial and satellite images indicate that it is difficult to match the terrestrial and satellite images in the same data space. To accommodate the cross-view geolocation task, a domain-specific Siamese-like structure was employed, with two independent ViT branches of the same structure, learning the terrestrial and satellite image representations, respectively. Method framework as in fig. 5, each branch is a hybrid structure consisting of a ResNet backbone that extracts CNN feature maps from the image input and an ViT that models global context from the CNN feature maps. The linear projection of patch embedding ViT is applied to the CNN feature map by treating each 1 x 1 feature as a patch.

Learnable location embedding: geometric cues can greatly simplify the cross-view geolocation task. The invention adopts a heightAn efficient, flexible approach gives network geometry concepts rather than imposing predefined directional knowledge on the network. Specifically, a learnable one-dimensional position embedding is used at ViT, i.e.

By adding position embedding to the linear patch embedding, the transformed features become position dependent. Furthermore, SDMT of the present invention has broader practical applicability since no assumptions are made about position knowledge, but rather is learned by learning objectives. Experiments have shown that combining learnable position embedding helps to capture relative position information, which is more suitable for images with unknown orientation than absolute position information. Furthermore, SDMT takes into account scene content in the corresponding cross-view geometry, which is complementary to the polar-to-perspective transformation and results in better localization performance.

The following description will be given with reference to specific examples. The invention provides a cross-view image matching method SMDT, which combines cross-view synthesis and geographic positioning. The SMDT fully considers the problems of the variability of different regions of the scene in the ground-air image pair, the spatial correspondence specificity of the co-view-non co-view region, the effectiveness of the additional information and the limitation of the application of the Transformer in the CVGL ignored in the existing method, realizes the new most advanced performance and has the advantage of high matching precision, as shown in tables 1 and 2, wherein the table 1 is the comparison with other methods on a CVUSA data set, and the table 2 is the comparison with other methods on a CVACT data set.

TABLE 1

Method	R@1	R@5	R@10	R@Top1
					CVM-Net	22.47	49.98	63.18	93.62
Liu and Li	40.79	66.82	76.36	96.12
					CVFT	61.43	84.69	90.49	99.02
SAFA	81.15	94.23	96.85	99.49
					Tokeret al.	92.56	97.55	98.33	99.67
Polar-Ego TR	94.05	98.27	98.99	99.67
					SMDT (invention)	95.02	98.97	99.25	99.87

TABLE 2

Method	R@1	R@5	R@10	R@Top1
					CVM-Net	20.15	45.00	56.87	87.57
Liu and Li	46.96	68.28	75.48	92.01
					CVFT	61.05	81.33	86.52	95.93
SAFA	78.28	91.60	93.79	98.15
					Tokeret al.	83.28	93.57	95.42	98.22
Polar-Ego TR	84.89	94.59	95.96	98.37
					SMDT (invention)	85.52	94.97	96.28	98.96

The working principle and the process of the invention are as follows: the invention provides a cross-view geographic positioning method called SMDT. Firstly, segmenting different areas of a ground image by utilizing a semantic segmentation technology; then, converting the top view of the satellite image into a front view through automatic mixed perspective-polar coordinate mapping, and greatly reducing the visual domain distance of a ground-air image pair; secondly, providing a DCGAN algorithm, taking the converted image and the semantically segmented image as input, training double-layer conditions to generate a confrontation network, and synthesizing a satellite image with a ground view style; finally, the Transformer is used to model global dependencies explicitly with self-attention attributes. The problems of the variability of different areas of a scene in a ground-air image pair, the spatial correspondence specificity of a co-view-non co-view area, the effectiveness of extra additional information and the limitation of application of a Transformer in CVGL, which are ignored in the existing method, are solved respectively.

The beneficial effects of the invention are as follows: the invention provides a method for automatic hybrid perspective-polar coordinate mapping for the first time, so as to reduce the visual domain space of a ground-space image; the method is characterized in that a double-layer conditional generation countermeasure network is provided for the first time, and additional effective information is considered by taking a converted image and a semantically segmented image as input; a Transformer-based CVGL model SMDT is proposed, which is the first model to combine image mapping with Transformer; semantic segmentation techniques are utilized to facilitate the segmentation of co-view and non-co-view regions.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention in its aspects.

Claims

1. A cross-view geographic positioning method based on ground-air is characterized by comprising the following steps:

s2: performing region segmentation on the ground image by using a semantic segmentation method;

S6: and acquiring a latest geo-positioned ground image and a latest satellite image, inputting the latest geo-positioned ground image and the latest satellite image into a trained transform network, and performing image matching to complete ground-space cross view geo-positioning.

2. The method for geo-location based cross-view geographic positioning according to claim 1, wherein in step S2, the calculation formula for perspective mapping is:

the calculation formula for polar coordinate mapping is as follows:

wherein, x'_aRepresents the abscissa, y ', of the original satellite image for polar mapping'_aDenotes the ordinate, x 'of the original satellite image for polar mapping'_gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'_gRepresenting the ordinate, A, of the satellite image after polar mapping conversion_aRepresenting the size of the original satellite image, H _gRepresenting the height, W, of the satellite image after polar mapping_gRepresenting the width of the satellite image after polar mapping.

3. The geo-spatial based cross-view geo-location method of claim 1, wherein in step S4, the expression of the two-tier conditional generation penalty function against the second generator G2 in the network is:

S'_g＝G₁(I_a),I'_g＝G₂(S'_g)

wherein, S'_gRepresenting an image, I ', generated by a first generator'_gRepresenting images generated by a second generator, I_aSatellite image, S, representing an input hybrid perspective-polar mapping_gSemantic segmentation graph representing an input ground image, I_gRepresenting the input original ground image, L_CGAN(. represents a CGAN loss function, L_L1(. represents an L1 loss function, G₁Denotes a first generator, G₂Is shown asTwo generators, D₂Represents a second discriminator, E [. ]]Representing the expectation function between the two correspondences, log being a logarithmic function, D₂[·]Representing a variance function between the two components, | · | | non-woven phosphor₁Is a norm of 1.

4. The method according to claim 1, wherein in step S4, the two-layer condition generates an overall objective function L of the countermeasure network_Dual-CGANThe expression of (a) is:

L_Dual-CGAN＝L_CGAN(G₁,D₁)+λL_L1(G₁)+L_CGAN(G₂,D₂)+λL_L1(G₂)

wherein L is _CGAN(. smallcircle.) denotes the CGAN loss function, L_L1(. represents an L1 loss function, G₁Denotes a first generator, G₂Denotes a second generator, D₁Denotes a first discriminator, D₂Denotes a second discriminator and lambda is expressed as a loss function L_L1The weight parameter of (2).

5. The ground-space based cross-view geolocation method according to claim 1, characterized in that said step S5 includes the following sub-steps:

6. The method according to claim 5, wherein in step S52, the input sequence is embedded in x₀The calculation formula of (2) is as follows:

x₀＝[x_class；x]+x_pos