CN114757999A - Ground-space based cross-view geographic positioning method - Google Patents

Ground-space based cross-view geographic positioning method Download PDF

Info

Publication number
CN114757999A
CN114757999A CN202210406567.3A CN202210406567A CN114757999A CN 114757999 A CN114757999 A CN 114757999A CN 202210406567 A CN202210406567 A CN 202210406567A CN 114757999 A CN114757999 A CN 114757999A
Authority
CN
China
Prior art keywords
image
ground
representing
view
satellite image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210406567.3A
Other languages
Chinese (zh)
Inventor
田晓阳
王珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Yikong Intelligent Control Technology Co ltd
Original Assignee
Sichuan Yikong Intelligent Control Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Yikong Intelligent Control Technology Co ltd filed Critical Sichuan Yikong Intelligent Control Technology Co ltd
Priority to CN202210406567.3A priority Critical patent/CN114757999A/en
Publication of CN114757999A publication Critical patent/CN114757999A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • G06T7/75Determining position or orientation of objects or cameras using feature-based methods involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30181Earth observation

Abstract

The invention discloses a cross-view geographical positioning method based on ground-space, which comprises the following steps: s1: establishing a ground-air image training set; s2: carrying out region segmentation on the ground image; s3: converting a top view of the satellite image into a front view; s4: training double-layer conditions to generate a countermeasure network to obtain a satellite image with a ground view style; s5: performing Transformer network training; s6: and acquiring the latest geo-positioned ground image and satellite image, inputting the latest geo-positioned ground image and satellite image into the trained transform network, and performing image matching. The invention provides a method for generating a countermeasure network by automatically mixing perspective-polar coordinate mapping and double-layer conditions for the first time, so as to reduce the visual domain distance of a ground-space image, and the image alignment and a Transformer method are combined for the first time, so that accurate cross-view geographic positioning is realized.

Description

Ground-space based cross-view geographic positioning method
Technical Field
The invention belongs to the technical field of geographic positioning, and particularly relates to a cross-view geographic positioning method based on ground-space.
Background
Cross-view image matching is the retrieval of the most relevant images from different platforms. Cross-view geographic localization (CVGL) is mainly based on ground images (front view) and satellite images (top view). The goal of ground-space based cross-view geolocation is to determine the location of a given terrestrial image through matching with satellite images. However, this is still a very challenging task, since the viewpoints of the ground-space images are very different.
With the rapid development of a Convolutional Neural Network (CNN) in computer vision, the existing method is mainly based on the convolutional neural network, early research is directly focused on feature representation, a metric learning method directly paying attention to features is gradually applied to CVGL, the method directly extracts view invariant features, no explicit view conversion is performed on input images, only the feature representation based on image contents is learned, and no consideration is given to the spatial correspondence of image pairs. Hu et al in the literature "Siming Hu, Mengdannfeng, Long M.H.Nguyen, and GimHee Lee," Cvm-Net: Cross-view matching network for image-based group-to-entity-location, "in 2018IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2018, pp.7258-7267" use a global VLAD descriptor, suggesting a CVM-Net based on the Simase architecture and employing weighted soft edge loss. The first of Liu and Li in the documents "Liu and Hongdong Li," orientation to neural networks for cross-view geo-localization, "in 2019IEEE Conference on Computer Vision and Pattern registration, CVPR,2019, pp.5624-5633" uses orientation information as an important clue for spatial positioning tasks. Shi et al propose a feature transport network CVFT that takes into account domain differences and spatial layout information in The literature "Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li," Optimal feature transport for cross-view image geo-localization, "in The third-Fourth AAAI Conference on Intelligent Intelligence, AAAI, 1192020, pp.11990-11997".
Some later studies considered spatial correspondence of ground-air image pairs, solving the inter-visual domain distance problem by transforming the input images, such as explicit viewpoint mapping of the input images and generation of a countermeasure network (CGAN) with conditions. Shi et al in the literature "Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li," Spatial-adaptive feature aggregation for image based cross-view-localization, "in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS,2019, pp.10090-10100" use the displayed view-point mapping, propose SAFA to eliminate the differences in the geometric domain using polar coordinate mapping and propose a Spatial perception layer. Toker et al, in the literature "AysimmoToker, Qunjie Zhou, Maxim Maximov, and Laura Leal-Taix' e," Coming down to earth: Satellite-to-street view synthesis for geo-localization, "in 2021IEEE Conference on Computer Vision and Pattern Recognition, CVPR,2021, pp.6488-6497," use conditions to generate a confrontation network, create a realistic ground image from the Satellite images, and simultaneously locate the corresponding ground view query in an end-to-end manner. A key method for reducing visual domain distances among different views is viewpoint mapping, but the existing mapping methods adopt the same mapping for the whole image, neglect the special spatial correspondence of the image on a common-view-non-common-view region, and the existing CGAN methods do not utilize additional effective information.
Transformer achieved excellent performance in Natural Language Processing (NLP), with parallel computing, global view and flexible overlay capability incomparable by CNN. Recent studies have attempted to use it for computer vision, but its application in CVGL is still very limited. Yang et al propose a simple and effective self-Cross attention mechanism Polar-EgoTR in the documents "hong jiYang, Xiufan Lu, and Yingying Zhu," Cross-view geo-localization with evolution transform, "CoRR, vol.abs/2107.00842,2021" to improve the quality of learning characterization. The prior method does not preprocess the image before the transform, and ignores the direct spatial correspondence of the image pair.
Furthermore, all existing methods do not take into account that the corresponding one-to-one scenes in the image pair are not taken simultaneously, and may even change over time, such as vegetation and buildings. Only the links remain unchanged at a large probability and play an important role in matching. Therefore, different areas are divided by utilizing a semantic division technology, and data are expanded so as to fully utilize road information.
Disclosure of Invention
The invention provides a cross-view geographic positioning method based on ground-space, aiming at solving the problems.
The technical scheme of the invention is as follows: a cross-view geographic positioning method based on ground-space comprises the following steps:
s1: acquiring a ground image and a satellite image of geographical positioning, and establishing a ground-air image training set;
s2: carrying out region segmentation on the ground image by using a semantic segmentation method;
s3: converting a top view of the satellite image into a front view by mixing perspective mapping and polar coordinate mapping;
s4: taking the ground image after the region segmentation and the converted satellite image as the input of a double-layer condition generation countermeasure network, training the double-layer condition generation countermeasure network, and obtaining a satellite image with a ground view style;
s5: inputting the ground image after the region segmentation and the satellite image with the ground view style into a transform network for matching, and finishing the training of the transform network;
s6: and acquiring the latest ground image and satellite image for geographical positioning, inputting the latest ground image and satellite image into the trained transform network, and performing image matching to complete ground-space cross-view geographical positioning.
And (3) segmenting different areas by utilizing a semantic segmentation technology so as to fully utilize road information and expand data. Since the segmented image often has valid information, the purpose of the low-level visual features is to generate more efficient image descriptors to achieve better image feature retrieval performance. Therefore, the most important thing of adopting the semantic segmentation technology is to take the segmented image as the input of the subsequent DCGAN (Dual Conditional genetic adaptive nets), so that the additional effective information is fully utilized, and the satellite image with more ground view style is generated.
Further, in step S2, the calculation formula for performing perspective mapping is:
Figure BDA0003602126140000041
Figure BDA0003602126140000042
wherein x isaAbscissa, y, representing the original satellite image for perspective mappingaOrdinate, x, representing the original satellite image for perspective mappinggAbscissa, y, representing the transformed satellite image of the perspective mappinggOrdinate, C, representing the transformed satellite image after perspective mappinghRepresenting the height of the camera from the ground level, HaRepresenting the height, W, of the perspective mapped satellite imageaRepresenting the width of the perspective mapped satellite image;
the calculation formula for polar coordinate mapping is as follows:
Figure BDA0003602126140000043
Figure BDA0003602126140000044
wherein, x'aRepresents the abscissa, y ', of the original satellite image for polar mapping'aDenotes the ordinate, x 'of the original satellite image for polar mapping'gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'gRepresenting the ordinate, A, of the satellite image after polar mapping conversionaRepresenting the size of the original satellite image, HgRepresenting the height, W, of the satellite image after polar mappinggRepresenting the width of the satellite image after polar mapping.
The beneficial effects of the further scheme are as follows: in the invention, although the deep neural network can learn any function transformation theoretically, the learning process generates a large burden, and the invention explicitly aligns two domains according to geometric correspondence to promote convergence of the network and reduce the learning burden. Instead of forcing learning the neural network of the implicit mapping, the satellite image is explicitly transformed, and the top view of the satellite image is transformed into a front view by automatic hybrid polar-perspective mapping so that it is approximately aligned with the ground image, thus establishing a better spatial correspondence so as to approximately bridge the distance of geometric-spatial correspondence between the two domains.
The cross-view image pair has a common view area and a non-common view area, in the non-visible area, only one side of the vertical structure can be seen in the top-down view or the ground view, but in the common view area, the plane structure can be seen at the same time, so that the geometric spatial correspondence of the two areas should not adopt the same method. The proposed hybrid polar-coordinate-perspective mapping method uses perspective mapping and polar-coordinate mapping, respectively, based on the essential difference between the common-view region and the non-common-view region. The satellite image after mapping is very close to the actual ground image, the ground panoramic image is aligned with the satellite image, the ground panoramic image is more suitable for matching, and the visual domain distance of the ground-space image is greatly reduced.
Further, in step S4, the expression of the two-layer condition generating penalty function against the second generator G2 in the network is:
S'g=G1(Ia),I'g=G2(S'g)
Figure BDA0003602126140000051
Figure BDA0003602126140000052
wherein, S'gRepresenting an image, I ', generated by a first generator'gRepresenting images generated by a second generator, IaSatellite image, S, representing an input hybrid perspective-polar mappinggSemantic segmentation graph representing an input ground image, IgRepresenting the input original ground image, LCGAN(. represents a CGAN loss function, LL1(. represents an L1 loss function, G 1、G2Denotes a first, second generator, D2Denotes a second discriminator, E [ ·]Representing an expectation function corresponding between the two, log being a logarithmic function, D2[·]Representing a variance function between the two components, | | · | | non-calculation1Is a norm of 1.
Further, in step S4, a two-layer condition generates an overall objective function L of the countermeasure networkDual-CGANThe expression of (c) is:
LDual-CGAN=LCGAN(G1,D1)+λLL1(G1)+LCGAN(G2,D2)+λLL1(G2)
wherein L isCGAN(. smallcircle.) denotes the CGAN loss function, LL1(. represents an L1 loss function, G1Denotes a first generator, G2Denotes a second generator, D1Denotes a first discriminator, D2Denotes a second discriminator and lambda is expressed as a loss function LL1The weight parameter of (2).
The beneficial effects of the further scheme are as follows: in the invention, the scene content is not considered in polar coordinate-perspective mapping, the real corresponding relation between two different domains is much more complicated than simple mapping, the appearance distortion of the transformed image is still obvious, and the field distance between two views cannot be completely eliminated only by the point. To solve this problem, the present invention synthesizes a ground image with realistic appearance and content preservation from the corresponding satellite views to solve the huge viewing angle difference existing between the two domains in terms of geographical localization. The invention adopts a DCGAN method, and aims to take the converted satellite image and the ground image subjected to semantic segmentation as input, train double-layer conditions to generate a countermeasure network and synthesize the satellite image with the ground view style.
Further, step S5 includes the following sub-steps:
s51: dividing the ground image after the region division and the satellite image with the ground view style into a plurality of patches, and embedding linear patches as the input of a transform network;
s52: will be provided withPosition embedding xposAnd learnable class embedding xclassAdding the sequence to the input of the Transformer network to obtain the input sequence embedding x0
S53: embedding an input sequence into x0Inputting the image into an L-layer Transformer encoder of a Transformer network for image matching.
Further, in step S52, the input sequence is embedded with x0The calculation formula of (c) is:
x0=[xclass;x]+xpos
wherein x isposIndicating position embedding, xclassRepresenting learnable class embedding.
The beneficial effects of the further scheme are as follows: in the invention, the parallel computing, global view and flexible stacking capability of the Transformer are incomparable with CNN. By observing the CNN-based approach in CVGL, two potential problems were discovered. On the one hand, CVGL needs to mine relevant information between contexts. Images from different domains have positional transformations, such as rotation, scaling and offset. Therefore, it is necessary to fully understand the semantic information of the global context. On the other hand, fine-grained information is very important for the retrieval task. The downsampling operations of CNN-based methods, i.e., pooling and step-wise convolution, can reduce the resolution of the image while intangibly destroying identifiable fine-grained information. For this reason, the Transformer will function in CVGL as a powerful context sensitive information extractor. However, its application in CVGL is still very limited and the existing methods neglect the direct spatial correspondence of the image and do not pre-process the image. Therefore, a Transformer-based CVGL model SMDT (english initials of steps S2-S5, respectively) was proposed, which is the first model to combine image mapping with Transformer.
The beneficial effects of the invention are: the invention provides a method for automatic hybrid perspective-polar coordinate mapping for the first time, so as to reduce the visual domain space of a ground-space image; the method is characterized in that a double-layer conditional generation countermeasure network is provided for the first time, and additional effective information is considered by taking a converted image and a semantically segmented image as input; a Transformer-based CVGL model SMDT is proposed, which is the first model to combine image mapping with Transformer; semantic segmentation techniques are utilized to facilitate the segmentation of co-view and non-co-view regions.
Drawings
FIG. 1 is a flow chart of a cross-view geolocation method;
FIG. 2 is a schematic diagram of semantic segmentation for obtaining category-specific masks corresponding to ground images (gray);
FIG. 3 is a diagram illustrating the mapping effect of satellite images under different methods;
FIG. 4 is a network architecture diagram of DCGAN;
FIG. 5 is a diagram of the overall method framework of the Transformer;
FIG. 6 is a diagram of a Transformer internal method framework.
Detailed Description
The embodiments of the present invention will be further described with reference to the accompanying drawings.
As shown in fig. 1, the present invention provides a cross-view geo-location method based on ground-space, comprising the following steps:
S1: acquiring a ground image and a satellite image of geographical positioning, and establishing a ground-air image training set;
s2: carrying out region segmentation on the ground image by using a semantic segmentation method;
s3: converting a top view of the satellite image into a front view by mixing perspective mapping and polar coordinate mapping;
s4: taking the ground image after the region segmentation and the converted satellite image as double-layer conditions to generate the input of an antagonistic network, training the double-layer conditions to generate the antagonistic network, and obtaining the satellite image with the ground view style;
s5: inputting the ground image after the region segmentation and the satellite image with the ground view style into a transform network for matching, and finishing the training of the transform network;
s6: and acquiring the latest ground image and satellite image for geographical positioning, inputting the latest ground image and satellite image into the trained transform network, and performing image matching to complete ground-space cross-view geographical positioning.
In the embodiment of the present invention, in step S1, segmentation of the common-view region and the non-common-view region is realized by performing semantic segmentation on the ground image, so as to enhance image sample data. The additional information of the added samples is utilized in the subsequent S3 mixed polar coordinate-perspective mapping to better distinguish the co-view region from the non co-view region, so as to select different mapping methods for different regions. And the keep road mode can also better utilize the additional information of the road, because the image is shot at different time, buildings, trees and the like can not be taken as features for image matching, only the road probability is kept unchanged, and the image can be taken as a mark feature for matching, which plays an important role in image matching. Most importantly, the semantic segmentation map is input as a condition in DCGAN in the subsequent S4, so that additional effective information is fully utilized, and a better image feature retrieval performance is realized by using a more effective image descriptor.
As shown in fig. 2, a semantic segmentation module is used to obtain segmented images corresponding to a ground view, considering sky, roads, sidewalks, trees, buildings and cars, and as shown in fig. 2B, masks are used to generate and distinguish these different classes for subsequent use in creating enhancement samples in keep and remove modes, which are two different image operation modes. The original pixels of a specific class are retained in the keep mode, which retains a portion of the original image as shown in fig. 2C, while they are masked by black pixels when operating in the remove mode, which removes a portion of the original image as shown in fig. 2D. remove would consider all object classes, but in keep mode the sky and cars would not be considered, since the areas in these classes are not available for matching in the bird's eye view. When the ground view is modified, the satellite view is not modified.
In the embodiment of the present invention, in step S2, formally let Aa×AaRepresenting the size of the satellite image, Hg×WgRepresenting the target size of the polar transformation. Thus, the original satellite image point (x'a,y'a) And target transform satellite image points (x'g,y'g) Polar coordinates of (a) betweenThe transformation is defined as:
Figure BDA0003602126140000091
Figure BDA0003602126140000092
wherein, x' aRepresents the abscissa, y 'of the original satellite image for polar mapping'aDenotes the ordinate, x 'of the original satellite image for polar mapping'gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'gRepresenting the ordinate, A, of the satellite image after polar mapping conversionaRepresenting the size of the original satellite image, HgRepresenting the height, W, of the satellite image after polar mappinggRepresenting the width of the satellite image after polar mapping.
And aligning the ground panoramic image with the satellite image to establish a spatial correspondence relationship for the cross-view image pair. Using the ground-space image training set containing ground images (elevation) and satellite images (overhead) established in step S1, specifically: the satellite images (top view) are generated by orthogonal projection and the ground panoramic images (front view) are generated by projection sphere rectangular projection. Therefore, the cross-view image pair has a common-view region and a non-common-view region, and the common-view region is mainly a planar structure which can be seen on the ground and the satellite view simultaneously, such as a road and a sidewalk, which means that pixels belonging to the common-view region in the cross-view image pair are associated through perspective mapping. In contrast, non-co-viewing areas consist of vertical structures, such as building roofs and tree crowns, only one side of which is visible in a top-down or ground level view. In the non-co-view region, only semantic relationships exist between pairs of cross-view images. The invention provides a hybrid polar coordinate-perspective mapping method, which comprises the following steps: and respectively adopting perspective mapping and polar coordinate mapping according to the essential difference of the common-view area and the non-common-view area. As can be seen from fig. 3(c), the satellite image of the hybrid perspective-polar mapping is very close to the actual ground image, and is more suitable for matching, and the visual domain distance of the ground-space image is greatly reduced.
For the common-view region, a satellite image which is very similar to the ground image is obtained by adopting perspective transformation. For a given pixel (x) in the satellite image coordinate systema,ya) And corresponding pixel (x) in the ground image coordinate systemg,yg) The mapping relationship is described by the following equation:
Figure BDA0003602126140000101
Figure BDA0003602126140000102
wherein x isaAbscissa, y, representing the original satellite image for perspective mappingaOrdinate, x, representing the original satellite image for perspective mappinggAbscissa, y, representing the transformed satellite image after perspective mappinggOrdinate, C, representing the transformed satellite image after perspective mappinghRepresenting the height of the camera from the ground level, HaRepresenting the height, W, of the perspective mapped satellite imageaRepresenting the width of the perspective mapped satellite image.
However, the basic assumption of the perspective mapping is that all the pixels in the common view region represent the same object plane, which is not satisfied for the vertical structure in the non-common view region. In fact, these vertical objects are greatly distorted and cannot even be converted into ground images. Therefore, polar mapping is used. By applying polar mapping, the projected geometric domain spacing between the terrestrial image and the satellite image is roughly closed.
In the embodiment of the present invention, the mapping of the satellite image to the ground image using the CGAN in step S4 may be regarded as a visual domain adaptive process. Krishna Regmi proposes X-Fork and X-Seq structures to assist the cross-view transformation of images with the help of CGAN, the model of the invention is mainly based on the improvement of X-Fork and X-Seq, as shown in FIG. 4, the invention proposes a DCGAN, wherein, a first network generates a cross-view segmentation image, and a second network acquires the segmentation image from a first generator as input to generate a ground image. The entire system is trained end-to-end, so that both CGANs are trained simultaneously. Compared with the X-Seq framework, a semantic segmentation graph of the ground image is added in G1 to participate in training. Because the semantic segmentation graph is used to generate more details from the outline information of the image, the purpose of the low-level visual features is to generate more efficient image descriptors to achieve better image feature retrieval performance, although more details tend to cause some confusion in the image, extended experiments prove that it is tradeable.
With reference to the loss function of the conventional CGAN, an equivalent expression of the cross-view CGAN network loss in the architecture is obtained, G2 takes the ground image generated by G1 as a condition input, and the expression of the loss function of the dual-layer condition generation countermeasure network is:
S'g=G1(Ia),I'g=G2(S'g)
Figure BDA0003602126140000111
Figure BDA0003602126140000112
wherein, S'gRepresenting an image, I ', generated by a first generator'gRepresenting the image generated by the second generator (i.e. the satellite image with the ground view style), IaSatellite image, S, representing an input hybrid perspective-polar mappinggSemantic segmentation graph representing an input ground image, IgRepresenting the input original ground image, LCGAN(. represents a CGAN loss function, LL1(. represents an L1 loss function, G1、G2Denotes a first, a second generator, D2Denotes a second discriminator, E [ ·]Representing the expectation function between the two correspondences, log being logarithmicFunction, D2[·]Representing a variance function between the two components, | · | | non-woven phosphor1Is a 1 norm (i.e., a satellite image with a ground view style and a pixel-by-pixel difference from the ground image).
In the embodiment of the present invention, in step S4, the two-layer condition generates the overall objective function L of the countermeasure networkDual-CGANThe expression of (a) is:
LDual-CGAN=LCGAN(G1,D1)+λLL1(G1)+LCGAN(G2,D2)+λLL1(G2)
wherein L isCGAN(. represents a CGAN loss function, LL1(. represents an L1 loss function, G 1Denotes a first generator, G2Denotes a second generator, D1Denotes a first discriminator, D2Denotes a second discriminator and lambda is expressed as a loss function LL1The weight parameter of (2).
In the embodiment of the present invention, step S5 includes the following sub-steps:
s51: dividing the ground image after the region division and the satellite image with the ground view style into a plurality of patches, and embedding linear patches as the input of a transform network;
s52: embedding locations into xposAnd learnable class embedding xclassAdding the sequence to the input of the Transformer network to obtain the input sequence embedding x0
S53: embedding an input sequence into x0Inputting the image into an L-layer Transformer encoder of a Transformer network for image matching.
In the embodiment of the present invention, in step S52, the input sequence is embedded with x0The calculation formula of (2) is as follows:
x0=[xclass;x]+xpos
wherein x isposIndicating position embedding, xclassRepresenting learnable class embedding.
The global context-aware nature of SDMT effectively reduces visual domain spacing, while position coding gives it a notion of geometry, thereby reducing ambiguity caused by geometry misadjustment. The present invention follows the method of Yang et al applying Transformer to CVGL but directly adopts the Transformer layer structure in ViT without using their self-cross-attention mechanism. The specific structure is shown in fig. 5 and 6.
Visual transducer: first, a description will be made in the context of the Vision Transformer (ViT) architecture. As shown in fig. 6, given an input image, ViT first divides the image into patches. ViT then receives as input the linear sequence of projection patch embeddings
Figure BDA0003602126140000121
Where N is the number of patches and D is the patch embedding size. Embedding after preparing a learnable class in advance
Figure BDA0003602126140000122
ViT is shown as an image and the position is embedded in xposAdding to X to obtain X0=[xclass;x]+xposAnd input it to an L-layer Transformer encoder. Each layer is composed of a multi-headed self-Attention module (SA) and layered modules (LayerNorm blocks (LN)). Note that the MSA consists of multiple self-attention heads and one linear projection block.
Domain-specific Transformer: the large regional differences between the terrestrial and satellite images indicate that it is difficult to match the terrestrial and satellite images in the same data space. To accommodate the cross-view geolocation task, a domain-specific Siamese-like structure was employed, with two independent ViT branches of the same structure, learning the terrestrial and satellite image representations, respectively. Method framework as in fig. 5, each branch is a hybrid structure consisting of a ResNet backbone that extracts CNN feature maps from the image input and an ViT that models global context from the CNN feature maps. The linear projection of patch embedding ViT is applied to the CNN feature map by treating each 1 x 1 feature as a patch.
Learnable location embedding: geometric cues can greatly simplify the cross-view geolocation task. The invention adopts a heightAn efficient, flexible approach gives network geometry concepts rather than imposing predefined directional knowledge on the network. Specifically, a learnable one-dimensional position embedding is used at ViT, i.e.
Figure BDA0003602126140000131
By adding position embedding to the linear patch embedding, the transformed features become position dependent. Furthermore, SDMT of the present invention has broader practical applicability since no assumptions are made about position knowledge, but rather is learned by learning objectives. Experiments have shown that combining learnable position embedding helps to capture relative position information, which is more suitable for images with unknown orientation than absolute position information. Furthermore, SDMT takes into account scene content in the corresponding cross-view geometry, which is complementary to the polar-to-perspective transformation and results in better localization performance.
The following description will be given with reference to specific examples. The invention provides a cross-view image matching method SMDT, which combines cross-view synthesis and geographic positioning. The SMDT fully considers the problems of the variability of different regions of the scene in the ground-air image pair, the spatial correspondence specificity of the co-view-non co-view region, the effectiveness of the additional information and the limitation of the application of the Transformer in the CVGL ignored in the existing method, realizes the new most advanced performance and has the advantage of high matching precision, as shown in tables 1 and 2, wherein the table 1 is the comparison with other methods on a CVUSA data set, and the table 2 is the comparison with other methods on a CVACT data set.
TABLE 1
Method R@1 R@5 R@10 R@Top1
CVM-Net 22.47 49.98 63.18 93.62
Liu and Li 40.79 66.82 76.36 96.12
CVFT 61.43 84.69 90.49 99.02
SAFA 81.15 94.23 96.85 99.49
Tokeret al. 92.56 97.55 98.33 99.67
Polar-Ego TR 94.05 98.27 98.99 99.67
SMDT (invention) 95.02 98.97 99.25 99.87
TABLE 2
Method R@1 R@5 R@10 R@Top1
CVM-Net 20.15 45.00 56.87 87.57
Liu and Li 46.96 68.28 75.48 92.01
CVFT 61.05 81.33 86.52 95.93
SAFA 78.28 91.60 93.79 98.15
Tokeret al. 83.28 93.57 95.42 98.22
Polar-Ego TR 84.89 94.59 95.96 98.37
SMDT (invention) 85.52 94.97 96.28 98.96
The working principle and the process of the invention are as follows: the invention provides a cross-view geographic positioning method called SMDT. Firstly, segmenting different areas of a ground image by utilizing a semantic segmentation technology; then, converting the top view of the satellite image into a front view through automatic mixed perspective-polar coordinate mapping, and greatly reducing the visual domain distance of a ground-air image pair; secondly, providing a DCGAN algorithm, taking the converted image and the semantically segmented image as input, training double-layer conditions to generate a confrontation network, and synthesizing a satellite image with a ground view style; finally, the Transformer is used to model global dependencies explicitly with self-attention attributes. The problems of the variability of different areas of a scene in a ground-air image pair, the spatial correspondence specificity of a co-view-non co-view area, the effectiveness of extra additional information and the limitation of application of a Transformer in CVGL, which are ignored in the existing method, are solved respectively.
The beneficial effects of the invention are as follows: the invention provides a method for automatic hybrid perspective-polar coordinate mapping for the first time, so as to reduce the visual domain space of a ground-space image; the method is characterized in that a double-layer conditional generation countermeasure network is provided for the first time, and additional effective information is considered by taking a converted image and a semantically segmented image as input; a Transformer-based CVGL model SMDT is proposed, which is the first model to combine image mapping with Transformer; semantic segmentation techniques are utilized to facilitate the segmentation of co-view and non-co-view regions.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art, having the benefit of this disclosure, may effect numerous modifications thereto and changes may be made without departing from the scope of the invention in its aspects.

Claims (6)

1. A cross-view geographic positioning method based on ground-air is characterized by comprising the following steps:
s1: acquiring a ground image and a satellite image of geographical positioning, and establishing a ground-air image training set;
s2: performing region segmentation on the ground image by using a semantic segmentation method;
s3: converting a top view of the satellite image into a front view by mixing perspective mapping and polar coordinate mapping;
s4: taking the ground image after the region segmentation and the converted satellite image as double-layer conditions to generate the input of an antagonistic network, training the double-layer conditions to generate the antagonistic network, and obtaining the satellite image with the ground view style;
s5: inputting the ground image after the region segmentation and the satellite image with the ground view style into a transform network for matching, and finishing the training of the transform network;
S6: and acquiring a latest geo-positioned ground image and a latest satellite image, inputting the latest geo-positioned ground image and the latest satellite image into a trained transform network, and performing image matching to complete ground-space cross view geo-positioning.
2. The method for geo-location based cross-view geographic positioning according to claim 1, wherein in step S2, the calculation formula for perspective mapping is:
Figure FDA0003602126130000011
Figure FDA0003602126130000012
wherein x isaAbscissa, y, representing the original satellite image for perspective mappingaOrdinate, x, representing the original satellite image for perspective mappinggAbscissa, y, representing the transformed satellite image of the perspective mappinggOrdinate, C, representing the transformed satellite image after perspective mappinghRepresenting the height of the camera from the ground level, HaRepresenting the height, W, of the perspective mapped satellite imageaRepresenting the width of the perspective mapped satellite image;
the calculation formula for polar coordinate mapping is as follows:
Figure FDA0003602126130000013
Figure FDA0003602126130000014
wherein, x'aRepresents the abscissa, y ', of the original satellite image for polar mapping'aDenotes the ordinate, x 'of the original satellite image for polar mapping'gRepresents the abscissa, y 'of the polar-coordinate-mapped converted satellite image'gRepresenting the ordinate, A, of the satellite image after polar mapping conversionaRepresenting the size of the original satellite image, H gRepresenting the height, W, of the satellite image after polar mappinggRepresenting the width of the satellite image after polar mapping.
3. The geo-spatial based cross-view geo-location method of claim 1, wherein in step S4, the expression of the two-tier conditional generation penalty function against the second generator G2 in the network is:
S'g=G1(Ia),I'g=G2(S'g)
Figure FDA0003602126130000021
Figure FDA0003602126130000022
wherein, S'gRepresenting an image, I ', generated by a first generator'gRepresenting images generated by a second generator, IaSatellite image, S, representing an input hybrid perspective-polar mappinggSemantic segmentation graph representing an input ground image, IgRepresenting the input original ground image, LCGAN(. represents a CGAN loss function, LL1(. represents an L1 loss function, G1Denotes a first generator, G2Is shown asTwo generators, D2Represents a second discriminator, E [. ]]Representing the expectation function between the two correspondences, log being a logarithmic function, D2[·]Representing a variance function between the two components, | · | | non-woven phosphor1Is a norm of 1.
4. The method according to claim 1, wherein in step S4, the two-layer condition generates an overall objective function L of the countermeasure networkDual-CGANThe expression of (a) is:
LDual-CGAN=LCGAN(G1,D1)+λLL1(G1)+LCGAN(G2,D2)+λLL1(G2)
wherein L is CGAN(. smallcircle.) denotes the CGAN loss function, LL1(. represents an L1 loss function, G1Denotes a first generator, G2Denotes a second generator, D1Denotes a first discriminator, D2Denotes a second discriminator and lambda is expressed as a loss function LL1The weight parameter of (2).
5. The ground-space based cross-view geolocation method according to claim 1, characterized in that said step S5 includes the following sub-steps:
s51: dividing the ground image after the region division and the satellite image with the ground view style into a plurality of patches, and embedding linear patches as the input of a transform network;
s52: embedding locations into xposAnd learnable class embedding xclassAdding the sequence to the input of the Transformer network to obtain the input sequence embedding x0
S53: embedding an input sequence into x0Inputting the image into an L-layer Transformer encoder of a Transformer network for image matching.
6. The method according to claim 5, wherein in step S52, the input sequence is embedded in x0The calculation formula of (2) is as follows:
x0=[xclass;x]+xpos
wherein x isposIndicating position embedding, xclassRepresenting learnable class embedding.
CN202210406567.3A 2022-04-18 2022-04-18 Ground-space based cross-view geographic positioning method Pending CN114757999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210406567.3A CN114757999A (en) 2022-04-18 2022-04-18 Ground-space based cross-view geographic positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210406567.3A CN114757999A (en) 2022-04-18 2022-04-18 Ground-space based cross-view geographic positioning method

Publications (1)

Publication Number Publication Date
CN114757999A true CN114757999A (en) 2022-07-15

Family

ID=82332132

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210406567.3A Pending CN114757999A (en) 2022-04-18 2022-04-18 Ground-space based cross-view geographic positioning method

Country Status (1)

Country Link
CN (1) CN114757999A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102022127693A1 (en) 2022-10-20 2024-04-25 Ford Global Technologies, Llc Method and system for evaluating a device for on-board camera-based detection of road markings

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102022127693A1 (en) 2022-10-20 2024-04-25 Ford Global Technologies, Llc Method and system for evaluating a device for on-board camera-based detection of road markings

Similar Documents

Publication Publication Date Title
Tian et al. UAV-satellite view synthesis for cross-view geo-localization
Yang et al. Cross-view geo-localization with layer-to-layer transformer
Lu et al. Geometry-aware satellite-to-ground image synthesis for urban areas
CN103839277B (en) A kind of mobile augmented reality register method of outdoor largescale natural scene
US10043097B2 (en) Image abstraction system
CN111126304A (en) Augmented reality navigation method based on indoor natural scene image deep learning
Huang et al. Learning identity-invariant motion representations for cross-id face reenactment
CN113361508B (en) Cross-view-angle geographic positioning method based on unmanned aerial vehicle-satellite
Shi et al. Accurate 3-dof camera geo-localization via ground-to-satellite image matching
CN110930310B (en) Panoramic image splicing method
Song et al. A joint siamese attention-aware network for vehicle object tracking in satellite videos
CN114757999A (en) Ground-space based cross-view geographic positioning method
Pham et al. Fast and efficient method for large-scale aerial image stitching
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
Zhu et al. Simple, effective and general: A new backbone for cross-view image geo-localization
Li et al. Super-resolution-based part collaboration network for vehicle re-identification
CN114299101A (en) Method, apparatus, device, medium, and program product for acquiring target region of image
CN114283152A (en) Image processing method, image processing model training method, image processing device, image processing equipment and image processing medium
Tian et al. Smdt: Cross-view geo-localization with image alignment and transformer
CN116740488B (en) Training method and device for feature extraction model for visual positioning
CN117456136A (en) Digital twin scene intelligent generation method based on multi-mode visual recognition
Li et al. Stereo neural vernier caliper
CN113781372A (en) Deep learning-based opera facial makeup generation method and system
Zhao et al. TransFG: A Cross-view geo-localization of Satellite and UAVs Imagery Pipeline using Transformer-Based Feature Aggregation and Gradient Guidance
Zhu et al. Cross-View Image Synthesis From a Single Image With Progressive Parallel GAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination