CN116168221B

CN116168221B - Transformer-based cross-mode image matching and positioning method and device

Info

Publication number: CN116168221B
Application number: CN202310450328.2A
Authority: CN
Inventors: 杨小冈; 李清格; 申通; 卢瑞涛; 朱正杰; 张涛; 谢学立
Original assignee: Rocket Force University of Engineering of PLA
Current assignee: Rocket Force University of Engineering of PLA
Priority date: 2023-04-25
Filing date: 2023-04-25
Publication date: 2023-07-25
Anticipated expiration: 2043-04-25
Also published as: CN116168221A

Abstract

The invention relates to a trans-form-based trans-form image matching and positioning method and device, which solve the problems of low trans-form image matching precision and large positioning error, and adopt trans-form image style migration to convert trans-form images into the same characteristic domain, and the trans-form-based intelligent matching algorithm can effectively solve the problem of improving the matching precision, thereby realizing reliable and effective trans-form image matching geographic positioning.

Description

Transformer-based cross-mode image matching and positioning method and device

Technical Field

The application relates to the technical field of autonomous vision positioning of aircrafts, in particular to a trans-former-based cross-mode image matching positioning method and device.

Background

The vision-based autonomous vision positioning technology of the aircraft is rapid in development, is a development requirement of navigation guidance, situation awareness and autonomous decision of the aircraft, and plays an irreplaceable role in typical space-based platform tasks such as target detection, identification and tracking. Because the single-mode sensor has limited acquired information, the infrared image can reflect the thermal radiation information of the object, is not easily influenced by external light, and can effectively image at night or in a smoke environment. Therefore, the real-time infrared image of aerial photography and the visible light image of known geographic information are utilized for matching, and richer information can be obtained, so that the night geographic positioning requirement of the aircraft is realized, the working requirement of the aircraft navigation system in the whole day is met, and the method has important and wide application prospect.

Developing a cross-modal image matching and positioning algorithm of an aircraft is a very basic task with quite high difficulty, the task firstly obtains a real-time image through an onboard camera, secondly matches the real-time image with a reference image with known geographic information position by using an intelligent matching algorithm, determines the position of a characteristic point in the real-time image in the reference image, wherein the real-time image and the reference image are cross-modal images, and finally calculates actual geographic positioning information of the aircraft by using a multi-view geometric algorithm according to the matching corresponding relation of the characteristic point. The biggest difficulty in the cross-modal image matching and positioning technology is to accurately, robustly and effectively match the cross-modal image. When the mode difference among images is large, the visual angle change is large or the characteristics are not obvious, the performance of the matching algorithm is greatly affected. The traditional matching method comprises a matching method based on a region and a matching method based on a feature, wherein the matching method is simple in principle, does not have real-time property and is easy to fall into a local optimal solution, and the matching method is small in calculated amount, but the extracted feature is shallow, the feature points defined by people cannot embody semantic information, and mismatching is easy to cause. The matching method based on deep learning utilizes a deep neural network to extract image features, and is difficult to directly apply to cross-mode image matching with obvious mode difference although higher matching precision is obtained.

Disclosure of Invention

In order to overcome at least one defect in the prior art, the application provides a trans-former-based cross-mode image matching and positioning method and device.

In a first aspect, a trans-former-based cross-mode image matching and positioning method is provided, including:

acquiring a real-time infrared image and a visible light image under the view angle of the unmanned aerial vehicle;

adopting a cross-mode image style migration network structure to perform style migration on the visible light image to obtain a pseudo infrared image;

performing intelligent matching on the real-time infrared image and the pseudo-infrared image by adopting a Transformer intelligent matching method to obtain a characteristic point matching relationship;

determining a homography transformation matrix according to the characteristic point matching relation;

according to the homography transformation matrix, perspective transformation is carried out on the center point of the real-time infrared image, and the pixel point corresponding to the center point in the pseudo infrared image is determined;

mapping pixel points corresponding to the center points in the pseudo infrared image onto the visible light image, and determining mapping points in the visible light image;

and obtaining a geographic positioning result of the unmanned aerial vehicle according to the geographic position information corresponding to the mapping points in the visible light image.

In one embodiment, the cross-modality image style migration network structure is a CycleGAN network structure, and the total loss function is:

wherein,,Gas a result of the fact that the first generator,Ffor the second generator, X is the source domain, Y is the target domain,for the first arbiter, ++>Is a second discriminator; />For the total loss function->For the first generatorGAnd a first discriminator->A contrast loss function between->For the second generator F and the second arbiter +.>A contrast loss function between->For the cyclic consistency loss function +.>As a function of the loss of identity of the entity,weight coefficient for cyclic consistency loss function, < ->The weight coefficient of the ontology consistency loss function.

In one embodiment, the first generatorGAnd a first discriminatorFight loss function betweenThe following formula is used:

wherein,,for the real image of the target domain +.>Is the real image of the source domain, +.>For input of +.>Time first generatorGAn output of (2);Eis the expected value; />For input of +.>First discriminator->Output of->For input of +.>First discriminator->An output of (2);

a second generatorFAnd a second discriminatorBetween the antagonism loss function->The following formula is used:

wherein,,for input of +.>The output of the second generator F; />For input of +.>Second discriminator->Output of->Input is +.>Second discriminator->An output of (2);

cyclic consistency loss functionThe following formula is used:

wherein,,for input of +.>Time second generatorFOutput of->For input of +.>Time first generatorGAn output of (2);

ontology consistency loss functionThe following formula is used:

wherein,,for input of +.>Time first generatorGOutput of->For input of +.>Time second generatorFIs provided.

In one embodiment, a transform intelligent matching method is used to intelligently match a real-time infrared image and a pseudo infrared image to obtain a characteristic point matching relationship, including:

the method comprises the steps of respectively carrying out feature extraction on a real-time infrared image and a pseudo-infrared image by adopting a twin network in a ResNet50 backbone network to obtain two feature images, and splicing the two feature images to obtain a spliced feature image;

adding position codes into the spliced feature images to obtain a context feature image;

inputting the query points and the context feature map into a transducer encoder-decoder structure together to obtain a high-dimensional vector;

and inputting the high-dimensional vector into a multi-layer perceptron to obtain the characteristic point matching relationship.

In a second aspect, a trans-former-based cross-modality image matching and positioning device is provided, including:

the image acquisition module is used for acquiring real-time infrared images and visible light images under the view angle of the unmanned aerial vehicle;

the image style migration module is used for performing style migration on the visible light image by adopting a cross-mode image style migration network structure to obtain a pseudo infrared image;

the intelligent matching module is used for intelligently matching the real-time infrared image and the pseudo-infrared image by adopting a Transformer intelligent matching method to obtain a characteristic point matching relationship;

the homography transformation matrix determining module is used for determining a homography transformation matrix according to the characteristic point matching relation;

the perspective transformation module is used for carrying out perspective transformation on the center point of the real-time infrared image according to the homography transformation matrix, and determining the pixel point corresponding to the center point in the pseudo infrared image;

the mapping module is used for mapping the pixel points corresponding to the center points in the pseudo infrared image to the visible light image and determining the mapping points in the visible light image;

and the positioning result determining module is used for obtaining the geographic positioning result of the unmanned aerial vehicle according to the geographic position information corresponding to the mapping points in the visible light image.

second generator F and second discriminatorBetween the antagonism loss function->The following formula is used:

cyclic consistency loss functionThe following formula is used:

ontology consistency loss functionThe following formula is used:

In one embodiment, the intelligent matching module is further configured to:

Compared with the prior art, the application has the following beneficial effects:

1. the invention provides a general framework of a cross-mode image matching and positioning method, which converts cross-mode images with larger feature differences into the same feature domain by using a generated countermeasure network, and solves the problem of mismatching caused by the imaging differences of the cross-mode images.

2. And (3) constructing a network model to carry out style migration on the visible light image, and constructing a body consistency loss besides the countermeasures loss and the cycle consistency loss in the original CycleGAN by the total loss function so as to accelerate the convergence of the network.

3. Matching the style-migrated image with the real-time infrared image by using a transform-based intelligent matching algorithm, so that the matching precision is remarkably improved, and a geographic positioning result is obtained by using perspective transformation; experimental results prove that the positioning method can remarkably improve the matching performance of the cross-mode images, reliable and effective geographic positioning results are realized, and the effectiveness and superiority of the positioning method are verified.

Drawings

The present application may be better understood by reference to the following description taken in conjunction with the accompanying drawings, which are incorporated in and form a part of this specification, together with the following detailed description. In the drawings:

FIG. 1 shows a flow diagram of a Transformer-based cross-modality image matching localization method according to an embodiment of the present application;

FIG. 2 illustrates an image modality conversion schematic;

FIG. 3 shows a schematic diagram of a CycleGAN network architecture;

FIG. 4 shows a block diagram of a trans-former based cross-modality image matching localization arrangement according to an embodiment of the present application;

fig. 5 shows a comparison of the positioning method of the present application with the positioning results of the prior art method.

Detailed Description

Exemplary embodiments of the present application will be described hereinafter with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual embodiment are described in the specification. It will of course be appreciated that in the development of any such actual embodiment, numerous implementation-specific decisions may be made to achieve the developers' specific goals, and that these decisions may vary from one implementation to another.

It should be noted that, in order to avoid obscuring the present application with unnecessary details, only the device structures closely related to the solution according to the present application are shown in the drawings, and other details not greatly related to the present application are omitted.

It is to be understood that the present application is not limited to the described embodiments due to the following description with reference to the drawings. In this context, embodiments may be combined with each other, features replaced or borrowed between different embodiments, one or more features omitted in one embodiment, where possible.

The method for cross-modal image style migration can convert cross-modal images into the same characteristic domain, and an intelligent matching algorithm based on a Transformer can effectively solve the problem of improving matching precision, so that reliable and effective cross-modal image matching geographic positioning is realized.

An embodiment of the present application provides a flow chart of a trans-former-based cross-mode image matching and positioning method, and fig. 1 shows a flow chart of a trans-former-based cross-mode image matching and positioning method according to an embodiment of the present application, referring to fig. 1, the method includes:

step S1, acquiring real-time infrared images and visible light images under the view angle of the unmanned aerial vehicle.

Here, the intelligent EVO ii unmanned aerial vehicle is utilized to cruise according to a planned route in a specified visual navigation area, the flying height is 350m, the visual angle is forward looking down, a visible light image in daytime is obtained through shooting, and the image comprises houses, roads, plants and the like.

Based on the satellite image, the unmanned aerial vehicle shoots a visible light image, longitude and latitude information corresponding to each pixel point is found in the visible light image, and a reference picture with known geographic position information is obtained. And shooting by using the intelligent road unmanned aerial vehicle carrying the infrared camera in the visual navigation area to obtain a real-time infrared image of forward looking.

S2, adopting a cross-mode image style migration network structure to perform style migration on the visible light image to obtain a pseudo infrared image;

step S3, intelligent matching is carried out on the real-time infrared image and the pseudo infrared image by adopting a transducer intelligent matching method, and a characteristic point matching relation is obtained;

s4, determining a homography transformation matrix according to the characteristic point matching relation;

s5, performing perspective transformation on a central point of the real-time infrared image according to the homography transformation matrix, and determining a pixel point corresponding to the central point in the pseudo-infrared image; here, the center point of the real-time infrared image is the current position of the drone.

Step S6, mapping the pixel points corresponding to the center points in the pseudo infrared image to the visible light image, and determining mapping points in the visible light image;

and S7, obtaining a geographic positioning result of the unmanned aerial vehicle according to the geographic position information corresponding to the mapping points in the visible light image. Here, the geographic position information corresponding to the mapping points is the geographic positioning result of the unmanned aerial vehicle; the geographical location information corresponding to each pixel point in the visible light image is known, so that a final geographical location result of the unmanned aerial vehicle can be obtained.

wherein,,Gas a result of the fact that the first generator,Ffor the second generator, X is the source domain, Y is the target domain,for the first arbiter, ++>Is a second discriminator; />For the total loss function->For the first generatorGAnd a first discriminator->A contrast loss function between->For the second generator F and the second arbiter +.>A contrast loss function between->For the cyclic consistency loss function +.>As a function of the loss of identity of the entity,weight coefficient for cyclic consistency loss function, < ->Weight coefficient for the ontology consistency loss function by adjusting +.>Andbetter results can be obtained.

In this embodiment, a cross-modal image style migration network structure may be used to implement style migration from a visible image to an infrared image, and fig. 2 shows an image mode conversion schematic diagram. The CycleGAN network structure is a ring structure composed of two opposite GAN networks, and can realize the image in the source domain(visible light image field) and target field->The inter-conversion between (infrared image fields) fig. 3 shows a schematic diagram of a CycleGAN network structure, see fig. 3, which mainly comprises a first generatorGA second generatorFFirst discriminator->And a second discriminator->In the figure, ->For the target domain->Is>Is the source domain->Is a real image of (a); first generator->For attaching images from the source domain->Mapping to the target Domain->Conversely, the second generator->For +_target field>Image conversion to Source Domain->。

The traditional CycleGAN network structure adoptsThe loss function is calculated and the generated loss is calculated using a Mean Square Error (MSE). The square operation due to MSE will be amplified more (+)>) And therefore outliers can significantly affect the prediction results, ultimately reducing the overall performance of the model. In addition, if the initial output value is larger, the gradient update amplitude of the MSE loss function is smaller, so that the convergence time is long, and the model training is unstable. In view of the above, in this embodiment, the total loss function includes three parts, and besides the countermeasures loss and the cycle consistency loss in the original CycleGAN, the body consistency loss is constructed to maintain the hue of the image and prevent the overall color of the image from changing.

Specifically, the countering loss is a game between a generator and a arbiter, two countering losses are designed, a first generator G and a first arbiterBetween the antagonism loss function->The following formula is used:

the cycle consistency loss is to ensure that the input image can be infinitely close to the original image after the image conversion period, namely a forward cycle loopAnd a reverse circulation loop->. The total loop consistency penalty exploits the +_ between reconstructed and true images>Calculating the distance, and circulating the consistency loss functionThe following formula is used:

wherein,,for input of +.>Time second generatorFOutput of->For input of +.>Time first generatorGAnd the outputs of (a) represent reconstructed images of the forward loop and the reverse loop, respectively.

The ontology consistency penalty is used to constrain the retention of image colors by the generator, prevent the generated image from changing in hue, and ensure that the generated image retains the color configuration of the original image. The loss of identity of the body can ensure the imageFeeding into a first generator->Or image->Into the second generator->After this, the output remains itself. Ontology consistency penalty function->The following formula is used:

The optimized objective function expression is:

wherein,,is thatGIs the optimal solution of->Is thatFIs a solution to the optimization of (3).

Before the cross-modal image style migration network structure is applied to perform style migration on the visible light image, training is required to be performed on the cross-modal image style migration network structure, and the specific training process is as follows:

the training process uses an open source RGB-NIR scene dataset containing 9 pairs of visible and near infrared images, totaling 954 images. Before the image is input into the network, it is first normalized to 256×256 size.

In the training process, the iteration number (epoch) is set to 200, and the batch is set to 1. For the first 100 epochs, the learning rate was set to 0.0002, and the last 100 epochs adaptively adjusted the learning rate using an Adam optimizer. To enhance the conversion effect of the visible light image to the infrared image, the method comprises the following steps ofThe forward reconstruction loss weight of the domain is set to 30 +.>The reverse reconstruction loss weight of the domain is set to 10, strengthening the generator +.>Importance of (3). Furthermore, loop consistency loss->Loss of identity with ontology>Weight coefficient ratio of->Set to->。

In the embodiment, the cross-mode images with larger feature differences are converted into the same feature domain by using the generated countermeasure network, so that the problem of mismatching caused by the imaging differences of the cross-mode images is solved; meanwhile, the total loss function constructs the body consistency loss besides the antagonism loss and the cycle consistency loss in the original CycleGAN network, and accelerates the convergence of the network.

In one embodiment, in step S3, a transform intelligent matching method is used to intelligently match the real-time infrared image and the pseudo-infrared image to obtain a feature point matching relationship, which includes:

step S31, adopting a twin network in a ResNet50 backbone network to respectively perform feature extraction on the real-time infrared image and the pseudo-infrared image to obtain two feature images, which are marked as I and IAnd splicing the two feature images to obtain a spliced feature image.

Here, the input image is first resized to 256×256, and a network of res net50 backbone is used as a feature extraction network of the twin network to obtain two feature maps with 1024 channels, and then the feature maps are mapped and projected to the feature maps with 16×16×256, so as to reduce the computation amount of the transducer. The two feature maps are then stitched to form a stitched feature map of size 16 x 32 x 256.

And step S32, adding position codes into the spliced feature images to obtain the context feature images. Here, the feature map of 16×32×256 is added to the position code, and a context feature map of 16×32×256 is generated.

In step S33, the query point is input to the transform encoder-decoder structure together with the context feature map, resulting in a high-dimensional vector.

Here, in the transducer encoder-decoder structure, the encoder and decoder are each 6 layers, each layer containing an 8-head self-attention module. The query point is the normalized coordinate position in the feature map I. The transform encoder-decoder structure outputs 256-dimensional vectors.

And step S34, inputting the high-dimensional vector into the multi-layer perceptron to obtain the characteristic point matching relationship. Here, the high-dimensional vector is input into the multi-layer perceptron, and the feature map is outputAnd the feature points matched with the query points form a feature point matching relationship.

With the same inventive concept as the trans-former-based trans-modal image matching and positioning method, the present embodiment further provides a trans-former-based trans-modal image matching and positioning device corresponding thereto, and fig. 4 shows a block diagram of a trans-former-based trans-modal image matching and positioning device according to an embodiment of the present application, including:

an image acquisition module 41, configured to acquire a real-time infrared image and a visible light image under a viewing angle of the unmanned aerial vehicle;

the image style migration module 42 is configured to perform style migration on the visible light image by using a cross-mode image style migration network structure, so as to obtain a pseudo infrared image;

the intelligent matching module 43 is configured to perform intelligent matching on the real-time infrared image and the pseudo-infrared image by using a transform intelligent matching method, so as to obtain a feature point matching relationship;

a homography transformation matrix determining module 44, configured to determine a homography transformation matrix according to the feature point matching relationship;

the perspective transformation module 45 is used for performing perspective transformation on the center point of the real-time infrared image according to the homography transformation matrix, and determining a pixel point corresponding to the center point in the pseudo infrared image;

the mapping module 46 is configured to map a pixel point corresponding to the center point in the pseudo-infrared image onto the visible light image, and determine a mapping point in the visible light image;

the positioning result determining module 47 is configured to obtain a geographic positioning result of the unmanned aerial vehicle according to the geographic location information corresponding to the mapping point in the visible light image.

In the above embodiment, the specific implementation function of each module is consistent with the specific implementation manner of the foregoing method embodiment, and will not be described in detail.

In order to further verify the effectiveness of the positioning method of the application, the positioning method of the application is compared with the existing method, and fig. 5 shows a comparison chart of the positioning method of the application and the positioning result of the existing method, and experiments prove that the method of the application effectively realizes the cross-mode conversion from a visible light image to an infrared image, remarkably improves the successful matching rate, and has good matching effect on the problems of large cross-mode image mode difference, large matching difficulty, poor robustness and the like, and the method of the application has significance and value in practical engineering application.

The foregoing is merely various embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A trans-former-based cross-mode image matching and positioning method is characterized by comprising the following steps of:

performing intelligent matching on the real-time infrared image and the pseudo infrared image by adopting a Transformer intelligent matching method to obtain a characteristic point matching relationship;

performing perspective transformation on a central point of the real-time infrared image according to the homography transformation matrix, and determining a pixel point corresponding to the central point in the pseudo infrared image;

mapping pixel points corresponding to the center point in the pseudo infrared image to the visible light image, and determining mapping points in the visible light image;

obtaining a geographic positioning result of the unmanned aerial vehicle according to the geographic position information corresponding to the mapping points in the visible light image;

the cross-mode image style migration network structure is a CycleGAN network structure, and the total loss function is as follows:

wherein,,Gas a result of the fact that the first generator,Ffor the second generator, X is the source domain, Y is the target domain,for the first arbiter, ++>Is a second discriminator; />For the total loss function->For the first generatorGAnd a first discriminatorA contrast loss function between->For the second generator F and the second arbiter +.>A contrast loss function between->For the cyclic consistency loss function +.>For the ontology consistency loss function, +.>Weight coefficient for cyclic consistency loss function, < ->Weight coefficients for the ontology consistency loss function;

the method for intelligently matching the real-time infrared image and the pseudo infrared image by adopting a Transformer intelligent matching method to obtain a characteristic point matching relationship comprises the following steps:

the method comprises the steps of respectively carrying out feature extraction on the real-time infrared image and the pseudo infrared image by adopting a twin network in a ResNet50 backbone network to obtain two feature images, and splicing the two feature images to obtain a spliced feature image;

and inputting the high-dimensional vector into a multi-layer perceptron to obtain the characteristic point matching relation.

2. The method of claim 1, wherein the first generatorGAnd a first discriminatorBetween the antagonism loss function->The following formula is used:

cyclic consistency loss functionThe following formula is used:

ontology consistency loss functionThe following formula is used:

3. Transformer-based cross-mode image matching and positioning device is characterized by comprising the following components:

the intelligent matching module is used for intelligently matching the real-time infrared image and the pseudo infrared image by adopting a Transformer intelligent matching method to obtain a characteristic point matching relationship;

the perspective transformation module is used for carrying out perspective transformation on the central point of the real-time infrared image according to the homography transformation matrix, and determining the pixel point corresponding to the central point in the pseudo infrared image;

the mapping module is used for mapping the pixel points corresponding to the center point in the pseudo infrared image to the visible light image and determining mapping points in the visible light image;

the positioning result determining module is used for obtaining the geographic positioning result of the unmanned aerial vehicle according to the geographic position information corresponding to the mapping points in the visible light image;

the intelligent matching module is also used for:

4. The apparatus of claim 3, wherein the first generatorGAnd a first discriminatorBetween the antagonism loss function->The following formula is used:

cyclic consistency loss functionThe following formula is used:

ontology consistency loss functionThe following formula is used: