CN114120013A

CN114120013A - Infrared and RGB cross-modal feature point matching method

Info

Publication number: CN114120013A
Application number: CN202111392935.5A
Authority: CN
Inventors: 田炜; 陈展; 邓振文; 黄禹尧; 谭大艺; 韩帅
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-03-01

Abstract

The invention relates to an infrared and RGB cross-modal feature point matching method, which comprises the following steps: performing off-line training on the deep learning model based on the collected original RGB image and IR image to obtain a trained matching model; and inputting the data to be detected into the matching model to extract the feature descriptors and output a corresponding matching result. Compared with the prior art, the method focuses on fusion of multispectral images, fuses visible light images (RGB) and thermal imaging Images (IR), can accurately extract feature points in multiple modes through model training, better executes a cross-mode feature matching task, further improves the accuracy of machine position and attitude estimation in scenes with severe illumination change and darkness, can provide a reliable sensing front end for many applications, lays a front end for subsequent research work of fusing multispectral sensors under a traditional SLAM framework, and is favorable for realizing mapping positioning matching or depth estimation and three-dimensional mapping of the same scene spanning day (based on RGB images) and night (based on IR images).

Description

Infrared and RGB cross-modal feature point matching method

Technical Field

The invention relates to the technical field of intelligent driving, in particular to an infrared and RGB cross-modal feature point matching method.

Background

In the unmanned perception task, feature extraction is the key, however, under the conditions of severe change of ambient light, even complete darkness and bad weather, the traditional machine vision often faces the problem of large error and even failure in feature extraction. For example, in tasks such as SLAM (simultaneous localization and mapping), SfM (Structure from motion, three-dimensional reconstruction based on motion), camera calibration, image registration, etc., extracted features mainly include points of interest, and points of interest with poor quality or small number extracted under the influence of environment inevitably cause failure in matching of subsequent feature points, so that subsequent calculation tasks cannot be performed.

Most of the conventional feature point extraction methods are based on relatively stable local image Features, including SIFT (Scale-invariant feature transform), SURF (Speeded Up Robust Features), and ORB (Oriented Fast and Rotated brif). Although the manual point precision of the conventional methods is high, robustness is lacked in many scenes, for example, under dark and low-texture scenes, the gradient information of pixels at the moment is extremely little and the noise is large, so that the feature points cannot be accurately and quickly extracted for matching.

At present, with the development of deep learning networks, feature point extraction methods based on deep learning begin to appear, but these methods are directed to a single modality, and the problem of feature difference between cross-modalities is not considered. Because the visible light camera has a perception defect in a low-light scene, the accuracy of the camera position and posture estimation in a dark scene with severe light change cannot be ensured, and great challenges are brought to mapping, positioning and matching or depth estimation and three-dimensional mapping in the same scene spanning day and night.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an infrared and RGB cross-modal feature point matching method, so that feature points can be accurately extracted in multiple modes, a cross-modal feature matching task can be better matched, the accuracy of machine pose estimation in a scene with severe illumination change and dark scenes is further improved, and the method is favorable for realizing mapping positioning matching or depth estimation and three-dimensional mapping in the same scene spanning day and night.

The purpose of the invention can be realized by the following technical scheme: an infrared and RGB cross-modal feature point matching method comprises the following steps:

s1, collecting an original RGB image and an original IR image, and performing off-line training on the deep learning model based on the original RGB image and the original IR image to obtain a trained matching model;

and S2, inputting the data to be tested into the trained matching model to extract the feature descriptors and outputting the corresponding matching result.

Further, the step S1 is to collect the raw RGB image by a visible light camera and collect the raw IR image by a thermal imaging camera.

Further, the deep learning model in step S1 is specifically an UnsuperPoint neural network model.

Further, the specific process of the offline training in step S1 is as follows:

s11, preprocessing the collected original RGB image and the original IR image to obtain a pair of images;

s12, establishing an UnstuperPoint neural network model, inputting paired images into the UnstuperPoint neural network model for off-line training, and obtaining a trained matching model.

Further, the step S11 is specifically to perform a pixel alignment process on the original RGB image and the original IR image to ensure that the original RGB image and the original IR image are completely aligned at a pixel level.

Further, the pair of images are specifically an original RGB image and an IR image added with a view angle transformation.

Further, the pair of images are specifically an original IR image and an RGB image added with a view angle transformation.

Further, the UnsuperPoint neural network model constructed in step S12 includes a backbone network, the backbone network is used for performing joint tasks of point confidence estimation, point coordinate regression, and descriptor extraction, and the backbone network is divided into two branches: one branch is used for processing an original image, the other branch is used for processing an image after the homography matrix transformation, extracted points are projected into the same image coordinate system through the true value of the homography matrix, the point distance of each pair is calculated, the point pair with the distance smaller than 4 pixels is used as an effective point pair, and a point corresponding relation is constructed to carry out self-supervision learning;

the UnsurPoint neural network model adopts a convolution network layer with convolution kernel size of 3 and step length of 2 to process edge blurring generated by temperature radiation in an IR image.

Further, the learning loss function of the UnsuperPoint neural network model is specifically as follows:

L＝α_scoreL_score+α_repL_rep+α_posL_pos+α_uniL_uni+α_desL_des+α_{des_coor}L_{des_coor}

sim(z_i，z_j)＝z_i ^Tz_j/||z_i||||z_j||

wherein A is the identification of RGB image, B is the identification of IR image, L is the total loss function, L_scoreIs the point confidence loss, which is represented by the square of the difference in scores of the same points A and B, α_scoreIs L_scoreThe corresponding weight;

L_repto account for loss of repeatability based on point pair distance, s is the confidence with which a point is extracted, d is the distance of a point pair,

is the mean of the distances of all pairs of points, α_repIs L_repThe corresponding weight;

L_poseuclidean distance loss, α, for point pairs_posIs L_posThe corresponding weight;

L_unifor loss of coordinate uniformity, α_uniIs L_uniThe corresponding weight;

L_desto describe the loss of a descriptor, the loss is represented by the square of the difference of the descriptors of the same points A and B, α_desIs L_desThe corresponding weight;

L_{des_coor}for increasing the compactness of the descriptors in space, the loss being represented by the sum of the cross-correlation coefficients of the descriptors at the points at different positions, α_{des_coor}Is L_{des_coor}The corresponding weight;

z_i，z_jfor two descriptor vectors, sim (z)_i，z_j) Is z_i，z_jIs a temperature over-parameter, and is used for controlling the strength of the learning negative case.

Further, the data to be measured comprises paired RGB images to be measured and IR images to be measured.

Compared with the prior art, the method focuses on fusion of multispectral images, namely, the visible light camera and the thermal imaging camera are fused, and the sensing defect of the visible light camera in a low-illumination scene is made up by utilizing the characteristic that the thermal imaging camera is not influenced by illumination change. The invention learns an extraction point and a description sub-model which can be adapted in a multi-mode based on a neural network, and lays a front-end cushion for the subsequent research work of fusing a multi-spectral sensor under the traditional SLAM framework so as to improve the accuracy of machine position and posture estimation under the scene with severe illumination change and dark, thereby being beneficial to realizing the mapping positioning matching or depth estimation and three-dimensional mapping of the same scene spanning day and night.

According to the invention, the RGB and IR data sets are mixed to train a model, so that the trained matching model can perform feature matching under respective modes under the RGB and IR modes, the matching precision of the model with the dual-mode feature matching function under the two modes can be obviously improved through the training mode, meanwhile, a large amount of data is adopted in an unsupervised learning task, and the model trained by the large amount of data has higher robustness on the tasks of extracting features and defining repeat points.

The invention is based on the infrastructure of the UnsuuperPoint network model, carries out self-supervision through the image pair, realizes thorough unsupervised, and relieves the dependence on synthetic data during SuperPoint pre-training; meanwhile, the position is used as a regression term to be differentiable, so that the position can be optimized to move; in order to solve the problem that the characteristic points appear on the grid boundary, a heuristic distribution approximation algorithm is adopted, and the distribution of the characteristic points is more uniform through the method; in addition, the cross correlation among the dimensions of the descriptors is reduced, so that the redundancy among the dimensions of the descriptors is reduced, and the expression capacity of the descriptors is improved; innovative adjustments are made to replace maximal pooling with convolution, change learning loss functions, etc. to better match the cross-modal feature matching task.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of an embodiment of a deep learning network;

FIG. 3 is a schematic diagram of an auto-supervised learning framework;

fig. 4 is a diagram illustrating exemplary descriptor matching in different modalities.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, a cross-modal feature point matching method for infrared and RGB includes the following steps:

since the UnsuperPoint model has already proven its excellent performance in the RGB mode at present, the present embodiment mainly uses the model as a main frame (as shown in fig. 2), and adjusts for the cross-mode feature matching task on the basis of the model;

in this embodiment, an unscuperpoint neural network model is used for off-line training, and the specific process of off-line training is as follows:

s11, pre-processing the captured original RGB image (which can be captured by a visible light camera) and the original IR image (which can be captured by a thermal imaging camera) to obtain a pair of images, specifically, the pair of images is the original RGB image and the IR image added with the view angle transformation, or the original IR image and the RGB image added with the view angle transformation, and the original RGB image and the original IR image are completely aligned at a pixel level;

it should be noted that the model training process in the invention is the same as the model training thought of UnsuperPoint in the single mode, and according to the thought in the single mode, the transformation relation between the original image and the transformation image is randomly generated and known, so that the data input in the cross-mode must ensure that the original RGB and the original IR image are completely aligned in pixel level, so as to realize the cross-mode matching;

s12, constructing an UnserPoint neural network model, inputting paired images into the UnserPoint neural network model for off-line training to obtain a trained matching model, wherein the UnserPoint neural network model comprises a backbone network, UnserPoint shows excellent performance in an RGB mode, points extracted by the network can avoid artificially defining angular points or image gradients, and obtain good performance in repeatability and positioning errors, but infrared information is derived from thermal radiation, and edge characteristic noise of infrared imaging is high, and can be significantly different from edge characteristics of RGB images. Thus, by selecting keypoints with repeatability, rather than by visual corner or edge information, errors due to noise can be minimized.

The UnsurPoint completes the extraction of low-level features by a lightweight network, which is used for the joint task of point confidence estimation, point coordinate regression and descriptor extraction. For each 8 x 8 grid on the original image scale, the network outputs a point and its descriptor. The self-supervised learning process is illustrated in fig. 3. The network is divided into two branches: one branch processes the original image and the other branch processes the homographic matrix transformed image. The extracted points are projected into the same image coordinate system through the true values of the homography matrix. The point distances of each pair are calculated. Taking a point pair with a distance smaller than 4 pixels as an effective point pair, and constructing a point correspondence relationship to perform self-supervision learning;

meanwhile, in consideration of the fact that the non-parameter network structure of maximum pooling is selected in the original UnsurPoint network through down sampling, and the image in the infrared mode generates edge blurring due to temperature radiation, in order to enable the network to better learn the bottom layer characteristics, but not excessively influence computing resources, the edge blurring is processed by adopting a convolution network layer with the convolution kernel size of 3 and the step length of 2;

in addition, in the aspect of a learning loss function, a descriptor loss part adopts a negative example-based comparative learning loss SimCLR, and the SimCLR is a framework of self-supervision learning, and positive examples and negative examples are independently constructed for comparative learning. The loss function adopts a twin structure, and constructs a positive case through transformation enhancement, so that the network is forced to learn certain invariant properties in the image characteristics. In the descriptor learning task, positive examples, namely point pairs descriptors with the distance within a threshold value, can encourage higher similarity between the positive examples through contrast loss, meanwhile, the distance between the positive examples and any negative examples is enlarged, and the network is supervised by the negative examples to reserve independent features, so that model collapse is prevented, namely, the negative examples are distributed as uniformly as possible. For two descriptor vectors z_i,z_jThe similarity calculation of (d) is obtained by canonical normalized vector dot product:

sim(z_i,z_j)＝z_i ^Tz_j/||z_i||||z_j||

the descriptor loss function is found to be:

in the formula (I), the compound is shown in the specification,

is a hint, which is multiplied by the following natural exponential operation, if k is not equal to i, the value of the hint is 1, otherwise it is 0; tau is a temperature over-parameter and can control the strength of the learning negative case. First, there are also differences between negative examples, with more similar negative examples and completely unrelated negative examples. Therefore, learning is difficult for the negative examples that are close in space, and the more difficult it is to pull the distance of the negative examples apart. The temperature super-parameter can be adjusted according to the punishment distribution of the hard negative cases, and the smaller the super-parameter is, the more the hard negative cases can be emphasized, so that the samples are more uniform in spatial distribution. However, the smaller the temperature over-parameter is, the better the temperature over-parameter is, when the descriptor is learned, the network judges whether the descriptor pair is a positive case or a negative case according to the distance prior of the point pair, and if the position of the extraction point is incorrect at the initial stage of network learning, the temperature over-parameter is too low, which may cause the descriptor which should be similar in reality to be pushed away, and the spatial distance of the descriptor is more difficult to be pulled back.

The learning loss function of the UnsurPoint neural network model in the invention is specifically as follows:

in the formula, L_scoreThe point confidence loss is mainly to ensure that the scores of the same points of the A diagram (RGB diagram) and the B diagram (infrared diagram) have similarity, that is, the confidence of the points under different visual angles are consistent, and the loss is represented by a square table of the score difference values of the same points of the A diagram and the B diagramShown in the specification;

L_repis based on the repeatability loss of point-to-distance, with a loss function of:

the twin map of A after homography transformation is considered here, and s is the confidence of the extracted point. Only point pairs with a pixel distance of less than 4, d being the distance of the point pair,

is the average of the distances of all the point pairs. This loss is expected to decrease confidence in points when distances are large and increase confidence in points when distances are small.

L_posIs the Euclidean distance loss of the point pairs, and the main objective of the Euclidean distance loss is to ensure that the key point positions detected by the graph A (after homography) are the same as those detected by the graph B, namely, the stability of the positions of the points under different visual angles, and the Euclidean distance loss of the point pairs is ensured. L is_uniHere, only points in an 8 × 8 grid are considered, and when the loss is calculated, the points are sorted by coordinates, and the variance of the intervals of the point coordinates is calculated as the loss. L is_desIs the descriptor penalty, represented by the square of the difference of the descriptors of the same points of the a and B plots, which ensures that the descriptors of the point pairs should be similar. L is_{des_coor}Is used to increase the compactness of the descriptors in space, the loss being represented by the sum of the cross-correlation coefficients of the descriptors at the points at different positions.

α_scoreIs L_scoreThe corresponding weight; alpha is alpha_repIs L_repThe corresponding weight; alpha is alpha_posIs L_posThe corresponding weight; alpha is alpha_uniIs L_uniThe corresponding weight; alpha is alpha_desIs L_desThe corresponding weight; alpha is alpha_{des_coor}Is L_{des_coor}The corresponding weight.

Therefore, the invention improves the structure of the UnsurPoint network: edge blurring is handled using convolutional layers, and contrast loss improves matching and descriptor extraction quality.

And S2, inputting the data to be detected into the trained matching model to extract the feature descriptors and outputting the corresponding matching result, wherein the data to be detected comprises paired RGB images to be detected and IR images to be detected.

In summary, according to the technical scheme, the visible light camera and the thermal imaging camera are used for carrying out initial image acquisition, and the network model is trained offline based on the acquired initial image data; and then, sending the data set to be tested into the trained model, extracting the feature descriptors, and matching, wherein in fig. 4, the left side is an RGB image, the right side is an IR image, and fig. 4 shows a descriptor matching example between an RGB modality and an IR modality in the embodiment.

The model training process in the present invention is the same as the model training concept in the single mode, and the difference is that the input pair of images is RGB original image and IR image with added view angle transformation (or IR original image and RGB image with added view angle transformation), and it is necessary to ensure that the original RGB and IR image are completely aligned at the pixel level.

The qualitative indicators in the training are all expressed as follows: RS is the repetition rate, LE is the position error, HE is the homography estimation error, wherein the epsilon is the threshold value of the average error after the homography transformation of 4 angular points, and MS is the matching score.

The robustness of the extracted points was evaluated using the repetition rate. The original image is represented by O, the transformed image is represented by C, and the transformation matrix is known, at this time, the points extracted from O are transformed into C view angle by the homographic transformation matrix, and are marked as P_{true_warped}And the point extracted from C is marked as P_warpedThe repetition rate calculation can be written as:

the distance threshold is 3 pixels, and points smaller than the distance threshold are considered as matching point pairs.

Another evaluation index is position error, which evaluates the accuracy of the extracted point position, as in the repetition rate calculation, when the recording distance is less thanThe points of the threshold are respectively paired with G_{true_warped}And G_warpedThe position error is calculated as:

the evaluation of the descriptors cannot be carried out independently, and because the evaluation mode is also a method of calculating the descriptors after detecting points, only the performance of the extracted points and the descriptors can be comprehensively evaluated. The comprehensive evaluation is carried out under a homography transformation estimation method, firstly violent matching is adopted for matching of the descriptors, the similarity of the descriptors is measured by an L2 distance, and then a RANSAC (Random Sample Consensus) algorithm is combined according to a matching result to estimate a homography transformation matrix between two images.

The matching score mainly reflects the performance of the descriptor, RANSAC screens interior points (corresponding to matching point pairs within the error range of the homography) and exterior points in the sample, wherein the exterior points are matching point pairs causing too large error in calculating the homography, and the matching score calculation can be written as:

and the homography transformation accuracy comprehensively evaluates the position accuracy and the matching performance of the extraction points. After coordinates of four edge points of the image size are given, average error distances of the four points under estimated homography transformation and transformation matrix truth values are calculated, whether estimation is correct or not is judged according to different threshold values, finally, accuracy is evaluated according to the estimated correct proportion, and the index is marked as HE. In the present embodiment, 1, 3, 5, and 10 pixels are used as evaluation thresholds.

Regarding the selection of the backbone network, unlike the single-mode feature, a pair of features output by the backbone network in the cross-mode are the RGB mode and the IR mode, respectively. The backbone network of UnsuperPoint is very light, different from more complex network architectures such as VGG or ResNet, etc., the migration performance of the backbone network may be poor, and the backbone network cannot be used as a common backbone network in a cross-modal. Therefore, the RGB and IR data sets are mixed to train a model, namely the model can carry out feature matching under the RGB and IR modes respectively, and through the training mode, the matching precision of the model with the dual-mode feature matching function under the two modes can be obviously improved. Meanwhile, a large amount of data is adopted in the unsupervised learning task, and the robustness of the model trained by the large amount of data on the tasks of extracting features and defining repetition points is higher.

The method can provide a reliable sensing front end for automatic driving, and lays a front end for subsequent research work of fusing the multispectral sensor under the traditional SLAM framework, so as to improve the accuracy of phase and attitude estimation under the dark scene with severe illumination change. This work would also be beneficial for achieving mapping location matching or depth estimation and three-dimensional mapping of the same scene across day and night. For example, the problem that the RGB camera fails to track the feature points when the illumination is severely changed can be solved, the conversion of a multi-mode sensor can be realized by using cross-modal feature matching, and the influence of illumination on the stability of the SLAM system is reduced.

Claims

1. An infrared and RGB cross-modal feature point matching method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein the step S1 is to collect original RGB images by a visible light camera and collect original IR images by a thermal imaging camera.

3. The infrared and RGB cross-modal feature point matching method according to claim 1, wherein the deep learning model in step S1 is specifically an UnsuperPoint neural network model.

4. The method as claimed in claim 3, wherein the off-line training in step S1 comprises:

5. The method as claimed in claim 4, wherein the step S11 is specifically to perform a pixel alignment process on the original RGB image and the original IR image to ensure that the original RGB image and the original IR image are completely aligned at a pixel level.

6. The method as claimed in claim 5, wherein the pair of images are original RGB images and IR images with added perspective transformation.

7. The method as claimed in claim 5, wherein the pair of images are an original IR image and an RGB image with a view transformation added.

8. The infrared and RGB cross-modal feature point matching method according to any of claims 6 or 7, wherein the UnsuperPoint neural network model constructed in the step S12 includes a backbone network, the backbone network is used for performing joint tasks of point confidence estimation, point coordinate regression and descriptor extraction, and the backbone network is divided into two branches: one branch is used for processing an original image, the other branch is used for processing an image after the homography matrix transformation, extracted points are projected into the same image coordinate system through the true value of the homography matrix, the point distance of each pair is calculated, the point pair with the distance smaller than 4 pixels is used as an effective point pair, and a point corresponding relation is constructed to carry out self-supervision learning;

9. The infrared and RGB cross-modal feature point matching method according to claim 8, wherein the learning loss function of the UnsuperPoint neural network model specifically is:

L＝α_scoreL_score+α_repL_rep+a_posL_pos+α_uniL_uni+α_desL_des+α_{des_coor}L_{des_coor}

sim(z_i，z_j)＝z_i ^Tz_j/||z_i||||z_j||

L_unifor loss of coordinate uniformity, α_uniIs L_uniThe corresponding weight;

10. The method as claimed in claim 4, wherein the data to be measured includes paired RGB image to be measured and IR image to be measured.