CN114820733A

CN114820733A - Interpretable thermal infrared visible light image registration method and system

Info

Publication number: CN114820733A
Application number: CN202210420876.6A
Authority: CN
Inventors: 白相志; 汪虹宇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-07-29
Anticipated expiration: 2042-04-21
Also published as: CN114820733B

Abstract

The invention discloses an interpretable thermal infrared visible light image registration method and an interpretable thermal infrared visible light image registration system, which basically comprise the following steps: 1) the process of extracting descriptors, matching, estimating transformation parameters and transforming images by using a neural network to simulate the traditional registration algorithm; 2) adopting a registration strategy of firstly carrying out global transformation and then carrying out local transformation from coarse to fine; 3) constructing a loss function pair network training; 4) and (3) processing the registration problem of the thermal infrared image-visible light image under different internal and external parameters by using a trained network. The interpretable Registration depth Neural network (ERDNN) provided by the invention can realize pixel-level Registration of a thermal infrared camera and a visible light camera which are not overlapped by optical centers. The trained descriptor subnetwork can be used as a general descriptor extractor to extract a thermal infrared-visible light cross-modal descriptor. The method has wide use value and application prospect in the fields of computer vision, automatic driving, monitoring security and the like.

Description

Interpretable thermal infrared visible light image registration method and system

Technical Field

The invention relates to an interpretable thermal infrared visible light image registration method and an interpretable thermal infrared visible light image registration system, and belongs to the field of computer vision. The method has wide application prospect in the fields of computer vision, multi-modal data fusion, monitoring security and protection, automatic driving and the like.

Background

Infrared thermal imaging techniques have many advantages over visible light imaging. The visible light camera needs auxiliary light to clearly and stably image; compared with the prior art, the thermal infrared camera mainly performs imaging according to infrared rays generated by object thermal radiation, and can be used under extremely severe weather conditions such as night, poor visibility and haze. In addition, compared with the radar detection technology, the infrared imaging is passive radiation imaging, electromagnetic waves are not emitted outwards actively, and the mode is not easy to be detected by enemies. Moreover, the thermal infrared camera also has certain insights to the aircraft adopting special coating for optical stealth and radar stealth. These advantages have led researchers to focus more and more on infrared imaging. In addition, with the reduction of the cost of the thermal infrared camera and the characteristic that the thermal infrared camera is not influenced by the illumination condition basically, the application and research of the thermal infrared camera in the civil field such as automatic driving are more and more. However, the thermal infrared image has low contrast, low resolution and fuzzy edges, and the disadvantages of the thermal infrared image make the thermal infrared image difficult to be directly applied to the perception of a complex system in a single mode form. In summary, the registration and fusion of the thermal infrared and visible light images are the current trend. The higher the safety requirements of the system, the more redundant design is needed, the more fusion of heterogeneous sensor information and coordination under complex working conditions are needed. Based on the method, the infrared and visible light multi-view registration data set and the high-robustness deep learning method have important significance.

At present, a thermal infrared and visible light registration method is basically based on two methods, namely a traditional registration method and an unsupervised learning method based on deep learning, and the evaluation indexes are subjective evaluation indexes.

Thermal infrared and visible image registration based on conventional registration methods has received much attention over a period of time. Jiang uses the Canny operator to extract the edges of the Infrared Image and the Visible light Image, and uses the geometric characteristics (local minimum curvature, direction, etc.) of the Contour line as a descriptor for matching, so that the matching problem of the thermal Infrared and the Visible light under a larger visual Angle can be processed (see the literature: Jiang, Qian, et al. "A content Angle organization for Power Equipment information and visual Image registration." IEEE Transactions on Power Delivery, vol.36, No.4,2021, pp.2559-2569.). Jiangzhou et al think that the infrared image and the visible light image have the same saliency map, so use HC saliency detection to get the preliminary saliency map of the infrared and visible light images, then extract ORB feature points on the saliency map to match (see document: Jiangzhou, Liu Xiao Yan, Wangchen. infrared and visible light image registration algorithm [ J ] laser and infrared based on saliency and ORB, 2019,49(2): 6.). The Wangchun break et al improves the three problems of small registration point number, uneven distribution and high mismatching rate between registration points. The method is provided for obtaining feature points with uniform spatial distribution and sufficient number by using a self-adaptive Harris angular point extraction method, and then the method is provided for fusing the gradient direction and the mutual information as similarity measurement, so that the matching error rate is greatly reduced. (see literature: Wang break, Weiming. visible-infrared image registration for adaptive feature point detection [ J ]. Chinese graphic report of images, 2017,22(2): 9.).

Unsupervised learning methods based on deep learning have also received much attention in recent years. Marouf considers that the thermal Infrared image and the visible light image have correlation at the image semantic level, so that the thermal Infrared and visible light semantic segmentation networks are trained to obtain semantic labels of the images, and then Registration is performed by using a Spatial Transform Network (STN) regression pixel displacement field to minimize MSE loss between the labels (see the documents: Marouf I.E, Barras L, Karaimer H.C., Susstrunk S. (2021) 'Joule unknown-RGB Video Registration and Fusion ", London Imaging measuring (LIM' 21), September 2021). Zhou Mei Qi et al believe that thermal infrared images and visible light images are of different "image styles", so first train a network that migrates from visible to infrared with cycleGAN. And then, the infrared image with the style migration and the real infrared image are used for regressing the displacement field by using a Space Transformation Network (STN). And finally, performing two-classification judgment on whether the matching is performed by using a discriminator for generating the countermeasure network, and alternately training the main network and the discriminator to optimize (Zhoumeqi, high intensity, pine, and the like, an infrared and visible light image registration method based on mode conversion [ J ]. computer engineering and design 2020,41(10): 5.).

However, the above methods are all established on a data set without a label, and most of the evaluation indexes are subjective evaluation. In addition, the feature point descriptors extracted by the traditional feature-based registration method are not checked on a large-scale data set, and the correlation relationship between the thermal infrared image and the visible light image is difficult to measure under different scenes, so that the algorithm is not high in universality; the unsupervised deep learning method does not consider the thermal infrared and visible light image registration process, belongs to end-to-end training, and causes low algorithm stability. If a traditional registration algorithm process can be introduced into a deep learning network as a network module and trained on a large data set, the possibility can be provided for solving the registration problem of the thermal infrared image and the visible light image. Based on the method, the traditional Registration method flow is used as guidance, an interpretable Registration Deep Neural Network (ERDNN for short) is constructed, and the problem of thermal infrared and visible light image Registration is effectively solved.

Disclosure of Invention

1. The purpose is as follows: in view of the above problems, the present invention aims to provide an interpretable thermal infrared visible image registration algorithm and system. An interpretable registration depth neural network ERDNN is constructed based on a deep learning technology, and the registration of the thermal infrared image and the visible light image is realized by combining a traditional registration flow driving network.

2. The technical scheme is as follows: to achieve the purpose, the overall idea of the technical solution of the present invention is to first split the registration system into four parts, which are a description sub-network, a motion field estimation network, a global transformation module and a local transformation module. Extracting a thermal infrared image descriptor and a visible light image descriptor by using a description subnetwork of shared parameters; after the description sub-features are added, the motion field estimation network and the global transformation module carry out global transformation on the infrared image; and then, extracting the descriptor by using the descriptor network again for the thermal infrared image after the global transformation, and adding the descriptor with the visible light image descriptor. And performing local transformation through the motion field estimation network and the local transformation module to output a final registration thermal infrared image. The technical idea of the invention is mainly embodied in the following three aspects:

1) and designing a description sub-network based on metric learning, namely training the description sub-network by using the metric learning.

2) The motion field estimation network is used for estimating a motion field (also called a displacement field) of infrared relative to visible light, and then a global transformation module is used for carrying out similarity transformation and a local transformation module is used for carrying out displacement field transformation to realize a coarse-to-fine and interpretable registration process of a thermal infrared image and a visible light image.

3) The description sub-network and the motion field estimation network construction loss functions are trained simultaneously so that the thermal infrared image can be effectively registered with visible light.

The invention relates to an interpretable thermal infrared visible light image registration method, which comprises the following specific steps:

the method comprises the following steps: the visible light image is changed into a single-channel gray image through a 1 x 1 convolutional neural network, and then a description sub-network based on shared parameters of a multi-scale Swin converter (Swin-Transformer) is utilized to extract a thermal infrared image descriptor and a visible light image descriptor.

Step two: and adding the thermal infrared image and the visible light image descriptor, inputting the thermal infrared image and the visible light image descriptor into a motion field estimation network, and outputting a pixel motion field of infrared relative to visible light. And then, resolving the motion field into a global similarity transformation by using a global transformation module, and then, carrying out global transformation on the thermal infrared image through the global transformation module based on a Space Transformation Network (STN).

Step three: inputting the thermal infrared image after global transformation into a description sub-network again to output a thermal infrared image descriptor after global transformation, adding the thermal infrared image descriptor and the original visible light image descriptor, inputting a motion field estimation network to estimate a pixel motion field, and finally performing local displacement field transformation through a local transformation module.

Step four: and constructing a loss function to train the network, wherein the loss function comprises three items of descriptor metric loss, global motion field estimation loss and local motion field estimation loss.

And (3) outputting: the registered thermal infrared image, the global transformation parameter and the local transformation parameter.

The first step is as follows:

1.1: the three channel visible image was converted to a single channel image using a 4-layer 1 x 1 convolution.

1.2: a descriptor of the image is extracted using a descriptor extraction module. The traditional convolutional neural network is formed by stacking and combining 2D convolutional layers, is limited by the size of a receptive field and is not enough for long-distance information. In order to fully mine the high-level semantics across the infrared-visible descriptor, which requires the use of a large-field descriptor subnetwork, the present invention uses a multiscale Swin Transformer (Swin-Transformer) instead of the traditional convolutional neural network. The range of receptive fields extends from the kernel size of the convolution kernel to the image patch block of the multi-scale Swin transformer.

Wherein, the second step is as follows:

2.1: and (4) merging the multi-scale descriptors extracted in the first step by the motion field estimation network, and outputting the motion field.

2.2: and (3) solving the displacement field obtained by the 2.1 into image global similarity transformation parameters S ═ sR, t by using a global transformation module, wherein R represents a 2-dimensional image rotation matrix, t represents an image translation vector, and S represents an image scale factor. And applying the thermal infrared image to obtain a globally transformed thermal infrared image. The global transformation operation is implemented by a Spatial Transformation Network (STN).

Wherein the third step is as follows:

3.1: and inputting the thermal infrared image subjected to global transformation into the descriptor network again to extract the descriptor, and adding the descriptor to the visible light image descriptor.

3.2: the descriptor of 3.1 is input into a motion field estimation network, which estimates the motion field between the globally transformed thermal infrared image and the visible light image.

3.3: and applying the motion field obtained by 3.2 to the thermal infrared image by using a local transformation module to obtain a locally transformed thermal infrared image. The partial transformation operation is implemented by a Spatial Transformation Network (STN).

Wherein the fourth step is as follows:

4.1: the interpretable loss function of the registration deep neural network ERDNN contains three terms: descriptor metric loss, global motion field estimation loss, local motion field estimation loss.

Descriptor (I)Network metric learning loss first classifies the registration fineness of thermal infrared images and visible light images into three grades: complete registration, global registration and non-registration. If the thermal infrared image and the visible light image are completely registered, the measurement between the thermal infrared visible light descriptors is as close as possible; if not, then if the measure between descriptors is less than the threshold C ₂ Are far away from each other; if global registration is achieved, then if the metric between descriptors is greater than a threshold C ₂ Are close to each other and less than the threshold C ₁ (C ₂ ＞C ₁ > 0) away from each other.

Global motion field estimation penalties are defined as the Mean Square Error (MSE) of the estimated global motion field and the true global motion field.

Wherein L is _global Representing the loss of the global motion field estimate,

the representation estimates the global motion field and,

representing a real global motion field.

The local motion field estimation penalty is defined as the Mean Square Error (MSE) of the estimated local motion field and the true local motion field.

Wherein L is _local Representing the estimated loss of the local motion field,

the representation estimates the local motion field and,

representing a real local motion field.

4.2: the ERDNN training is performed on the training data. Parameter optimization adjustments were made using an Adam optimizer.

An interpretable thermal infrared visible image registration system, the basic structural framework and the workflow of which are shown in figure 1, and the interpretable thermal infrared visible image registration system is characterized by comprising:

and the description subnetwork is used for extracting high-level semantic descriptors of corresponding pixels of the thermal infrared image and the visible light image.

A motion field estimation network for performing pixel motion field estimation.

And the global transformation module is used for carrying out similarity transformation calculation on the pixel motion field and carrying out global transformation on the initial infrared image.

And the local transformation module is used for carrying out local displacement field transformation on the thermal infrared image after the global transformation.

Wherein the network is a system component with optimized parameters and the module is a system component without optimized parameters. The description sub-network inputs a thermal infrared image (visible light image) and outputs a high-level semantic descriptor of a corresponding pixel of the thermal infrared image (visible light image). The thermal infrared and visible image descriptors are summed as input to a motion field estimation network and a global motion field (also referred to as a displacement field) is output. And the global motion field is resolved by a global transformation module to obtain a similarity transformation matrix, and global similarity transformation is carried out on the thermal infrared image. And extracting the descriptors of the thermal infrared image subjected to global similarity transformation by using the descriptor network again, adding the descriptors to the visible light descriptors, and inputting the extracted descriptors into the motion field estimation network again to obtain a local motion field. And carrying out local displacement field transformation on the thermal infrared image through a local transformation module to obtain a finally registered thermal infrared image.

3. The advantages and the effects are as follows: the invention provides an interpretable thermal infrared visible light image registration network ERDNN, which consists of a description sub-network, a motion field estimation network, a global transformation module and a local transformation module. The description sub-network is used in the whole network three times in total, and is matched with other networks and modules for use to obtain a high-level semantic thermal infrared-visible light cross-modal descriptor. The motion field estimation network is used twice in total, and a global transformation strategy and a local transformation strategy from coarse to fine are used. The invention can accurately and robustly solve the problem of thermal infrared visible light image registration. The method has wide use value and application prospect in the fields of computer vision, automatic driving, monitoring security and the like.

Drawings

Fig. 1 is a basic structural framework of a thermal infrared visible light image registration algorithm proposed by the present invention.

Fig. 2 is a diagram illustrating the basic structure of a sub-network.

Fig. 3 is a basic structure of a motion field estimation network.

Fig. 4 is a basic structure of a global transformation module.

Fig. 5 is a basic structure of a partial conversion module.

Fig. 6 a-6 d illustrate the registration effect of the network, where fig. 6a and 6b illustrate the thermal infrared image and the visible light image to be registered, respectively, fig. 6c illustrates the thermal infrared image after global parameter transformation, and fig. 6d illustrates the thermal infrared image after local transformation.

Detailed Description

For better understanding of the technical solutions of the present invention, the following further describes embodiments of the present invention with reference to the accompanying drawings.

The invention relates to an interpretable thermal infrared visible light image registration algorithm, an algorithm framework and a network structure of which are shown in figure 1, and the specific implementation steps of each part are as follows:

the method comprises the following steps: respectively extracting descriptors of a thermal infrared image and a single-channel visible light image after 1 × 1 convolution by a descriptor network sharing parameters, wherein the basic structure of the descriptor network is shown in fig. 2;

step two: adding the multi-scale descriptors of the thermal infrared image and the visible light image, and estimating a global motion field of the visible light relative to the infrared by a motion field estimation network, wherein the basic structure of the motion field estimation network is shown in FIG. 3; performing similarity transformation calculation on the global motion field by using a global transformation module, and performing global similarity transformation on the thermal infrared image, as shown in fig. 4;

step three: the globally transformed thermal infrared image descriptor is extracted again using the descriptor network, added to the visible descriptor, and the local motion field is estimated again using the motion field estimation network as shown in fig. 3. The local transformation module is utilized to directly act the displacement field on the thermal infrared image after the global transformation, and the final transformation image is obtained and is shown in figure 5;

step four: constructing a loss function to train the whole ERDNN network;

The first step is as follows:

1.1: the 3-channel visible light image is first converted into a single-channel image using a 4-layer neural network, each layer including a 1 × 1 convolution operation, a batch normalization operation, and a ReLU activation function.

1.2: and extracting the multi-scale feature descriptors of the images by using the multi-scale Swin Transformer. And the input image is subjected to downsampling coding under the processing of the description submodule, the scale number is increased at the same time, and finally the multi-scale description is output for subsequent prediction parameters.

Wherein, the second step is as follows:

2.1: and adding the thermal infrared and visible light multi-scale descriptors into the motion field estimation network. The motion field estimation network firstly fuses the multi-scale descriptors by using a multilayer convolution and residual connecting network; then, the fusion feature map is used to obtain a global displacement field through a 5-layer convolutional neural network containing residual connection.

2.2: and the global transformation module is used for resolving the global displacement field into a similarity transformation by using a similarity transformation ICP algorithm, and then the similarity transformation is acted on the input infrared image to obtain a globally transformed thermal infrared image. The solving method adopts a Direct Linear Transformation (DLT) algorithm, converts the parameter solving problem of the transformation matrix into an Ax-b form linear equation system solving problem, and can solve the optimal solution by using singular value decomposition. Because the resolving process is conductive, a neural network can be embedded, and gradient back propagation is realized.

Wherein the third step is as follows:

3.1: and extracting the thermal infrared image descriptor after global transformation by using a description sub-network, adding the thermal infrared image descriptor and the visible light descriptor, inputting the mixture into a motion field estimation network, and obtaining a local displacement field by the motion field estimation network.

3.2: and the local displacement field carries out local displacement field transformation fine adjustment on the globally transformed thermal infrared image through a local transformation module to obtain a final registration result.

Wherein the fourth step is as follows:

4.1: description of the sub-network metric learning loss is as follows.

Wherein d is _p Is a Euclidean distance function of the pixels, and returns the Euclidean distance of the corresponding pixels of the two images. m is ₁ ，m ₂ Is a distance threshold hyperparameter, m in the local network ₁ ＝10，m ₂ ＝5。d ₂ Is a descriptor of a visible light image, d ₄ Is a precisely registered thermal infrared image descriptor, d ₁ Is a thermal infrared image descriptor of misregistration, d ₃ Is a globally matched thermal infrared image descriptor,

representing the square of the Frobenius norm of the matrix.

4.2: the global motion field estimation penalty is as follows.

Wherein L is _global Representing the loss of the global motion field estimate,

the representation estimates the global motion field and,

representing a real global motion field. The Mean Square Error (MSE) function is defined as

4.3 local motion field estimation loss is as follows.

Wherein L is _local Representing the estimated loss of the local motion field,

the representation estimates the local motion field and,

representing a real local motion field.

4.4: the overall loss function is of the form shown below.

L＝αL _metric +L _global +L _local

L _metric ，L _global ，L _local The three terms represent the learning loss of the sub-network metric, the global parameter transformation loss and the local parameter transformation loss respectively, and alpha is a weighting coefficient and is 0.3 in the invention.

The field of automatic driving often needs multi-mode data fusion to improve the perception capability of the system, so as to enhance the robustness of the system. In order to visually demonstrate the effect of the invention, the invention uses a constructed thermal infrared-visible light image registration dataset of an automatic driving scene for training. Firstly, a data set is divided into a training set and a testing set, and a global motion field (similarity transformation corresponding to a motion field) and a local motion field (local displacement field transformation corresponding to a motion field) label are obtained by decomposing a motion field label.

The training process inputs a thermal infrared image-visible light image pair as previously described. The first time of the motion field estimation network outputs an estimated global motion field, and the estimated global motion field and a real global motion field construct a global motion field estimation loss; the second time of the motion field estimation network outputs an estimated local motion field, and the estimated local motion field and the real local motion field construct a local motion field estimation loss; together with metric learning losses describing the sub-networks, a loss function (objective function) is formed for the entire neural network. Training optimization was performed using Adam optimizer for a total of 200 rounds of training.

The trained model is tested using the test set data. Fig. 6a and 6b show an input image pair of a thermal infrared image to be registered and a visible light image, fig. 6c shows an image effect after global transformation, and fig. 6d shows a final registration output result after local transformation is added on the basis of global transformation. Comparing fig. 6b and fig. 6d, it can be seen that the present invention can effectively solve the problem of registration of the thermal infrared image and the visible light image, and the thermal infrared image transformation result has a small numerical error with the real reference tag value and has a high consistency with the spatial distribution of the visible light image.

The interpretable registration depth neural network ERDNN provided by the invention can effectively complete the registration problem of the thermal infrared image and the visible light image, the multi-mode image registration is the basis of fusion, and after the registration and fusion, the rear end can perform sensing tasks such as semantic segmentation, target detection and the like. In conclusion, the invention has wide use value and application prospect in the fields of computer vision, self-driving, monitoring security and the like.

Claims

1. An interpretable thermal infrared visible light image registration method is characterized by comprising the following steps:

the method comprises the following steps: the method comprises the steps that a visible light image is changed into a single-channel gray image through a 1 x 1 convolution neural network, and then a description sub-network based on the sharing parameters of a multi-scale Swin converter Swin-Transformer is used for extracting a thermal infrared image descriptor and a visible light image descriptor;

step two: adding the thermal infrared image and the visible light image descriptor, inputting the thermal infrared image and the visible light image descriptor into a motion field estimation network, and outputting a pixel motion field of infrared relative to visible light; resolving the motion field into a global similarity transformation by using a global transformation module, and then carrying out global transformation on the thermal infrared image through the global transformation module based on the space transformation network STN;

step three: inputting the globally transformed thermal infrared image into a description sub-network again to output a globally transformed thermal infrared image descriptor, adding the globally transformed thermal infrared image descriptor to the original visible light image descriptor, inputting the globally transformed thermal infrared image descriptor into a motion field estimation network to estimate a pixel motion field, and finally performing local displacement field transformation through a local transformation module;

step four: constructing a loss function to train the network, wherein the loss function comprises three items of descriptor metric loss, global motion field estimation loss and local motion field estimation loss;

2. The method of claim 1, wherein the method comprises: in the first step, the following is specifically performed:

1.1: converting the three-channel visible light image into a single-channel image using a 4-layer 1 × 1 convolution; each layer includes a 1 × 1 convolution operation, a batch normalization operation, and a ReLU activation function;

1.2: extracting a descriptor of the image by using a descriptor extraction module; a multi-scale Swin converter Swin-Transformer is used for replacing a traditional convolutional neural network; the range of receptive fields extends from the kernel size of the convolution kernel to the image patch block of the multi-scale Swin transformer.

3. The method of claim 1, wherein the method comprises: in the second step, the concrete steps are as follows:

2.1: the motion field estimation network fuses the multi-scale descriptors extracted in the first step and outputs a motion field;

2.2: using a global transformation module to solve the displacement field obtained in the step 2.1 into an image global similarity transformation parameter S ═ sR, t ], wherein R represents a 2-dimensional image rotation matrix, t represents an image translation vector, and S represents an image scale factor; applying the thermal infrared image to obtain a globally transformed thermal infrared image; the global transformation operation is implemented by the spatial transformation network STN.

4. An interpretable thermal infrared visible image registration method according to claim 1 or 3, wherein: adding the thermal infrared and visible light multi-scale descriptors to input into a motion field estimation network; the motion field estimation network firstly fuses the multi-scale descriptors by using a multilayer convolution and residual connecting network; then, the fusion feature map is used to obtain a global displacement field through a 5-layer convolutional neural network containing residual connection.

5. An interpretable thermal infrared visible image registration method according to claim 1 or 3, wherein: the global transformation module is used for resolving the global displacement field into a similarity transformation by using a similarity transformation ICP algorithm, and then the similarity transformation is acted on the input infrared image to obtain a globally transformed thermal infrared image; the solving method adopts a direct linear transformation DLT algorithm, converts the parameter solving problem of a transformation matrix into an Ax-b form linear equation system solving problem, and can solve an optimal solution by using singular value decomposition; because the resolving process is conductive, a neural network can be embedded, and gradient back propagation is realized.

6. The method of claim 1, wherein the method comprises: in step three, the concrete steps are as follows:

3.1: inputting the thermal infrared image after global transformation into the descriptor network again to extract the descriptor, and adding the descriptor and the visible light image descriptor;

3.2: inputting the descriptor in the step 3.1 into a motion field estimation network, and estimating a motion field between the thermal infrared image and the visible light image after global transformation by the motion field estimation network;

3.3: applying the motion field obtained in the step 3.2 on the thermal infrared image by using a local transformation module to obtain a locally transformed thermal infrared image; the partial transformation operation is implemented by a spatial transformation network STN.

7. The method of claim 1, wherein the method comprises: in step four, the details are as follows:

4.1: the interpretable loss function of the registration deep neural network ERDNN contains three terms: descriptor metric loss, global motion field estimation loss, local motion field estimation loss;

description sub-network metric learning loss the registration fineness of the thermal infrared image and the visible light image is first classified into three levels: complete registration, realization of global registration and non-registration; if the thermal infrared image and the visible light image are completely registered, the measurement between the thermal infrared visible light descriptors is as close as possible; if not, then if the measure between descriptors is less than the threshold C ₂ Are far away from each other; if global registration is achieved, then if the metric between descriptors is greater than a threshold C ₂ Are close to each other and less than the threshold C ₁ Are far away from each other; c ₂ ＞C ₁ ＞0；

The global motion field estimation loss is defined as the mean square error MSE of the estimated global motion field and the real global motion field;

wherein L is _global Representing the loss of the global motion field estimate,

the representation estimates the global motion field and,

representing a real global motion field;

the local motion field estimation loss is defined as the mean square error MSE of the estimated local motion field and the real local motion field;

wherein L is _local Representing the estimated loss of the local motion field,

the representation estimates the local motion field and,

representing a real local motion field; the Mean Square Error (MSE) function is defined as

4.2: performing training on the ERDNN on the training data; parameter optimization adjustments were made using an Adam optimizer.

8. An interpretable thermal infrared visible image registration method according to claim 1 or 4, wherein: describing the sub-network metric learning loss is shown below;

wherein d is _p The Euclidean distance function of the pixel returns the Euclidean distance of the corresponding pixel of the two images; m is ₁ ，m ₂ Is a distance threshold hyperparameter, m in the local network ₁ ＝10，m ₂ ＝5；d ₂ Is a descriptor of a visible light image, d ₄ Is a precisely registered thermal infrared image descriptor, d ₁ Is a thermal infrared image descriptor of misregistration, d ₃ Is a globally matched thermal infrared image descriptor,

represents the square of the Frobenius norm of the matrix;

the loss function is of the form shown below;

L＝αL _metric +L _global +L _local

L _metric ，L _global ，L _local the three terms represent the learning loss of the sub-network metric, the global parameter transformation loss and the local parameter transformation loss respectively, and alpha is a weighting coefficient and is 0.3.

9. An interpretable thermal infrared visible image registration system according to any one of claims 1 to 8, comprising:

the description subnetwork is used for extracting high-level semantic descriptors of pixels corresponding to the thermal infrared image and the visible light image;

a motion field estimation network for performing pixel motion field estimation;

the global transformation module is used for carrying out similarity transformation resolving on the pixel motion field and carrying out global transformation on the initial infrared image;

10. An interpretable thermal infrared visible image registration system according to claim 9, wherein: the network is a system component with optimized parameters, and the module is a system component without optimized parameters; the description sub-network inputs a thermal infrared image and outputs a high-level semantic descriptor of a pixel corresponding to the thermal infrared image; adding the thermal infrared image descriptors and the visible light image descriptors to be used as input of a motion field estimation network, and outputting a global motion field, namely a displacement field; the global motion field is resolved by a global transformation module to obtain a similarity transformation matrix, and global similarity transformation is carried out on the thermal infrared image; extracting the descriptors of the thermal infrared image subjected to global similarity transformation by using the descriptor network again, adding the descriptors to the visible light descriptors, and inputting the extracted descriptors into the motion field estimation network again to obtain a local motion field; and carrying out local displacement field transformation on the thermal infrared image through a local transformation module to obtain a finally registered thermal infrared image.