CN114820733B

CN114820733B - Interpretable thermal infrared visible light image registration method and system

Info

Publication number: CN114820733B
Application number: CN202210420876.6A
Authority: CN
Inventors: 白相志; 汪虹宇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-05-31
Anticipated expiration: 2042-04-21
Also published as: CN114820733A

Abstract

The invention discloses an interpretable thermal infrared visible light image registration method and system, which basically comprises the following steps: 1) The neural network is utilized to simulate the traditional registration algorithm to extract descriptors, match, estimate transformation parameters and transform images; 2) The global transformation is adopted firstly, and then the registration strategy from coarse to fine of the local transformation is carried out; 3) Constructing a loss function to train a network; 4) And (3) using a trained network to solve the problem of thermal infrared image-visible light image registration under different internal and external parameters. The interpretable registration depth neural network ERDNN (Explainable Registration Deep Neural Network) provided by the invention can realize pixel-level registration of a thermal infrared camera and a visible light camera which are not coincident with each other by the optical center. The trained descriptor sub-network can be used as a general descriptor extractor to extract thermal infrared-visible light cross-modal descriptors. The method has wide use value and application prospect in the fields of computer vision, automatic driving, monitoring security and protection and the like.

Description

Interpretable thermal infrared visible light image registration method and system

Technical Field

The invention relates to an interpretable thermal infrared visible light image registration method and system, and belongs to the field of computer vision. The method has wide application prospect in the fields of computer vision, multi-mode data fusion, monitoring security, automatic driving and the like.

Background

Infrared thermal imaging techniques have many advantages over visible light imaging. The visible light camera can clearly and stably image by auxiliary light; in contrast, the thermal infrared camera is mainly used for imaging according to infrared rays generated by thermal radiation of objects, and can be used in extremely severe weather conditions such as night, poor visibility, haze and the like. In addition, compared with radar detection technology, infrared imaging is passive radiation imaging, electromagnetic waves are not emitted outwards actively, and the mode is not easy to be perceived by enemy. Moreover, thermal infrared cameras also have some insight into aircraft that use special paints for optical and radar stealth. These advantages have led researchers to pay more and more attention to infrared imaging. In addition, with the reduction of the cost of the thermal infrared camera and the characteristic of being basically not influenced by illumination conditions, the application and research of the thermal infrared camera in the civil fields of automatic driving and the like are more and more. However, thermal infrared images have low contrast, low resolution, and blurred edges, which make it difficult to apply the thermal infrared images in a single-mode form directly to the perception of complex systems. In summary, the registration and fusion of thermal infrared and visible light images are the current trend. The higher the safety requirement, the more the redundant design is needed, and the fusion of the heterogeneous sensor information and the coordination under the complex working condition are needed. Based on the method, the infrared and visible light multi-view registration data set and the high-robustness deep learning method have important significance.

Up to the present, the thermal infrared and visible light registration method is basically based on the traditional registration method and the non-supervision learning method based on deep learning, and the evaluation indexes are also subjective evaluation indexes.

Thermal infrared and visible light image registration based on conventional registration methods has received much attention over a period of time. The method comprises the steps of extracting edges of an infrared image and a visible light image by using a Canny operator, matching by using geometric features (local minimum curvature, directions and the like) of contour lines as descriptors, and processing the matching problem of thermal infrared and visible light under a larger visual angle (see document ：Jiang,Qian,et al."A Contour Angle Orientation for Power Equipment Infrared and Visible Image Registration."IEEE Transactions on Power Delivery,vol.36,no.4,2021,pp.2559–2569.). Jiang Zetao and the like for considering that the infrared image and the visible light image have the same saliency map, so that a preliminary saliency map of the infrared image and the visible light image is obtained by using HC saliency detection, and then extracting ORB characteristic points on the saliency map for matching (see documents Jiang Zetao, liu Xiaoyan and Wang Qi. An infrared and visible light image registration algorithm [ J ] based on the saliency and the ORB, laser and infrared, 2019,49 (2): 6.) and the like are respectively improved for three problems of small number of registration points, uneven distribution and high mismatch rate between the registration points, and the method for obtaining the characteristic points with uniform spatial distribution and sufficient number by using an adaptive Harris angular point extraction method is proposed, and then gradient directions are fused with mutual information as similarity, so that the similarity is greatly reduced (see documents Jiang Zetao, figure Wei Ming: error detection, figure error rate is 3483): error rate is high (see document 529).

Unsupervised learning methods based on deep learning have received much attention in recent years. Marouf it is believed that the thermal infrared image and the visible light image have correlation at the image semantic level, so that the thermal infrared and visible light semantic segmentation networks are trained first to obtain semantic labels of the images, then the spatial transformation networks (Spatial Transformer Network, STN) are used to regress pixel displacement fields for registration, so that MSE losses between the labels are minimized (see ：Marouf I.E,Barras L,Karaimer H.C.,Süsstrunk S.(2021)"Joint Unsupervised Infrared-RGB Video Registration and Fusion",London Imaging Meeting(LIM`21),September 2021). Zhou Meiqi et al for the thermal infrared image and the visible light image to belong to different "image styles", so that a network migrating from the visible light to the infrared style is trained first with CycleGAN. Then the infrared image migrated from the style is used to regress displacement fields with the real infrared image using the Spatial Transformation Network (STN). Finally, the discriminators for generating the countermeasure networks are used to perform classification discrimination on whether the matches are alternately trained to optimize (Zhou Meiqi, gao Chenjiang, wood pine, etc. the infrared and visible light image registration method based on modal transformation [ J ]. Computer engineering and design, 2020,41 (10): 5..

However, the above methods are all based on a data set without a label, and most of the evaluation indexes are subjective evaluation. In addition, feature point descriptors extracted by the traditional feature-based registration method are not checked on a large-scale data set, and correlation relation between a thermal infrared image and a visible light image is difficult to measure under different scenes, so that universality of an algorithm is not high; the unsupervised deep learning method does not consider the registration flow of the thermal infrared image and the visible light image, belongs to end-to-end training, and causes low algorithm stability. If the traditional registration algorithm flow can be introduced into the deep learning network as a network module and trained on a large data set, the method can provide possibility for solving the registration problem of the thermal infrared image and the visible light image. Based on the method, the traditional registration method is used as a guide, an interpretable registration depth neural network (Explainable Registration Deep Neural Network, ERDNN for short) is constructed, and the problem of registration of thermal infrared and visible light images is effectively solved.

Disclosure of Invention

1. The purpose is as follows: in view of the above problems, the present invention aims to provide an interpretable thermal infrared visible light image registration algorithm and system. An interpretable registration depth neural network ERDNN is constructed based on a depth learning technology, and a traditional registration flow driving network is combined to realize registration of the thermal infrared image and the visible light image.

2. The technical scheme is as follows: in order to achieve the purpose, the whole idea of the technical scheme of the invention is to split the registration system into four parts, namely a description sub-network, a motion field estimation network, a global transformation module and a local transformation module. Extracting a thermal infrared image and a visible light image descriptor by using a description sub-network sharing parameters; after the description sub-features are added, the motion field estimation network and the global transformation module perform infrared image global transformation; and extracting descriptors from the globally transformed thermal infrared image by using a description sub-network, and adding the descriptors with the visible light image descriptors. And carrying out local transformation by a motion field estimation network and a local transformation module to output a final registered thermal infrared image. The technical idea of the invention mainly comprises the following three aspects:

1) The description sub-network based on the metric learning is designed, namely, the description sub-network is trained by using the metric learning.

2) The motion field estimation network is used for estimating the motion field (also called as displacement field) of infrared relative to visible light, and then the global transformation module is used for carrying out similarity transformation and the local transformation module is used for carrying out displacement field transformation to realize the registering process of the thermal infrared image and the visible light image from coarse to fine and interpretable.

3) The description sub-network and motion field estimation network construct loss functions are trained simultaneously so that thermal infrared images can be effectively registered with visible light.

The invention relates to an interpretable thermal infrared visible light image registration method, which comprises the following specific steps:

Step one: the visible light image is changed into a single-channel gray level image through a 1X 1 convolutional neural network, and then a description sub-network based on a multi-scale Swin converter (Swin-converter) sharing parameter is utilized to extract a thermal infrared image and a visible light image descriptor.

Step two: the thermal infrared image is added with the visible light image descriptor, and the pixel motion field of infrared relative to visible light is output by inputting the pixel motion field into a motion field estimation network. The motion field is resolved into global similar transformation by using a global transformation module, and then global transformation is carried out on the thermal infrared image by using a global transformation module based on a Space Transformation Network (STN).

Step three: and inputting the thermal infrared image subjected to global transformation into a description sub-network again, outputting a thermal infrared image description subjected to global transformation, adding the thermal infrared image description with the original visible light image description, inputting a motion field estimation network to estimate a pixel motion field, and finally carrying out local displacement field transformation through a local transformation module.

Step four: the network is trained by constructing a loss function, and the loss function comprises three items of descriptive sub-metric loss, global motion field estimated loss and local motion field estimated loss.

And (3) outputting: the registered thermal infrared images, global transformation parameters and local transformation parameters.

Wherein, the first step is as follows:

1.1: the three-channel visible light image is converted into a single channel image using a 4-layer 1 x 1 convolution.

1.2: And extracting descriptors of the image by using a descriptor extraction module. The traditional convolutional neural network is formed by stacking and combining 2D convolutional layers, is limited by the size of a receptive field, and is not enough for long-distance information utilization. To fully exploit the high-level semantics of the cross-infrared-visible light descriptors requires the use of a large receptive field description sub-network, the present invention uses a multi-scale Swin Transformer (Swin-transducer) instead of the traditional convolutional neural network. The range of the receptive field extends from the kernel size of the convolution kernel to the image patch of the multi-scale Swin transformer.

Wherein, the second step is specifically as follows:

2.1: and D, merging the multiscale descriptors extracted in the step I by the motion field estimation network, and outputting the motion field.

2.2: And (3) resolving the displacement field obtained by 2.1 into an image global similarity transformation parameter S= [ sR, t ] by using a global transformation module, wherein R represents a 2-dimensional image rotation matrix, t represents an image translation vector, and S represents an image scale factor. The global transformed thermal infrared image is obtained by applying the global transformed thermal infrared image to the thermal infrared image. The global transformation operation is implemented by a Spatial Transformation Network (STN).

Wherein, the third step is as follows:

3.1: and inputting the thermal infrared image subjected to global transformation into the description sub-network extraction descriptor again, and adding the description sub-network extraction descriptor with the visible light image descriptor.

3.2: 3.1 Is input into a motion field estimation network, which estimates motion fields between the globally transformed thermal infrared image and the visible light image.

3.3: The motion field obtained in 3.2 is applied to the thermal infrared image by using a local transformation module to obtain a thermal infrared image after local transformation. The local transformation operation is implemented by a Spatial Transformation Network (STN).

The fourth step is specifically as follows:

4.1: the interpretable loss function of the registered deep neural network ERDNN contains three terms: the sub-metric loss is described, global motion field estimation loss, local motion field estimation loss.

Description of sub-network metric learning loss first the registration finesse of the thermal infrared image and the visible image is divided into three classes: complete registration, global registration, non-registration. If the thermal infrared image and the visible light image are completely registered, the measurement between the thermal infrared and visible light descriptors is as close as possible; if not, the descriptors are far apart from each other if the inter-descriptor metric is less than a threshold C ₂; if global registration is achieved, descriptors are close to each other if the inter-descriptor metric is greater than a threshold C ₂, less than a threshold C ₁(C₂＞C₁ > 0).

The global motion field estimation penalty is defined as the Mean Square Error (MSE) of the estimated global motion field and the real global motion field.Where L _global represents global motion field estimation penalty,/>Representing an estimated global motion field,/>Representing a real global motion field.

The local motion field estimation loss is defined as the Mean Square Error (MSE) of the estimated local motion field and the real local motion field.Where L _local represents the local motion field estimation loss,/>Representing an estimated local motion field,/>Representing a real local motion field.

4.2: And performing ERDNN unfolding training on the training data. Parameter optimization adjustments were made using Adam optimizer.

An interpretable thermal infrared-visible image registration system, the basic structural framework and workflow of which are shown in figure 1, comprising:

and the description sub-network is used for extracting high-level semantic descriptors of pixels corresponding to the thermal infrared image and the visible light image.

A motion field estimation network for performing pixel motion field estimation.

And the global transformation module is used for carrying out similar transformation calculation on the pixel motion field and carrying out global transformation on the initial infrared image.

And the local transformation module is used for carrying out local displacement field transformation on the thermal infrared image subjected to global transformation.

Wherein the network is a system component with optimized parameters and the module is a system component without optimized parameters. The description sub-network inputs the thermal infrared image (visible light image) and outputs the high-level semantic description of the corresponding pixels of the thermal infrared image (visible light image). The thermal infrared and visible image descriptors are summed as input to a motion field estimation network and a global motion field (also referred to as a displacement field) is output. The global motion field is solved by the global transformation module to obtain a similar transformation matrix, and global similar transformation is carried out on the thermal infrared image. And extracting descriptors from the thermal infrared image subjected to global similarity transformation by using the descriptor network again, adding the descriptors with visible light descriptors, and inputting the descriptors into the motion field estimation network again to obtain a local motion field. And carrying out local displacement field transformation on the thermal infrared image through a local transformation module to obtain a final registered thermal infrared image.

3. The advantages and the effects are as follows: the invention provides an interpretable thermal infrared visible light image registration network ERDNN which consists of a description sub-network, a motion field estimation network, a global transformation module and a local transformation module. The description sub-network is used for three times in the whole network, is matched with the rest of the network and the modules, and is used for obtaining the high-grade semantic thermal infrared-visible light cross-mode descriptor. The motion field estimation network is used twice in total, using a coarse-to-fine global transformation and a local transformation strategy. The invention can accurately and robustly solve the problem of thermal infrared visible light image registration. The method has wide use value and application prospect in the fields of computer vision, automatic driving, monitoring security and the like.

Drawings

Fig. 1 is a basic structural framework of a thermal infrared visible light image registration algorithm proposed by the present invention.

Fig. 2 is a diagram depicting the basic structure of a subnetwork.

Fig. 3 is a basic structure of a motion field estimation network.

Fig. 4 is a basic structure of the global transformation module.

Fig. 5 is a basic structure of the partial transformation module.

Fig. 6 a-6 d show the registration effect of the network, wherein fig. 6a and 6b show the input of the thermal infrared image and the visible light image to be registered, respectively, fig. 6c shows the thermal infrared image after global parameter transformation, and fig. 6d shows the thermal infrared image after local transformation.

Detailed Description

For a better understanding of the technical solution of the present invention, embodiments of the present invention are further described below with reference to the accompanying drawings.

The invention relates to an interpretable thermal infrared visible light image registration algorithm, wherein an algorithm framework and a network structure of the algorithm are shown in figure 1, and the specific implementation steps of the algorithm framework and the network structure are as follows:

step one: respectively extracting descriptors of the thermal infrared image and the single-channel visible light image after 1X 1 convolution from a description sub-network sharing parameters, wherein the basic structure of the description sub-network is shown in figure 2;

Step two: adding the thermal infrared image and the multiscale descriptors of the visible light image, and estimating a global motion field of visible light relative to infrared by a motion field estimation network, wherein the basic structure of the motion field estimation network is shown in figure 3; performing similarity transformation calculation on the global motion field by using a global transformation module, and performing global similarity transformation on the thermal infrared image, as shown in fig. 4;

Step three: the description sub-network is again used to extract the globally transformed thermal infrared image descriptors, add them to the visible light descriptors, and the motion field estimation network is again used to estimate the local motion field as shown in fig. 3. The displacement field is directly acted on the thermal infrared image after global transformation by utilizing a local transformation module, and a final transformation image is obtained as shown in fig. 5;

step four: constructing a loss function to train the whole ERDNN network;

Wherein, the first step is as follows:

1.1: the 3-channel visible light image is first converted to a single channel image using a 4-layer neural network, each layer comprising a1 x 1 convolution operation, a batch normalization operation, and a ReLU activation function.

1.2: Multiscale Swin transducer is used to extract multiscale feature descriptors of the image. The input image is subjected to downsampling coding under the processing of the description submodule, the scale number is increased, and finally the multi-scale descriptor is output for the subsequent prediction parameters.

Wherein, the second step is specifically as follows:

2.1: the thermal infrared and visible light multi-scale descriptors are added and input into a motion field estimation network. The motion field estimation network firstly uses a multi-layer convolution and residual error connection network to fuse multi-scale descriptors; and then obtaining a global displacement field by using the fusion characteristic diagram through a 5-layer convolutional neural network containing residual connection.

2.2: The global transformation module is used for resolving the global displacement field into similar transformation by using a similar transformation ICP algorithm, and then the similar transformation is applied to the input infrared image to obtain a thermal infrared image after global transformation. The solution method adopts a Direct Linear Transformation (DLT) algorithm, converts a parameter solution problem of a transformation matrix into a linear equation system solution problem in the form of ax=b, and can calculate an optimal solution by using singular value decomposition. Since the solution process is guided, a neural network can be embedded to realize gradient back propagation.

Wherein, the third step is as follows:

3.1: and extracting the globally transformed thermal infrared image descriptors by using the description sub-network, adding the thermal infrared image descriptors with the visible light descriptors, inputting the obtained thermal infrared image descriptors into a motion field estimation network, and obtaining a local displacement field by the motion field estimation network.

3.2: And carrying out local displacement field conversion fine adjustment on the thermal infrared image subjected to global conversion by the local displacement field through a local conversion module to obtain a final registration result.

The fourth step is specifically as follows:

4.1: the description sub-network metric learning penalty is shown below.

Where d _p is the pixel euclidean distance function, returning the euclidean distance of the corresponding pixels of the two images. m ₁,m₂ is the distance threshold hyper-parameter, m ₁＝10,m₂＝5.d₂ is the descriptor of the visible image, d ₄ is the accurately registered thermal infrared image descriptor, d ₁ is the unregistered thermal infrared image descriptor, d ₃ is the thermal infrared image descriptor with global matching,Representing the square of the matrix Frobenius norm.

4.2: The global motion field estimation penalty is shown below.

Where L _global represents the global motion field estimation penalty,Representing an estimated global motion field,/>Representing a real global motion field. The Mean Square Error (MSE) function is defined as/>

4.3 Local motion field estimation loss is shown below.

Where L _local represents the local motion field estimation loss,Representing an estimated local motion field,/>Representing a real local motion field.

4.4: The overall loss function form is shown below.

L＝αL_metric+L_global+L_local

The three terms L _metric,L_global,L_local represent the description sub-network metric learning loss, global parameter transformation loss, and local parameter transformation loss, respectively, and α is a weighting coefficient, which is taken to be 0.3 in the present invention.

The autopilot field often requires multimode data fusion to enhance the perceptibility of the system, thereby enhancing the robustness of the system. To intuitively demonstrate the effect of the present invention, the present invention is trained using a constructed thermal infrared-visible image registration dataset of an autopilot scene. Firstly, a data set is divided into a training set and a testing set, and a motion field label is decomposed to obtain a global motion field (motion field corresponding to similar transformation) and a local motion field (motion field corresponding to local displacement field transformation) label.

The training process inputs a thermal infrared image-visible image pair as described previously. The first motion field estimation network outputs an estimated global motion field and constructs a global motion field estimation loss with a real global motion field; the second time motion field estimation network outputs an estimated local motion field and constructs a local motion field estimation loss with the real local motion field; the metric learning loss describing the subnetwork is added to form the loss function (objective function) of the entire neural network. Training optimizations were performed using Adam optimizer for a total of 200 rounds of training.

And testing the trained model by using the test set data. Fig. 6a and 6b show the input thermal infrared image to be registered and the visible light image input image pair, fig. 6c shows the image effect after global transformation, and fig. 6d shows the final registration output result after adding local transformation on the basis of global transformation. By comparing the figure 6b with the figure 6d, the invention can effectively solve the problem of registering the thermal infrared image and the visible light image, and the thermal infrared image conversion result has smaller numerical error with the true reference label value and has high consistency with the spatial distribution of the visible light image.

The interpretable registration depth neural network ERDNN provided by the invention can effectively complete the registration problem of the thermal infrared image and the visible light image, the multi-mode image registration is the basis of fusion, and after registration and fusion, the rear end can carry out sensing tasks such as semantic segmentation, target detection and the like. In summary, the invention has wide use value and application prospect in the fields of computer vision, self-current driving, monitoring security protection and the like.

Claims

1. An interpretable thermal infrared visible light image registration method is characterized by comprising the following steps:

Step one: changing the visible light image into a single-channel gray level image through a 1X 1 convolutional neural network, and then extracting a thermal infrared image and a visible light image descriptor by using a description sub-network based on a multi-scale Swin converter Swin-converter sharing parameter;

Step two: adding the thermal infrared image and the visible light image descriptor, inputting the thermal infrared image and the visible light image descriptor into a motion field estimation network, and outputting a pixel motion field of infrared relative to visible light; then the global transformation module is utilized to resolve the motion field into global similar transformation, and then global transformation is carried out on the thermal infrared image through the global transformation module based on the spatial transformation network STN;

step three: inputting the thermal infrared image subjected to global transformation into a description sub-network again, outputting a thermal infrared image description subjected to global transformation, adding the thermal infrared image description with an original visible light image description, inputting a motion field estimation network to estimate a pixel motion field, and finally carrying out local displacement field transformation through a local transformation module;

Step four: constructing a loss function to train the network, wherein the loss function comprises three items of descriptive sub-metric loss, global motion field estimated loss and local motion field estimated loss;

and (3) outputting: registering the thermal infrared images, global transformation parameters and local transformation parameters;

in the first step, the specific steps are as follows:

S1.1: converting the three-channel visible light image into a single-channel image using a 4-layer 1 x 1 convolution; each layer includes a1 x 1 convolution operation, a batch normalization operation, and a ReLU activation function;

S1.2: extracting descriptors of the image by using a descriptor extraction module; using a multi-scale Swin converter Swin-transducer to replace the traditional convolutional neural network; extending the range of the receptive field from the kernel size of the convolution kernel to the image patch of the multi-scale Swin transformer;

in the third step, the specific steps are as follows:

S3.1: inputting the thermal infrared image subjected to global transformation into a description sub-network extraction descriptor again, and adding the description sub-network extraction descriptor with the visible light image descriptor;

S3.2: inputting the description of the step 3.1 into a motion field estimation network, wherein the motion field estimation network estimates the motion field between the globally transformed thermal infrared image and the visible light image;

S3.3: applying the motion field obtained in the step 3.2 to the thermal infrared image by using a local transformation module to obtain a thermal infrared image after local transformation; the local transformation operation is implemented by a spatial transformation network STN;

in the fourth step, the specific steps are as follows:

s4.1: the interpretable loss function of the registered deep neural network ERDNN contains three terms: describing sub-metric loss, global motion field estimation loss, local motion field estimation loss;

Description of sub-network metric learning loss first the registration finesse of the thermal infrared image and the visible image is divided into three classes: complete registration, realization of global registration and non-registration; if the thermal infrared image and the visible light image are completely registered, the measurement between the thermal infrared and visible light descriptors is as close as possible; if not, the descriptors are far apart from each other if the inter-descriptor metric is less than a threshold C ₂; if global registration is achieved, the descriptors are close to each other if the measurement between the descriptors is greater than a threshold C ₂, and are far away from each other if the measurement between the descriptors is less than a threshold C ₁; c ₂>C₁ >0;

the global motion field estimation penalty is defined as the mean square error MSE of the estimated global motion field and the real global motion field; Where L _global represents global motion field estimation penalty,/> Representing an estimated global motion field,/>Representing a real global motion field;

The local motion field estimation loss is defined as the mean square error MSE of the estimated local motion field and the real local motion field; Where L _local represents the local motion field estimation loss,/> Representing an estimated local motion field,/>Representing a real local motion field; the Mean Square Error (MSE) function is defined as/>

S4.2: performing a training on ERDNN on the training data; performing parameter optimization adjustment by using an Adam optimizer;

The description sub-network metric learning loss is as follows;

Wherein d _o is a pixel Euclidean distance function, and the Euclidean distance of the corresponding pixels of the two images is returned; m ₁,m₂ is the distance threshold hyper-parameter, m ₁＝10,m₂＝5;d₂ is the descriptor of the visible image, d ₄ is the accurately registered thermal infrared image descriptor, d ₁ is the unregistered thermal infrared image descriptor, d ₃ is the thermal infrared image descriptor with global matching, Representing the square of the matrix Frobenius norm;

The loss function is in the form shown below;

L＝αL_metric+L_global+L_local

the three terms L _metric,L_global,L_local represent the description sub-network metric learning loss, global parameter transformation loss, and local parameter transformation loss, respectively, and α is a weighting coefficient, taken to be 0.3.

2. An interpretable thermal infrared visible image registration method according to claim 1, wherein: in the second step, the specific steps are as follows:

S2.1: the motion field estimation network fuses the multiscale descriptors extracted in the first step and outputs motion fields;

S2.2: resolving the displacement field obtained in the step 2.1 into an image global similarity transformation parameter S= [ sR, t ] by using a global transformation module, wherein R represents a 2-dimensional image rotation matrix, t represents an image translation vector, and S represents an image scale factor; applying the global transformed thermal infrared image to the thermal infrared image to obtain a global transformed thermal infrared image; the global transformation operation is implemented by a spatial transformation network STN.

3. An interpretable thermal infrared visible image registration method according to claim 1 or 2, wherein: adding the thermal infrared and visible light multi-scale descriptors to input a motion field estimation network; the motion field estimation network firstly uses a multi-layer convolution and residual error connection network to fuse multi-scale descriptors; and then obtaining a global displacement field by using the fusion characteristic diagram through a 5-layer convolutional neural network containing residual connection.

4. An interpretable thermal infrared visible image registration method according to claim 1 or 2, wherein: the global transformation module is used for resolving the global displacement field into similar transformation by using a similar transformation ICP algorithm, and then the similar transformation is applied to the input infrared image to obtain a thermal infrared image after global transformation; the solving method adopts a direct linear transformation DLT algorithm, converts the parameter solving problem of the transformation matrix into a linear equation set solving problem in the form of ax=b, and can calculate the optimal solution by using singular value decomposition; since the solution process is guided, a neural network can be embedded to realize gradient back propagation.

5. A system for implementing the interpretable thermal infrared-visible image registration method of claim 1, comprising:

The description sub-network is used for extracting high-level semantic descriptors of pixels corresponding to the thermal infrared image and the visible light image;

a motion field estimation network for performing pixel motion field estimation;

The global transformation module is used for carrying out similar transformation calculation on the pixel motion field and carrying out global transformation on the initial infrared image;

6. The system according to claim 5, wherein: the network is a system component with optimized parameters, and the modules are system components without optimized parameters; the description sub-network inputs the thermal infrared image and outputs the high-level semantic descriptor of the corresponding pixel of the thermal infrared image; the thermal infrared and visible light image descriptors are added to be used as the input of a motion field estimation network, and a global motion field, namely a displacement field, is output; the global motion field is solved by a global transformation module to obtain a similar transformation matrix, and global similar transformation is carried out on the thermal infrared image; extracting descriptors from the thermal infrared image after global similarity transformation by using a description sub-network again, adding the descriptors with visible light descriptors, and inputting the descriptors into a motion field estimation network again to obtain a local motion field; and carrying out local displacement field transformation on the thermal infrared image through a local transformation module to obtain a final registered thermal infrared image.