CN116363382B

CN116363382B - Dual-band image feature point searching and matching method

Info

Publication number: CN116363382B
Application number: CN202310106850.9A
Authority: CN
Inventors: 蒋一纯; 刘云清; 詹伟达; 郭金鑫; 韩登; 于永吉
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2023-02-14
Filing date: 2023-02-14
Publication date: 2024-02-23
Anticipated expiration: 2043-02-14
Also published as: CN116363382A

Abstract

The invention belongs to the technical field of image processing, in particular to a method for searching and matching characteristic points of a dual-band image, which comprises the following steps: step 1, constructing a feature extraction network model: constructing a characteristic extraction network model according to the calculation performance and the storage capacity of a training computer, pre-training a basic model, and adding a nonlinear multi-layer perceptron at the output end of the basic model; step 2, preparing a data set for training a feature extraction network: the data set comprises infrared and visible light image pairs under the same scene, firstly, coarse registration is carried out on the data set, and then the images are cut into image block pairs with uniform specification so as to facilitate subsequent network training. The invention constructs the characteristic descriptors of the infrared and visible light images based on the contrast learning method, and the characteristic descriptors do not depend on manual design, and can provide more stable and reliable characteristic vectors when larger deformation exists between the images.

Description

Dual-band image feature point searching and matching method

Technical Field

The invention relates to the technical field of image processing, in particular to a method for searching and matching characteristic points of a dual-band image.

Background

In engineering applications, due to the high uncertainty of the environment, a single imaging means is very susceptible to the environment, and in order to improve the robustness of the system, multiple imaging devices are often required to work simultaneously to provide more information, such as thermal infrared and visible light cameras; however, most of the existing image processing algorithms for infrared and visible light image processing are built on the premise that the infrared and visible light images are already aligned, so that the acquired images need to be registered; the key point of image registration is that feature points in different images are extracted, after the feature points are matched in pairs, a mapping matrix between the images is calculated, and finally the image registration is completed; the main stream image feature point searching and matching algorithm is based on manually designed descriptors, focuses the attention points on part of significant features in the image, such as corner points, extreme points or gradient histograms, and the like, and then associates according to certain similarity or distribution relation among feature points; however, because the imaging mechanism of the infrared image is different from that of the visible light image, huge differences exist between image features, the precision of feature searching and extracting is seriously affected, and the actual requirements cannot be met;

the Chinese patent publication number is CN110428455B, the name is a method for registering targets of a visible light image and a far infrared image, and the method extracts the space gray histogram characteristics of the visible light image and the far infrared image respectively so as to roughly classify the targets in the infrared image and the visible light image; then, after extracting edge features of visible light and infrared images, constructing an edge direction histogram, and carrying out correlation measurement, realizing feature point matching; the robustness of the manually designed feature descriptors and the matching algorithm is poor, and the possible blurring, noise, brightness change and the like in the image are difficult to deal with, so that the performance is more sharply reduced when the infrared and visible light image scale and angle are different; therefore, by means of deep learning technology, by means of the strong feature extraction and expression capability of the neural network, the construction of a more robust feature description and matching method is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides a method for searching and matching the characteristic points of a dual-band image, which solves the problems of large characteristic difference, high searching difficulty and poor matching precision in the existing infrared and visible light image registration method.

(II) technical scheme

The invention adopts the following technical scheme for realizing the purposes:

a method for searching and matching characteristic points of a dual-band image comprises the following steps:

step 1, constructing a feature extraction network model: constructing a characteristic extraction network model according to the calculation performance and the storage capacity of a training computer, pre-training a basic model, and adding a nonlinear multi-layer perceptron at the output end of the basic model;

step 2, preparing a data set for training a feature extraction network: the data set comprises infrared and visible light image pairs under the same scene, firstly, coarse registration is carried out on the data set, and then the images are cut into image block pairs with uniform specification so as to facilitate subsequent network training;

step 3, training the characteristic extraction network parameter: respectively carrying out data augmentation on the infrared and visible light images in the data set prepared in the step 2, then inputting training data into the network model constructed in the step 1 for training, and minimizing a loss function;

step 4, constructing a fine-grained feature description network model: the feature extraction network model consists of three parts, namely feature initialization, feature extraction and feature description;

step 5, preparing a data set for the fine-grained characterization network: this step is substantially identical to step 2, but the dataset requires accurate registration;

step 6, training fine granularity characteristic description network parameters: the step is basically consistent with the step 3, but image displacement is not included in the data set augmentation, so that strict alignment between infrared and visible light images is ensured;

step 7, searching and matching the infrared and visible light image characteristic points: firstly, respectively inputting infrared and visible light images into a feature extraction network to obtain a feature image of a designated layer, and disassembling the feature image into feature vectors; performing similarity measurement on feature vectors of the infrared and visible light images, and performing coarse-granularity matching on the features according to similarity and direction consistency; and for each pair of matching points, intercepting a certain image area taking the matching points as the center, carrying out sliding treatment on the area by a fine-granularity characteristic description network to obtain a corresponding characteristic description subset, carrying out similarity comparison, carrying out fine-granularity matching, and finally completing characteristic searching and matching of infrared and visible light images.

Further, the feature extraction network model in the step 1 can be flexibly selected according to training computing equipment, and it is recommended to select ResNet and ResNet-divider in the convolutional neural network, and the network depth is at least 34 layers; when the training data is very sufficient, criteria ViT or Shift Window Transformer in Vision Transformer can also be selected to obtain better feature extraction capabilities.

Further, the multi-layer perceptron added at the output end of the basic network in the step 1 has a two-layer structure, and an activation function is added in the middle to provide nonlinear mapping capability.

Further, the pre-training weights of the feature extraction network model in step 1 need to be obtained by training on an ImageNet or equivalent image recognition dataset.

Further, the data set in the step 2 uses FLIRADAS data set; selecting a small number of picture pairs in the data set, manually labeling, selecting and matching characteristic points, and calculating to obtain a correction coordinate graph; and (3) applying the correction coordinate graph to the whole data set to obtain a rough registration image pair.

Further, the data augmentation mode in the step 3 includes image selection, translation, scaling, miscut, contrast transformation, random probability graying and random gaussian blur, and the center of all the transformations is the center of the image.

Further, the loss function in the step 3 adopts contrast loss; specifically, the infrared and visible light images of the same scene are regarded as positive samples, the infrared and visible light images of different scenes are regarded as negative samples, the inner products among vectors output after the input features of different samples are extracted from the network are calculated, and the network parameters are optimized by calculating the contrast loss of the inner products.

Further, the feature initialization of the feature extraction network model in the step 4 is composed of a first convolution layer, a first batch normalization layer, a second convolution layer and a second batch normalization layer; the characteristic extraction module can be composed of a plurality of residual convolution blocks, wherein the residual convolution blocks are composed of a first convolution layer, a first batch normalization layer, a second convolution layer, a second batch normalization layer and residual connection; the feature description consists of an average pooling layer and a multi-layer perceptron.

Further, the data augmentation in the step 6 includes rotation, scaling and miscut with the image center as a transformation center.

Further, the output value of the similarity measure function in step 7 should be positively correlated with the similarity of the input image pair to match the characteristics of the contrast loss function.

(III) beneficial effects

Compared with the prior art, the invention provides a method for searching and matching the characteristic points of the dual-band image, which has the following beneficial effects:

the invention constructs the characteristic descriptors of the infrared and visible light images based on the contrast learning method, and the characteristic descriptors do not depend on manual design, and can provide more stable and reliable characteristic vectors when larger deformation exists between the images.

In the training process of the proposed feature extraction network model, the invention does not need manual punctuation to match feature points or accurately register, thereby realizing unsupervised learning, and reducing the dependence of a deep learning-based method on huge data and manual punctuation thereof by combining with a transfer learning technology.

The coarse-granularity and fine-granularity two-step feature searching and matching provided by the invention effectively reduces the range of feature point searching and relieves the problem of high calculation complexity of the deep learning technology.

In the invention, a confidence coefficient threshold value is set through the characteristic of extensive statistics data in the proposed feature matching process, the spatial relation of the feature point pair with the highest confidence coefficient is calculated, and a direction threshold value is set; and the correctly matched feature points are screened according to the threshold value, so that the probability of mismatching is effectively reduced, and the feature matching precision is improved.

Drawings

FIG. 1 is a flow chart of a method for searching and matching feature points of a dual-band image;

FIG. 2 is a schematic diagram of a feature extraction network model training method;

FIG. 3 is a schematic diagram of a fine-grained characterization network model;

FIG. 4 is a schematic diagram of a fine-grained characterization network model training method;

FIG. 5 is a flow chart of the feature search and matching process operating principle;

FIG. 6 is a comparison result of the main performance indexes of the method for searching and matching feature points of a dual-band image according to the present invention and the prior art;

fig. 7 is a schematic diagram of an internal structure of an electronic device for implementing the method for searching and matching feature points of a dual-band image according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

As shown in fig. 1, a flow chart of a method for searching and matching feature points of a dual-band image, the method specifically includes the following steps:

step 1, constructing a feature extraction network model: the feature extraction network model can be established on any basic network model for image classification, flexibly selected according to the calculation performance and the storage capacity of a training computer, and the ResNet-divider in the convolutional neural network are recommended to be selected, wherein the network depth is at least 34 layers; when the training data is very sufficient, either criteria ViT or Shift Window Transformer in Vision Transformer can also be selected to obtain better feature extraction capabilities; the basic model needs pre-training weights and needs training and obtaining on an image identification data set of an ImageNet or an image identification data set of the same scale; adding a nonlinear multi-layer perceptron at the output end of the basic model, wherein the nonlinear multi-layer perceptron is provided with a two-layer full-connection structure, and an activation function is added in the middle to provide nonlinear mapping capability;

step 2, preparing a data set for training a feature extraction network: preparing a training dataset, the dataset using FLIR ADAS dataset; selecting a small number of picture pairs in the data set, selecting and matching characteristic points by using a manual standard post, and calculating to obtain a correction coordinate graph; applying the corrected graph to the whole data set to obtain a rough registration image pair; cutting the image into image block pairs with uniform specification so as to carry out network training subsequently;

step 3, training the characteristic extraction network parameter: respectively carrying out data augmentation on the infrared and visible light images in the data set prepared in the step 2, wherein the data augmentation comprises image rotation, translation, scaling, miscut, contrast conversion, random probability graying and random Gaussian blur, and the centers of all the conversion are the centers of the images, so that the generalization capability of training is improved; then, inputting a large number of infrared and visible light image pairs into the network model constructed in the step 1 for training, taking the infrared and visible light images of the same scene as positive samples, taking the infrared and visible light images of different scenes as negative samples, calculating inner products among vectors output after extracting the network from input characteristics of different samples, and optimizing network parameters by calculating and minimizing the contrast loss of the inner products;

step 4, constructing a fine-grained feature description network model: the feature extraction network model consists of three parts, namely feature initialization, feature extraction and feature description; the feature initialization consists of a first convolution layer, a first batch normalization layer, a second convolution layer and a second batch normalization layer; the feature extraction module consists of eight residual convolution blocks, wherein the residual convolution blocks consist of a first convolution layer, a first batch of normalization layers, a second convolution layer and a second batch of normalization layers; the feature description consists of an average pooling layer and a multi-layer perceptron;

step 5, preparing a data set for the fine-grained characterization network: the dataset adopts a RoadScene dataset or similar strictly registered infrared and visible light image datasets;

step 6, training fine granularity characteristic description network parameters: respectively carrying out data augmentation on the infrared and visible light images in the data set prepared in the step 5, wherein the data augmentation comprises rotation, scaling and miscut taking the image center as a transformation center, and the generalization capability of training is improved; the training mode is the same as that of the step 3;

Example 2:

step 1, constructing a feature extraction network model;

the feature extraction network is used for searching and matching coarse-granularity feature descriptors, and can extract potential public image structural features from infrared and visible light images; considering calculation and storage costs during training, the embodiment adopts a classical convolutional neural network ResNet50, which consists of an input head, a first residual error module, a second residual error module, a third residual error module, a fourth residual error module, self-adaptive average pooling and a full connection layer; then deleting the full connection layer after adaptive average pooling of ResNet50, wherein the part is the basic network; then, a multi-layer perceptron is connected behind the basic network; the multi-layer perceptron is of a two-layer full-connection structure, the dimension of the input vector of the first layer is 2048, the dimension of the output vector of the first layer is 1000, the dimension of the input vector of the second layer is 1000, and the dimension of the output vector of the second layer is 1000; a linear rectification activation layer sigma (x) =max (x, 0) is arranged in the middle; the basic network loads a pre-training weight parameter, wherein the weight parameter is that ResNet50 is trained on an ImageNet or a large-scale image classification data set with equal scale; the pre-training weight of a general classical network can be directly obtained from a general model database such as torchvision, modelzoo and the like, and independent training is not needed;

step 2, preparing a data set for training a feature extraction network;

the dataset for training the feature extraction network uses a FLIRADAS dataset comprising thermal infrared and visible light images at a rate of 8862 Zhang Fenbian of 512 x 640, deleting partial data in the dataset with different resolutions; then 20 pairs of images in a data set are selected, characteristic points are manually selected and matched in the infrared and visible light images respectively by using control point select tools of an image processing tool box in matlab, the characteristic point data of the 20 pairs of images are combined, and then a nonlinear transformation method polynominal calculation is adopted to obtain a correction coordinate graph by taking the infrared images as a reference; because only coarse registration is required, the mapping matrix of all image pairs in the dataset can be replaced with the correction coordinate map; due to the limitation of the video memory capacity during training, the images are cut into 256×256 image blocks, and 37976 image blocks are obtained in total so as to carry out subsequent network training;

step 3, training characteristics to extract network parameters;

the training scheme in the step 3 specifically comprises the following steps: setting the training frequency as 100, wherein the number of the network pictures (i.e. batch processing number) input each time is about 128-512, and the upper limit of the number of the network pictures input each time is mainly determined according to the operation storage capacity of a computer graphic processor; the learning rate in the training process is set to be 0.001, so that the training speed can be ensured, and the problem of gradient explosion can be avoided; the learning rate is reduced to 0.1 of the current learning rate when training is performed for 50 times and 75 times, and the optimal value of the parameter can be better approached; the network parameter optimizer selects the self-adaptive moment estimation algorithm, and has the advantages that after bias correction, each iteration learning rate has a determined range, so that the parameters are stable; the threshold value of the function value of the loss function is set to be 0.01, and the training of the network can be considered to be basically completed when the function value of the loss function is smaller than the threshold value;

as shown in fig. 2, during training, firstly, data augmentation is performed on infrared and visible light images respectively, including image rotation, translation, scaling, miscut, contrast transformation, random probability graying and random gaussian blurring, the centers of all the transformations are the centers of the images, the probability of occurrence of each transformation is 0.5, and the random transformation is repeated for each training, so that the generalization capability of the training is improved; will be infraredAnd visible light image seti=1, 2, i, N (B, H and W are the batch number, height and width of the input image set respectively, N is total batch number) are respectively input into the feature extraction network to obtain output vectors(L is the output vector dimension, 1000 in this embodiment); at this time, infrared and visible light images of the same scene are +.>Considered as positive samples, and infrared and visible images in different scenesThen, as a negative example, the loss value required to train the network parameters can be obtained by applying a contrast loss function, which can be expressed as:

wherein S is an indication function, f _sim (x ₁ ,x ₂ ) For the similarity measure function, the cosine similarity measure is used in the invention, and can be expressed as:

the loss function generated by the network is minimized through reverse gradient conduction, so that better weight parameters can be obtained;

step 4, constructing a fine-grained feature description network model;

as shown in fig. 3, the feature extraction network model consists of three parts, namely feature initialization, feature extraction and feature description; the feature initialization consists of a parameter convolution layer I, a batch normalization layer I, a convolution layer II and a batch normalization layer II; the core size of the first convolution layer is 7 multiplied by 7, the step length is 2, the filling is 1 and no offset is generated, the core size of the second convolution layer is 3 multiplied by 3, the step length is 1, the filling is 1 and no offset parameter is generated; the characteristic extraction module consists of 8 residual convolution blocks, wherein the residual convolution blocks consist of a first convolution layer, a first batch normalization layer, a second convolution layer and a second batch normalization layer, the kernel sizes of the first convolution layer and the second convolution layer are 3 multiplied by 3, the step length is 1, the filling is 1, and no offset parameter exists; the feature description consists of an average pooling layer and a full-connection layer, wherein the core size of the average pooling layer is 2, and the step length is 2;

step 5, preparing a data set for the fine-grained characterization network;

the dataset for the fine-grained characterization network uses a RoadScene dataset comprising 200 thermal infrared images of different resolutions, which dataset has been in strict registration; due to the limitation of the video memory capacity during training, the images are cut into 64 multiplied by 64 image blocks according to the steps of 16, and image pairs with too small average variance of one of infrared and visible light are screened and deleted to obtain 15000 image blocks in total;

step 6, training fine granularity characteristic description network parameters;

the training scheme in the step 6 specifically comprises the following steps: setting training times to be 50, wherein the number of pictures input to the network (i.e. batch processing number) at each time is about 128-512, and because the fine-granularity feature description network also relies on a large number of positive and negative samples to learn the image structure, the larger batch processing number can obtain better performance; the learning rate of the training process is set to be 0.01, and the learning rate is reduced to be 0.1 of the current learning rate when training is performed for 25 times and 40 times; the network parameter optimizer selects an adaptive moment estimation algorithm, the loss function value threshold is set to be 0.01, and training of the network can be considered to be basically completed when the loss function value threshold is smaller than the threshold;

as shown in fig. 4, during training, firstly, data augmentation is performed on the infrared and visible light images respectively, including image rotation, scaling, miscut, contrast conversion and random probability graying, the centers of all the conversions are the centers of the images, the probability of each conversion is 0.5, and the random conversion is repeated during each training; the selection and calculation of the loss function are consistent with the step 3;

step 7, searching and matching the characteristic points of the infrared and visible light images;

the flow is shown in figure 5, after the feature extraction network and the fine-granularity feature description network are trained, the parameters of the feature extraction network and the fine-granularity feature description network are solidified, and the reasoning stage is carried out; given infrared image input I _ir ∈ ^3×H×W (here, the infrared image is duplicated as 3 channels) and the visible light image input I _vi ∈ ^3×H×W The method comprises the steps of carrying out a first treatment on the surface of the Respectively passing the two through a feature extraction network, extracting the output of a residual error module IV in the feature extraction network, and obtaining a feature graph f after 8 times downsampling of infrared and visible light _ir ∈ ^{1024×(H/16)×(W/16)} And f _vi ∈ ^{1024×(H/16)×(W/16)} And adjusting the feature diagram to a feature vector form to obtain f' _ir ∈ ^{(HW/256)×1024} And f' _vi ∈ ^{(HW/256)×1024} The method comprises the steps of carrying out a first treatment on the surface of the Will f' _ir And f' _vi After normalization operation is carried out on each feature vector in the model, the inner product between the feature vectors is calculated as similarity measure sim epsilon ^{(HW/256)×(HW/256)} The process can be expressed as:

sim is normalized by softmax along the direction of infrared light and visible light and multiplied to obtain a similarity confidence matrix conf epsilon ^{(HW/256)×(HW/256)} The process can be expressed as:

conf＝softmax _dim＝0 (sim)·softmax _dim＝1 (sim)

in this embodiment, the infrared image is used as a reference, similar features of the visible light image and the infrared image are searched, and the opposite confidence matrix is maximized along the infrared direction to obtain the confidence conf _ir ∈ ^(HW/256) Index value arg thereof _ir ∈ ^(HW/256) The process can be expressed as:

conf _ir ＝max _dim＝0 (conf)

arg _ir ＝argmax _dim＝0 (conf)

in the embodiment, selecting a confidence threshold value of 0.3, and screening out possible pairing points; then calculating five pairs of matching points with highest confidence coefficient, calculating average offset in x and y directions, setting an offset threshold 10, screening matching points with offset close to the five pairs of matching points with highest confidence coefficient, and regarding the matching points as effective matching;

after coarse granularity matching is completed, each pair of matching points represents a 16×16 image area in a source image, and further refined matching is required; taking pixels in a 16 multiplied by 16 area corresponding to a source image as the center, intercepting the 64 multiplied by 64 area of the original image, generating feature vectors, performing sliding processing with a step of 2, and obtaining 8 multiplied by 8 feature vectors in the area in total; similarly, by calculating a similarity confidence coefficient matrix and taking an infrared image as a reference, searching and matching the features in the corresponding region of the visible light image, setting a confidence coefficient threshold value of 0.5, screening out effective matching points, and finishing final accurate feature matching;

the implementation of convolution, splicing, up-down sampling and other operations is an algorithm well known to those skilled in the art, and the specific flow and method can be referred to in corresponding textbooks or technical literature.

According to the method, a more robust and denser feature descriptor can be obtained by constructing a dual-band image feature point searching and matching method, and the similarity is directly calculated and matched through the feature descriptor; the feasibility and superiority of the method are further verified by calculating the related indexes of the image obtained by the existing method; related index pairs of the prior art and the proposed method of the present invention are shown in fig. 6;

the processor may be a general-purpose processor, such as a Central Processing Unit (CPU), digital Signal Processor (DSP), graphics Processor (GPU), application Specific Integrated Circuit (ASIC), field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application; the general purpose processor may be a microprocessor or any conventional processor or the like; the steps of the method disclosed in connection with the embodiments of the present application may be directly embodied as performed by a hardware processor, or may be performed by a combination of hardware and software modules in a processor;

the memory is used as a nonvolatile computer readable storage medium for storing nonvolatile software programs, nonvolatile computer executable programs and modules; the memory may include at least one type of storage medium, which may include, for example, random Access Memory (RAM), static Random Access Memory (SRAM), charged erasable programmable read-only memory (EEPROM), magnetic memory, optical disk, and the like; memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such; the memory in the embodiments of the present application may also be a circuit or any other device capable of implementing a storage function, and is configured to store program instructions or data;

the communication interface may be used for data transmission between the computing device and other computing devices, terminals or imaging devices, and may employ a general-purpose protocol, such as Universal Serial Bus (USB), synchronous/asynchronous serial receiver/transmitter (USART), controller Area Network (CAN), etc.; the communication interface can be an interface for transferring data between different devices and a communication protocol thereof, but is not limited thereto; the communication interface in the embodiment of the present application may also be optical communication or any other manner or protocol capable of implementing information transmission;

the present invention also provides a computer readable storage medium for searching and matching feature points of a dual-band image, where the computer readable storage medium may be a computer readable storage medium included in the apparatus in the above embodiment; or may be a computer-readable storage medium, alone, that is not assembled into a device; the computer-readable storage medium stores one or more programs for use by one or more processors to perform the methods described herein;

it should be noted that while the electronic device shown in fig. 7 shows only a memory, a processor, and a communication interface, in a particular implementation, those skilled in the art will appreciate that the apparatus also includes other devices necessary to achieve proper operation; meanwhile, as will be appreciated by those skilled in the art, the apparatus may further include components for implementing other additional functions according to specific needs; furthermore, it will be appreciated by those skilled in the art that the apparatus may also include only the devices necessary to implement the embodiments of the present invention, and not necessarily all of the devices shown in FIG. 7.

Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for searching and matching characteristic points of a dual-band image is characterized by comprising the following steps: the method comprises the following steps:

2. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the feature extraction network model in the step 1 can be flexibly selected according to training computing equipment, and ResNet-divider in a convolutional neural network are recommended to be selected, wherein the network depth is at least 34 layers; when the training data is very sufficient, criteria ViT or Shift Window Transformer in Vision Transformer can also be selected to obtain better feature extraction capabilities.

3. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the multi-layer perceptron added at the output end of the basic network in the step 1 has a two-layer structure, and an activation function is added in the middle to provide nonlinear mapping capability.

4. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the pre-training weights of the feature extraction network model in the step 1 need to be obtained by training on an image net or an image recognition data set with the same scale.

5. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the data set in the step 2 uses FLIRADAS data set; selecting a small number of picture pairs in the data set, manually labeling, selecting and matching characteristic points, and calculating to obtain a correction coordinate graph; and (3) applying the correction coordinate graph to the whole data set to obtain a rough registration image pair.

6. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the data augmentation mode in the step 3 comprises image selection, translation, scaling, miscut, contrast transformation, random probability graying and random Gaussian blur, and the centers of all the transformations are the centers of the images.

7. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the loss function in the step 3 adopts contrast loss; specifically, the infrared and visible light images of the same scene are regarded as positive samples, the infrared and visible light images of different scenes are regarded as negative samples, the inner products among vectors output after the input features of different samples are extracted from the network are calculated, and the network parameters are optimized by calculating the contrast loss of the inner products.

8. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the feature initialization of the feature extraction network model in the step 4 consists of a first convolution layer, a first batch normalization layer, a second convolution layer and a second batch normalization layer; the characteristic extraction module can be composed of a plurality of residual convolution blocks, wherein the residual convolution blocks are composed of a first convolution layer, a first batch normalization layer, a second convolution layer, a second batch normalization layer and residual connection; the feature description consists of an average pooling layer and a multi-layer perceptron.

9. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the data augmentation in step 6 includes rotation, scaling and miscut with the image center as the transformation center.

10. The method for searching and matching the feature points of the dual-band image according to claim 1, wherein the method comprises the following steps: the output value of the similarity measure function in step 7 should be positively correlated with the similarity of the input image pair to match the characteristics of the contrast loss function.