CN114612698A

CN114612698A - Infrared and visible light image registration method and system based on hierarchical matching

Info

Publication number: CN114612698A
Application number: CN202210191584.XA
Authority: CN
Inventors: 林颖; 刘萌; 白德盟; 郑文杰; 李�杰; 杨祎; 李程启; 师伟; 刘晓东; 宋扬; 乔木; 李龙龙; 张皓; 李壮壮; 张峰达
Original assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Electric Power Research Institute of State Grid Shandong Electric Power Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-10
Anticipated expiration: 2042-02-28
Also published as: CN114612698B

Abstract

The invention belongs to the technical field of image processing, and provides an infrared and visible light image registration method and system based on hierarchical matching. The method comprises the steps of acquiring an infrared image and a visible light image; pre-screening pixels of the visible light image based on the local aggregation characteristics to obtain the visible light image after the pixel pre-screening; extracting and matching characteristic points of the infrared image and the visible light image subjected to pixel pre-screening; according to the pixel coordinates of the matched feature point pairs in the infrared image and the visible light image, a progressive consistent sampling algorithm is utilized to obtain conversion parameters from the infrared image to the visible light image; and transforming the coordinates of the infrared image into a visible light image coordinate system according to the transformation parameters to realize hierarchical registration of the infrared image and the visible light image.

Description

Infrared and visible light image registration method and system based on hierarchical matching

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an infrared and visible light image registration method and system based on hierarchical matching.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Due to the fact that the infrared lens and the visible light lens are different in position, focal length, distortion parameters and the like, deformation problems such as offset, scaling and the like inevitably exist between infrared modal images and visible light modal images collected by the same thermal infrared imager. Performing fusion analysis on the infrared image and the visible light image, firstly, accurately registering images of two modes, and simultaneously overcoming interference caused by some objects in the visible light image; the registration process generally includes pixel screening, corner extraction, feature computation, and feature matching.

In a traditional method for directly performing corner extraction, Feature calculation and Feature matching on an original image, Scale-Invariant Feature Transform (SIFT) is often not robust enough to changes of different modes such as infrared and visible light, and a Feature matching method of a k-d tree nearest neighbor query algorithm (Best Bin First, BBF) ignores context information of adjacent Feature points during matching and is also easy to cause matching errors. Compared with the traditional method, the method based on the deep learning, such as SuperPoint, for extracting the feature points and calculating the feature, and the method based on the deep graph convolution network, such as SuperGlue, is easy to generate compact matching point pairs, and can generate more stable matching pairs to a certain extent. However, the inventor finds that feature point extraction and matching are directly performed on the infrared image and the visible light image, and if feature points or matching pairs are extracted from sky clouds or other non-registration related components, invalid matching is extracted most frequently, and subsequent registration and fusion are interfered.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a method and a system for registering infrared and visible light images based on hierarchical matching, which can realize more accurate registration of the infrared image and the visible light image compared with the traditional method; and moreover, accurate registration results can be obtained for different images shot at different angles.

In order to achieve the purpose, the invention adopts the following technical scheme:

the first aspect of the invention provides an infrared and visible light image registration method based on hierarchical matching, which comprises the following steps:

acquiring an infrared image and a visible light image;

pre-screening pixels of the visible light image based on the local aggregation characteristics to obtain the visible light image after the pixel pre-screening;

extracting and matching characteristic points of the infrared image and the visible light image subjected to pixel pre-screening;

according to the pixel coordinates of the matched feature point pairs in the infrared image and the visible light image, a progressive consistent sampling algorithm is utilized to obtain conversion parameters from the infrared image to the visible light image;

and transforming the coordinates of the infrared image into a visible light image coordinate system according to the transformation parameters to realize hierarchical registration of the infrared image and the visible light image.

A second aspect of the present invention provides a hierarchy matching based infrared and visible image registration system, comprising:

the image acquisition module is used for acquiring an infrared image and a visible light image;

the pixel pre-screening module is used for pre-screening pixels of the visible light image based on the local aggregation characteristics to obtain the visible light image after the pixel pre-screening;

the characteristic point extracting and matching module is used for extracting and matching characteristic points of the infrared image and the visible light image subjected to pixel pre-screening;

the conversion parameter calculation module is used for obtaining conversion parameters from the infrared image to the visible light image by utilizing a progressive consistent sampling algorithm according to the pixel coordinates of the matched feature point pairs in the infrared image and the visible light image;

and the level registration module is used for transforming the coordinates of the infrared image into a visible light image coordinate system according to the transformation parameters to realize level registration of the infrared image and the visible light image.

A third aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the infrared and visible light image registration method based on hierarchical matching as described above.

A fourth aspect of the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the infrared and visible image registration method based on hierarchical matching as described above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

(1) the invention realizes the registration of infrared and visible light images based on hierarchical matching on the basis of pixel pre-screening based on local aggregation characteristics, extracting characteristic points by using a self-supervision learning network, matching the characteristic points by using a depth map convolution network and obtaining transformation parameters by using a progressive consistent sampling algorithm, has more accurate registration result and realizes more effective matching of the infrared images and the visible light images.

(2) Compared with the traditional image registration method, the method has the advantages that unreliable feature point selection areas can be filtered out through the local aggregation features NetVLAD based on the deep learning, more reliable feature points and descriptors can be extracted through SuperPoint based on the deep self-supervision, and more effective matching is realized through the SuperGlue feature matching method based on the depth map convolutional network by effectively utilizing the feature point position information and the context information; on the basis, the adopted asymptotic sampling consistency method can obtain more accurate transformation parameter estimation results, and accurate registration results can be obtained for images shot at different angles.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a method for infrared and visible image registration based on hierarchical matching according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating NetVLAD pixel prescreening according to an embodiment of the invention;

FIG. 3 is a diagram showing the effect of the visible light image after pixel screening by NetVLAD according to the embodiment of the present invention;

fig. 4 is a schematic diagram of a superfiepoint feature extraction and description network according to an embodiment of the present invention;

FIG. 5 is an effect diagram of the infrared image extracted through the SuperPoint feature points in the embodiment of the invention;

fig. 6 is a diagram illustrating an effect of extracting a SuperPoint feature point of a visible light image after pixel pre-screening according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a SuperGlue feature matching network according to an embodiment of the present invention;

FIG. 8 is an effect diagram of matching SuperGlue feature points of an infrared image and a visible light image according to an embodiment of the present invention;

FIG. 9 is a final fused image effect diagram according to an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

As shown in fig. 1, the present embodiment provides a method for registering infrared and visible light images based on hierarchical matching, which specifically includes the following steps:

s101: an infrared image and a visible light image are acquired.

The infrared image can be realized by adopting infrared image acquisition equipment, and the visible light image can be realized by adopting visible light image acquisition equipment.

S102: and pre-screening pixels of the visible light image based on the local aggregation characteristics to obtain the visible light image after the pixel pre-screening.

In this embodiment, a local aggregation feature NetVLAD method based on deep learning is used to perform pixel pre-screening. Specifically, calculating the availability of a cluster of local aggregation characteristics based on a convolutional neural network and a local aggregation characteristic network, and further performing pre-screening on pixels; wherein, the availability is the L2 distance of the corresponding local feature matching the negative sample on each cluster.

Hard distributing the visible light image pixels to the clustering clusters of the local aggregation characteristics;

and screening out the pixels aggregated to the clustering cluster with high utilization as pixels for pre-screening and extracting the characteristic points.

In view of the fact that the existing method directly extracts and matches feature points of an original image, feature points which are difficult to accurately match, such as sky clouds or objects with periodic texture changes, such as sleeves, are easily extracted in some unstable areas.

As shown in fig. 2, the CNN feature extraction module in this example is composed of a VGG16 network with a classification layer removed, and the NetVLAD part includes 16 cluster centers and respective soft distribution weight parameters. For an input image with the size of W multiplied by H, the input image is obtained after passing through a CNN feature extraction module

Characteristic diagram of (1), NetVLAD will

Soft distribution and aggregation of 512-dimensional features on 16 cluster clusters in a residual aggregation mode to obtain 16 x 512-dimensional NetVLAD output features, wherein the features are as follows:

wherein N is equal to

x_iA 512-dimensional feature column vector of the feature map is output for the CNN. w is a_k，b_kA learnable parameter representing the k-th cluster, c_kIs the cluster center of the kth cluster. Before the training begins, cluster center c_kInitialization is needed, wherein the initialization mode is to calculate all images, obtain initial features through a CNN feature extraction module, perform K-means clustering on all the initial features, and calculate to obtain 16 clustering centers.

The CNN module and the NetVLAD module of the present example are trained on an image matching training set, so that the features of images at slightly varying angles at the same location are drawn closer, and the features of images at different locations are pushed away. After training, the utility of each cluster is calculated by summing the L2 distances between a cluster in any query image and the features of the corresponding clusters in all non-matching images in the validation set. Calculating the average value of the utilizability of all the clustering clusters of all the query images to obtain the utilizability of all the clustering clusters, namely:

wherein, U_KRepresents the utilization of the kth cluster, N is the total number of all query images, (V)_K)_aTo query the features of the image on the kth cluster, (V)_K)_nMatching for query imagesThe characteristics of the negative examples on the kth cluster.

And then inputting the visible light image into the trained CNN network, extracting the feature map, and then performing hard distribution on the trained NetVLAD cluster, namely mapping a point on the feature map to the only nearest cluster, wherein the point also corresponds to the pixel region of the original input image. And selecting a high-utilization cluster, correspondingly corresponding to some pixel regions of the input visible light image, and using the pixels as pre-screened pixels for extracting feature points and matching in subsequent steps. In this example, the effect graph after pixel screening through the local aggregation feature network is shown in fig. 3.

S103: and extracting and matching characteristic points of the infrared image and the visible light image subjected to pixel pre-screening.

In the specific implementation process, an automatic supervision learning network (SuperPoint method) is used for extracting characteristic points of the infrared image and the visible light image subjected to pixel pre-screening.

The learning process of the self-supervision learning network is as follows:

constructing an image set containing definite corner information for training a primary corner extraction network;

carrying out homography transformation on the original image at random, adding noise, and obtaining the positions of feature points in the transformed image by using a trained primary network;

and monitoring the self-monitoring learning network by using the transformed image and the angular point information thereof, and learning to obtain the self-monitoring learning network with the capability of acquiring the infrared image and the visible light image characteristic point.

In view of the lack of robustness of the conventional point feature extraction and description methods such as SIFT and surf (speedup Robust features) to modal changes, in the present embodiment, a deep learning-based SuperPoint method is adopted to extract and describe feature points of images in two different modalities, namely infrared and visible light.

As shown in fig. 4, the SuperPoint feature extraction and description network in this embodiment is composed of one encoder and two decoders; the encoder is a convolutional network similar to VGGNet, comprises a plurality of convolutional layers and pooling layers and is used for encoding the input image with the size of W multiplied by H; the two decoders are respectively used for feature point extraction and feature point description; the characteristic point extraction decoder is composed of a convolution layer, a Softmax layer and a Reshape layer and outputs a W multiplied by H multiplied by 1 image, wherein the value of each pixel represents the probability that the pixel is the characteristic point; the feature description decoder is composed of a convolution layer, a bilinear interpolation layer and an L2 modular normalization layer, and finally outputs a W × H × D feature map, wherein each pixel corresponds to a D-dimensional feature vector.

Because the truth value of the characteristic points is difficult to label manually, the embodiment designs an automatic supervision learning strategy which does not need the truth value; the method specifically comprises the following steps: firstly, constructing a synthetic image set containing simple shapes such as triangles, quadrangles, cubes, checkers and the like, wherein the images have definite angular point information, and training to obtain an initial angular point extraction network; then, randomly carrying out various homographic transformations on the real image and adding noise; extracting characteristic points from the network by using the trained initial corner points, taking the characteristic points as monitoring information, and learning to obtain a SuperPoint network; in this embodiment, the result of extracting the feature points is shown in fig. 5 and 6.

In the specific implementation process, in view of the problem that errors are easily caused because the traditional feature matching method is mostly used for independently matching each feature point, in the embodiment, the accuracy of feature matching is improved by adopting a superslue method based on a depth map neural network and combining the spatial geometric relationship between the feature points and the mutual correlation information of the feature points.

Wherein, as shown in fig. 7, the depth map convolution network comprises an attention map convolution network and an optimization matching layer; the input of the attention map convolutional network is a set of the feature point positions and the description vectors of a pair of images, and a feature descriptor after spatial information aggregation is output; the input of the optimization matching layer is a feature descriptor output by the attention-seeking convolutional network, and a matching result is output.

In the attention-driven convolutional network, firstly, the positions of characteristic points are coded and added with characteristic descriptor vectors to obtain initial representation of each characteristic point;

for example: and (3) the positions of the feature points are subjected to dimension increasing through a feature point encoder consisting of a multilayer perceptron (MLP), high-dimensional vectors are obtained and then added with feature descriptor vectors, and the initial representation of each feature point is obtained.

Then, constructing a multivariate graph; the vertexes of the multivariate graph are all characteristic points in the two images, and the edges comprise an image inner edge and an image crossing edge; the inner edge of the image is connected with the characteristic point pairs in the single image, and the cross-image edge is connected with the characteristic point pairs from the two images;

the vertexes of the multivariate graph are all characteristic points in the two images, and the edges comprise an inner image edge and an image-crossing edge; wherein, the inner edge of the image is connected with the characteristic point pairs in the single image, and the cross-image edge is connected with the characteristic point pairs from the two images; after the graph is constructed, the information transmission mechanism is utilized to alternately carry out information aggregation and update on the features of the inner edge and the vertex of the cross-image edge in the graph, and the description vector of each feature point is obtained after the information aggregation update

And adding an optimized matching layer, and calculating an M multiplied by N similarity matrix S, wherein each unit in the matrix represents the similarity of the features in the image A and the features in the image B.

And after constructing the multivariate graph, performing message aggregation and updating on the characteristics of all vertexes in the constructed multivariate graph.

And the optimized matching layer calculates a similarity matrix according to the updated features, each unit in the matrix represents the similarity of the features in the two images, the matrix is expanded, and the newly added row and column are used for describing the condition that the feature points are not matched.

Due to occlusion or different view ranges and the like, the feature points in one image may not have matched feature points in the other image; for this purpose, the matrix S is expanded to a matrix of (M +1) × (N +1)

Wherein, the added row and column are used to describe the condition that there is no match between the feature points.

Thus, the problem of feature point matching is converted into an optimal transportation problem, and a Sinkhorn algorithm can be used for solving the problem. Since the Sinkhorn algorithm is conducive, it can be implemented with one network layer.

S104: and obtaining the conversion parameters from the infrared image to the visible light image by utilizing a progressive consistent sampling algorithm according to the pixel coordinates of the matched characteristic point pairs in the infrared image and the visible light image.

After the characteristic point pairs matched in the infrared image and the visible light image are obtained, calculating transformation parameters between the images according to pixel coordinates of the characteristic point pairs; and performing quality evaluation on all the matching point pairs to obtain Q values, performing descending order arrangement according to the Q values, performing random sampling according to the descending order arrangement result of the Q values in each iteration, and performing model hypothesis and verification.

In the embodiment, the conversion parameters from the infrared image to the visible light image are estimated by using the asymptotic sampling consensus PROSAC algorithm according to the pixel coordinates of the matched feature points in the two images.

The method specifically comprises the following steps: after the matched feature point pairs in the infrared image and the visible light image are obtained, in the embodiment, transformation parameters between the images are estimated according to pixel coordinates where the feature point pairs are located; because the offset between the infrared lens and the visible light lens is very small relative to the distance of the shot object, the offset can be approximate to a common optical center, and the infrared image is converted into a coordinate system of the visible light image by utilizing a homography conversion matrix H; under the homogeneous coordinate system, the pixel coordinate transformation in the two images can be expressed as the following relation:

wherein H in the homography transformation matrix H ₃₃1. Therefore, the degree of freedom of the parameters of the transformation matrix is 8, and four or more pairs of feature points can be used for estimation.

Because matching noise and even outliers of error matching exist, a least square method, a RANSAC method and the like are commonly used for parameter estimation; however, in RANSAC, each pair of feature point pairs are treated equally, and samples are randomly selected from the entire set of feature point pairs, which has problems of randomness of estimation results and slow convergence speed. For this reason, in the present embodiment, the PROSAC algorithm is used to estimate the conversion parameters.

In this embodiment, the sac algorithm designs a semi-random method, performs quality evaluation calculation on all matching point pairs to obtain Q values, then performs descending order arrangement according to the Q values, preferentially performs random sampling on high-quality point pairs in each iteration, and performs model assumption and verification, thereby reducing algorithm complexity, improving efficiency, and avoiding the situation that convergence cannot be guaranteed in the RANSAC random algorithm.

S105: and transforming the coordinates of the infrared image into a visible light image coordinate system according to the transformation parameters to realize hierarchical registration of the infrared image and the visible light image. Wherein, fig. 8 is an effect diagram of the infrared image and the visible light image after matching the SuperGlue feature points.

And transforming the coordinates of the infrared image into a coordinate system of the visible light image according to the transformation parameters, and averaging according to the pixels at the corresponding positions to obtain a final fusion image.

Specifically, after the transformation model is calculated, the coordinates of the infrared image are transformed into the coordinate system of the visible light image, and then the final fusion image is obtained by averaging the pixels at the corresponding positions, as shown in fig. 9.

In view of difference between infrared modalities and visible light modalities, the embodiment adopts local aggregation feature NetVLAD based on deep learning to perform pixel pre-screening on a visible light image, SuperPoint feature point extraction and descriptor calculation based on deep self-supervision, a SuperGlue feature matching method based on a depth map convolution network, and a progressive consistent sampling algorithm, so that matching of the infrared image and the visible light image is realized, registration of the infrared image and the visible light image which is more accurate than that of a traditional method can be realized, and accurate registration results can be obtained for images of different power transformation equipment shot at different angles.

Example two

The embodiment provides an infrared and visible light image registration system based on hierarchical matching, which specifically comprises the following modules:

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described herein again.

EXAMPLE III

The present embodiment provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the infrared and visible light image registration method based on hierarchical matching as described above.

Example four

The present embodiment provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the infrared and visible image registration method based on hierarchical matching as described above when executing the program.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for registering infrared and visible light images based on hierarchical matching is characterized by comprising the following steps:

acquiring an infrared image and a visible light image;

according to pixel coordinates of the matched feature point pairs in the infrared image and the visible light image, obtaining a transformation parameter from the infrared image to the visible light image by utilizing a progressive consistent sampling algorithm;

2. The infrared and visible image registration method based on hierarchical matching as claimed in claim 1, wherein in the process of pre-screening the pixels of the visible image, the availability of clusters of local aggregated features is calculated based on a convolutional neural network and a local aggregated feature network, thereby pre-screening the pixels; wherein, the availability is the L2 distance of the corresponding local feature matching the negative sample on each cluster.

3. The infrared and visible image registration method based on hierarchical matching as claimed in claim 1, wherein the feature point extraction is performed on both the infrared image and the visible image after pixel pre-screening by using an auto-supervised learning network.

4. The infrared and visible image registration method based on hierarchical matching according to claim 3, wherein the learning process of the self-supervised learning network is as follows:

5. The infrared and visible image registration method based on hierarchical matching as claimed in claim 1, wherein the infrared image and the visible image after pixel pre-screening are feature point matched using a depth map convolution network.

6. The infrared and visible image registration method based on hierarchical matching according to claim 5, wherein the depth map convolution network comprises an attention map convolution network and an optimization matching layer; the input of the attention map convolutional network is a set of the feature point positions and the description vectors of a pair of images, and a feature descriptor after spatial information aggregation is output; the input of the optimization matching layer is a feature descriptor output by the attention-seeking convolutional network, and a matching result is output.

7. The infrared and visible image registration method based on hierarchical matching as claimed in claim 6, wherein the optimized matching layer calculates a similarity matrix according to the updated features, each cell in the matrix represents the similarity of the features in the two images, and the matrix is expanded, and the added row and column are used to describe the condition that there is no matching of the feature points.

8. An infrared and visible image registration system based on hierarchical matching, comprising:

the characteristic point extraction and matching module is used for extracting and matching characteristic points of the infrared image and the visible light image subjected to pixel pre-screening;

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the infrared and visible light image registration method based on hierarchical matching according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program performs the steps in the infrared and visible image registration method based on hierarchical matching according to any of claims 1-7.