CN112561973A

CN112561973A - Method and device for training image registration model and electronic equipment

Info

Publication number: CN112561973A
Application number: CN202011541901.3A
Authority: CN
Inventors: 龙勇志
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2020-12-23
Filing date: 2020-12-23
Publication date: 2021-03-26

Abstract

The application discloses a method and a device for training an image registration model and electronic equipment, and belongs to the field of image processing. The method comprises the following steps: acquiring a dataset comprising a registered pair of images; cutting each image pair in the data set to obtain a target image block pair; calculating first transformation matrixes corresponding to two image blocks in the target image block pair; sequentially inputting target image block pairs serving as training data into an initial image registration model to obtain registration image pairs of the target image block pairs and obtain second transformation matrixes corresponding to two image blocks in the registration image pairs, wherein the image registration model is a deep neural network model; and calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image registration model. The method and the device can improve the registration accuracy of the infrared image and the visible light image.

Description

Method and device for training image registration model and electronic equipment

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a method and a device for training an image registration model and electronic equipment.

Background

With the rapid development of sensor imaging technology, the imaging of a single sensor is difficult to meet the daily application requirements, and the imaging of multiple sensors leads to technical innovation. The image fusion is to comprehensively process the image information detected by a plurality of sensors, so as to realize more comprehensive and reliable description of the detection scene.

The infrared and visible light are used as image types which are most widely applied in the field of image processing, the infrared image can efficiently capture scene heat radiation and identify a scene highlight target, the visible light image has high resolution and can present scene detail texture information, and the image information of the infrared image and the image information of the visible light image have efficient complementarity. Therefore, the infrared image and the visible light image are fused, a fused image with rich scene information content can be obtained, and the scene background and the target can be clearly and accurately described.

Image registration refers to a process of matching and aligning two or more images acquired at different times, different sensors (imaging devices), or under different conditions (weather, illuminance, camera position and angle, etc.). The image registration is an indispensable precondition processing flow of an image fusion task and is also a performance guarantee of the image fusion task, and the image registration accuracy degree directly influences the image fusion effect. However, in the case that the infrared image and the visible light image are from different sensors, because the imaging principles of the different sensors have great differences, and the gray values and the contrast of the imaging pixels of the two images have great differences, the image registration algorithm based on the features usually has fewer effective feature points, which causes great error offset to be generated in image registration, and further causes ghost images or blur to appear in the final fused image, thereby affecting the image fusion effect.

Disclosure of Invention

The embodiment of the application aims to provide a method for training an image registration model, which can solve the problem that image registration in the prior art generates large error offset.

In order to solve the technical problem, the present application is implemented as follows:

in a first aspect, an embodiment of the present application provides a method for training an image registration model, where the method includes:

acquiring a dataset comprising registered image pairs, wherein each image pair comprises a visible light image and an infrared image in the same scene;

cutting each image pair in the data set to obtain a target image block pair, wherein the target image block pair comprises a first image block cut from a visible light image and a second image block cut from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset;

calculating first transformation matrixes corresponding to two image blocks in the target image block pair;

taking the target image block pair as training data, and sequentially inputting an initial image registration model to obtain a registration image pair of the target image block pair and obtain second transformation matrixes corresponding to two image blocks in the registration image pair, wherein the image registration model is a deep neural network model;

and calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating the network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image registration model.

In a second aspect, an embodiment of the present application provides an apparatus for training an image registration model, where the apparatus includes:

a dataset acquisition module for acquiring a dataset comprising registered image pairs, wherein each image pair comprises a visible light image and an infrared image in the same scene;

the image shearing module is used for shearing each image pair in the data set to obtain a target image block pair, wherein the target image block pair comprises a first image block sheared from a visible light image and a second image block sheared from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset;

the matrix calculation module is used for calculating first transformation matrixes corresponding to two image blocks in the target image block pair;

the data training module is used for taking the target image block pair as training data, sequentially inputting an initial image registration model to obtain a registration image pair of the target image block pair and obtain second transformation matrixes corresponding to two image blocks in the registration image pair, wherein the image registration model is a deep neural network model;

and the parameter adjusting module is used for calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating the network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image registration model.

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, the registered image pairs collected in advance are cut, and the target image block pairs obtained by cutting are used as training data for training the image registration model. Each image pair can be cut into a plurality of target image block pairs according to a preset size, and the target image block pairs are used as training data, so that each image pair can generate a plurality of training data, the data quantity of the training data can be increased, and the training accuracy is improved. After the training data and the label are obtained, the training data (target image block pairs) are sequentially input into the initial image registration model for training, and the trained image registration model is obtained. The image registration model is obtained through supervised training according to a large amount of accurate training data, and fine registration of the image pair of the infrared image and the visible light image can be realized through the trained image registration model. Compared with an image registration method based on artificial features, the image registration method based on the deep neural network model provided by the application forces the image registration model to learn the image features with high robustness and high consistency in the image pair through given training data with accurate alignment in a network fitting mode, is used for calculating the space transformation between the images, and improves the registration accuracy of the infrared image and the visible light image.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of a method of training an image registration model of the present application;

FIG. 2 is a schematic flow chart of a pair of cropped target image blocks of the present application;

FIG. 3 is a schematic diagram of a network structure of an image registration model of the present application;

FIG. 4 is a schematic diagram of a network structure of a deep feature extraction network FEB according to the present application;

FIG. 5 is a schematic diagram of the internal structure of an RDN network of the present application;

FIG. 6 is a schematic diagram of a network structure of a mask prediction network MPB according to the present application;

FIG. 7 is a schematic diagram of an equivalent transformation of a first transformation matrix according to the present application;

FIG. 8 is a schematic structural diagram of an embodiment of an apparatus for training an image registration model according to the present application;

FIG. 9 is a schematic structural diagram of an electronic device of the present application;

fig. 10 is a hardware structure diagram of an electronic device implementing an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that embodiments of the application may be practiced in sequences other than those illustrated or described herein, and that the terms "first," "second," and the like are generally used herein in a generic sense and do not limit the number of terms, e.g., the first term can be one or more than one. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

The method for training the image registration model provided by the embodiment of the present application is described in detail below with reference to the accompanying drawings by specific embodiments and application scenarios thereof.

Referring to fig. 1, a flow chart of steps of an embodiment of a method of training an image registration model of the present application is shown, comprising the steps of:

step 101, acquiring a data set containing registered image pairs, wherein each image pair contains a visible light image and an infrared image in the same scene;

102, cutting each image pair in the data set to obtain a target image block pair, wherein the target image block pair comprises a first image block cut from a visible light image and a second image block cut from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset;

103, calculating first transformation matrixes corresponding to two image blocks in the target image block pair;

step 104, using the target image block pair as training data, and sequentially inputting an initial image registration model to obtain a registration image pair of the target image block pair and second transformation matrixes corresponding to two image blocks in the registration image pair, wherein the image registration model is a deep neural network model;

and 105, calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image registration model.

The embodiment of the application provides a method for training an image registration model, wherein a target image block pair and a first transformation matrix are generated according to a pre-collected registered image pair, the target image block pair is used as training data, the first transformation matrix is used as a label corresponding to the training data, and a neural network model is obtained through supervised training. The neural network model can decompose and extract richer and image features with appropriate types by simulating the structure of the human eye neurons, and the accuracy of feature extraction can be improved.

In an embodiment of the present application, a data set is pre-collected containing registered image pairs, each image pair containing an infrared image and a visible light image of the same scene. For example, the embodiment of the present application may collect an appropriate number of image pair datasets selected from a public database of infrared and visible light images and videos such as international published INO, TNO, OTCVBS, and the like, and use the image pair datasets for training data of an image registration model and making an annotation tag. The international published infrared and visible light image and video consensus public database contains infrared and visible light image pairs that achieve fine registration.

Further, in the embodiment of the present application, a preset number of image pairs are selected from the data set including the registered image pairs, and the selected image pairs include different scenes, for example, the selected preset number of image pairs include different scenes such as daytime scenes, night scenes, indoor scenes and outdoor scenes, and the image pairs including scene targets such as pedestrians and vehicles are selected as much as possible, so as to improve the objectivity of subsequent registration.

Each image pair in the data set is cut to obtain a target image block pair, the target image block pair comprises a first image block cut from a visible light image and a second image block cut from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset.

Because less training data can be used for training the registration model, the embodiment of the application cuts the pre-collected registered image pair, and uses the cut target image block pair as the training data for training the image registration model. Each image pair can be cut into a plurality of target image block pairs according to a preset size, and the target image block pairs are used as training data, so that each image pair can generate a plurality of training data, the data quantity of the training data can be increased, and the training accuracy is improved. In one example, the preset size is 32 × 32 pixels.

Because the first image block and the second image block in the target image block correspond to the same position in the image pair and have a preset random offset, the first transformation matrix corresponding to the first image block and the second image block in the target image block pair is calculated in the embodiment of the application. The first transform matrix may be used to represent an offset between the first image block and the second image block. The embodiment of the application takes the first transformation matrix as a label corresponding to training data and is used for guiding the training of the registration model.

After the training data and the label are obtained, the training data (target image block pair) are sequentially input into an initial image registration model to obtain a registration image pair of the target image block pair, and a second transformation matrix corresponding to two image blocks in the registration image pair is calculated, wherein the image registration model is a deep neural network model.

The image registration model may be obtained by performing supervised training on an existing neural network according to a large amount of training data and a machine learning method. It should be noted that, the embodiment of the present application does not limit the model structure and the training method of the image registration model. The image registration model may be a deep neural network model that fuses multiple neural networks. The neural network includes, but is not limited to, at least one or a combination, superposition, nesting of at least two of the following: CNN (Convolutional Neural Network), LSTM (Long Short-Term Memory) Network, RNN (Simple Recurrent Neural Network), attention Neural Network, and the like.

After calculating corresponding second transformation matrices for two image blocks in a registration image pair output by an image registration model, calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, thereby obtaining the trained image registration model.

The embodiment of the application provides a method for training an image registration model, wherein the image registration model is a deep neural network model, the image registration model is obtained through supervised training according to a large amount of accurate training data, and fine registration of the image pair of an infrared image and a visible light image can be realized through the trained image registration model. Compared with an image registration method based on artificial features, the image registration method based on the deep neural network model provided by the application forces the image registration model to learn the image features with high robustness and high consistency in the image pair through given training data with accurate alignment in a network fitting mode, is used for calculating the space transformation between the images, and improves the registration accuracy of the infrared image and the visible light image.

In an optional embodiment of the present application, the cropping each image pair in the data set to obtain a target image block pair includes:

step S11, randomly determining a first selection frame in a first image in the image pair, and acquiring first coordinates corresponding to four corner points of the first selection frame;

step S12, determining a second selection frame in a second image in the image pair according to the first coordinate;

step S13, according to a preset offset, carrying out random offset on four corner points of the second selection frame to obtain second coordinates corresponding to the four corner points after offset;

step S14, calculating a transformation matrix between the first coordinate and the second coordinate;

step S15, performing perspective transformation on the second selection frame in the second image according to the inverse matrix of the transformation matrix to obtain a third selection frame;

and step S16, cutting the first selection frame in the first image and cutting the third selection frame in the second image to obtain a target image block pair.

In an optional embodiment of the present application, the first image is a visible light image of the pair of images, and the second image is an infrared image of the pair of images; or, the first image is an infrared image in the image pair, and the second image is a visible light image in the image pair. In the embodiment of the present application, the first image is a visible light image, and the second image is an infrared image. Referring to fig. 2, a schematic flow chart of cutting a target image block pair according to an embodiment of the present application is shown.

First, step S10 is executed to select a certain image pair in the data set, as shown in FIG. 2, the image pair includes I_rAnd I_v，I_rAs an infrared image, I_vIs a visible light image. Step S11 is executed to randomly determine a first selection box, denoted as P, in the visible light image of the image pair for the selected image pair_vAnd the first selection frame is a rectangular frame, and first coordinates corresponding to four corner points of the first selection frame are obtained. It is to be understood that the size of the first selection frame is not limited in the embodiments of the present application. In the embodiment of the present application, the size of the first selection frame is 32 × 32 pixels as an example.

Then, step S12 is executed to determine a second selection box in the second image of the pair of images according to the first coordinate. Specifically, a first coordinate of four focuses of a first selection frame in the visible light image may be mapped to the infrared image to obtain a second selection frame in the infrared image, such as P_v′。

And next, executing step S13, and performing random offset on four corner points of the second selection frame according to a preset offset to obtain second coordinates corresponding to the four corner points after offset. Wherein the second selection frame P_v' the offset direction and offset distance of the four corner points are randomly selected within a preset range, and P is shown in step S13 in fig. 2_vAn example of the offset direction and offset distance of the four corner points of'. The four corner points of the second selection frame in the infrared image correspond to second coordinates after being offset, and the second coordinates form the offset second selection frame and are marked as P_r′。

Step S14 is executed to calculate a transformation matrix between the first coordinates and the second coordinates. Specifically, the first selection frames P in the visible light image are respectively selected_vAnd a second selection frame P after the offset in the infrared image_r' the coordinates of the corner point are calibrated, and a first selection frame P is calculated_vAnd the second selection frame P after the offset_r' i.e. a transformation matrix between the first coordinates and the second coordinates is calculated. In the embodiment of the present application, the transformation matrix may be a homography matrix of 3 × 3, denoted as H_v,r。

Step S15 is executed, according to the transformation matrix H_v,rInverse matrix of (noted as H)_r,v) And carrying out perspective transformation on the second selection frame in the second image to obtain a third selection frame. The embodiment of the application adopts the transformation matrix H for the second selection frame in the infrared image_v,rInverse matrix H of_r,vPerforming perspective transformation, the size of the transformed second selection frame may be changed, and the embodiment of the present application designates the transformed second selection frame as a third selection frame P_r". Third selection frame P_r"position of and first selection frame P in visible light image_vCorresponding to the same position in the image pair, but with a certain offset, where H_r,vThe transformation relation is shown in formula 1, wherein H_v,rRepresenting the transformation matrix of the visible image with respect to the infrared image, H_r,vRepresenting the transformation matrix of the infrared image with respect to the visible image, I₃Representing a 3 rd order identity matrix.

H_v,r·H_r,v＝I₃ (1)

Step S16 is executed to select the first selection frame P in the visible light image_vCutting to obtain a first image block marked as P_vAnd a third selection frame P in the second image_r"carry outCutting to obtain a second image block, which is marked as P_rFrom this, a target image block pair, denoted as (P), can be obtained_v,P_r)。

Performing the above steps S11 to S16 for each image pair in the data set may obtain a large number of target image block pairs, and inputting the obtained target image block pairs as training data into an initial image registration model for data fitting to train the image registration model.

In an optional embodiment of the present application, the calculating a first transformation matrix corresponding to two image blocks in the target image block pair includes:

step S21, after calculating a transformation matrix between the first coordinates and the second coordinates, recording the transformation matrix between the first coordinates and the second coordinates;

step S22, using a transformation matrix between the first coordinate value and the second coordinate value as a first transformation matrix corresponding to two image blocks in the target image block pair.

In the embodiment of the present application, after step S14 is executed, a transformation matrix between the first coordinate and the second coordinate may be recorded, and the transformation matrix between the first coordinate value and the second coordinate value may be used as a first transformation matrix H corresponding to two image blocks in the target image block pair_v,r. The first transformation matrix is a label for the training data.

In an optional embodiment of the present application, the image registration model is a deep neural network model including a deep feature extraction network, a mask prediction network, a channel cascade module, and a matrix estimation network, and the obtaining a registration image pair of the target image block pair and a second transformation matrix corresponding to two image blocks in the registration image pair by using the target image block pair as training data and sequentially inputting an initial image registration model includes:

step S31, respectively inputting the first image block and the second image block in the target image block pair into the depth feature extraction network, so as to extract a first depth feature of the first image block and a second depth feature of the second image block;

step S32, inputting a first image block and a second image block in the target image block pair into the mask prediction network respectively to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block;

step S33, weighting the first depth feature by using the first mask to obtain a first feature map, and weighting the second depth feature by using the second mask to obtain a second feature map;

step S34, inputting the first feature map and the second feature map into the channel cascade module to obtain a registered image pair of the target image block pair;

and step S35, inputting the registration image pair into the matrix estimation network to obtain a second transformation matrix corresponding to two image blocks in the registration image pair.

According to the image registration method and device, an image registration framework based on a neural network is designed for an infrared image and visible light image registration task, a transformation matrix is calculated through regression to conduct network model training, and robustness features can be extracted from an image registration model network obtained through training. Referring to fig. 3, a schematic network structure diagram of an image registration model according to an embodiment of the present application is shown. As shown in fig. 3, the image registration model includes a depth feature extraction network (FEB), a mask prediction network (MPB), a channel cascade module, and a matrix estimation network (HEB).

First, step S31 is executed to combine the first image block P in the target image block pair_vAnd a second image block P_rRespectively inputting the depth feature extraction network FEB to extract the first image block P_vFirst depth feature f_vAnd said second image block P_rSecond depth feature f_r。

While step S31 is being performed, step S32 may be performed to combine the first image block P in the target pair of image blocks_vAnd a second image block P_rRespectively inputting the mask prediction network MPB, and performing hierarchical mask prediction on the target image block pairs to obtain a target image blockObtaining the first image block P_vCorresponding first mask M_vAnd said second image block P_rCorresponding second mask M_r。

Then, step S33 is performed, in which the first mask is used to weight the first depth feature to obtain a first feature map, and the second mask is used to weight the second depth feature to obtain a second feature map. The embodiment of the application is to the first mask M_vAnd a first depth feature f_vPerforming weighted superposition to obtain a first characteristic map G_vFirst feature map G_vTherein contains a first mask M_vAnd a first depth feature f_vThe high efficiency common feature of (1). Likewise, for the second mask M_rAnd a second depth feature f_rPerforming weighted superposition to obtain a second feature map G_rSecond feature map G_rIn which a second mask M is included_rAnd a second depth feature f_rThe high efficiency common feature of (1).

Next, step S34 is executed to map the first feature map G_vAnd said second feature pattern G_rInputting the image into the channel cascade module to obtain a registration image pair G of the target image block pair_r,v. The channel cascade module is used for the first feature map G_vAnd a second feature map G_rPerforming channel cascade operation to obtain a first feature map G_vAnd a second feature map G_rCascading containers G with common characteristics_r,vNamely, obtaining the registration image pair of the target image block pair. It should be noted that the first feature map G_vAnd a second feature map G_rThe corresponding physical significance is that the characteristic information which is lack of consistency between the images is removed in a mode of setting a characteristic mask, and common characteristics of the images with more robustness are reserved.

Step S35 is executed to input the pair of registered images of the target image block pair into the matrix estimation network HEB to obtain a second transformation matrix corresponding to two image blocks in the pair of registered images. That is, the common characteristic container G_r,vInputting a matrix estimation network HEB to perform regression calculation of a registration network, and outputting a simulation through an HEB network moduleThe second transformation matrix obtained after the combination is recorded as H_v,r', the second transform matrix is a 3 x 3 homography matrix.

Calculating the first transformation matrix H according to a preset loss function_v,rAnd the second transformation matrix H_v,rAnd updating the network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value to obtain the trained image registration model.

In the embodiment of the application, the depth feature extraction network FEB is used for extracting multi-level depth feature information in an image. It should be noted that, in the embodiment of the present application, a network structure of the deep feature extraction network FEB is not limited.

Referring to fig. 4, a network structure diagram of a deep feature extraction network FEB according to an embodiment of the present application is shown. As shown in fig. 4, the depth Feature extraction network FEB may include a DFE (Deep Feature extraction) module and an FCN (full Convolutional network) module. In one example, the network structure parameters of the deep feature extraction network FEB shown in fig. 4 are shown in table 1.

TABLE 1

In an application example of the present application, the DFE module may be composed of three RDN (super resolution image Network) Network structures, referring to fig. 5, a schematic diagram of an internal structure of an RDN Network of the present application is shown, and Network structure parameters of the RDN Network shown in fig. 5 are shown in table 2.

TABLE 2

In the embodiment of the application, dense connection is performed through the RDN network shown in fig. 5, so that reusability of front and rear layer features is improved, and feature calculation complexity and structure width are reduced. Further, two branches may be included in each RDN network to enhance the diversity of the extracted features, making full use of the depth features extracted by each convolutional layer (Conv) within the module. Assuming that HRDN. d (-) represents the correlation operation of the d-th RDN module, calculating the characteristic mapping map of the d-th RDN module as shown in formula 2.

Wherein, F_inputRepresenting the depth characteristics of the front-level input to this RDN block,

and representing the feature map to which the characteristics of the d-th RDN module are mapped.

In an optional embodiment of the present application, the respectively inputting a first image block and a second image block in the target image block pair into the mask prediction network to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block includes:

respectively inputting a first image block and a second image block in the target image block pair into the mask prediction network, so as to generate a first mask which is equal to the first image block size through the mask prediction network learning, marking a first contribution estimation value corresponding to each pixel in the first image block in the first mask, generate a second mask which is equal to the second image block size through the mask prediction network learning, and marking a second contribution estimation value corresponding to each pixel in the second image block in the second mask;

the weighting the first depth feature by using the first mask to obtain a first feature map, and weighting the second depth feature by using the second mask to obtain a second feature map, including:

weighting the first depth feature to obtain a first feature map by using a first contribution estimation value corresponding to each pixel in the first image block labeled in the first mask, and weighting the second depth feature to obtain a second feature map by using a second contribution estimation value corresponding to each pixel in the second image block labeled in the second mask.

In the embodiment of the present application, the mask prediction network MPB generates a first mask equal to the first image block size through network learning by automatically learning common features required for image registration, and marks a first contribution estimation value corresponding to each pixel in the first image block in the first mask. The first contribution estimation value refers to an estimated contribution degree of each element in the first image block to the first transformation matrix, the greater the contribution degree, the greater the probability that the feature of the corresponding element in the first image block is retained, and the smaller the contribution degree, the greater the probability that the feature of the corresponding element in the first image block is filtered out.

Similarly, the mask prediction network MPB generates a second mask having the same size as the first image block by automatically learning common features required for image registration, and labels a second contribution estimation value corresponding to each pixel in the second image block in the second mask. The second contribution estimation value refers to an estimated contribution degree of each element in the second image block to the first transformation matrix, the greater the contribution degree, the greater the probability that the feature of the corresponding element in the second image block is retained, and the smaller the contribution degree, the greater the probability that the feature of the corresponding element in the second image block is filtered out.

Referring to fig. 6, a schematic diagram of a network structure of a mask prediction network MPB according to an embodiment of the present application is shown, and as shown in fig. 6, the mask prediction network MPB may include a single RDN structure, which follows the structure shown in fig. 5. In one example, the network structure parameters of the mask prediction network MPB shown in fig. 6 are shown in table 3.

TABLE 3

Outputting a first image block P through a mask prediction network MPB_vCorresponding first mask M_vAnd a second image block P_rCorresponding second mask M_rThen, for the first mask M_vAnd a first depth feature f_vPerforming weighted superposition to obtain a first characteristic map G_v. And a second mask M_rAnd a second depth feature f_rPerforming weighted superposition to obtain a second feature map G_r. The process of weighted overlap-add is shown in equation 3.

G_i＝f_i×M_i(i＝r,v) (3)

Wherein f is_i(i ═ r, v) respectively represent first depth features f extracted by the depth feature extraction network FEB_vAnd a second depth feature f_r。M_i(i-r, v) represents the first mask M of the output of the mask prediction network MPB, respectively_vAnd a second mask M_r。G _i(i ═ r, v) represents the first characteristic pattern G, respectively_vAnd a second feature map G_r。

Pairing a first feature map G by a channel cascade module_vAnd a second feature map G_rPerforming channel cascade operation to obtain a first feature map G_vAnd a second feature map G_rCascading containers G with common characteristics_r,vI.e. obtaining a registered pair of images of the target pair of image blocks, sharing the feature container G_r,vAnd inputting the matrix estimation network HEB to perform regression calculation of the registration network.

In the embodiment of the present application, the matrix estimation network HEB may generate four sets of two-dimensional offset vectors (total 8-dimensional vectors). The whole matrix estimation process is represented by heb (·), and the calculation process is shown in equation 4.

H＝heb(G_r,v) (4)

In an alternative embodiment of the present application, the matrix estimation network HEB may use ResNet34 as a backbone network structure, which includes 34 layers of convolution, followed by an adaptive pooling layer, and the matrix estimation network HEB module requires only a few dimensions of input features to generate a feature matrix of fixed size. In one example, the network structure parameters of the matrix estimation network HEB are shown in table 4.

TABLE 4

It should be noted that the network structure of the image registration model shown in fig. 3, the network structure of the depth feature extraction network FEB shown in fig. 4, the network structure of the RDN network shown in fig. 5, the network structure of the mask prediction network MPB shown in fig. 6, and parameters corresponding to the network structures shown in tables 1 to 4 are all used as an application example of the present application. The network structure of the image registration model, the network structure of the depth feature extraction network FEB, the network structure of the RDN network, the network structure of the mask prediction network MPB and the specific setting of the network parameters corresponding to the network structures are not limited.

In an optional embodiment of the present application, after the calculating the first transformation matrices corresponding to two image blocks in the target image block pair, the method further includes: performing equivalent transformation on the first transformation matrix according to the angular point coordinate offset of the first image block and the second image block to obtain a first equivalent matrix;

after obtaining the second transformation matrices corresponding to the two image blocks in the registered image pair, the method further includes: performing equivalent transformation estimation on the second transformation matrix to obtain a second equivalent matrix;

the calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function includes: and calculating a difference value between the first equivalent matrix and the second equivalent matrix according to a preset loss function.

In the embodiment of the application, the degrees of freedom of the fitted second transformation matrix are considered to be set in the network training stage of the image registration model, and the matrix estimation network HEB is output as an 8-dimensional vector. In addition, the transformation types of the parameter representations in the second transformation matrix are different, and some parameters represent rotation, scaling and shearing transformation, some parameters represent translation transformation, some parameters represent perspective transformation, and the like, that is, different dimensions exist among the parameters. If the parameter regression of the second transformation matrix is directly performed in the network training, different dimensions may affect the final training effect.

In order to solve the problem, in the embodiment of the present application, equivalent transformation processing is performed on the first transformation matrix and the second transformation matrix, so as to solve the problem of different regression parameter dimensions, thereby reducing the complexity of network training and improving the network training effect.

Specifically, after calculating first transformation matrices corresponding to two image blocks in the target image block pair, according to the offset of corner coordinates of the first image block and the second image block in the target image block pair, performing equivalent transformation on the first transformation matrices to obtain first equivalent matrices. Referring to fig. 7, a schematic diagram of performing an equivalent transformation on a first transformation matrix according to an embodiment of the present application is shown. The first transformation matrix H is transformed by the equivalent transformation shown in FIG. 7_v,rThe parameters in the first image block and the second image block are replaced by the angular point coordinate offset of the first image block and the second image block to obtain a first equivalent matrix. Similarly, a second transformation matrix H obtained by fitting the matrix estimation network HEB_v,rEach parameter in the' is replaced by the angular point coordinate offset estimated by the matrix estimation network HEB to obtain a second equivalent matrix.

In the training process of the image registration model, calculating a difference value between the first equivalent matrix and the second equivalent matrix according to a preset loss function so as to guide the adjustment of network parameters of the image registration model.

In an optional embodiment of the present application, the loss function is determined according to an offset of each corner point in the first equivalent matrix and an offset of each corner point in the second equivalent matrix.

Specifically, the loss function of the embodiment of the present application is shown in equation 5:

wherein, the value of i in the formula (5) is 1 to 4, which respectively represents four corner points in the first equivalent matrix or the second equivalent matrix. Δ μ_iAnd Δ v_iRespectively representing the offset of the ith corner point in the first equivalent matrix in the x and y directions. Δ μ_i' and Δ v_i' denotes the offset of the ith corner point in the second equivalent matrix in the x, y directions, respectively. It is understood that the loss function shown in formula (5) is only an application example of the present application, and the function structure of the loss function is not limited in the embodiments of the present application.

Further, in the training phase of the image registration model, the pixel values of each target image block pair are normalized first, and the learning rate is set to be a constant piecewise attenuation, and when the number of network training iterations reaches a preset index, such as [5,000,10,000,14,000,18,000,20,000,24,000], the network training learning rate is set to be a value corresponding to [0.01,0.007,0.005,0.0025,0.001,0.0001,0.00005 ].

In an optional embodiment of the present application, after obtaining the trained image registration model, the method further includes:

inputting an image pair to be registered into the trained image registration model, wherein the image pair to be registered comprises an infrared image and a visible light image in the same scene;

and outputting a registration image pair through the trained image registration model.

After the training of the image registration model is completed, the trained image registration model can be used for image registration processing. In the embodiment of the application, the image registration model is an end-to-end model, the infrared image and the visible light image in the same scene to be registered are used as an image pair to be input into the trained image registration model, and then the registered image pair can be output.

Further, after the image pair to be included with the infrared image and the visible light image in the same scene is registered through the image registration model, the obtained registration image pair can be used for image fusion processing, so that the fusion effect is improved. The application provides a neural network-based infrared and visible light image fine registration method, when infrared and visible light images in the same scene are acquired, fine alignment between the images can be achieved, and registration performance guarantee is provided for a subsequent scene image fusion task.

In summary, the embodiment of the present application cuts the pre-collected registered image pair, and uses the cut target image block pair as training data for training the image registration model. Each image pair can be cut into a plurality of target image block pairs according to a preset size, and the target image block pairs are used as training data, so that each image pair can generate a plurality of training data, the data quantity of the training data can be increased, and the training accuracy is improved. After the training data and the label are obtained, the training data (target image block pairs) are sequentially input into the initial image registration model for training, and the trained image registration model is obtained. The image registration model is obtained through supervised training according to a large amount of accurate training data, and fine registration of the image pair of the infrared image and the visible light image can be realized through the trained image registration model. Compared with an image registration method based on artificial features, the image registration method based on the deep neural network model provided by the application forces the image registration model to learn the image features with high robustness and high consistency in the image pair through given training data with accurate alignment in a network fitting mode, is used for calculating the space transformation between the images, and improves the registration accuracy of the infrared image and the visible light image.

It should be noted that, in the method for training an image registration model provided in the embodiment of the present application, the execution subject may be an apparatus for training an image registration model, or a control module in the apparatus for training an image registration model, which is used for executing the method for training an image registration model. In the embodiment of the present application, a method for executing a training image registration model by using a device for training an image registration model is taken as an example, and the device for training an image registration model provided in the embodiment of the present application is described.

Referring to fig. 8, a schematic structural diagram of an embodiment of an apparatus for training an image registration model according to the present application is shown, the apparatus including:

a dataset acquisition module 801 for acquiring a dataset comprising registered image pairs, wherein each image pair comprises a visible light image and an infrared image in the same scene;

an image cropping module 802, configured to crop each image pair in the data set to obtain a target image block pair, where the target image block pair includes a first image block cropped from a visible light image and a second image block cropped from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset;

a matrix calculation module 803, configured to calculate a first transformation matrix corresponding to two image blocks in the target image block pair;

a data training module 804, configured to take the target image block pair as training data, sequentially input an initial image registration model to obtain a registration image pair of the target image block pair, and obtain a second transformation matrix corresponding to two image blocks in the registration image pair, where the image registration model is a deep neural network model;

a parameter adjusting module 805, configured to calculate a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and update a network parameter of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold, so as to obtain a trained image registration model.

Optionally, the image cropping module includes:

the first selection submodule is used for randomly determining a first selection frame in a first image in the image pair and acquiring first coordinates corresponding to four corner points of the first selection frame;

a second selection submodule for determining a second selection frame in a second image of the pair of images according to the first coordinate;

the coordinate offset submodule is used for randomly offsetting the four corner points of the second selection frame according to a preset offset to obtain second coordinates corresponding to the four corner points after offset;

a transformation calculation submodule for calculating a transformation matrix between the first coordinate and the second coordinate;

the third selection submodule is used for carrying out perspective transformation on the second selection frame in the second image according to the inverse matrix of the transformation matrix to obtain a third selection frame;

and the image shearing submodule is used for shearing the first selection frame in the first image and shearing the third selection frame in the second image to obtain a target image block pair.

Optionally, the first image is a visible light image in the image pair, and the second image is an infrared image in the image pair; or, the first image is an infrared image in the image pair, and the second image is a visible light image in the image pair.

Optionally, the image registration model is a deep neural network model including a depth feature extraction network, a mask prediction network, a channel cascade module, and a matrix estimation network, and the data training module includes:

the feature extraction sub-module is used for respectively inputting a first image block and a second image block in the target image block pair into the depth feature extraction network so as to extract a first depth feature of the first image block and a second depth feature of the second image block;

the mask prediction sub-module is used for respectively inputting a first image block and a second image block in the target image block pair into the mask prediction network so as to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block;

the mask superposition submodule is used for weighting the first depth features by using the first mask to obtain a first feature map and weighting the second depth features by using the second mask to obtain a second feature map;

a cascade processing submodule, configured to input the first feature map and the second feature map into the channel cascade module, so as to obtain a registered image pair of the target image block pair;

and the matrix estimation submodule is used for inputting the registration image pair into the matrix estimation network so as to obtain a second transformation matrix corresponding to two image blocks in the registration image pair.

Optionally, the mask prediction sub-module is specifically configured to input a first image block and a second image block in the target image block pair into the mask prediction network, so as to generate a first mask equal to the first image block size through the mask prediction network learning, and mark a first contribution estimation value corresponding to each pixel in the first image block in the first mask, and generate a second mask equal to the second image block size through the mask prediction network learning, and mark a second contribution estimation value corresponding to each pixel in the second image block in the second mask;

the mask superposition sub-module is specifically configured to weight the first depth feature to obtain a first feature map by using a first contribution estimation value corresponding to each pixel in the first image block labeled in the first mask, and weight the second depth feature to obtain a second feature map by using a second contribution estimation value corresponding to each pixel in the second image block labeled in the second mask.

Optionally, the apparatus further comprises:

the first transformation module is used for performing equivalent transformation on the first transformation matrix according to the angular point coordinate offset of the first image block and the second image block to obtain a first equivalent matrix;

the second transformation module is used for performing equivalent transformation estimation on the second transformation matrix according to the angular point coordinate offset in the second transformation matrix to obtain a second equivalent matrix;

the parameter adjusting module is specifically configured to calculate a difference value between the first equivalent matrix and the second equivalent matrix according to a preset loss function.

Optionally, the loss function is determined according to an offset of each corner point in the first equivalent matrix and an offset of each corner point in the second equivalent matrix.

Optionally, the apparatus further comprises:

the data registration module is used for inputting an image pair to be registered into the trained image registration model, wherein the image pair to be registered comprises an infrared image and a visible light image in the same scene;

and the result output module is used for outputting a registration image pair through the trained image registration model.

The device for training the image registration model, which is provided by the embodiment of the application, cuts a pre-collected registered image pair, and uses a target image block pair obtained by cutting as training data for training the image registration model. Each image pair can be cut into a plurality of target image block pairs according to a preset size, and the target image block pairs are used as training data, so that each image pair can generate a plurality of training data, the data quantity of the training data can be increased, and the training accuracy is improved. After the training data and the label are obtained, the training data (target image block pairs) are sequentially input into the initial image registration model for training, and the trained image registration model is obtained. The image registration model is obtained through supervised training according to a large amount of accurate training data, and fine registration of the image pair of the infrared image and the visible light image can be realized through the trained image registration model. Compared with an image registration method based on artificial features, the image registration method based on the deep neural network model provided by the application forces the image registration model to learn the image features with high robustness and high consistency in the image pair through given training data with accurate alignment in a network fitting mode, is used for calculating the space transformation between the images, and improves the registration accuracy of the infrared image and the visible light image.

The device for training the image registration model in the embodiment of the present application may be a device, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The apparatus for training the image registration model in the embodiment of the present application may be an apparatus having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The device for training the image registration model provided in the embodiment of the present application can implement each process implemented in the method embodiment of fig. 1, and is not described here again to avoid repetition.

Optionally, as shown in fig. 9, an electronic device 900 is further provided in this embodiment of the present application, and includes a processor 901, a memory 902, and a program or an instruction stored in the memory 902 and executable on the processor 901, where the program or the instruction is executed by the processor 901 to implement each process of the above embodiment of the method for training an image registration model, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application. The electronic device 1000 includes, but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, and a processor 1010.

Those skilled in the art will appreciate that the electronic device 1000 may further comprise a power source (e.g., a battery) for supplying power to various components, and the power source may be logically connected to the processor 1010 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system. The electronic device structure shown in fig. 10 does not constitute a limitation of the electronic device, and the electronic device may include more or less components than those shown, or combine some components, or arrange different components, and thus, the description is not repeated here.

Wherein the processor 1010 is configured to acquire a data set comprising registered image pairs, wherein each image pair comprises a visible light image and an infrared image of the same scene; cutting each image pair in the data set to obtain a target image block pair, wherein the target image block pair comprises a first image block cut from a visible light image and a second image block cut from an infrared image, and the first image block and the second image block correspond to the same position in the image pair and have a preset random offset; calculating first transformation matrixes corresponding to two image blocks in the target image block pair; taking the target image block pair as training data, and sequentially inputting an initial image registration model to obtain a registration image pair of the target image block pair and obtain second transformation matrixes corresponding to two image blocks in the registration image pair, wherein the image registration model is a deep neural network model; and calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function, and updating the network parameters of the image registration model according to the calculated difference value until the calculated difference value is smaller than a preset threshold value, so as to obtain the trained image registration model.

According to the image registration method based on the deep neural network model, the image registration model is forced to learn the image characteristics with high robustness and high consistency in the image pair through given training data aligned accurately and in a network fitting mode, the image registration method is used for calculating space transformation between the images, and the registration accuracy of the infrared image and the visible light image is improved.

Optionally, the processor 1010 is further configured to randomly determine a first selection frame in a first image of the image pair, and obtain first coordinates corresponding to four corner points of the first selection frame; determining a second selection frame in a second image of the pair of images according to the first coordinate; randomly offsetting four corner points of the second selection frame according to a preset offset to obtain second coordinates corresponding to the four offset corner points; calculating a transformation matrix between the first coordinate and the second coordinate; performing perspective transformation on a second selection frame in the second image according to the inverse matrix of the transformation matrix to obtain a third selection frame; and cutting a first selection frame in the first image and cutting a third selection frame in the second image to obtain a target image block pair.

Because less training data can be used for training the registration model, the embodiment of the application cuts the pre-collected registered image pair, and uses the cut target image block pair as the training data for training the image registration model. Each image pair can be cut into a plurality of target image block pairs according to a preset size, and the target image block pairs are used as training data, so that each image pair can generate a plurality of training data, the data quantity of the training data can be increased, and the training accuracy is improved.

Optionally, the processor 1010 is further configured to input a first image block and a second image block in the target image block pair into the depth feature extraction network, respectively, so as to extract a first depth feature of the first image block and a second depth feature of the second image block; respectively inputting a first image block and a second image block in the target image block pair into the mask prediction network to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block; weighting the first depth features by using the first mask to obtain a first feature map, and weighting the second depth features by using the second mask to obtain a second feature map; inputting the first feature map and the second feature map into the channel cascade module to obtain a registered image pair of the target image block pair; and inputting the registration image pair into the matrix estimation network to obtain a second transformation matrix corresponding to two image blocks in the registration image pair.

The image registration model is a deep neural network model and is obtained through supervised training according to a large amount of accurate training data, and fine registration of the image pairs of the infrared image and the visible light image can be realized through the trained image registration model. Compared with an image registration method based on artificial features, the image registration method based on the deep neural network model provided by the application forces the image registration model to learn the image features with high robustness and high consistency in the image pair through given training data with accurate alignment in a network fitting mode, is used for calculating the space transformation between the images, and improves the registration accuracy of the infrared image and the visible light image.

Optionally, the processor 1010 is further configured to perform equivalent transformation on the first transformation matrix according to the offset of the corner coordinate of the first image block and the second image block, so as to obtain a first equivalent matrix; performing equivalent transformation estimation on the second transformation matrix according to the angular point coordinate offset in the second transformation matrix to obtain a second equivalent matrix; and calculating a difference value between the first equivalent matrix and the second equivalent matrix according to a preset loss function.

The embodiment of the application carries out equivalent transformation processing on the first transformation matrix and the second transformation matrix respectively to solve the problem of different regression parameter dimensions, thereby reducing the complexity of network training and improving the network training effect.

Optionally, the processor 1010 is further configured to input an image pair to be registered into the trained image registration model, where the image pair to be registered includes an infrared image and a visible light image in the same scene; and outputting a registration image pair through the trained image registration model.

The image registration model is an end-to-end model, the infrared image and the visible light image under the same scene to be registered are used as image pairs to be input into the trained image registration model, and then the registered image pairs can be output.

It should be understood that in the embodiment of the present application, the input Unit 1004 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, and the Graphics Processing Unit 1041 processes image data of a still picture or a video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1006 may include a display panel 1061, and the display panel 1061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1007 includes a touch panel 1071 and other input devices 1072. The touch panel 1071 is also referred to as a touch screen. The touch panel 1071 may include two parts of a touch detection device and a touch controller. Other input devices 1072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1009 may be used to store software programs as well as various data, including but not limited to application programs and operating systems. Processor 1010 may integrate an application processor that handles primarily operating systems, user interfaces, applications, etc. and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1010.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above method for training an image registration model, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the above method for training an image registration model, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of training an image registration model, the method comprising:

2. The method of claim 1, wherein said cropping each pair of images in the dataset to obtain a pair of target image blocks comprises:

randomly determining a first selection frame in a first image in the image pair, and acquiring first coordinates corresponding to four corner points of the first selection frame;

determining a second selection frame in a second image of the pair of images according to the first coordinate;

randomly offsetting four corner points of the second selection frame according to a preset offset to obtain second coordinates corresponding to the four offset corner points;

calculating a transformation matrix between the first coordinate and the second coordinate;

performing perspective transformation on a second selection frame in the second image according to the inverse matrix of the transformation matrix to obtain a third selection frame;

and cutting a first selection frame in the first image and cutting a third selection frame in the second image to obtain a target image block pair.

3. The method of claim 2, wherein the first image is a visible light image of the pair of images and the second image is an infrared image of the pair of images; or, the first image is an infrared image in the image pair, and the second image is a visible light image in the image pair.

4. The method according to claim 1, wherein the image registration model is a deep neural network model including a depth feature extraction network, a mask prediction network, a channel cascade module, and a matrix estimation network, and the obtaining of the registration image pair of the target image block pair and the second transformation matrix corresponding to two image blocks in the registration image pair by using the target image block pair as training data and sequentially inputting the initial image registration model comprises:

inputting a first image block and a second image block in the target image block pair into the depth feature extraction network respectively so as to extract a first depth feature of the first image block and a second depth feature of the second image block;

respectively inputting a first image block and a second image block in the target image block pair into the mask prediction network to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block;

weighting the first depth features by using the first mask to obtain a first feature map, and weighting the second depth features by using the second mask to obtain a second feature map;

inputting the first feature map and the second feature map into the channel cascade module to obtain a registered image pair of the target image block pair;

and inputting the registration image pair into the matrix estimation network to obtain a second transformation matrix corresponding to two image blocks in the registration image pair.

5. The method according to claim 4, wherein the inputting a first image block and a second image block in the target image block pair into the mask prediction network respectively to obtain a first mask corresponding to the first image block and a second mask corresponding to the second image block comprises:

6. The method according to claim 1, wherein after calculating the first transformation matrix corresponding to two image blocks in the target image block pair, further comprising:

performing equivalent transformation on the first transformation matrix according to the angular point coordinate offset of the first image block and the second image block to obtain a first equivalent matrix;

after obtaining the second transformation matrices corresponding to the two image blocks in the registered image pair, the method further includes:

performing equivalent transformation estimation on the second transformation matrix according to the angular point coordinate offset in the second transformation matrix to obtain a second equivalent matrix;

the calculating a difference value between the first transformation matrix and the second transformation matrix according to a preset loss function includes:

and calculating a difference value between the first equivalent matrix and the second equivalent matrix according to a preset loss function.

7. The method according to claim 6, characterized in that the loss function is determined from the offset of each corner point in the first equivalent matrix and the offset of each corner point in the second equivalent matrix.

8. The method of claim 1, wherein after obtaining the trained image registration model, further comprising:

9. An apparatus for training an image registration model, the apparatus comprising:

10. The apparatus of claim 9, wherein the image cropping module comprises:

11. The apparatus of claim 10, wherein the first image is a visible light image of the pair of images and the second image is an infrared image of the pair of images; or, the first image is an infrared image in the image pair, and the second image is a visible light image in the image pair.

12. The apparatus of claim 9, wherein the image registration model is a deep neural network model comprising a deep feature extraction network, a mask prediction network, a channel cascade module, and a matrix estimation network, and the data training module comprises:

13. The apparatus according to claim 12, wherein the mask prediction sub-module is specifically configured to input a first image block and a second image block of the target image block pair into the mask prediction network, respectively, to generate a first mask equal to the first image block size through the mask prediction network learning, and to label a first contribution estimation value corresponding to each pixel in the first image block in the first mask, and to generate a second mask equal to the second image block size through the mask prediction network learning, and to label a second contribution estimation value corresponding to each pixel in the second image block in the second mask;

14. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions, when executed by the processor, implementing the steps of the method of training an image fusion model according to any one of claims 1 to 8.