CN116597177A

CN116597177A - Multi-source image block matching method based on dual-branch parallel depth interaction cooperation

Info

Publication number: CN116597177A
Application number: CN202310216711.1A
Authority: CN
Inventors: 张艳宁; 张秀伟; 李艳平; 孙怡; 王文娜; 邢颖慧
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-08-15

Abstract

The invention relates to a multi-source image block matching method based on double-branch parallel depth interaction, and belongs to the technical field of image processing. A high-precision multi-source image block matching network model is designed, and a four-branch network formed by twin and pseudo twin is used for extracting common features and private features among cross-mode images. Encoding the different scale features by using a scale channel attention module, wherein each scale extracts cross-mode consistency features through channel attention; meanwhile, a spatial correlation characteristic enhancement module is designed to learn the correlation degree of the characteristics among different modes. And finally, fusing the space and channel characteristics by adopting a depth interaction fusion and prediction module and predicting a matching result. The average value of FPR95 of the SCCA-Net in the VL-CMIM data set can reach 3.95, and the performance is far superior to other methods.

Description

Multi-source image block matching method based on dual-branch parallel depth interaction cooperation

Technical Field

The invention belongs to the technical field of image processing, and relates to a multi-source image block matching method based on double-branch parallel depth interaction.

Background

In recent years, the variety and the number of sensors are rapidly increased, and the multisource images have strong difference and complementarity, so that people can obtain more complete description of the current target or scene. Multisource images are also used in a large number of fields for target detection and tracking, military exploration, medical detection, security monitoring, and the like. Therefore, research on multi-source image processing has important theoretical and practical significance.

Because of the different principles and use environments of various imaging sensors, the difference exists among the multi-source images, so that the problem to be solved in the multi-source image processing process is how to match the images from different sensors, and technical support is provided for subsequent processing. As an important ring of image processing, the image matching technology is the basis of many complex image processing, and the success or failure of matching determines the quality of the image processing result in the next step. This also makes the image matching technique one of image processing techniques that have been rapidly developed in recent years. The existing automatic image matching technology is mostly aimed at single-source images, mainly visible light images. But for multi-source images with inconsistent information characterization, matching them using single-source image matching techniques will result in a significant drop in matching accuracy and robustness. Therefore, research on automatic matching technology for multi-source images is of great importance for multi-source image processing.

The difficulty of multi-source image matching is that the characterization of information between different source images is inconsistent and unstable. Existing feature-based multi-source image matching algorithms focus much on how to solve the information characterization inconsistencies. Because of multiple sensor types, the characteristic difference of the images generated by each sensor is large, it is still difficult to find a feature description suitable for most applications, such as features suitable for designing hyperspectral certain single-band images and visible light images, and it is difficult to accurately describe the features of infrared images and visible light images. In addition, because the information characterization is unstable, it is difficult to ensure higher accuracy for the improved characteristics of a specific sensor, for example, the infrared sensor is sensitive to temperature, so that the information captured by the sensor at different moments/seasons is different, and the identical points cannot be consistently described when a given certain characteristic is used for description. Therefore, how to cooperate with multiple methods and reasonably utilize the advantages of each method has great theoretical and practical significance for improving the robustness of multi-source image matching.

Traditional methods based on artificial design descriptors rely heavily on human prior knowledge, and it is difficult to effectively capture potential pattern information in the data. Therefore, in recent years, image matching methods based on deep learning are becoming hot spots of research, and these methods automatically perform matching by learning features of images, avoiding limitations of conventional methods.

Image matching methods based on deep learning can be classified into two types, descriptor learning methods and metric learning methods. The descriptor learning method uses a convolutional neural network to extract advanced features from an image, matches the image according to feature distances (e.g., L2 distances, etc.), and determines a correct matching pair if the matching distance is less than a preselected threshold. The measurement learning method mainly adopts an original image pair or a generated feature descriptor as input, learns a similarity measurement relation between image blocks, converts a matching task into a classification task and outputs a matching label of the image block pair. This approach does not require design of metric criteria and the channel end-to-end network architecture directly learns the mapping of image blocks to matching labels. In 2017 Kumar et al in document "Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions [ C ]// Proceedings of the IEEE conference on computer vision and pattern recognment.2016:5385," it was proposed to use twin and triplet networks and introduce global losses to minimize the average distance between matching pairs, maximize the average distance between non-matching pairs, and improve image matching performance. In 2018, yuki et al in document "Learning local features from images [ J ]. Advances in neural information processing systems,2018,31." for the first time proposed an end-to-end image matching network LF-Net, in which the entire network learns key point positions, scales, directions, and feature descriptors, respectively, through a multi-scale full convolution network and a key point description network. In 2019 Shen et al in document "Rf-net: an end-to-end image matching network based on receptive field [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognizing.2019:8132-8140", the feature detection network is improved, and when the feature point response distribution is extracted, the feature detection precision is improved through a multi-scale receptive field, and a new loss function neighbor mask is introduced to improve the matching performance.

However, the accuracy of the result of the image matching method based on deep learning still has some problems: 1. because of the large appearance differences among the multi-source images, different scale features of the same image pair may have different similarities, and the existing method is difficult to accurately measure the image similarities. 2. The existing method focuses more on spatial information learned by the shallow layer of the network in the characteristic extraction process, and does not fully utilize characteristic details and semantic information reserved in the middle layer and the deep layer of the network. Resulting in the inability of the deep network to perform efficient feature interactive learning. It is therefore necessary to design a high-precision image matching network.

Disclosure of Invention

Technical problem to be solved

Aiming at the problem of lower result precision of the existing image matching method, the invention provides a multi-source image block matching method based on double-branch parallel depth interaction.

Technical proposal

A multi-source image block matching model based on double-branch parallel depth interaction is characterized by comprising a four-branch multi-mode feature extraction module, a multi-scale channel attention module, a spatial correlation feature enhancement module and a depth interaction fusion and prediction module;

the four-branch multi-mode feature extractionThe method comprises the steps that a module is taken to input two image block pairs T1 and T2 with different modes into a twin sub-network and a pseudo-twin sub-network respectively to extract characteristics, wherein the twin sub-network represents two branches with the same shared parameters and the same structure, and the pseudo-twin sub-network represents branches with the same structure but without shared parameters; then two high-level semantic information vectors learned by the same mode image in different sub-networks are stacked along the channel to obtain a feature vector F ^m1 、F ^m2 ；

The multi-scale channel attention module refers to the feature F output by the four-branch multi-mode feature extraction module ^m1 、F ^m2 Stacked along the channel, and then passed through a convolution kernel of 3×3,5×5,7×7,9×9 to obtain four sets of different scale feature vectors F ₀ ，F ₁ ，F ₂ ，F ₃ Each scale sequentially executes channel attention SE and then multiplies the channel attention SE by the channel attention SE element by element to obtain a feature vector F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' then F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' stacking along the channel dimension, inputting the stacked feature vector into a transducer encoder module, and adding the obtained feature representation with itself in a residual manner to obtain a feature vector F ^msa As an output;

the spatial correlation characteristic enhancement module outputs the characteristic F output by the four-branch multi-mode characteristic extraction module ^m1 、F ^m2 Performing correlation operation, and learning the correlation degree between the two; then, the encoder structure of the transducer module is utilized to establish a remote dependency relationship and acquire global context information, and the obtained characteristic representation is added with the encoder structure by a residual manner to obtain a characteristic vector F ^psa As an output;

the depth interaction fusion and prediction module fuses the characteristic vector F ^psa And F is equal to ^msa And sending the network final result into a spatial correlation characteristic enhancement module and a multi-scale channel attention module, and finally predicting the network final result by using three full-connection layer operations.

A multi-source image block matching method based on double-branch parallel depth interaction is characterized by comprising the following steps:

step 1: respectively inputting image block pairs T1 and T2 of different modes into a twin sub-network and a pseudo-twin sub-network to extract features, and then stacking two high-level semantic information vectors learned by the same mode image in different sub-networks along a channel to obtain a feature vector F ^m1 、F ^m2 ；

Step 2: feature F to be output by the four-branch multi-modal feature extraction module ^m1 、F ^m2 Obtaining a characteristic diagram F after stacking channels ^u New feature map F ^u The resolution of (2) is denoted as H _u ×W _u Wherein H is _u ＝H _input ，W _u ＝W _input ，H _input And W is equal to _input Resolution size for the input image pair T; will F ^u The new feature vector F is obtained after convolution by 3×3,5×5,7×7 and 9×9 respectively ₀ ，F ₁ ，F ₂ ，F ₃ The resolution of each set of feature vectors is denoted as H _s ×W _s The number of channels is marked as C _s ，H _s ＝H _u /4，W _s ＝W _u /4，C _s ＝C _u 4, wherein H _u And W is equal to _u Is characterized by F ^u Resolution size, C _u Is characterized by F ^u The number of channels; then, the channel attention SE is sequentially executed on each scale and multiplied by the channel attention SE element by element to obtain a feature vector F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ′，F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' stacking along the channel dimension, inputting the stacked feature vectors into a transducer encoder module, fusing the feature information of different scales through two multi-head attention modules in the transducer module, and adding the obtained feature vectors with the feature vectors by a residual mode to obtain feature vectors F ^msa As an output;

step 3: feature F to be extracted by the four-branch multimodal feature extraction module ^m1 、F ^m2 Input to the spatial correlation feature enhancement module, using feature map F ^m1 And F is equal to ^m2 Performing correlation operation, and learning the correlation degree between the two; then, using a transducer moduleThe encoder structure of the block establishes a remote dependency and acquires global context information, and the obtained characteristic representation is added with itself by means of residual error to obtain F ^psa As an output;

step 4: feature F output by the spatial correlation feature enhancement module ^psa Features F output from a multiscale channel attention module ^msa The method comprises the steps of sending the feature vectors into a space correlation feature enhancement module and a multi-scale channel attention module again, and stacking the obtained feature vectors along the channel dimension to generate final feature descriptors to promote information flow between modes, so that images between two modes mutually transmit cross-domain consistency features; finally, predicting a network final result by using three full connection layer operations;

step 5: and when the network is trained, calculating cross entropy loss by using the prediction result and a real matching label, then carrying out reverse propagation, and repeating iteration until the number of iterations reaches a set initial value, and judging that training is completed.

The invention further adopts the technical scheme that: the three full connection layers in step 4 are 512 and 128,2 in size, respectively.

A computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods described above.

A computer readable storage medium, characterized by storing computer executable instructions that when executed are configured to implement the method described above.

Advantageous effects

The invention provides a multi-source image block matching method based on double-branch parallel depth interaction, which designs a high-precision multi-source image block matching network model, and extracts common features and private features among cross-mode images by using a four-branch network formed by twin and pseudo twin. Encoding the different scale features by using a scale channel attention module, wherein each scale extracts cross-mode consistency features through channel attention; meanwhile, a spatial correlation characteristic enhancement module is designed to learn the correlation degree of the characteristics among different modes. And finally, fusing the space and channel characteristics by adopting a depth interaction fusion and prediction module and predicting a matching result. The average value of FPR95 of the SCCA-Net in the VL-CMIM data set can reach 3.95, and the performance is far superior to other methods.

1) The method can extract common features and private features in the multi-source image and eliminate the influence caused by interference factors such as modes, shadows, seasonal variation and the like. This means that the invention can capture the differences between images more accurately, thereby improving the matching accuracy;

2) The matching result obtained by the invention is more accurate. Compared with the prior art, the method can effectively solve the problem of difficult-to-match image blocks, thereby improving the matching precision and accuracy.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

Fig. 1 is a network configuration diagram of a method according to an embodiment of the present invention.

FIG. 2 is a block diagram of a multi-scale channel attention module in a network model according to an embodiment of the present invention.

FIG. 3 is a block diagram of a spatial correlation module in a method model according to an embodiment of the present invention.

FIG. 4 is a schematic diagram of a deep interaction fusion and prediction module in a method model according to an embodiment of the invention.

Fig. 5 is a diagram of the channel attention SE structure of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides a multi-source image block matching method based on double-branch parallel depth interaction, and designs a high-precision multi-source image block matching network model aiming at the method.

The four-branch multi-mode feature extraction module inputs two image block pairs T1 and T2 with different modes into a twin sub-network and a pseudo-twin sub-network respectively to extract features, wherein the twin sub-network represents two branches with shared parameters and the same structure, and the pseudo-twin sub-network represents branches with the same structure but not shared parameters. Then two high-level semantic information vectors learned by the same mode image in different sub-networks are stacked along the channel to obtain a feature vector F ^m1 、F ^m2 ；

the spatial correlation characteristic enhancement module outputs the characteristic F output by the four-branch multi-mode characteristic extraction module ^m1 、F ^m2 Performing correlation operation, learning the correlation degree between the two, then establishing a remote dependency relationship by utilizing the encoder structure of the transducer module and acquiring global context information, and adding the obtained characteristic representation with the encoder structure by a residual manner to obtain a characteristic vector F ^psa As an output;

the depth interaction fusion and prediction module fuses the characteristic vector F ^psa And F is equal to ^msa And (3) sending the network final result into a spatial correlation characteristic enhancement module and a multi-scale channel attention module, and finally predicting the network final result by using three full connection layer operations (512 and 128,2 respectively).

The method comprises the following steps of:

s1, two images T of different modes are processed ₁ And T ₂ The input twin sub-network and the pseudo twin sub-network module are used for carrying out characteristic extraction for a plurality of times, and the input image of each branch sequentially passes through the convolution module 1, the pooling layer 1, the convolution module 2, the convolution module 3, the pooling layer 3 and the convolution module 4. The convolution module 1 consists of 2 convolution layers, wherein each convolution layer comprises 32 3×3 convolution kernels, and the step size of convolution is 1. The pooling layer 1 is the largest pooling operation with a step size of 2 and a convolution kernel size of 3 x 3. The convolution module 2 is also composed of 2 convolution layers, each convolution layer contains 64 3×3 convolution kernels, and the convolution steps are all 1. The convolution module 3 is composed of 1 convolution layer, each convolution layer contains 128 3×3 convolution kernels, and the convolution step sizes are all 1. Pooling layer 2 is the largest pooling operation with a step size of 2 and a convolution kernel size of 3 x 3. The convolution module 4 is composed of 2 convolution layers, each convolution layer contains 128 3×3 convolution kernels, and the convolution steps are all 1. The first three convolution layers are followed by an example normalization layer and a batch normalization layer to reduce illumination variation and feature differences caused by different imaging mechanisms, and the last four convolution layers are followed by a batch normalization layer.

Then two high-level semantic information vectors learned by the same modal image in different sub-networks are stacked along the channel dimension to obtain a feature vector F ^m1 、F ^m2 。

S2, outputting the characteristic F output by the four-branch multi-mode characteristic extraction module ^m1 、F ^m2 Obtaining a characteristic diagram F after stacking channels ^u New feature map F ^u The resolution of (2) is denoted as H _u ×W _u Wherein H is _u ＝H _input ，W _u ＝W _input ，H _input And W is equal to _input For dividing the input image pair TResolution size. Will F ^u The new feature vector F is obtained after convolution by 3×3,5×5,7×7 and 9×9 respectively ₀ ，F ₁ ，F ₂ ，F ₃ The resolution of each set of feature vectors is denoted as H _s ×W _s The number of channels is marked as C _s ，H _s ＝H _u /4，W _s ＝W _u /4，C _s ＝C _u 4, wherein H _u And W is equal to _u Is characterized by F ^u Resolution size, C _u Is characterized by F ^u Is a number of channels. Subsequently, the channel attention SE (Squeeze-and-specification) is sequentially executed on each scale and multiplied by the channel attention SE per SE element to obtain a feature vector F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ′，F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' stacking along the channel dimension, inputting the stacked feature vectors into a transducer encoder module, fusing the feature information of different scales through two multi-head attention modules in the transducer module, and adding the obtained feature vectors with the feature vectors by a residual mode to obtain feature vectors F ^msa ，F ^msa Calculated by the following formula:

[F ₀ ,F ₁ ,F ₂ ,F ₃ ]＝α(f ^3×3 (F ^u ),f ^5×5 (F ^u ),f ^7×7 (F ^u ),f ^9×9 (F ^u ))

F ^msa ＝TF(Concat((F ₀ )′,(F ₁ )′,(F ₂ )′,(F ₃ )′))+Concat((F ₀ )′,(F ₁ )′,(F ₂ )′,(F ₃ )′)

wherein f ^n×n Representing the operation of n x n convolution on the feature vector, alpha represents the Sigmoid activation function, F _m Representing four features of different dimensions, m being 0,1,2,3, i.e. F ₀ ，F ₁ ，F ₂ ，F ₃ The method comprises the steps of carrying out a first treatment on the surface of the SE denotes the attention of the channel and,representing element-wise multiplication, concat represents concatenation along the channel dimension, and TF represents a transducer encoder module.

S3, the spatial correlation characteristic enhancement module firstly uses a characteristic vector F ^m1 And F is equal to ^m2 And performing correlation operation so as to learn the correlation degree between the two. In the process, the related layer can fuse the information in different feature vectors to obtain richer feature information. The computation of the correlation layer is similar to the convolution layer, except that the convolution layer is able to learn the feature weights, whereas the correlation layer is a sum of the hidden layer outputs from the two branches. The transducer encoder module adopts a self-attention mechanism, so that the model can be calculated in parallel to obtain global information, and finally, the global context information is captured through a multi-layer self-attention mechanism to obtain a new feature vector. The obtained feature vector is added with itself by means of residual error to obtain feature vector F ^psa ，F ^psa Calculated by the following formula:

F ^psa ＝TF(F ^c )+F ^c

wherein f ₁ ，f ₂ For features input to the relevant layer, in particular this module f ₁ Represents F ^m1 ，f ₂ Represents F ^m2 ，x ₁ 、x ₂ Representing coordinate points in the two feature vectors; o represents the region to be compared; k represents the area range to be compared and is set to 3; f (F) ^c Representing the characteristics obtained through the calculation of the correlation layer;representing element-wise multiplication, TF represents the transducer encoder module.

S4, the depth interaction fusion and prediction module fuses the feature vector F ^psa And F is equal to ^msa Into a spatial correlation feature enhancement module and a multi-scale channel attention moduleAnd the deep interaction of information among different modes in the network is promoted, so that the model can extract more distinguishing characteristics, and the precision of multi-source image matching is improved. Finally, three full connection layer operations (512, 128,2, respectively) are used to predict the network end result.

S5, when the network is trained, calculating cross entropy loss by using a prediction result (matching or non-matching) and a real matching label, then carrying out back propagation, and repeating iteration until the iteration number reaches a set initial value, and judging that training is completed.

Specific examples:

as shown in fig. 1, the invention designs a model of a multi-source image block matching method based on double-branch parallel depth interaction aiming at the problem of insufficient precision of the existing multi-source image matching result. The system comprises a four-branch multi-mode feature extraction module, a multi-scale channel attention module, a spatial correlation feature enhancement module and a depth interaction fusion and prediction module, which are respectively shown in figures 1,2 and 4. The specific method comprises the following steps:

s1, respectively inputting image block pairs T1 and T2 of different modes into a twin sub-network and a pseudo-twin sub-network to perform feature extraction, stacking two high-level semantic information vectors learned by the same mode image in different sub-networks along a channel dimension to obtain a feature vector F ^m1 、F ^m2 ；

S2, outputting the characteristic F output by the four-branch multi-mode characteristic extraction module ^m1 、F ^m2 Obtaining characteristic F after stacking channels ^u ，F ^u The new feature vector F is obtained after convolution by 3×3,5×5,7×7 and 9×9 respectively ₀ ，F ₁ ，F ₂ ，F ₃ The resolution of each set of feature vectors is denoted as H _s ×W _s The number of channels is marked as C _s ，H _s ＝H _u /4，W _s ＝W _u /4，C _s ＝C _u 4, wherein H _u And W is equal to _u Is characterized by F ^u Resolution size, C _u Is characterized by F ^u Is a number of channels. Then, the channel attention SE is sequentially executed on each scale and multiplied by the channel attention SE element by element to obtainTo the feature vector F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ′，F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' stacking along the channel dimension, inputting the stacked feature vectors into a transducer encoder module, fusing the feature information of different scales through two multi-head attention modules in the transducer module, and adding the obtained feature vectors with the feature vectors by a residual mode to obtain feature vectors F ^msa As an output;

s3, extracting the characteristics F extracted by the twin sub-network and the pseudo twin sub-network ^m1 、F ^m2 Input to a spatial correlation feature enhancement module, using feature vector F ^m1 And F is equal to ^m2 And performing correlation operation, and learning the correlation degree between the two. Then, utilizing the encoder structure of the transducer module to establish a remote dependency relationship and acquire global context information, and adding the obtained feature vector with the encoder structure by a residual manner to obtain a feature vector F ^psa As an output;

s4, the feature vector F ^psa And F is equal to ^msa And sending the images to the spatial correlation characteristic enhancement module and the multi-scale channel attention module again to promote information flow between the modes, so that the images between the two modes mutually transmit cross-domain consistency characteristics. Finally, the network end result is predicted using three full connection layers of 512, 128,2 sizes, respectively.

In this embodiment, the execution network of S1-4 is simply referred to as SCCA-Net. The execution of S1-S4 will be described in further detail below in connection with the structure of SCCA-Net.

In this embodiment, in S1, the multi-source image blocks T1 and T2 are respectively input into the common feature and the private feature of the learning image in the twin sub-network and the pseudo-twin sub-network, where the twin sub-network refers to the dual-branch sharing parameter, and the pseudo-twin sub-network refers to the dual-branch non-sharing parameter. The four branches have the same structure, and the input image of each branch sequentially passes through a convolution module 1, a pooling layer 1, a convolution module 2, a convolution module 3, a pooling layer 3 and a convolution module 4. The convolution module 1 consists of 2 convolution layers, wherein each convolution layer comprises 32 3×3 convolution kernels, and the step size of convolution is 1. The pooling layer 1 is the largest pooling operation with a step size of 2 and a convolution kernel size of 3 x 3. The convolution module 2 is also composed of 2 convolution layers, each convolution layer contains 64 3×3 convolution kernels, and the convolution steps are all 1. The convolution module 3 is composed of 1 convolution layer, each convolution layer contains 128 3×3 convolution kernels, and the convolution step sizes are all 1. Pooling layer 2 is the largest pooling operation with a step size of 2 and a convolution kernel size of 3 x 3. The convolution module 4 is composed of 2 convolution layers, each convolution layer contains 128 3×3 convolution kernels, and the convolution steps are all 1. The first three convolution layers are followed by an example normalization layer and a batch normalization layer to reduce illumination variation and feature differences caused by different imaging mechanisms, and the last four convolution layers are followed by a batch normalization layer.

Stacking the high-level semantic information vectors learned by the same modal image in different sub-networks along the channel to obtain a feature vector F ^m1 、F ^m2 ；

Referring to fig. 1, in the present embodiment S2, the feature F to be output by the four-branch multi-modal feature extraction module ^m1 、F ^m2 Obtaining a characteristic diagram F after stacking channels ^u New feature map F ^u The resolution of (2) is denoted as H _u ×W _u Wherein H is _u ＝H _input ，W _u ＝W _input ，H _input And W is equal to _input Is the resolution size of the input image pair T. Will F ^u The new feature vector F is obtained after convolution by 3×3,5×5,7×7 and 9×9 respectively ₀ ，F ₁ ，F ₂ ，F ₃ The resolution of each set of feature vectors is denoted as H _s ×W _s The number of channels is marked as C _s ，H _s ＝H _u /4，W _s ＝W _u /4，C _s ＝C _u 4, wherein H _u And W is equal to _u Is characterized by F ^u Resolution size, C _u Is characterized by F ^u Is a number of channels. Then, the channel attention SE is sequentially executed on each scale and multiplied by the channel attention SE element by element to obtain a feature vector F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ′，F ₀ ′，F ₁ ′，F ₂ ′，F ₃ ' stacking along the channel dimension, post stackingThe feature vector is input into a transducer encoder module, the feature information of different scales is fused through two multi-head attention modules in the transducer module, and the obtained feature vector is added with the feature vector by a residual way to obtain a feature vector F ^msa As an output;

in this embodiment, referring to fig. 5, channel attention (SE) can encode and output each set of channel level correlations to obtain correlations between feature vectors. The module contains global average pooling, full connection layer, reLU activation function, full connection layer, and Sigmoid activation function. The dimension of the network input and output is unchanged.

In the present embodiment, referring to fig. 3, S3 will extract the feature F by the twin sub-network and pseudo-twin sub-network ^m1 、F ^m2 Input to a spatial correlation feature enhancement module, using feature vector F ^m1 And F is equal to ^m2 And performing correlation operation, and learning the correlation degree between the two. Then, utilizing the encoder structure of the transducer module to establish a remote dependency relationship and acquire global context information, and adding the obtained feature vector with the encoder structure by a residual manner to obtain a feature vector F ^psa As an output;

in this embodiment, referring to fig. 4, the feature vector F output by the spatial correlation feature enhancement module ^psa Feature vector F output by multi-scale channel attention module ^msa And sending the predicted network final result into a multi-scale channel attention module and a spatial correlation characteristic enhancement module again, and finally predicting the network final result by using three full-connection layers with the sizes of 512 and 128,2 respectively.

To verify the validity of the SCCA-Net network, the present embodiment uses the dataset VL-CMIM (visible light-thermal infrared) for training and testing of the network framework and is compared with other methods. VL-CMIM is the first large visible and thermal infrared image block matching dataset so far. The method comprises six categories, wherein the data volume of image blocks exceeds 290 ten thousand pairs, and the six categories are respectively: planetary, building, country, field, street and lake. This embodiment uses country class training and other class testing, with all image block sizes of 64 x 64 pixels.

The algorithm proposed in this embodiment is compared with 9 mainstream multisource image matching methods, and specific results are shown in table 1. The evaluation index is FPR95, specifically, the false positive rate (FPR 95) when the true positive rate (positive recall rate) is equal to 95 percent is used for quantitatively evaluating the matching performance of the network, and the lower the index is, the better the representative performance is.

Table 1 shows the image block matching results of the method of the present invention and other prior art methods

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. A multi-source image block matching model based on double-branch parallel depth interaction is characterized by comprising a four-branch multi-mode feature extraction module, a multi-scale channel attention module, a spatial correlation feature enhancement module and a depth interaction fusion and prediction module;

the four-branch multi-mode feature extraction module respectively inputs two image block pairs T1 and T2 with different modes into a twin sub-network and a pseudo-twin sub-network to extract features, wherein the twin sub-network represents two branches with shared parameters and the same structure, and the pseudo-twin sub-network represents branches with the same structure but not shared parameters; then two high-level semantic information vectors learned by the same mode image in different sub-networks are stacked along the channel to obtain a feature vector F ^m1 、F ^m2 ；

The multi-scale channel attention module refers to the feature F output by the four-branch multi-mode feature extraction module ^m1 、F ^m2 Stacked along the channel, and then passed through a convolution kernel of 3×3,5×5,7×7,9×9 to obtain four sets of different scale feature vectors F ₀ ，F ₁ ，F ₂ ，F ₃ Each scale sequentially executes channel attention SE and then multiplies the channel attention SE by the channel attention SE element by element to obtain a feature vector F ₀ ^′ ，F ₁ ^′ ，F ₂ ^′ ，F ₃ ^′ Subsequently F ₀ ^′ ，F ₁ ^′ ，F ₂ ^′ ，F ₃ ^′ Stacking along channel dimension, inputting the stacked feature vector into a transducer encoder module, and adding the obtained feature representation with itself in a residual manner to obtain a feature vector F ^msa As an output;

2. A multi-source image block matching method based on dual-branch parallel depth interaction realized by using the model as claimed in claim 1, which is characterized by comprising the following steps:

Step 2: feature F to be output by the four-branch multi-modal feature extraction module ^m1 、F ^m2 Obtaining a characteristic diagram F after stacking channels ^u New feature map F ^u The resolution of (2) is denoted as H _u ×W _u Which is provided withMiddle H _u ＝H _input ，W _u ＝W _input ，H _input And W is equal to _input Resolution size for the input image pair T; will F ^u The new feature vector F is obtained after convolution by 3×3,5×5,7×7 and 9×9 respectively ₀ ，F ₁ ，F ₂ ，F ₃ The resolution of each set of feature vectors is denoted as H _s ×W _s The number of channels is marked as C _s ，H _s ＝H _u /4，W _s ＝W _u /4，C _s ＝C _u 4, wherein H _u And W is equal to _u Is characterized by F ^u Resolution size, C _u Is characterized by F ^u The number of channels; then, the channel attention SE is sequentially executed on each scale and multiplied by the channel attention SE element by element to obtain a feature vector F ₀ ^′ ，F ₁ ^′ ，F ₂ ^′ ，F ₃ ^′ ，F ₀ ^′ ，F ₁ ^′ ，F ₂ ^′ ，F ₃ ^′ Stacking along the channel dimension, inputting the stacked feature vectors into a transducer encoder module, fusing the feature information of different scales through two multi-head attention modules in the transducer module, and adding the obtained feature vectors with the feature vectors by a residual mode to obtain feature vectors F ^msa As an output;

step 3: feature F to be extracted by the four-branch multimodal feature extraction module ^m1 、F ^m2 Input to the spatial correlation feature enhancement module, using feature map F ^m1 And F is equal to ^m2 Performing correlation operation, and learning the correlation degree between the two; then, the encoder structure of the transducer module is utilized to establish a remote dependency relationship and acquire global context information, and the obtained characteristic representation is added with the encoder structure by a residual manner to obtain F ^psa As an output;

step 4: feature F output by the spatial correlation feature enhancement module ^psa Features F output from a multiscale channel attention module ^msa And sending the feature vectors into the spatial correlation feature enhancement module and the multi-scale channel attention module again, and stacking the obtained feature vectors along the channel dimension to generateThe final feature descriptors promote the information flow between the modes, so that the images between the two modes mutually transmit cross-domain consistency features; finally, predicting a network final result by using three full connection layer operations;

3. The multi-source image block matching method based on dual-branch parallel depth interaction according to claim 2, wherein the sizes of the three full connection layers in the step 4 are 512 and 128,2 respectively.

4. A computer system, comprising: one or more processors, a computer-readable storage medium storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of claim 1.

5. A computer readable storage medium, characterized by storing computer executable instructions that, when executed, are adapted to implement the method of claim 1.