CN117710207A

CN117710207A - Image stitching method based on progressive alignment and interweaving fusion network

Info

Publication number: CN117710207A
Application number: CN202410159901.9A
Authority: CN
Inventors: 范晓婷; 张重; 周祥一; 王荣禄
Original assignee: Tianjin Normal University
Current assignee: Tianjin Normal University
Priority date: 2024-02-05
Filing date: 2024-02-05
Publication date: 2024-03-15

Abstract

The invention discloses an image splicing method based on a gradual alignment and interweaving fusion network, which comprises the following steps: step S1, acquiring a training data set, wherein the training data set comprises a plurality of input image pairs, and the input image pairs comprise reference images I ^R And target image I ^T The method comprises the steps of carrying out a first treatment on the surface of the Step S2, constructing image stitchingA depth model; step S3, training the image stitching depth model based on the training data set and a preset integral loss function to obtain an image stitching target depth model; and S4, splicing the image pairs to be spliced by using the image splicing target depth model to obtain an image splicing result. The invention utilizes progressive alignment and interweaving fusion to carry out depth optimization deformation and fusion on the images, reduces the alignment distortion and joint artifact of the images and realizes accurate splicing of the images.

Description

Image stitching method based on progressive alignment and interweaving fusion network

Technical Field

The invention relates to the field of image processing and deep learning, in particular to an image stitching method based on a progressive alignment and interweaving fusion network.

Background

In recent years, image stitching technology is a very popular topic in the fields of computer graphics and multimedia display, and the purpose of the image stitching technology is to stitch a plurality of reference images with overlapping areas and target images to generate a high-quality wide-field panoramic image. However, inconsistencies in the overlapping region between the reference image and the target image may lead to significant alignment distortion and joint artifacts. Therefore, how to acquire a natural panoramic image of a wide field of view image is a challenging task.

Currently, a large number of image stitching methods are proposed by researchers. The conventional image stitching method is divided into a global homography method and a local deformation method. The global homography method estimates global homography relations by matching complex geometric features, such as an automatic stitching method, a re-homography method, and the like. The local deformation method is to divide the image pair into uniform units and construct local parameter deformation constraint, including a projection deformation method, a robust elastic deformation method and the like as much as possible.

Because the deep convolutional neural network has strong representation capability and flexible structure, some image stitching methods based on the deep convolutional neural network achieve better performance. In general, an image stitching method based on a depth convolution neural network obtains an aligned image pair by using depth homography and fuses the aligned image pair, thereby generating a natural panoramic image. For example, song et al propose an end-to-end multi-homography estimation network that morphs the reference image and the target image to achieve image stitching. Nie et al achieve edge-preserving image stitching by learning multi-scale homographies. Dai et al fused the aligned image pairs by learning edge information to obtain a panoramic view. These methods can effectively cope with low-texture scenes and unnatural situations. However, the performance of these image stitching methods still needs to be further improved.

In carrying out the invention, the inventors have found that at least the following drawbacks and deficiencies in the prior art are present:

for the image alignment stage, the method in the prior art generally only uses the information of the reference image or the target image, and does not consider the cooperative relationship between the reference image and the target image, so that alignment distortion can be generated when scenes with different parallaxes are processed; for the image fusion stage, the method in the prior art generally adopts a simple depth convolution neural network fusion strategy to fuse the aligned image pairs, and ignores the complementary information between the aligned image pairs, so that obvious joint artifacts exist in the panoramic image.

Disclosure of Invention

The invention designs an image splicing method based on a gradual alignment and interweaving fusion network. The invention utilizes the progressive homography alignment submodel to deform the overlapping area of the input image pair and reduce the alignment inconsistency, adopts the interweaved image fusion submodel to fuse the aligned image pair and reduce the joint distortion, and constructs the alignment loss and the fusion loss to reduce the geometric distortion and the joint artifact of the panoramic image. The image stitching method can obtain high-quality image stitching results and reduce alignment distortion and joint artifacts.

In order to achieve the purpose, the image stitching method based on the gradual alignment and interweaving fusion network provided by the invention comprises the following steps of:

step S1, acquiring a training data set, wherein the training data set comprises a plurality of input image pairs, and the input image pairs comprise reference images I ^R And target image I ^T ；

S2, constructing an image stitching depth model;

step S3, training the image stitching depth model based on the training data set and a preset integral loss function to obtain an image stitching target depth model;

and S4, splicing the image pairs to be spliced by using the image splicing target depth model to obtain an image splicing result.

Optionally, the image stitching depth model includes a progressive alignment sub-model and an interlaced image fusion sub-model, where the progressive alignment sub-model is used to align the input image pair to obtain an aligned image pair, and the interlaced image fusion sub-model is used to fuse the aligned image pair to obtain an image stitching result.

Optionally, the progressive alignment submodel sequentially includes m convolution layers, m cross feature cooperation modules, m text-related layers and m space transformation modules, where the m convolution layers and the m cross feature cooperation modules are connected in a staggered manner, and m is a natural number.

Optionally, the cross feature collaboration module is composed of a reference image branch, a target image branch, and a cascade of reference and target images branch, wherein:

the reference image branch and the target image branch have the same structure, and each branch sequentially comprises a residual error density block, a global average pooling layer, two convolution layers and a sigmoid activation function, wherein the output of the residual error density block is multiplied with the output of the sigmoid activation function in addition to the output of the global average pooling layer;

the cascade branches of the reference image and the target image sequentially comprise a cascade layer, two convolution layers and a sigmoid activation function, wherein the output of the cascade layer is multiplied by the output of the sigmoid activation function in addition to the output of the first convolution layer.

Optionally, when aligning the input image pair using the progressive alignment sub-model:

inputting the input image pair into the convolution layer and cross feature cooperation module to obtain a reference image I in the input image pair ^R And target image I ^T Corresponding output cross feature mapAnd->；

Based on the output cross feature mapAnd->Acquiring a homography matrix by using a text correlation layer;

using spatial transform module to transform reference image I ^R And target image I ^T Performing deformation to align pixel positions of overlapping areas of the reference image and the target image, wherein the aligned image pairsAndexpressed as:

wherein H is _E Representing the identity matrix, H _P The function S represents the output of the spatial deformation module, namely the output of the spatial deformation processing when the input is the reference image and the unit matrix or the input is the target image and the progressive homography matrix.

Optionally, the interleaved image fusion submodel sequentially includes a convolution layer, K consecutive interleaved Swin transform modules, 4K attention interleaving blocks, a cascade layer, and a reconstruction unit, where the K consecutive interleaved Swin transform modules and the 4K attention interleaving blocks are used to extract an interleaved feature map of the aligned image pairs.

Optionally, the attention interleaving block sequentially comprises a maximum pooling layer, an average pooling layer, an upsampling layer, a convolution layer, a cascade layer and a sigmoid activation function.

Optionally, when fusing the aligned image pairs using the interleaved image fusion sub-model:

obtaining a basic feature map of an aligned reference image and an aligned target image in an aligned image pair by using a convolution layer;

k continuous interleaving Swin transducer modules and 4K attention interleaving blocks in the interleaving image fusion submodel are utilized to respectively obtain K interleaving feature maps of the aligned reference image and the aligned target image;

adding the K interleaving feature images of the obtained aligned reference image to obtain a final feature image of the aligned reference image, and adding the K interleaving feature images of the aligned target image to obtain a final feature image of the aligned target image;

cascading and convoluting the final feature map of the aligned reference image and the final feature map of the aligned target image to obtain fusion features of the aligned reference image and the aligned target image;

the fusion characteristic is sent to a reconstruction unit to obtain a final spliced image I of the reference image and the target image ^F Expressed as:

wherein:

andrepresenting the final feature maps of the reference image and the target image respectively,andthe interleaved feature map representing the aligned reference image and the aligned target image, respectively, the function Res representing the reconstruction unit output when the input is a post-convolution cascade feature, K representing the number of consecutive interleaved Swin transducer modules.

Optionally, the preset integral loss function includes an alignment loss function and a fusion loss function, where the preset integral loss function L _total Expressed as:

L _total =λ ₁ L _A +λ ₂ L _C

wherein L is _A Representing an alignment loss function, L _C Represents the fusion loss function, lambda ₁ And lambda (lambda) ₂ The weights of the alignment loss function and the fusion loss function, respectively.

Optionally, the fusion penalty function includes a structure penalty term for constraining structural similarity between the image stitching result and the input image pair and a texture penalty term for constraining the image stitching result and the input image pair to have similar texture details.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention can accurately align the images, reduce joint artifacts while keeping the alignment of the images consistent, and obtain high-quality image splicing results.

2. The invention solves the problem of image stitching by using a deep learning technology, reduces image alignment distortion by a progressive alignment submodel and an alignment loss function, and reduces joint artifacts by using an interweaving fusion submodel and a fusion loss function.

Drawings

FIG. 1 is a flow chart of a method of image stitching based on a progressive alignment and interlace fusion network in accordance with an embodiment of the present invention;

FIG. 2 is a schematic diagram of a cross-feature collaboration module in accordance with an embodiment of the invention;

FIG. 3 is a schematic diagram of the architecture of an interleaved Swin transducer module according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the structure of an attention interleaving block according to an embodiment of the present invention;

fig. 5 is a schematic diagram showing peak snr contrast results of different image stitching methods according to an embodiment of the present invention.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

Fig. 1 is a flowchart of an image stitching method based on a progressive alignment and interleaving fusion network according to an embodiment of the present invention, and some specific implementation procedures of the present invention are described below by taking fig. 1 as an example. As shown in fig. 1, the image stitching method based on the gradual alignment and interweaving fusion network provided by the invention comprises the following steps:

S2, constructing an image stitching depth model;

in one embodiment of the present invention, the image stitching depth model includes a progressive alignment sub-model and an interlaced image fusion sub-model.

The progressive alignment sub-model is used for aligning the input image pair to obtain an aligned image pair.

Further, the progressive alignment submodel sequentially comprises m convolution layers, m cross feature cooperation modules, m text related layers and m space transformation modules, wherein m is a natural number. The m convolution layers and the m cross feature cooperation modules are connected in a staggered manner to generate cross features, so that a more accurate homography matrix is obtained, namely, one convolution layer is arranged firstly, then one cross feature cooperation module is arranged, then one convolution layer is arranged, and similarly, the cross feature cooperation module is arranged between every two convolution layers. The cross feature cooperation module consists of three branches, namely a reference image branch, a target image branch and a cascade branch of the reference image and the target image, wherein the reference image branch and the target image branch have the same structure, each branch sequentially comprises a residual error dense block, a global average pooling layer, two convolution layers and a sigmoid activation function, and the output of the residual error dense block is multiplied by the output of the sigmoid activation function besides being input to the global average pooling layer; the cascade branches of the reference image and the target image sequentially comprise a cascade layer, two convolution layers and a sigmoid activation function, wherein the output of the cascade layer is multiplied with the output of the sigmoid activation function in addition to the output of the first convolution layer; as shown in fig. 2.

Assuming that the progressive alignment sub-model comprises i+1 convolution layers and i+1 cross feature cooperation modules, for the i+1 convolution layers and i+1 cross feature cooperation modules in the progressive alignment sub-model, the reference image I ^R And target image I ^T After being fed into the first-stage convolution layer in the progressive alignment sub-model, a reference image I can be generated ^R And target image I ^T Is a primary feature map of (1)Andthe method comprises the steps of carrying out a first treatment on the surface of the After passing through a cross feature cooperation module between the first-stage convolution layer and the second-stage convolution layer, the cross feature cooperation module and the reference image I can be obtained ^R And target image I ^T Corresponding first-level cross feature mapAndthe method comprises the steps of carrying out a first treatment on the surface of the After passing through the second-stage convolution layer, the reference image I can be obtained ^R And target image I ^T Is a two-level feature map of (2)Andthe method comprises the steps of carrying out a first treatment on the surface of the After passing through a cross feature cooperation module between the second-level convolution layer and the third-level convolution layer, the cross feature cooperation module and the reference image I can be obtained ^R And target image I ^T Corresponding two-level cross feature mapAndthe method comprises the steps of carrying out a first treatment on the surface of the And so on, after passing through the (i+1) -th cross feature cooperation module connected with the (i+1) -th convolution layer of the last stage, the (i+1) -th cross feature cooperation module can obtain the cross feature cooperation module connected with the reference image I ^R And target image I ^T Corresponding output cross feature mapAndthe output cross feature mapAndcan be expressed as:

wherein the method comprises the steps of

Wherein,andthe i +1 th level self features of the reference image and the target image respectively,is the i+1st level cooperative feature of the reference image and the target image,andthe i+1st level input features of the reference image and the target image, namely the i+1st level convolution layer of the progressive alignment sub-model is input to the i+1st cross feature cooperation module,is the i+1-th cascade feature of the reference picture and the target picture, i.e. the output of the cascade layer in the cascade branch of the reference picture and the target picture,andthe i+1st stage channel level attention of the reference image and the target image, respectively, i.e. the output of the sigmoid activation function in the reference image branch and the target image branch,is the (i+1) -th cascade featureI+1-th spatial level attention of (i) i.e. output of cascade branches of reference image and target image, function RDB represents residual when input is the i+1-th self-feature of reference image or target imageThe output of the difference dense block,/-represents pixel level addition,representing pixel level multiplication.

When aligning input image pairs using the progressive alignment sub-model:

firstly, inputting the input image pair into the convolution layer and cross feature cooperation module to obtain a reference image I in the input image pair ^R And target image I ^T Corresponding output cross feature mapAnd；

then, based on the output cross feature mapAndand acquiring a homography matrix by using the text correlation layer. The method for acquiring the homography matrix by using the text correlation layer belongs to a technology which is mastered by a person skilled in the art, and the method is not repeated;

then, the reference image I is transformed by a spatial transformation module ^R And target image I ^T Performing deformation to align pixel positions of overlapping areas of the reference image and the target image, wherein the aligned image pairsAndexpressed as:

wherein H is _E Representing the identity matrix, H _P The function S represents the output of the spatial deformation module, namely the output of the spatial deformation processing when the input is the reference image and the unit matrix or the input is the target image and the progressive homography matrix. The image transformation by using the spatial transformation module belongs to a technology which should be mastered by a person skilled in the art, and the present invention is not repeated.

The interweaved image fusion submodel is used for fusing the aligned image pairs to obtain an image splicing result.

The interleaving image fusion submodel sequentially comprises a convolution layer, K continuous interleaving Swin Transformer modules, 4K attention interleaving blocks, a cascade layer and a reconstruction unit.

Further, the interleaved image fusion sub-model extracts an interleaved feature map of the aligned image pair using K consecutive interleaved Swin transform modules and 4K attention interleaving blocks, wherein the aligned image pair includes an aligned reference image and an aligned target image, and K is a natural number, as shown in fig. 3.

The attention interleaving block sequentially comprises a maximum pooling layer, an average pooling layer, an up-sampling layer, a convolution layer, a cascade layer and a sigmoid activation function, as shown in fig. 4. Since the network structure of the aligned reference image and the aligned target image is symmetrical, the attention interleaving block will be described below with reference to the aligned target image for which the final output of the attention interleaving block isThe definition is as follows:

wherein,an interleaving feature map representing the attention interleaving block output of the aligned target image,andinput feature maps representing the aligned reference image and the aligned target image respectively,andrespectively representing the maximum pooling feature and the average pooling feature of the aligned reference image, namely, the output obtained after the processing of the maximum pooling layer, the upsampling layer and the convolution layer of the input feature image of the aligned reference image, and the output obtained after the processing of the average pooling layer, the upsampling layer and the convolution layer of the input feature image of the aligned target image, wherein conv (up (max ()) represents the sequentially performing the maximum pooling, upsampling and the convolution processing, conv (up (avg ()) represents the sequentially performing the average pooling, upsampling and the convolution processing, conv (contact ()) represents the sequentially performing the cascade and the convolution processing, and W ^R Representing the output of the sigmoid activation function, conv (conv ()) represents three-layer convolution processing, conv () represents one-layer convolution processing, H ^T Representing feature maps of, i.e. input feature maps for, aligned target imagesThe output obtained after the processing of the three convolution layers,representing a sigmoid function.

When fusing the aligned image pairs using the interleaved image fusion sub-model:

firstly, obtaining a basic feature map of an aligned reference image and an aligned target image in an aligned image pair by utilizing a convolution layer;

then, K interleaving feature graphs of the aligned reference image and target image are respectively obtained by using K continuous interleaving Swin transducer modules and 4K attention interleaving blocks in the interleaving image fusion sub-model;

then, adding the K interleaving feature images of the obtained aligned reference image to obtain a final feature image of the aligned reference image, and adding the K interleaving feature images of the aligned target image to obtain a final feature image of the aligned target image;

then, carrying out cascading and convolution processing on the final feature map of the aligned reference image and the final feature map of the aligned target image to obtain fusion features of the aligned reference image and the aligned target image;

finally, the fusion characteristic is sent to a reconstruction unit to obtain a final spliced image I of the reference image and the target image ^F This can be expressed as:

wherein the method comprises the steps of

Wherein,andrepresenting the final feature maps of the reference image and the target image respectively,andthe interleaved feature map representing the aligned reference image and the aligned target image, respectively, the function Res representing the reconstruction unit output when the input is a post-convolution cascade feature, K representing the number of consecutive interleaved Swin transducer modules.

In one embodiment of the present invention, K may be set to 4.

in an embodiment of the present invention, the preset integral loss function includes an alignment loss function and a fusion loss function, so as to maintain alignment consistency of an overlapping area between the reference image and the target image and reduce joint artifacts by using the alignment loss and the fusion loss.

Wherein the preset integral loss function L _total Expressed as:

L _total =λ ₁ L _A +λ ₂ L _C

Wherein the alignment loss function is used to constrain pixel coincidence of overlapping regions between the reference image and the target image.

Further, the alignment loss function L _A Can be expressed as:

wherein I is ^R And I ^T Respectively representing an input reference image and a target image, I ^X And I ^Y Respectively represent and I ^R And I ^T Identity matrix with same resolution, H _p The progressive homography matrix is represented,representing a progressive homography matrix H _p Is the inverse of the function S, the function S represents the output of the spatial warping process, |  | ₂ Representing the L2 norm.

The fusion loss function comprises a structure loss term and a texture loss term, wherein the structure loss term is used for restraining the structural similarity between an image splicing result and an input image pair, and the texture loss term is used for restraining the image splicing result and the input image pair from having similar texture details;

further, the fusion loss function L _C Can be expressed as:

L _C =L _stru +L _text

wherein the method comprises the steps of

Wherein L is _stru And L _text Representing a structure loss term and a texture loss term, O ^AR And O ^AT Is an image block extracted from a certain pixel point O in the aligned reference image and target image, O ^F Is an image block extracted from the final spliced image at the same position O, M and N represent O ^F The dimensions in the horizontal direction and in the vertical direction, the function ssim () represents a structural similarity operation,  and  ₁ Representing an absolute value operator and an L1 norm, wherein the V represents a Sobel gradient operation, and the function max represents an element maximum selection operation.

FIG. 5 shows peak SNR contrast results of image stitching results obtained by different methods, the contrast algorithm comprising: a Zhang method and a Nie method, wherein the Zhang method is a conventional image stitching method and the Nie method is a deep learning-based image stitching method. The greater the peak signal-to-noise ratio, the less distortion the image stitching results. As can be seen from fig. 5, the peak signal-to-noise ratio of the present invention is greater than that of Zhang's method, mainly because Zhang's method does not consider the apparent difference of the images in terms of the overall intensity. Furthermore, the method of Nie also performs worse than the present invention in terms of peak signal-to-noise ratio. The main reason is that the method of Nie ignores the importance of the complementary information between the reference image and the target image during the image fusion process. In contrast, by constructing the progressive alignment model and the image interweaving fusion model and combining the alignment loss and the fusion loss, the invention reduces the alignment inconsistency and the seam artifact of the image splicing result and obtains good image splicing performance.

The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An image stitching method based on a progressive alignment and interweaving fusion network, which is characterized by comprising the following steps:

S2, constructing an image stitching depth model;

2. The method of claim 1, wherein the image stitching depth model comprises a progressive alignment sub-model for aligning the input image pair to obtain an aligned image pair and an interlaced image fusion sub-model for fusing the aligned image pair to obtain an image stitching result.

3. The method of claim 2, wherein the progressive alignment sub-model comprises, in order, m convolutional layers, m cross feature collaboration modules, m text-related layers, and m spatial transformation modules, wherein the m convolutional layers and the m cross feature collaboration modules are connected in a staggered manner, and m is a natural number.

4. A method according to claim 3, wherein the cross-feature collaboration module consists of a reference image branch, a target image branch, and a cascade of reference and target images branch, wherein:

5. The method of claim 4, wherein, when aligning input image pairs using the progressive alignment sub-model:

using spatial transform module to transform reference image I ^R And target image I ^T Performing deformation to align pixel positions of overlapping areas of the reference image and the target image, wherein the aligned image pairsAnd->Expressed as:

;

6. The method of claim 2, wherein the interleaved image fusion submodel comprises, in order, a convolutional layer, K consecutive interleaved Swin transform modules, 4K attention interleaving blocks, a cascade layer, and a reconstruction unit, wherein the K consecutive interleaved Swin transform modules and the 4K attention interleaving blocks are used to extract an interleaved signature of the aligned image pairs.

7. The method of claim 6, wherein the attention interleaving block comprises, in order, a max pooling layer, an average pooling layer, an upsampling layer, a convolution layer, a concatenation layer, a sigmoid activation function.

8. The method of claim 7, wherein, when fusing the aligned image pairs with the interleaved image fusion sub-model:

;

wherein:

;

and->Final feature maps representing the reference image and the target image, respectively,/->And->The interleaved feature map representing the aligned reference image and the aligned target image, respectively, the function Res representing the reconstruction unit output when the input is a post-convolution cascade feature, K representing the number of consecutive interleaved Swin transducer modules.

9. The method of claim 1, wherein the preset integral loss function comprises an alignment loss function and a fusion loss function, wherein the preset integral loss function L _total Expressed as:

L _total =λ ₁ L _A +λ ₂ L _C ;

10. The method of claim 9, wherein the fusion penalty function includes a structure penalty term for constraining structural similarity between the image stitching result and the input image pair and a texture penalty term for constraining the image stitching result and the input image pair to have similar texture details.