CN116596815A

CN116596815A - Image stitching method based on multi-stage alignment network

Info

Publication number: CN116596815A
Application number: CN202310517330.7A
Authority: CN
Inventors: 范晓婷; 张重; 徐敏
Original assignee: Tianjin Normal University
Current assignee: Tianjin Normal University
Priority date: 2023-05-09
Filing date: 2023-05-09
Publication date: 2023-08-15

Abstract

The invention discloses an image stitching method based on a multi-stage alignment network, which comprises the following steps: step S1, acquiring a training data set, wherein the training data set comprises a plurality of input image pairs and image stitching results corresponding to each input image pair, and the input image pairs comprise a reference image I ₁ And target image I ₂ The method comprises the steps of carrying out a first treatment on the surface of the S2, constructing an image stitching depth model; step S3, training the image stitching depth model based on the training data set and the overall loss function to obtain a target image stitching depth model; and S4, splicing the images to be spliced by using the target image splicing depth model to obtain an image splicing result. The invention utilizes multi-stage alignment to carry out depth optimization deformation on the image, reduces the distortion of the image content and simultaneously protects the image contentAnd the holding seam is smooth, so that the accurate splicing of the images is realized.

Description

Image stitching method based on multi-stage alignment network

Technical Field

The invention relates to the field of image processing and deep learning, in particular to an image stitching method based on a multi-stage alignment network.

Background

As a key technique for obtaining a high-resolution wide-field panoramic image, image stitching aims at obtaining a plurality of images with overlapping areas by rotating a camera, and stitching the images by feature matching and image fusion. However, when the rotation angle of the image capturing device is large or the photographed scenes are not coplanar, visual artifacts and misalignment of the stitched images may occur. Thus, how to ensure accurate alignment and natural smoothing of wide field-of-view panoramic images is a challenging problem in image stitching.

In recent years, a large number of image stitching methods have been proposed by researchers. The conventional image stitching method is classified into a global alignment method and a spatial variation deformation method. The global alignment method utilizes the invariant local feature matching image and establishes a mapping relation alignment image through a homography matrix, for example: a dual-homography estimation method, a smooth variation affine method, and the like. The spatial variation deformation method is to divide an image into uniform grids, and obtain optimal grid coordinates by optimizing a content-based grid deformation function, and includes a projection method as much as possible, a natural self-adaptive method as much as possible, and the like. In recent years, researchers have proposed image stitching methods based on deep learning to improve stitching performance. For example, nie et al propose an image stitching network based on global homography and eliminate image artifacts by constructing a structure stitching stage and a content modification stage. In view of the importance of edge preservation, dai et al propose an edge-guided fusion-based approach for image stitching. Jong et al devised a depth image squaring solution for preserving the linear and nonlinear structure of the image. However, the performance of these image stitching methods still needs to be further improved.

In carrying out the invention, the inventors have found that at least the following drawbacks and deficiencies in the prior art are present:

the methods in the prior art generally align images by estimating a single depth map transform, cannot effectively process large parallax scenes, and may distort the global structure of the panoramic image; the existing method ignores the importance of the image content and the splicing seams in the image splicing process, and easily causes the problems of inconsistent image content and discontinuous splicing seams.

Disclosure of Invention

The invention designs an image splicing method based on a multi-stage alignment network. The invention uses a depth homography estimation module based on content retention to pre-align an input image pair and reduce content artifacts, uses a grid deformation module based on edge assistance to further align the image pair and avoid joint distortion, constructs content consistency loss and joint smoothness loss to maintain the geometric structure of the image pair and reduce joint discontinuity of an overlapping area, and further predicts high-quality image splicing results. The image splicing method realizes high-quality splicing of images, avoids content artifacts and reduces joint distortion.

The image stitching method based on the multi-stage alignment network provided by the invention comprises the following steps:

step S1, acquiring a training data set, wherein the training data set comprises a plurality of input image pairs and image stitching results corresponding to each input image pair, and the input image pairs comprise a reference image I ₁ And target image I ₂ ；

S2, constructing an image stitching depth model;

step S3, training the image stitching depth model based on the training data set and the overall loss function to obtain a target image stitching depth model;

and S4, splicing the images to be spliced by using the target image splicing depth model to obtain an image splicing result.

Optionally, the image stitching depth model includes an image pre-alignment sub-model for pre-aligning the input image pair using a content-preserving based depth homography estimation module and an image alignment sub-model for further aligning the pre-aligned input image pair using an edge-aided network.

Optionally, the depth homography estimation module is formed by connecting a plurality of symmetrical convolution layer units and a corresponding number of content-holding attention modules in a staggered manner, wherein the symmetrical convolution layer units comprise two convolution layers and one maximum pooling layer; the content holding attention module comprises a spatial attention module and a plurality of cross operation modules, wherein the spatial attention module comprises two maximum pooling layers, two average pooling layers, a shared full connection layer and an activation function layer.

Optionally, the edge-assist network includes a convolutional layer, three multi-scale residual blocks, an upsampling layer, and a bottleneck layer.

Optionally, when the input image pair is pre-aligned using the image pre-alignment sub-model:

inputting the input image pair into the depth homography estimation module to obtain a reference image I in the input image pair ₁ And target image I ₂ Corresponding output characteristic diagram F _i ^R and F_i ^T ；

Based on the output characteristic diagram F _i ^R and F_i ^T Obtaining a homography matrix by using a direct linear transformation method;

reference image I is respectively processed by using a spatial deformation network ₁ And target image I ₂ Deforming to realize reference image I ₁ And target image I ₂ Pre-alignment of pixel locations of overlapping regions, wherein the pre-aligned input image pair is represented as:

wherein E represents an identity matrix, H represents a homography matrix, W _STN (. Cndot. Cndot.) represents the output of the spatially deformed network.

Optionally, when aligning the pair of pre-aligned input images using the image alignment submodel:

obtaining a basic feature map of the pre-aligned reference image and the target image by using a convolution layer in the edge auxiliary network;

extracting and obtaining edge feature images of the pre-aligned reference image and the target image by using the edge auxiliary network;

respectively cascading the obtained edge feature images of the pre-aligned reference image and the target image with corresponding basic feature images to obtain a fusion feature image of the pre-aligned reference image and the target image;

calculating to obtain feature flows of the pre-aligned reference image and the target image by using a context correlation method based on the fusion feature map of the pre-aligned reference image and the target image;

the pre-aligned reference image and the target image and the corresponding characteristic flow thereof are sent into a depth grid deformation network to obtain an aligned reference imageAnd target image->Wherein the aligned reference pictures +.>And target image->Expressed as:

wherein

wherein ,F_1conv and F_2conv Respectively representing the basic feature maps of the pre-aligned image pairs, F _1edge and F_2edge Respectively represent pre-formsAligning edge feature maps of image pairs, F _1c and F_2c Fusion feature maps respectively representing pre-aligned image pairs, [ ·, · ]]Representing tandem operation, CCL (·, ·) representing context dependent method, W _mesh (. Cndot.). Cndot.represents a deep mesh morphing network,representing pre-aligned reference pictures, +.>Representing a pre-aligned target image.

Optionally, the overall loss function includes a content consistency loss and a seam smoothness loss, the overall loss function L _All Expressed as:

L _All ＝αL _cont +βL _seam

wherein ,L_cont Representing content consistency loss, L _seam Representing the joint smoothness loss, α and β are weights for the content consistency loss and the joint smoothness loss, respectively.

Optionally, the content consistency loss consists of a photometric loss term and a structural loss term.

Optionally, the luminosity loss term L _photo Expressed as:

L _photo ＝||I _F -I _G || ₁

wherein ,I_F and I_G Respectively representing a final image splicing result and a true value ₁ Represents an L1 norm;

the structure loss term L _struc Expressed as:

wherein ,representing conv1 in VGG-16 networks _i Is a function of (a) and (b), I.I ₂ Representing the L2 norm.

Optionally, the seam smoothness loss L _seam Expressed as:

L _seam ＝||E ₁ -E _1G || ₁ +||E ₂ -E _2G || ₁

wherein

wherein ,E₁ and E₂ Is an edge image of the aligned image pair, E _1G and E_2G Is the true value of the edge image of the aligned image pair obtained by using the curvature formula, E _net (·) represents the output of the edge assist network,reference image representing alignment, +.>Representing aligned target images, m and n representing horizontal and vertical directions, +.>And div (·) represent the gradient and divergence operations, respectively.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention can accurately align the images, reduce the distortion of the image content and simultaneously keep the joints smooth, and obtain the high-quality image splicing result.

2. The invention solves the problem of image stitching by using a deep learning technology, reduces image alignment artifacts by using a multi-stage alignment mode, and reduces seam discontinuity by using edge information assistance and seam smoothness loss.

Drawings

FIG. 1 is a flow chart of a method for image stitching based on a multi-stage alignment network according to an embodiment of the present invention;

FIG. 2 (a) is a block diagram according to the present inventionA schematic diagram of a content-holding attention module structure of an embodiment of the invention, wherein,representing pixel level multiplication;

fig. 2 (b) is a schematic view of a spatial attention module structure according to an embodiment of the present invention, wherein,representing pixel level addition, +.>Representing a sigmoid function;

FIG. 3 is a schematic diagram of an edge assist network architecture according to an embodiment of the invention;

fig. 4 is a schematic diagram showing structural similarity comparison results of different image stitching methods according to an embodiment of the present invention.

Detailed Description

The objects, technical solutions and advantages of the present invention will become more apparent by the following detailed description of the present invention with reference to the accompanying drawings. It should be understood that the description is only illustrative and is not intended to limit the scope of the invention. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the present invention.

Fig. 1 is a flowchart of an image stitching method based on a multi-stage alignment network according to an embodiment of the present invention, and some specific implementation procedures of the present invention are described below by taking fig. 1 as an example. The image stitching method based on the multi-stage alignment network provided by the invention comprises the following steps:

S2, constructing an image stitching depth model;

in an embodiment of the present invention, the image stitching depth model includes an image pre-alignment sub-model and an image alignment sub-model.

The image pre-alignment sub-model is used for pre-aligning the input image pair by utilizing a depth homography estimation module based on content preservation so as to reduce content artifacts and obtain a pre-aligned input image pair.

Further, the depth homography estimation module is formed by connecting a plurality of symmetrical convolution layer units and a corresponding number of content-holding attention modules in a staggered manner, namely, the content-holding attention modules are arranged between every two symmetrical convolution layer units so as to find correct matching features and reduce wrong matching features, as shown in fig. 2 (a). Wherein the symmetrical convolution layer unit comprises two convolution layers and a Maxpooling layer; the content holding attention module comprises a spatial attention module and a plurality of cross operation modules, as shown in fig. 2 (a); the spatial attention module includes two Maxpooling layers, two Avgpooling layers, one shared full connection layer and one sigmoid layer, as shown in fig. 2 (b).

Assuming that the depth homography estimation module includes i+1 symmetric convolution layer units and i+1 content-holding attention modules, for the depth homography estimation module, the reference image I ₁ And target image I ₂ After being sent to a first-stage symmetrical convolution layer unit in the depth homography estimation module, a reference image I can be generated ₁ And target image I ₂ Is a primary feature map of (1) and />After passing through the content holding attention module between the first-stage symmetrical convolution layer unit and the second-stage symmetrical convolution layer unit, the reference image I can be obtained ₁ And target image I ₂ Corresponding first-level weighted feature diagram F ₀ ^R and F₀ ^T The method comprises the steps of carrying out a first treatment on the surface of the Warp yarnAfter passing through the second-stage symmetrical convolution layer unit, a second-stage characteristic diagram can be obtained> and />After passing through the content holding attention module between the second-level symmetrical convolution layer unit and the third-level symmetrical convolution layer unit, the reference image I can be obtained ₁ And target image I ₂ Corresponding two-level weighted feature map F ₁ ^R and F₁ ^T The method comprises the steps of carrying out a first treatment on the surface of the And so on, after passing through the (i+1) th content-holding attention module connected with the (i+1) th symmetrical convolution layer unit of the last stage, the (i+1) th symmetrical convolution layer unit, the (i+1) th content-holding attention module can obtain the reference image I ₁ And target image I ₂ Corresponding output characteristic diagram F _i ^R and F_i ^T As shown in FIG. 2 (a), the output characteristic diagram F _i ^R and F_i ^T Can be expressed as:

wherein

wherein , and />Level i+1 feature map representing reference image and target image, respectively,/for each of the images> and />Respectively representing the spatial level feature map of the reference image and the target image obtained by multiplying the i+1st level feature map and the corresponding spatial attention mask, namely the output of the corresponding spatial attention module, pixel by pixel, M _s (. Cndot.) represents the spatial attention mask, (. Cndot.)>Representing pixel level multiplication.

When the image pre-alignment submodel is utilized to pre-align an input image pair:

firstly, inputting the input image pair into the depth homography estimation module to obtain a reference image I in the input image pair ₁ And target image I ₂ Corresponding output characteristic diagram F _i ^R and F_i ^T ；

Then, based on the output feature map F _i ^R and F_i ^T The homography matrix is obtained by using a direct linear transformation method, wherein the homography matrix is obtained by using the direct linear transformation method, which belongs to the technology which should be mastered by the person skilled in the art, and the invention is not repeated;

then, the reference images I are respectively processed by using a spatial deformation network ₁ And target image I ₂ Performing deformation, i.e. realizing reference image I ₁ And target image I ₂ Pre-alignment of pixel locations of overlapping regions, wherein the pre-aligned input image pair may be represented as:

Wherein the image alignment sub-model is configured to further align the pre-aligned input image pair using an edge-aided network to reduce joint distortion.

Further, the edge-assisted network is an edge-based mesh morphing module that includes a convolutional layer, three multi-scale residual blocks, an upsampling layer, and a bottleneck layer, as shown in fig. 3.

When the image alignment submodel is utilized to align the pre-aligned input image pair:

firstly, a basic feature map of a prealigned reference image and a target image is obtained by utilizing a convolution layer in the edge auxiliary network;

then, extracting and obtaining an edge feature map of the pre-aligned reference image and the target image by using the edge auxiliary network;

then, the obtained edge feature images of the pre-aligned reference image and the target image are respectively cascaded with corresponding basic feature images to obtain a fusion feature image of the pre-aligned reference image and the target image;

then, calculating to obtain feature flows of the pre-aligned reference image and the target image by using a context correlation method based on the fusion feature map of the pre-aligned reference image and the target image;

finally, the prealigned reference image, the target image and the corresponding characteristic streams are sent into a depth grid deformation network to obtain an aligned reference imageAnd target image->Wherein the aligned reference pictures +.>And target image->Can be expressed as:

wherein

wherein ,F_1conv and F_2conv Respectively representing the basic feature maps of the pre-aligned image pairs, F _1edge and F_2edge Edge feature maps, F, respectively representing pairs of pre-aligned images _1c and F_2c Fusion feature maps respectively representing pre-aligned image pairs, [ ·, · ]]Representing tandem operation, CCL (·, ·) representing context dependent method, W _mesh (. Cndot.) represents the output of the deep mesh morphing network,representing pre-aligned reference pictures, +.>Representing a pre-aligned target image.

Step S3: training the image stitching depth model based on the training data set and the overall loss function to obtain a target image stitching depth model;

in one embodiment of the invention, the overall loss function includes a content consistency loss and a seam smoothness loss to preserve the geometry of the image pair and reduce seam discontinuities in the overlapping region using the content consistency loss and the seam smoothness loss.

Wherein the overall loss function L _All Can be expressed as:

L _All ＝αL _cont +βL _seam

In an embodiment of the present invention, the weights α and β may each be set to 0.5.

Wherein the content consistency loss consists of a photometric loss term for minimizing pixel differences between the image stitching result and the true value and a structural loss term for constraining the image stitching result and the true value to have similar characteristic representations;

further, the luminosity loss term L _photo Can be expressed as:

L _photo ＝||I _F -I _G || ₁

the structure loss term L _struc Can be expressed as:

wherein ,representing conv1 in VGG-16 networks _i Is a function obtainable by a person skilled in the art, conv1 ₁ and conv1₂ The receptive field of each pixel in (1) is a 5×5 neighborhood, |·|| ₂ Representing the L2 norm.

Wherein the seam smoothness loss is used to constrain the edge image of the aligned image pair to be close to the edge image realism value of the aligned image pair.

Further, the joint smoothness loss L _seam Can be expressed as:

L _seam ＝||E ₁ -E _1G || ₁ +||E ₂ -E _2G || ₁

wherein

wherein ,E₁ and E₂ Is an edge image of the aligned image pair, E _1G and E_2G Is the true value of the edge image of the aligned image pair obtained by using the curvature formula, E _net (. Cndot.) represents the output of the edge assist network, m and n represent the horizontal and vertical directions,and div (·) represent the gradient and divergence operations, respectively.

And S4, splicing the images to be spliced by using the target image splicing depth model to obtain a high-quality image splicing result.

Fig. 4 shows structural similarity comparison results of image stitching results obtained by different methods, and the comparison algorithm includes: the method of Zaragaza and the method of Zhao, wherein the method of Zaragaza is a traditional image stitching method and the method of Zhao is an image stitching method based on deep learning. The greater the structural similarity, the higher the quality of the image stitching result. As can be seen from fig. 4, the structural similarity of the present invention is greater than that of the zaagaoza method, illustrating the important role of the content-preserving-based depth homography model in image stitching. In addition, the Zhao method also performs worse than the present invention in terms of structural similarity. The main reason is that the Zhao method uses only a single depth homography for image alignment, which can produce undesirable alignment distortion, and thus lead to seam discontinuities. In contrast, the invention reduces the content artifacts and joint distortions of the image stitching results by constructing a multi-stage alignment model and combining the content consistency loss and joint smoothness loss.

The embodiment of the invention does not limit the types of other devices except the types of the devices, so long as the devices can complete the functions.

Those skilled in the art will appreciate that the drawings are schematic representations of only one preferred embodiment, and that the above-described embodiment numbers are merely for illustration purposes and do not represent advantages or disadvantages of the embodiments.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. An image stitching method based on a multi-stage alignment network, comprising the steps of:

S2, constructing an image stitching depth model;

2. The method of claim 1, wherein the image stitching depth model comprises an image pre-alignment sub-model for pre-aligning the input image pair using a content-preserving based depth homography estimation module and an image alignment sub-model for further aligning the pre-aligned input image pair using an edge-aided network.

3. The method of claim 2, wherein the depth homography estimation module is formed by interleaving a plurality of symmetric convolution layer units and a corresponding number of content-holding attention modules, wherein the symmetric convolution layer units include two convolution layers and a maximum pooling layer; the content holding attention module comprises a spatial attention module and a plurality of cross operation modules, wherein the spatial attention module comprises two maximum pooling layers, two average pooling layers, a shared full connection layer and an activation function layer.

4. The method of claim 2, wherein the edge-assisted network comprises a convolutional layer, three multi-scale residual blocks, an upsampling layer, and a bottleneck layer.

5. The method of claim 2, wherein, when pre-aligning an input image pair with the image pre-alignment submodel:

6. The method of claim 2, wherein, when aligning a pre-aligned input image pair using the image alignment sub-model:

wherein

wherein ,F_1conv and F_2conv Respectively representing the basic feature maps of the pre-aligned image pairs, F _1edge and F_2edge Edge feature maps, F, respectively representing pairs of pre-aligned images _1c and F_2c Fusion feature maps respectively representing pre-aligned image pairs, [ ·, · ]]Representing tandem operation, CCL (·, ·) representing context dependent method, W _mesh (. Cndot.). Cndot.represents a deep mesh morphing network,representing pre-aligned reference pictures, +.>Representing a pre-aligned target image.

7. The method of claim 1, wherein the overall loss function comprises a content consistency loss and a seam smoothness loss, the overall loss function L _All Expressed as:

L _All ＝αL _cont +βL _seam

8. The method of claim 7, wherein the content consistency penalty consists of a photometric penalty term and a structural penalty term.

9. The method of claim 8, wherein the luminosity loss term L _photo Expressed as:

L _photo ＝||I _F -I _G || ₁

the structure loss term L _struc Expressed as:

10. The method of claim 7, wherein the seam smoothness loss L _seam Expressed as:

L _seam ＝||E ₁ -E _1G || ₁ +||E ₂ -E _2G || ₁

wherein

wherein ,E₁ and E₂ Is an edge image of the aligned image pair, E _1G and E_2G Is the true value of the edge image of the aligned image pair obtained by using the curvature formula, E _net (·) represents the output of the edge assist network,representation pairA reference image of the same order,/>Representing aligned target images, m and n representing horizontal and vertical directions, +.>And div (·) represent the gradient and divergence operations, respectively.