CN115965529A

CN115965529A - Image stitching method based on unsupervised learning and confrontation generation network

Info

Publication number: CN115965529A
Application number: CN202211676957.9A
Authority: CN
Inventors: 林怡格; 李晓鹏; 许毅杰
Original assignee: Suzhou Lianshitai Electronic Information Technology Co ltd
Current assignee: Suzhou Lianshitai Electronic Information Technology Co ltd
Priority date: 2022-12-26
Filing date: 2022-12-26
Publication date: 2023-04-14

Abstract

The invention discloses an image splicing method based on an unsupervised learning and confrontation generation network, which comprises the following steps: (1) Sending two images to be spliced into an alignment model as a reference image and a target image, and calculating to obtain the offset of the vertex of the grid; (2) Performing projection transformation on the target image according to the grid vertex offset to obtain an aligned target image; (3) And inputting the aligned target image and the reference image into a splicing model for splicing to obtain a spliced image. The method can accurately realize the splicing of the image.

Description

Image stitching method based on unsupervised learning and confrontation generation network

Technical Field

The invention relates to the field of computer vision and artificial intelligence, in particular to an image splicing method based on an unsupervised learning and confrontation generation network.

Background

The image stitching technology is a technology for stitching two images with parallax but overlapping areas to obtain a seamless high-definition panoramic image, and is widely applied to the fields of automatic driving, video safety and virtual reality.

In the traditional image stitching process, corresponding characteristic points in two images to be stitched are manually extracted, a homography matrix with the size of 3 multiplied by 3, which can complete image translation, rotation, scaling and ideal point transformation, is obtained through calculation, projection transformation is carried out on one image by using the homography matrix to align with the other image, and then the two aligned images are fused to obtain a final panorama. However, the conventional method has limited ability to learn features and fuse images, the image alignment effect is not good, and the final picture obtained through the fusion stage often has defects of dislocation and ghost.

Due to the powerful automatic feature learning capability of the deep learning technology, the image stitching method based on the neural network becomes the mainstream. The image splicing algorithm based on deep learning mainly comprises two stages, wherein the first stage is image alignment, a convolutional neural network is used for extracting characteristic points corresponding to two images to be spliced, and then the images are aligned through projection transformation. And the second stage is the fusion of images, and the two aligned images are used as the input of a neural network and output to obtain a panoramic spliced image with smooth transition of an overlapping area.

For most of the current splicing algorithms based on deep learning, the network structure used in the first stage is simple, the parameter quantity is large, and the training and reasoning time is long. The training is based on a supervised learning method, the used training images are manually and automatically generated through homography transformation, deviation exists between the training images and a multi-depth-of-field and multi-plane alignment task of the real world images, and the alignment algorithm usually only uses a single homography matrix to perform projection transformation on the target images. All the above elements can cause the final image alignment to be not perfect, and great improvement space exists. In the second stage of the current mainstream splicing algorithm, the quality of the spliced image is often constrained by adding a plurality of artificially designed loss functions to the output image, the method is difficult to ensure that the spliced image really achieves the texture effect of a real image, and the fused image has artifacts and obvious splicing traces.

Disclosure of Invention

In view of the above, the present invention provides an image stitching method based on unsupervised learning and countermeasure generation network, comprising the following steps:

(1) Sending two images to be spliced into an alignment model as a reference image and a target image, and calculating to obtain the offset of the vertex of the grid;

(2) Performing projection transformation on the target image according to the grid vertex offset to obtain an aligned target image;

(3) And inputting the aligned target image and the reference image into a splicing model for splicing to obtain a spliced image.

Preferably, the process of constructing the alignment model includes:

(a) Constructing an alignment model;

(b) Selecting public image data set, cutting and transforming in the image to obtain image pair, forming data set A ₁ Acquiring image pairs with different proportion and parallax of overlapped areas acquired in the real world to form a data set A ₂ Wherein the image pair comprises a reference image and a target image;

(c) Data set A ₁ Performing steps (d) and (e) as a sample set;

(d) Inputting the image pair in the sample set into an alignment model to extract features, and outputting (n + 1) × (m + 1) × 2 grid vertex offsets according to feature calculation;

(e) Constructing n x m transformation matrixes according to the grid vertex offset, uniformly dividing a target image into n x m image blocks, respectively performing projection transformation on the corresponding image blocks by adopting the n x m transformation matrixes and splicing to obtain an aligned target image, and adjusting network parameters of an alignment model by comparing the similarity of the aligned target image and the superposition area of a reference image;

(e) Data set A ₂ As a sample set, on the basis of the step (e), repeatedly executing the steps (d) and (e) to realize network parameters of the alignment modelAnd fine tuning to obtain the trained alignment model.

Preferably, the alignment model includes two branches with the same structure, which are respectively used for extracting feature maps of the reference image and the target image in the image, each branch includes a convolutional layer and N CSP modules, each CSP module outputs a feature map, and inputs the feature map to the next CSP module;

after feature maps output by the CSP modules of the same layer of the two branches are spliced in the channel direction, feature extraction and optimization are carried out on a splicing result by using a plurality of convolution layers, and then 1 grid vertex offset is obtained by using a regression network composed of an average pooling layer and a full connection layer and carrying out regression calculation according to the optimization result.

Preferably, each CSP module includes two sub-branches, the first sub-branch is formed by sequentially connecting a CBS module including a convolutional layer, a batch normalization layer, a SiLU active layer, a residual error unit of ResNet, and a convolutional layer, and is used for extracting a feature map, the second sub-branch has only one independent convolutional layer for extracting a feature map, and then the feature maps of the two sub-branches are spliced and input to the batch normalization layer, the Leaky ReLU active layer, and the CBS module, and the feature map is output through calculation.

Preferably, the constructing N transformation matrices according to the mesh vertex offsets includes:

when there are N network offsets, denoted S _i I =1,2, \8230, and N, the constructed N transformation matrices are represented as S ₁ ，S ₁ +S ₂ ，S ₁ +S ₂ +S ₃ ，…，S ₁ +S ₂ +…+S _N The N transformation matrices are formulated as

Preferably, the adjusting the network parameters of the alignment model by comparing the similarity of the coincidence region of the aligned target image and the reference image comprises:

construct the following loss function L _align Adjusting network parameters of the optimized alignment model according to the minimization loss function:

wherein, I _A Representing the target image, I _B Which represents a reference image, is shown,

representing a transformation matrix, E representing a matrix of all 1's of the same size as the image, λ _i Representing a loss weight for each transform, | representing a pixel level multiplication, | | ₁ Representing a norm.

Preferably, the construction process of the mosaic model comprises:

(i) The splicing model adopts an antagonistic generating network, and a generator and a discriminator of the antagonistic generating network are established;

(ii) Acquiring image pairs with different proportion and parallax of overlapped areas acquired in the real world to form a data set A ₂ Data set A ₂ Sending into the trained alignment model to obtain the aligned target image, and collecting data set A ₂ Setting the original image pair as a real label, setting the spliced image generated by the generator as a synthetic label, directly stacking the aligned target image and the reference image, replacing the pixel value of the overlapped area by the average value of the pixels of the two images to obtain a superposed image, setting the superposed image as the synthetic label, and further obtaining a training data set;

(iii) And sending the training data set into a confrontation generating network, training the confrontation generating network by using a confrontation generating loss function, updating network parameters, and taking a parameter optimized generator as a splicing model.

Preferably, the generator adopts an encoder-decoder structure, and specifically includes: the generator is obtained by connecting a plurality of convolution layers and deconvolution layers with the same number in sequence, and each convolution layer and deconvolution layer is followed by a batch of normalization layers.

Preferably, the classifier network consists of convolutional layers, average pooling layers, and connection layers.

Preferably, the countermeasure generation loss function includes a generator loss and a discriminator loss;

wherein the discriminator loss is expressed as:

the generator loss is expressed as:

wherein a represents a synthetic tag, b represents a genuine tag, p _x Representing the distribution of real images x, including pairs of original images, p _z Representing the distribution of the aligned target and reference images z, D (-) representing the discriminator, G (-) representing the generator, c representing the predicted value that the generator wants the discriminator to generate the data output.

Compared with the prior art, the invention has the beneficial effects that at least:

the invention optimizes the structure of the alignment model in the alignment stage of the image mosaic algorithm, reduces the parameters of the alignment model, enables the structure of the alignment model to be more suitable for the task of aligning with the image, reduces the occupation of computing resources, and improves the training speed, the convergence speed and the reasoning speed of the network. And an image transformation mechanism based on grids is introduced, so that the image projection transformation mechanism more conforms to the real world, and the image alignment effect is better.

Compared with the traditional manual design loss function constraint fusion effect, the method has the advantages that the game of the generator and the discriminator enables the generator to automatically learn the characteristics of the real world images, so that the overlapping area and the transition area of the finally spliced images are more natural, and splicing traces and image artifacts are greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic flow chart of an image stitching method based on unsupervised learning and confrontation generation network according to the present invention;

FIG. 2 is a flow chart of the construction of an alignment model according to the present invention;

FIG. 3 is a schematic diagram of an alignment model according to the present invention;

FIG. 4 is a schematic structural diagram of a CSP module in the alignment net model according to the present invention;

FIG. 5 is a flow chart of construction of a mosaic model according to the present invention;

FIG. 6 is a schematic diagram of a generator in the mosaic model according to the present invention;

fig. 7 is a schematic structural diagram of an arbiter in the stitching model of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, an embodiment provides an image stitching method based on an unsupervised learning and confrontation generation network, including the following steps:

step 1, two images to be spliced are taken as a reference image and a target image and sent into an alignment model, and grid vertex offset is obtained through calculation.

In the embodiment, the alignment model is trained in an unsupervised mode, two images which need to be spliced and have parallax are received as input, and a transformation result is output in a grid vertex offset mode.

In an embodiment, the alignment model is constructed based on a convolutional neural network, as shown in fig. 2, the specific construction process includes:

(a) And (5) constructing an alignment model.

In an embodiment, as shown in fig. 3, the alignment model includes two branches with the same structure, and the network parameters of the branches are kept consistent, the two branches are respectively used for extracting feature maps of a reference image and a target image in an image, each branch includes a convolutional layer and N CSP modules, each CSP module outputs a feature map and inputs the feature map to a next CSP module; after feature maps output by the CSP modules of the same layer of the two branches are spliced in the channel direction, feature extraction and optimization are carried out on a splicing result by using a plurality of convolution layers, and then 1 grid vertex offset is obtained by using a regression network composed of an average pooling layer and a full connection layer and carrying out regression calculation according to the optimization result. In fig. 3, which illustrates 3 CSP modules, a total of 3 mesh vertex offsets are generated.

As shown in fig. 4, each CSP module includes two sub-branches, the first sub-branch is formed by sequentially connecting a CBS module including a convolutional layer, a batch normalization layer, a SiLU active layer, a residual error unit of ResNet, and a convolutional layer, and is used for extracting a feature map, the second sub-branch has only one independent convolutional layer for extracting a feature map, and then the feature maps of the two sub-branches are spliced and input to the batch normalization layer, the Leaky ReLU active layer, and the CBS module, and the feature map is output through calculation.

(b) Constructing a dataset A for training an alignment model ₁ And A ₂ 。

In an embodiment, a public image data set, such as an MS COCO data set, is selected, and an image pair is cropped and transformed in the image to form a data set A ₁ Specifically, an image pair with an overlap area of different proportions and a size of 64 × 64 is obtained by selecting any one of the images from the MS COCO dataset and cropping the image, and a group of image pairs to be stitched can be obtained by applying random projection transformation to one of the images.

In the embodiment, image pairs with different proportion and parallax of overlapped areas acquired in the real world are also acquired to form a data set A ₂ Wherein the image pair comprises the reference image and the target image.

(c) Data set A ₁ AsA sample set.

(d) And inputting the image pairs in the sample set into an alignment model to extract features, and calculating and outputting N grid vertex offsets according to the features.

In one embodiment, as shown in FIG. 3, for each pass of the input image through the CSP module, the feature maps of the two branches are concatenated and fed into a regression network consisting of an average pooling layer and a full-link layer to output a grid offset value of (n + 1) × (m + 1) × 2. The transformation of the setting image of the present invention can be expressed by a grid transformation of n × m, and therefore the coordinate shift amount of the grid vertex can be expressed by a tensor of (n + 1) × (m + 1) × 2.

(e) And constructing n multiplied by m transformation matrixes according to the grid vertex offset, and updating the network parameters of the alignment model after transforming the target image according to the transformation matrixes.

In the embodiment, n x m transformation matrixes are constructed according to grid vertex offset, a target image is uniformly divided into n x m image blocks, the n x m transformation matrixes are adopted to respectively carry out projection transformation on the corresponding image blocks and are spliced to obtain an aligned target image, and the network parameters of an alignment model are adjusted by comparing the similarity of the overlapping areas of the aligned target image and a reference image.

The method for constructing n multiplied by m transformation matrixes according to the grid vertex offset comprises the following steps: when there are N sets of network offsets, denoted S _i I =1,2, \ 8230, N, the N transform matrices constructed are denoted as S ₁ ，S ₁ +S ₂ ，S ₁ +S ₂ +S ₃ ，…，S ₁ +S ₂ +…+S _N The N transformation matrices are formulated as

i =1,2, \8230;, N. As shown in fig. 3, including 3 mesh vertex offsets, 3 transformation matrices are constructed, respectively denoted as S ₁ ，S ₁ +S ₂ ，S ₁ +S ₂ +S ₃ 。

(e) Data set A ₂ And (e) jumping to the step (d) by taking the parameters as the sample set.

In an embodiment, the data set A is combined on the basis of step (e) ₂ And (e) as a sample set, repeatedly executing the steps (d) and (e) to realize fine adjustment of the network parameters of the alignment model, and obtaining the trained alignment model.

And 2, performing projection transformation on the target image according to the grid vertex offset to obtain an aligned target image.

In the embodiment, on the basis of obtaining the grid vertex offset, the target image is subjected to projection transformation according to the grid vertex offset to obtain an aligned target image.

And 3, inputting the aligned target image and the reference image into a splicing model for splicing to obtain a spliced image.

In the embodiment, the mosaic model adopts a confrontation generation network structure, and a least square loss training generator automatically learns and fuses the alignment images to obtain a mosaic image. As shown in fig. 5, the building process of the mosaic model includes:

(i) The splicing model adopts a countermeasure generating network, and a generator and a discriminator of the countermeasure generating network are established.

In an embodiment, as shown in fig. 6, the generator adopts an encoder-decoder structure, and specifically includes: the generator is obtained by connecting a plurality of convolution layers and deconvolution layers with the same number in sequence, and each convolution layer and deconvolution layer is followed by a batch of normalization layers.

As shown in fig. 7, the classifier network is composed of four convolution layers, an average pooling layer and a full link layer, and the output probability result is used to determine whether the input picture is a real image or a composite image.

(ii) And constructing a training data set of the splicing model.

In the embodiment, image pairs with different proportion and parallax of overlapped areas acquired in the real world are acquired to form a data set A ₂ Data set A ₂ Sending into the trained alignment model to obtain the aligned target image, and collecting data set A ₂ The original image pair is set as a real label, the spliced image generated by the generator is set as a synthetic label, the aligned target image and the reference image are directly stacked, the pixel value of the overlapped area is replaced by the average value of the pixels of the two images to obtain a superposed image, the superposed image is set as the synthetic label, and thus a group of data groups containing the two real images (the original image pair) and the two synthetic images (the spliced image and the superposed image) are obtained, and a training data set is further obtained.

In an embodiment, referring to the loss function of the LSGAN network, a denotes the synthetic tag, b denotes the real tag, p _x Representing the distribution of real images x, including pairs of original images, p _z Representing the distribution of the aligned target and reference images z, D (-) representing the discriminator, G (-) representing the generator, c representing the predicted value that the generator wants the discriminator to generate the data output. The discriminator penalty is then expressed as:

the generator loss is expressed as:

alternatively, the tag parameter may be set to a = -1,b =1,c =0. And then inputting the four images of each group into a confrontation generation network, and updating parameters of a generator and a discriminator simultaneously through the two losses to finish training.

And after training is finished, the trained generator is used as a splicing model, and when the generator is applied, the aligned target image and the reference image are input into the splicing model, and the spliced image is output through calculation.

The method provided by the embodiment is based on the unsupervised learning training alignment model and the confrontation generation network structure training splicing model, can quickly train to obtain a high-precision model without labeled data, and improves the accuracy of image splicing.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. An image stitching method based on an unsupervised learning and confrontation generation network is characterized by comprising the following steps:

(1) Sending two images to be spliced as a reference image and a target image into an alignment model, and calculating to obtain the grid vertex offset;

2. The unsupervised learning and confrontation generation network-based image stitching method according to claim 1, wherein the construction process of the alignment model comprises:

(a) Constructing an alignment model;

(b) Selecting public image data set, cutting and transforming in the image to obtain image pair, forming data set A ₁ Acquiring image pairs with different overlapping area proportions and parallax acquired in the real world to form a data set A ₂ Wherein the image pair comprises a reference image and a target image;

(c) Data set A ₁ Performing steps (d) and (e) as a sample set;

(d) Inputting the image pair in the sample set into an alignment model to extract features, and calculating and outputting (n + 1) × (m + 1) × 2 grid vertex offsets according to the features;

(e) Constructing n x m transformation matrixes according to grid vertex offset, uniformly dividing a target image into n x m image blocks, respectively carrying out projection transformation on the corresponding image blocks by adopting the n x m transformation matrixes and splicing to obtain an aligned target image, and adjusting network parameters of an alignment model by comparing the similarity of the overlapped areas of the aligned target image and a reference image;

(e) Data set A ₂ And (e) as a sample set, repeatedly executing the steps (d) and (e) on the basis of the step (e) to realize fine adjustment of the network parameters of the alignment model, and obtaining the trained alignment model.

3. The unsupervised learning and confrontation generation network-based image stitching method according to claim 2, wherein the alignment model comprises two branches with the same structure for extracting feature maps of the reference image and the target image in the image, respectively, each branch comprising a convolutional layer and N CSP modules, each CSP module outputting a feature map and inputting the feature map to the next CSP module;

4. The image stitching method based on the unsupervised learning and countermeasure generation network as claimed in claim 3, wherein each CSP module comprises two sub-branches, the first sub-branch is formed by sequentially connecting a CBS module comprising a convolutional layer, a batch normalization layer and a SiLU active layer, a residual error unit of ResNet and the convolutional layer and is used for extracting a feature map, the second sub-branch only has an independent convolutional layer and is used for extracting the feature map, then the feature maps of the two sub-branches are stitched and input to the batch normalization layer, the Leaky ReLU active layer and the CBS module, and the feature map is output through calculation.

5. The image stitching method based on unsupervised learning and countermeasure generation network of claim 2, wherein the constructing n x m transformation matrices according to the grid vertex offsets comprises:

when there are N sets of network offsets, denoted S _i I =1,2, \8230, and N, the constructed N transformation matrices are represented as S ₁ ，S ₁ +S ₂ ，S ₁ +S ₂ +S ₃ ，…，S ₁ +S ₂ +…+S _N The N transformation matrices are formulated as

6. The method for image stitching based on unsupervised learning and competition generation network of claim 5, wherein the adjusting the network parameters of the alignment model by comparing the similarity of the coincidence areas of the aligned target image and the reference image comprises:

7. The image stitching method based on unsupervised learning and countermeasure generation network according to claim 1, wherein the construction process of the stitching model comprises:

(i) The splicing model adopts an antagonism generation network, and a generator and a discriminator of the antagonism generation network are established;

(ii) Acquiring image pairs which are acquired in the real world and have different overlapping area proportions and parallax to form a data set A ₂ Data set A ₂ Sending into the trained alignment model to obtain the aligned target image, and collecting data set A ₂ Setting the original image pair as a real label, setting the spliced image generated by the generator as a synthetic label, directly stacking the aligned target image and the reference image, replacing the pixel value of the overlapped area by the average value of the pixels of the two images to obtain a superposed image, setting the superposed image as the synthetic label, and further obtaining a training data set;

8. The image stitching method based on the unsupervised learning and countermeasure generation network as claimed in claim 7, wherein the generator adopts an encoder-decoder structure, and specifically comprises: the generator is obtained by connecting a plurality of convolution layers and deconvolution layers with the same number in sequence, and each convolution layer and deconvolution layer is followed by a batch of normalization layers.

9. The unsupervised learning and confrontation generation network-based image stitching method of claim 7, wherein the classifier network is composed of convolutional layers, average pooling layers and connection layers.

10. The unsupervised learning and confrontation generation network-based image stitching method of claim 7, wherein the confrontation generation loss function comprises a generator loss and a discriminator loss;

wherein the discriminator loss is expressed as:

the generator loss is expressed as:

wherein a represents a synthetic tag, b represents a real tag, p _x Representing the distribution of real images x, including pairs of original images, p _z Representing the distribution of the aligned target and reference images z, D (-) representing the discriminator, G (-) representing the generator, c representing the predicted value that the generator wants the discriminator to generate the data output.