CN115345781A

CN115345781A - Multi-view video stitching method based on deep learning

Info

Publication number: CN115345781A
Application number: CN202210956950.6A
Authority: CN
Inventors: 达飞鹏; 衡玮
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-15

Abstract

The invention discloses a multi-view video stitching method based on deep learning, which comprises the following steps: firstly, an Airsim simulator is used for collecting images and depth data at a set virtual common viewpoint, a data set for a video splicing task is generated, and preprocessing such as cylindrical projection is carried out on the images. Then, respectively designing an artifact eliminating module and a smooth transition module by utilizing a convolutional neural network, wherein the artifact eliminating module considers the characteristic correlation of an overlapped region, and aligns the overlapped region through viewpoint regression to eliminate the fused artifact; the latter propagates the obtained deformation rule of the overlapped region to the non-overlapped region according to the characteristic information of the image to guide the smooth transition between the regions so as to improve the visual impression. And finally, transforming the original viewpoint image according to the predicted displacement field distortion, and performing weighted linear fusion to obtain a splicing result. The invention can achieve real-time splicing performance while removing splicing artifacts, and can meet the on-line splicing requirement in practical application.

Description

Multi-view video stitching method based on deep learning

Technical Field

The invention relates to a video splicing technology, and belongs to the technical field of computer vision.

Background

The video splicing technology has important theoretical research significance and plays an important role in various application fields such as virtual reality, security monitoring, intelligent driving, video conferences and unmanned aerial vehicle aerial photography. Video stitching techniques are commonly used to compose two or more videos captured by cameras of different poses. It can reduce the requirement on video acquisition equipment and obtain a larger field of view. Although image and video stitching have a long history of research, existing video stitching methods do not perform perfectly. The challenges of the current method are long computing time consumption, poor performance of wide baseline large parallax scenes and insufficient algorithm robustness. The common algorithm based on global homography alignment in video splicing is not influenced by parallax when the optical centers of cameras are basically overlapped or the depth of a scene is changed slightly, so that a better result can be obtained, and otherwise, obvious artifacts can be generated. However, in practical applications, it is difficult to achieve a condition that the optical centers of the cameras are completely overlapped, and a distributed camera arrangement is also required in some scenes such as a vehicle-mounted around-the-eye system. To reduce artifacts, methods based on optimal sutures are commonly used, but such methods may cause problems with uneven transitions and still be computationally inefficient to minimize depending on the energy function.

The development of the deep learning technology provides a brand-new dimension for the video image splicing technology, and the quality of spliced videos can be improved by adopting a proper mode. The Convolutional Neural Network (CNN) has strong feature extraction capability, and the CNN is used for replacing the original traditional feature extraction mode, so that the Convolutional Neural Network (CNN) has better robustness in the scenes of low illumination, low texture or repeated texture and the like. Accordingly, a homography estimation method based on deep learning is also applied to the splicing task of the small parallax images. However, the lack of a proper data set is a difficulty of applying the deep learning method to video and image splicing tasks, and some methods use a synthesized data set without parallax, which is often inconsistent with an actual application scene.

Disclosure of Invention

The technical problem is as follows: aiming at the prior art, the invention provides a multi-view video splicing method based on deep learning, which can eliminate the artifact problem caused by parallax error; robustness under challenging scenes such as low illumination, low texture or repeated texture can be improved; meanwhile, higher calculation efficiency can be achieved, and the requirement of online real-time splicing in practical application can be met.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a multi-view video stitching method based on deep learning comprises the following steps:

step 1: and acquiring image and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the image in the data set according to the camera view angle.

Step 2: and obtaining 3D information of the scene through the depth data, and converting to obtain a pixel-level displacement field.

And 3, step 3: and designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion.

And 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image, so that the smooth transition between the areas is guided, and the visual split feeling is reduced.

And 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.

Further, the specific method of step 1 is as follows:

and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.

Further, the specific method of step 2 is as follows:

the pixel displacement field is obtained by converting depth information in the scene. Depth corresponding to two cameras at the position of obtaining virtual viewpointAnd obtaining the 3D coordinates of the pixel points after information is obtained. Transforming the image at the virtual viewpoint to the original viewpoint, and calculating the displacement field flow in the viewpoint transformation process in a stereo geometric mode _gt 。

Further, the specific method in step 3 is as follows:

in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera _{ov_max} This portion is extracted from the input image and input into the current module.

For possible overlap regions, the structure of the encoder-decoder is designed, for the encoder, two pictures are stacked together in channel dimensions, features are extracted using a series of convolutional layer downsampling, and the decoder consists of a series of upsampled layers and convolutional layers, where a skip connection is used. Inputting the characteristics of the corresponding layer of the encoder and the displacement field output by the upper layer of the decoder, gradually upsampling for optimization, and directly upsampling by bilinear interpolation when obtaining the displacement field with the resolution of 1/4 to obtain the displacement field of the overlapped area with the same size as the input resolution.

In order to train the artifact removal module, displacement field loss, content loss, perceptual layer loss are defined.

Converting the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapped area _ov Pixel displacement field flow predicted from the network _O Constructing an L1 loss function for the overlapping area:

content loss computation of image I at virtual viewpoint _gt And image I output by network _O L1 Loss at the overlap region:

the purpose of perceiving layer Loss is to keep the feature of the transformed image as consistent as possible, a conv5_3 layer in a pre-trained VGG-19 feature extraction network is used for extracting deep high-level semantic features, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used _ov Carrying out extraction:

the overall penalty function for this module is then:

further, the specific method of step 4 is as follows:

the smooth transition module is used for enabling the overlapped area and the non-overlapped area to be connected smoothly, and enabling the image to have better visual appearance. For the non-overlapped region, the design purpose is to form the propagation of the displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as guidance. To be able to form such a propagation relation, the original view image and the overlap region displacement field predicted at the previous stage are input, and the original view image is set to 1/4 resolution to fit the size of the overlap region displacement field. The submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all regions of the two images are predicted through the regression structures of the two images respectively.

In order to train this module, displacement field loss, displacement field consistency loss, perceptual layer loss are defined.

In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus it is not appropriate to apply the same loss of weight for each pixel. Taking into account the use of Gaussian functionsConstruction weight W _k And obtaining the displacement field loss:

a shift field consistency loss function to keep the output of the second module in the overlap region consistent with the output of the first module:

for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is also calculated, at this time, the input is a deformed image, and the binary mask containing the original viewpoint image content in the virtual viewpoint is M:

the overall loss function of the module is defined as:

further, the specific method of step 5 is as follows:

transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I _o ：

I _o ＝W·warp(I _A ,flow _A )+(1-W)·warp(I _B ,flow _B )

Wherein, flowg _A 、flowg _B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.

Has the advantages that: the multi-view video splicing method based on deep learning provided by the invention utilizes a deep convolutional neural network to process video splicing, and provides a new idea for solving the problems. The method can be applied to camera arrangement with wide base lines, artifacts caused by parallax error are eliminated through the viewpoint regression idea, and the quality of spliced videos is improved. Meanwhile, due to the advantage of extracting image features by the convolutional neural network, the method has better robustness compared with the traditional method under the challenging scenes of low illumination, low texture or repeated texture and the like. The designed module has short running time and can meet the performance of on-line real-time splicing.

Drawings

Fig. 1 is an overall flowchart of a deep learning-based multi-view video stitching method provided by the invention.

FIG. 2 is a schematic diagram of an arrangement of cameras according to the present invention.

Fig. 3 is a view of a camera configuration in a virtual 3D environment in the present invention.

Fig. 4 is a design of the overall network architecture of the present invention.

FIG. 5 is a comparison of the stitching results of the present invention under different methods, wherein a is a reference true value, b is a multiband fusion method, c is an APAP method, and d is the patented method, and each diagram is a stitching result diagram under different test scenarios from top to bottom.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.

As shown in the figure, a multi-view video stitching method based on deep learning includes the following steps:

step 1: the method comprises the steps of collecting images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and carrying out cylindrical projection preprocessing on the images in the data set according to a camera view angle.

Step 2: and 3D information of the scene is obtained from the depth data, and a pixel-level displacement field is obtained through conversion.

In this embodiment, the specific method of step 1 is as follows:

and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by using an Airsim simulator, and a data set for training is generated.

In this embodiment, the specific method of step 2 is as follows:

the pixel displacement field is obtained by converting depth information in the scene. After the depth information corresponding to the two cameras at the virtual viewpoint is obtained, the 3D coordinates of the pixel points can be obtained. The image at the virtual viewpoint is transformed to the original viewpoint, and the displacement field flow in the viewpoint transformation process can be calculated and obtained in a stereo geometric mode _gt 。

In this embodiment, the specific method of step 3 is as follows:

in the video splicing task, the overlapping area is generally less, and in order to reduce the filtering of invalid areas as much as possible and reduce the calculation amount, the maximum binary mask M containing the overlapping area is obtained according to the configuration mode of the camera _{ov_max} This portion is extracted from the input image and input into the current module.

Transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area _ov Pixel displacement field flow predicted from the network _O Constructing an L1 loss function for the overlapping area:

content loss computation of images I at virtual viewpoints _gt And image I output from network _O L1 Loss at the overlap region:

the purpose of perceiving layer Loss is to keep the characteristics of the transformed image consistent as much as possible, a conv5_3 layer in a pre-trained VGG-19 characteristic extraction network is used for extracting deep high-level semantic characteristics, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used for masking M _ov Carrying out extraction:

the overall loss function of the module is then:

in this embodiment, the specific method of step 4 is as follows:

In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel. Construction of the weights W with a Gaussian function is considered _k And obtaining the displacement field loss:

a displacement field consistency loss function for keeping the output of the second module in the overlap region consistent with the output of the first module:

for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is calculated in the same way, at the moment, the input is a deformed image, and the binary mask of the virtual viewpoint containing the image content of the original viewpoint is M:

the overall loss function of the module is defined as:

in this embodiment, the specific method of step 5 is as follows:

I _o ＝W·warp(I _A ,flow _A )+(1-W)·warp(I _B ,flow _B )

Wherein, flow _A 、flow _B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.

Examples

According to the multi-view video stitching method based on deep learning, the camera configuration shown in the figure 3 is set in the Airsim simulator, and a camera acquisition model is built. The FOV of the camera is set to 90 degrees, the resolution of the camera is 1280x720, and the included angle between the two cameras is 60 degrees, when the overlapping area of the images is less than 33%. A total of four cameras are arranged, the base line distance between the first two cameras is 1m, the two cameras are used for capturing the original viewpoint images, and the second two cameras are arranged at the same virtual viewpoint and capture the set true value images and the depth data. The method synthesizes thousands of groups of images under a plurality of scene maps and weather conditions, so as to construct a data set for training a network. In the specific training process, only the artifact eliminating module is trained, the parameters of the artifact eliminating module are fixed, and then the smooth transition module is trained continuously.

Experiment: in other map scenes different from training data, test video segments are collected, some frames are selected from video splicing results to be compared with the existing splicing method, the method obtains good splicing results, and the effectiveness of the method is verified.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.

Claims

1. A multi-view video stitching method based on deep learning is characterized by comprising the following steps:

step 1: acquiring images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the images in the data set according to a camera view angle;

step 2: obtaining 3D information of a scene from the depth data, and converting to obtain a pixel-level displacement field;

and step 3: designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion;

and 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image so as to guide the smooth transition between the areas and reduce the visual split feeling;

2. The method for splicing the multi-view video based on the deep learning of claim 1, wherein the specific method in the step 1 is as follows:

splicing videos at different viewpoints is regarded as a viewpoint regression problem, and images collected at an original viewpoint are mapped to an arbitrary public virtual viewpoint so as to process parallax caused by misalignment of optical centers of cameras; in order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.

3. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 2 is as follows:

obtaining a pixel displacement field by converting depth information in a scene; obtaining the 3D coordinates of the pixel points after obtaining the depth information corresponding to the two cameras at the virtual viewpoint; transforming the image at the virtual viewpoint to the original viewpoint, and calculating to obtain the displacement field flow in the viewpoint transformation process in a stereo geometric mode _gt 。

4. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 3 is as follows:

in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera _{ov_max} Extracting the part from the input image and inputting the part into the current module;

for possible overlapping regions, the structure of an encoder-decoder is designed, for the encoder, two pictures are stacked together according to channel dimensions, a series of convolutional layers are used for downsampling to extract features, and the decoder is composed of a series of upsampling layers and convolutional layers, wherein jump connection is used; inputting the characteristics of a layer corresponding to an encoder and a displacement field output by a layer above a decoder, gradually up-sampling for optimization, and when obtaining the displacement field with the resolution of 1/4, directly up-sampling the displacement field through bilinear interpolation to obtain an overlapped area displacement field with the same size as the input resolution;

in order to train an artifact eliminating module, displacement field loss, content loss and perception layer loss are defined;

transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area _ov Flow of pixel displacement field predicted from network _O Constructing an L1 loss function for the overlapping area:

the overall loss function of the module is then:

5. the method for splicing multi-view video based on deep learning according to claim 1, wherein the specific method in step 4 is as follows:

the smooth transition module is used for enabling the overlapping area and the non-overlapping area to be connected smoothly and enabling the image to have better visual impression; for the non-overlapped region, the design aim is to form the propagation of a displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as a guide; in order to be able to form such a propagation relation, a raw view image and an overlap region displacement field predicted at a previous stage are input, the raw view image is set to 1/4 resolution to fit the size of the overlap region displacement field; the submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all areas of the two images are predicted through the regression structures of the two images respectively;

in order to train the module, displacement field loss, displacement field consistency loss and perception layer loss are defined;

in the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel; construction of the weights W with a Gaussian function is considered _k And obtaining the displacement field loss:

the overall loss function of the module is defined as:

6. the method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in the step 5 is as follows:

I _o ＝W·warp(I _A ,flow _A )+(1-W)·warp(I _B ,flow _B )