CN115345781A - Multi-view video stitching method based on deep learning - Google Patents

Multi-view video stitching method based on deep learning Download PDF

Info

Publication number
CN115345781A
CN115345781A CN202210956950.6A CN202210956950A CN115345781A CN 115345781 A CN115345781 A CN 115345781A CN 202210956950 A CN202210956950 A CN 202210956950A CN 115345781 A CN115345781 A CN 115345781A
Authority
CN
China
Prior art keywords
displacement field
loss
viewpoint
image
splicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210956950.6A
Other languages
Chinese (zh)
Inventor
达飞鹏
衡玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202210956950.6A priority Critical patent/CN115345781A/en
Publication of CN115345781A publication Critical patent/CN115345781A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4038Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4046Scaling the whole image or part thereof using neural networks
    • G06T5/70
    • G06T5/73
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/122Improving the 3D impression of stereoscopic images by modifying image signal contents, e.g. by filtering or adding monoscopic depth cues
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/161Encoding, multiplexing or demultiplexing different image signal components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N2013/0074Stereoscopic image analysis
    • H04N2013/0081Depth or disparity estimation from stereoscopic image signals

Abstract

The invention discloses a multi-view video stitching method based on deep learning, which comprises the following steps: firstly, an Airsim simulator is used for collecting images and depth data at a set virtual common viewpoint, a data set for a video splicing task is generated, and preprocessing such as cylindrical projection is carried out on the images. Then, respectively designing an artifact eliminating module and a smooth transition module by utilizing a convolutional neural network, wherein the artifact eliminating module considers the characteristic correlation of an overlapped region, and aligns the overlapped region through viewpoint regression to eliminate the fused artifact; the latter propagates the obtained deformation rule of the overlapped region to the non-overlapped region according to the characteristic information of the image to guide the smooth transition between the regions so as to improve the visual impression. And finally, transforming the original viewpoint image according to the predicted displacement field distortion, and performing weighted linear fusion to obtain a splicing result. The invention can achieve real-time splicing performance while removing splicing artifacts, and can meet the on-line splicing requirement in practical application.

Description

Multi-view video stitching method based on deep learning
Technical Field
The invention relates to a video splicing technology, and belongs to the technical field of computer vision.
Background
The video splicing technology has important theoretical research significance and plays an important role in various application fields such as virtual reality, security monitoring, intelligent driving, video conferences and unmanned aerial vehicle aerial photography. Video stitching techniques are commonly used to compose two or more videos captured by cameras of different poses. It can reduce the requirement on video acquisition equipment and obtain a larger field of view. Although image and video stitching have a long history of research, existing video stitching methods do not perform perfectly. The challenges of the current method are long computing time consumption, poor performance of wide baseline large parallax scenes and insufficient algorithm robustness. The common algorithm based on global homography alignment in video splicing is not influenced by parallax when the optical centers of cameras are basically overlapped or the depth of a scene is changed slightly, so that a better result can be obtained, and otherwise, obvious artifacts can be generated. However, in practical applications, it is difficult to achieve a condition that the optical centers of the cameras are completely overlapped, and a distributed camera arrangement is also required in some scenes such as a vehicle-mounted around-the-eye system. To reduce artifacts, methods based on optimal sutures are commonly used, but such methods may cause problems with uneven transitions and still be computationally inefficient to minimize depending on the energy function.
The development of the deep learning technology provides a brand-new dimension for the video image splicing technology, and the quality of spliced videos can be improved by adopting a proper mode. The Convolutional Neural Network (CNN) has strong feature extraction capability, and the CNN is used for replacing the original traditional feature extraction mode, so that the Convolutional Neural Network (CNN) has better robustness in the scenes of low illumination, low texture or repeated texture and the like. Accordingly, a homography estimation method based on deep learning is also applied to the splicing task of the small parallax images. However, the lack of a proper data set is a difficulty of applying the deep learning method to video and image splicing tasks, and some methods use a synthesized data set without parallax, which is often inconsistent with an actual application scene.
Disclosure of Invention
The technical problem is as follows: aiming at the prior art, the invention provides a multi-view video splicing method based on deep learning, which can eliminate the artifact problem caused by parallax error; robustness under challenging scenes such as low illumination, low texture or repeated texture can be improved; meanwhile, higher calculation efficiency can be achieved, and the requirement of online real-time splicing in practical application can be met.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a multi-view video stitching method based on deep learning comprises the following steps:
step 1: and acquiring image and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the image in the data set according to the camera view angle.
Step 2: and obtaining 3D information of the scene through the depth data, and converting to obtain a pixel-level displacement field.
And 3, step 3: and designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion.
And 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image, so that the smooth transition between the areas is guided, and the visual split feeling is reduced.
And 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
Further, the specific method of step 1 is as follows:
and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.
Further, the specific method of step 2 is as follows:
the pixel displacement field is obtained by converting depth information in the scene. Depth corresponding to two cameras at the position of obtaining virtual viewpointAnd obtaining the 3D coordinates of the pixel points after information is obtained. Transforming the image at the virtual viewpoint to the original viewpoint, and calculating the displacement field flow in the viewpoint transformation process in a stereo geometric mode gt
Further, the specific method in step 3 is as follows:
in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera ov_max This portion is extracted from the input image and input into the current module.
For possible overlap regions, the structure of the encoder-decoder is designed, for the encoder, two pictures are stacked together in channel dimensions, features are extracted using a series of convolutional layer downsampling, and the decoder consists of a series of upsampled layers and convolutional layers, where a skip connection is used. Inputting the characteristics of the corresponding layer of the encoder and the displacement field output by the upper layer of the decoder, gradually upsampling for optimization, and directly upsampling by bilinear interpolation when obtaining the displacement field with the resolution of 1/4 to obtain the displacement field of the overlapped area with the same size as the input resolution.
In order to train the artifact removal module, displacement field loss, content loss, perceptual layer loss are defined.
Converting the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapped area ov Pixel displacement field flow predicted from the network O Constructing an L1 loss function for the overlapping area:
Figure BDA0003791746540000021
content loss computation of image I at virtual viewpoint gt And image I output by network O L1 Loss at the overlap region:
Figure BDA0003791746540000031
the purpose of perceiving layer Loss is to keep the feature of the transformed image as consistent as possible, a conv5_3 layer in a pre-trained VGG-19 feature extraction network is used for extracting deep high-level semantic features, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used ov Carrying out extraction:
Figure BDA0003791746540000032
the overall penalty function for this module is then:
Figure BDA0003791746540000033
further, the specific method of step 4 is as follows:
the smooth transition module is used for enabling the overlapped area and the non-overlapped area to be connected smoothly, and enabling the image to have better visual appearance. For the non-overlapped region, the design purpose is to form the propagation of the displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as guidance. To be able to form such a propagation relation, the original view image and the overlap region displacement field predicted at the previous stage are input, and the original view image is set to 1/4 resolution to fit the size of the overlap region displacement field. The submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all regions of the two images are predicted through the regression structures of the two images respectively.
In order to train this module, displacement field loss, displacement field consistency loss, perceptual layer loss are defined.
In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus it is not appropriate to apply the same loss of weight for each pixel. Taking into account the use of Gaussian functionsConstruction weight W k And obtaining the displacement field loss:
Figure BDA0003791746540000034
Figure BDA0003791746540000035
a shift field consistency loss function to keep the output of the second module in the overlap region consistent with the output of the first module:
Figure BDA0003791746540000036
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is also calculated, at this time, the input is a deformed image, and the binary mask containing the original viewpoint image content in the virtual viewpoint is M:
Figure BDA0003791746540000037
the overall loss function of the module is defined as:
Figure BDA0003791746540000038
further, the specific method of step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flowg A 、flowg B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
Has the advantages that: the multi-view video splicing method based on deep learning provided by the invention utilizes a deep convolutional neural network to process video splicing, and provides a new idea for solving the problems. The method can be applied to camera arrangement with wide base lines, artifacts caused by parallax error are eliminated through the viewpoint regression idea, and the quality of spliced videos is improved. Meanwhile, due to the advantage of extracting image features by the convolutional neural network, the method has better robustness compared with the traditional method under the challenging scenes of low illumination, low texture or repeated texture and the like. The designed module has short running time and can meet the performance of on-line real-time splicing.
Drawings
Fig. 1 is an overall flowchart of a deep learning-based multi-view video stitching method provided by the invention.
FIG. 2 is a schematic diagram of an arrangement of cameras according to the present invention.
Fig. 3 is a view of a camera configuration in a virtual 3D environment in the present invention.
Fig. 4 is a design of the overall network architecture of the present invention.
FIG. 5 is a comparison of the stitching results of the present invention under different methods, wherein a is a reference true value, b is a multiband fusion method, c is an APAP method, and d is the patented method, and each diagram is a stitching result diagram under different test scenarios from top to bottom.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.
As shown in the figure, a multi-view video stitching method based on deep learning includes the following steps:
step 1: the method comprises the steps of collecting images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and carrying out cylindrical projection preprocessing on the images in the data set according to a camera view angle.
Step 2: and 3D information of the scene is obtained from the depth data, and a pixel-level displacement field is obtained through conversion.
And 3, step 3: and designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion.
And 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image, so that the smooth transition between the areas is guided, and the visual split feeling is reduced.
And 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
In this embodiment, the specific method of step 1 is as follows:
and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by using an Airsim simulator, and a data set for training is generated.
In this embodiment, the specific method of step 2 is as follows:
the pixel displacement field is obtained by converting depth information in the scene. After the depth information corresponding to the two cameras at the virtual viewpoint is obtained, the 3D coordinates of the pixel points can be obtained. The image at the virtual viewpoint is transformed to the original viewpoint, and the displacement field flow in the viewpoint transformation process can be calculated and obtained in a stereo geometric mode gt
In this embodiment, the specific method of step 3 is as follows:
in the video splicing task, the overlapping area is generally less, and in order to reduce the filtering of invalid areas as much as possible and reduce the calculation amount, the maximum binary mask M containing the overlapping area is obtained according to the configuration mode of the camera ov_max This portion is extracted from the input image and input into the current module.
For possible overlap regions, the structure of the encoder-decoder is designed, for the encoder, two pictures are stacked together in channel dimensions, features are extracted using a series of convolutional layer downsampling, and the decoder consists of a series of upsampled layers and convolutional layers, where a skip connection is used. Inputting the characteristics of the corresponding layer of the encoder and the displacement field output by the upper layer of the decoder, gradually upsampling for optimization, and directly upsampling by bilinear interpolation when obtaining the displacement field with the resolution of 1/4 to obtain the displacement field of the overlapped area with the same size as the input resolution.
In order to train the artifact removal module, displacement field loss, content loss, perceptual layer loss are defined.
Transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area ov Pixel displacement field flow predicted from the network O Constructing an L1 loss function for the overlapping area:
Figure BDA0003791746540000051
content loss computation of images I at virtual viewpoints gt And image I output from network O L1 Loss at the overlap region:
Figure BDA0003791746540000052
the purpose of perceiving layer Loss is to keep the characteristics of the transformed image consistent as much as possible, a conv5_3 layer in a pre-trained VGG-19 characteristic extraction network is used for extracting deep high-level semantic characteristics, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used for masking M ov Carrying out extraction:
Figure BDA0003791746540000053
the overall loss function of the module is then:
Figure BDA0003791746540000061
in this embodiment, the specific method of step 4 is as follows:
the smooth transition module is used for enabling the overlapped area and the non-overlapped area to be connected smoothly, and enabling the image to have better visual appearance. For the non-overlapped region, the design purpose is to form the propagation of the displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as guidance. To be able to form such a propagation relation, the original view image and the overlap region displacement field predicted at the previous stage are input, and the original view image is set to 1/4 resolution to fit the size of the overlap region displacement field. The submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all regions of the two images are predicted through the regression structures of the two images respectively.
In order to train this module, displacement field loss, displacement field consistency loss, perceptual layer loss are defined.
In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel. Construction of the weights W with a Gaussian function is considered k And obtaining the displacement field loss:
Figure BDA0003791746540000062
Figure BDA0003791746540000063
a displacement field consistency loss function for keeping the output of the second module in the overlap region consistent with the output of the first module:
Figure BDA0003791746540000064
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is calculated in the same way, at the moment, the input is a deformed image, and the binary mask of the virtual viewpoint containing the image content of the original viewpoint is M:
Figure BDA0003791746540000065
the overall loss function of the module is defined as:
Figure BDA0003791746540000066
in this embodiment, the specific method of step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flow A 、flow B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
Examples
According to the multi-view video stitching method based on deep learning, the camera configuration shown in the figure 3 is set in the Airsim simulator, and a camera acquisition model is built. The FOV of the camera is set to 90 degrees, the resolution of the camera is 1280x720, and the included angle between the two cameras is 60 degrees, when the overlapping area of the images is less than 33%. A total of four cameras are arranged, the base line distance between the first two cameras is 1m, the two cameras are used for capturing the original viewpoint images, and the second two cameras are arranged at the same virtual viewpoint and capture the set true value images and the depth data. The method synthesizes thousands of groups of images under a plurality of scene maps and weather conditions, so as to construct a data set for training a network. In the specific training process, only the artifact eliminating module is trained, the parameters of the artifact eliminating module are fixed, and then the smooth transition module is trained continuously.
Experiment: in other map scenes different from training data, test video segments are collected, some frames are selected from video splicing results to be compared with the existing splicing method, the method obtains good splicing results, and the effectiveness of the method is verified.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.

Claims (6)

1. A multi-view video stitching method based on deep learning is characterized by comprising the following steps:
step 1: acquiring images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the images in the data set according to a camera view angle;
step 2: obtaining 3D information of a scene from the depth data, and converting to obtain a pixel-level displacement field;
and step 3: designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion;
and 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image so as to guide the smooth transition between the areas and reduce the visual split feeling;
and 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
2. The method for splicing the multi-view video based on the deep learning of claim 1, wherein the specific method in the step 1 is as follows:
splicing videos at different viewpoints is regarded as a viewpoint regression problem, and images collected at an original viewpoint are mapped to an arbitrary public virtual viewpoint so as to process parallax caused by misalignment of optical centers of cameras; in order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.
3. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 2 is as follows:
obtaining a pixel displacement field by converting depth information in a scene; obtaining the 3D coordinates of the pixel points after obtaining the depth information corresponding to the two cameras at the virtual viewpoint; transforming the image at the virtual viewpoint to the original viewpoint, and calculating to obtain the displacement field flow in the viewpoint transformation process in a stereo geometric mode gt
4. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 3 is as follows:
in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera ov_max Extracting the part from the input image and inputting the part into the current module;
for possible overlapping regions, the structure of an encoder-decoder is designed, for the encoder, two pictures are stacked together according to channel dimensions, a series of convolutional layers are used for downsampling to extract features, and the decoder is composed of a series of upsampling layers and convolutional layers, wherein jump connection is used; inputting the characteristics of a layer corresponding to an encoder and a displacement field output by a layer above a decoder, gradually up-sampling for optimization, and when obtaining the displacement field with the resolution of 1/4, directly up-sampling the displacement field through bilinear interpolation to obtain an overlapped area displacement field with the same size as the input resolution;
in order to train an artifact eliminating module, displacement field loss, content loss and perception layer loss are defined;
transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area ov Flow of pixel displacement field predicted from network O Constructing an L1 loss function for the overlapping area:
Figure FDA0003791746530000021
content loss computation of image I at virtual viewpoint gt And image I output by network O L1 Loss at the overlap region:
Figure FDA0003791746530000022
the purpose of perceiving layer Loss is to keep the characteristics of the transformed image consistent as much as possible, a conv5_3 layer in a pre-trained VGG-19 characteristic extraction network is used for extracting deep high-level semantic characteristics, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used for masking M ov Carrying out extraction:
Figure FDA0003791746530000023
the overall loss function of the module is then:
Figure FDA0003791746530000024
5. the method for splicing multi-view video based on deep learning according to claim 1, wherein the specific method in step 4 is as follows:
the smooth transition module is used for enabling the overlapping area and the non-overlapping area to be connected smoothly and enabling the image to have better visual impression; for the non-overlapped region, the design aim is to form the propagation of a displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as a guide; in order to be able to form such a propagation relation, a raw view image and an overlap region displacement field predicted at a previous stage are input, the raw view image is set to 1/4 resolution to fit the size of the overlap region displacement field; the submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all areas of the two images are predicted through the regression structures of the two images respectively;
in order to train the module, displacement field loss, displacement field consistency loss and perception layer loss are defined;
in the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel; construction of the weights W with a Gaussian function is considered k And obtaining the displacement field loss:
Figure FDA0003791746530000025
Figure FDA0003791746530000026
a displacement field consistency loss function for keeping the output of the second module in the overlap region consistent with the output of the first module:
Figure FDA0003791746530000027
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is calculated in the same way, at the moment, the input is a deformed image, and the binary mask of the virtual viewpoint containing the image content of the original viewpoint is M:
Figure FDA0003791746530000031
the overall loss function of the module is defined as:
Figure FDA0003791746530000032
6. the method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in the step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flow A 、flow B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
CN202210956950.6A 2022-08-10 2022-08-10 Multi-view video stitching method based on deep learning Pending CN115345781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210956950.6A CN115345781A (en) 2022-08-10 2022-08-10 Multi-view video stitching method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210956950.6A CN115345781A (en) 2022-08-10 2022-08-10 Multi-view video stitching method based on deep learning

Publications (1)

Publication Number Publication Date
CN115345781A true CN115345781A (en) 2022-11-15

Family

ID=83952205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210956950.6A Pending CN115345781A (en) 2022-08-10 2022-08-10 Multi-view video stitching method based on deep learning

Country Status (1)

Country Link
CN (1) CN115345781A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422617A (en) * 2023-10-12 2024-01-19 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117422617A (en) * 2023-10-12 2024-01-19 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system
CN117422617B (en) * 2023-10-12 2024-04-09 华能澜沧江水电股份有限公司 Method and system for realizing image stitching of video conference system

Similar Documents

Publication Publication Date Title
Song et al. Channel attention based iterative residual learning for depth map super-resolution
TWI709107B (en) Image feature extraction method and saliency prediction method including the same
CN111652966B (en) Three-dimensional reconstruction method and device based on multiple visual angles of unmanned aerial vehicle
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
CN110610526B (en) Method for segmenting monocular image and rendering depth of field based on WNET
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
CN112019828B (en) Method for converting 2D (two-dimensional) video into 3D video
CN111696035A (en) Multi-frame image super-resolution reconstruction method based on optical flow motion estimation algorithm
CN113538243B (en) Super-resolution image reconstruction method based on multi-parallax attention module combination
CN113034563A (en) Self-supervision type monocular depth estimation method based on feature sharing
Yuan et al. Multiview scene image inpainting based on conditional generative adversarial networks
CN114782596A (en) Voice-driven human face animation generation method, device, equipment and storage medium
CN113808005A (en) Video-driving-based face pose migration method and device
CN116563459A (en) Text-driven immersive open scene neural rendering and mixing enhancement method
CN115345781A (en) Multi-view video stitching method based on deep learning
CN116957931A (en) Method for improving image quality of camera image based on nerve radiation field
CN116823908A (en) Monocular image depth estimation method based on multi-scale feature correlation enhancement
CN115330935A (en) Three-dimensional reconstruction method and system based on deep learning
CN115170921A (en) Binocular stereo matching method based on bilateral grid learning and edge loss
CN113362240A (en) Image restoration method based on lightweight feature pyramid model
CN112950481A (en) Water bloom shielding image data collection method based on image mosaic network
Zhao et al. 3dfill: Reference-guided image inpainting by self-supervised 3d image alignment
CN110766732A (en) Robust single-camera depth map estimation method
LIU et al. A Lightweight and Efficient Infrared Pedestrian Semantic Segmentation Method
Zhuang et al. Dimensional transformation mixer for ultra-high-definition industrial camera dehazing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination