CN115345781A - Multi-view video stitching method based on deep learning - Google Patents
Multi-view video stitching method based on deep learning Download PDFInfo
- Publication number
- CN115345781A CN115345781A CN202210956950.6A CN202210956950A CN115345781A CN 115345781 A CN115345781 A CN 115345781A CN 202210956950 A CN202210956950 A CN 202210956950A CN 115345781 A CN115345781 A CN 115345781A
- Authority
- CN
- China
- Prior art keywords
- displacement field
- loss
- viewpoint
- image
- splicing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 238000013135 deep learning Methods 0.000 title claims abstract description 18
- 238000006073 displacement reaction Methods 0.000 claims abstract description 58
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 14
- 230000007704 transition Effects 0.000 claims abstract description 13
- 230000001131 transforming effect Effects 0.000 claims abstract description 11
- 230000004927 fusion Effects 0.000 claims abstract description 10
- 230000000007 visual effect Effects 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 18
- 230000003287 optical effect Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 11
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 6
- 230000008447 perception Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000013461 design Methods 0.000 claims description 4
- 238000007499 fusion processing Methods 0.000 claims description 3
- 208000020442 loss of weight Diseases 0.000 claims description 3
- 238000005457 optimization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 2
- 230000000873 masking effect Effects 0.000 claims description 2
- 238000005070 sampling Methods 0.000 claims 2
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005286 illumination Methods 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 description 1
- 208000029618 autoimmune pulmonary alveolar proteinosis Diseases 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4038—Scaling the whole image or part thereof for image mosaicing, i.e. plane images composed of plane sub-images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4046—Scaling the whole image or part thereof using neural networks
-
- G06T5/70—
-
- G06T5/73—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/122—Improving the 3D impression of stereoscopic images by modifying image signal contents, e.g. by filtering or adding monoscopic depth cues
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N13/10—Processing, recording or transmission of stereoscopic or multi-view image signals
- H04N13/106—Processing image signals
- H04N13/161—Encoding, multiplexing or demultiplexing different image signal components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2200/00—Indexing scheme for image data processing or generation, in general
- G06T2200/32—Indexing scheme for image data processing or generation, in general involving image mosaicing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N13/00—Stereoscopic video systems; Multi-view video systems; Details thereof
- H04N2013/0074—Stereoscopic image analysis
- H04N2013/0081—Depth or disparity estimation from stereoscopic image signals
Abstract
The invention discloses a multi-view video stitching method based on deep learning, which comprises the following steps: firstly, an Airsim simulator is used for collecting images and depth data at a set virtual common viewpoint, a data set for a video splicing task is generated, and preprocessing such as cylindrical projection is carried out on the images. Then, respectively designing an artifact eliminating module and a smooth transition module by utilizing a convolutional neural network, wherein the artifact eliminating module considers the characteristic correlation of an overlapped region, and aligns the overlapped region through viewpoint regression to eliminate the fused artifact; the latter propagates the obtained deformation rule of the overlapped region to the non-overlapped region according to the characteristic information of the image to guide the smooth transition between the regions so as to improve the visual impression. And finally, transforming the original viewpoint image according to the predicted displacement field distortion, and performing weighted linear fusion to obtain a splicing result. The invention can achieve real-time splicing performance while removing splicing artifacts, and can meet the on-line splicing requirement in practical application.
Description
Technical Field
The invention relates to a video splicing technology, and belongs to the technical field of computer vision.
Background
The video splicing technology has important theoretical research significance and plays an important role in various application fields such as virtual reality, security monitoring, intelligent driving, video conferences and unmanned aerial vehicle aerial photography. Video stitching techniques are commonly used to compose two or more videos captured by cameras of different poses. It can reduce the requirement on video acquisition equipment and obtain a larger field of view. Although image and video stitching have a long history of research, existing video stitching methods do not perform perfectly. The challenges of the current method are long computing time consumption, poor performance of wide baseline large parallax scenes and insufficient algorithm robustness. The common algorithm based on global homography alignment in video splicing is not influenced by parallax when the optical centers of cameras are basically overlapped or the depth of a scene is changed slightly, so that a better result can be obtained, and otherwise, obvious artifacts can be generated. However, in practical applications, it is difficult to achieve a condition that the optical centers of the cameras are completely overlapped, and a distributed camera arrangement is also required in some scenes such as a vehicle-mounted around-the-eye system. To reduce artifacts, methods based on optimal sutures are commonly used, but such methods may cause problems with uneven transitions and still be computationally inefficient to minimize depending on the energy function.
The development of the deep learning technology provides a brand-new dimension for the video image splicing technology, and the quality of spliced videos can be improved by adopting a proper mode. The Convolutional Neural Network (CNN) has strong feature extraction capability, and the CNN is used for replacing the original traditional feature extraction mode, so that the Convolutional Neural Network (CNN) has better robustness in the scenes of low illumination, low texture or repeated texture and the like. Accordingly, a homography estimation method based on deep learning is also applied to the splicing task of the small parallax images. However, the lack of a proper data set is a difficulty of applying the deep learning method to video and image splicing tasks, and some methods use a synthesized data set without parallax, which is often inconsistent with an actual application scene.
Disclosure of Invention
The technical problem is as follows: aiming at the prior art, the invention provides a multi-view video splicing method based on deep learning, which can eliminate the artifact problem caused by parallax error; robustness under challenging scenes such as low illumination, low texture or repeated texture can be improved; meanwhile, higher calculation efficiency can be achieved, and the requirement of online real-time splicing in practical application can be met.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:
a multi-view video stitching method based on deep learning comprises the following steps:
step 1: and acquiring image and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the image in the data set according to the camera view angle.
Step 2: and obtaining 3D information of the scene through the depth data, and converting to obtain a pixel-level displacement field.
And 3, step 3: and designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion.
And 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image, so that the smooth transition between the areas is guided, and the visual split feeling is reduced.
And 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
Further, the specific method of step 1 is as follows:
and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.
Further, the specific method of step 2 is as follows:
the pixel displacement field is obtained by converting depth information in the scene. Depth corresponding to two cameras at the position of obtaining virtual viewpointAnd obtaining the 3D coordinates of the pixel points after information is obtained. Transforming the image at the virtual viewpoint to the original viewpoint, and calculating the displacement field flow in the viewpoint transformation process in a stereo geometric mode gt 。
Further, the specific method in step 3 is as follows:
in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera ov_max This portion is extracted from the input image and input into the current module.
For possible overlap regions, the structure of the encoder-decoder is designed, for the encoder, two pictures are stacked together in channel dimensions, features are extracted using a series of convolutional layer downsampling, and the decoder consists of a series of upsampled layers and convolutional layers, where a skip connection is used. Inputting the characteristics of the corresponding layer of the encoder and the displacement field output by the upper layer of the decoder, gradually upsampling for optimization, and directly upsampling by bilinear interpolation when obtaining the displacement field with the resolution of 1/4 to obtain the displacement field of the overlapped area with the same size as the input resolution.
In order to train the artifact removal module, displacement field loss, content loss, perceptual layer loss are defined.
Converting the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapped area ov Pixel displacement field flow predicted from the network O Constructing an L1 loss function for the overlapping area:
content loss computation of image I at virtual viewpoint gt And image I output by network O L1 Loss at the overlap region:
the purpose of perceiving layer Loss is to keep the feature of the transformed image as consistent as possible, a conv5_3 layer in a pre-trained VGG-19 feature extraction network is used for extracting deep high-level semantic features, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used ov Carrying out extraction:
the overall penalty function for this module is then:
further, the specific method of step 4 is as follows:
the smooth transition module is used for enabling the overlapped area and the non-overlapped area to be connected smoothly, and enabling the image to have better visual appearance. For the non-overlapped region, the design purpose is to form the propagation of the displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as guidance. To be able to form such a propagation relation, the original view image and the overlap region displacement field predicted at the previous stage are input, and the original view image is set to 1/4 resolution to fit the size of the overlap region displacement field. The submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all regions of the two images are predicted through the regression structures of the two images respectively.
In order to train this module, displacement field loss, displacement field consistency loss, perceptual layer loss are defined.
In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus it is not appropriate to apply the same loss of weight for each pixel. Taking into account the use of Gaussian functionsConstruction weight W k And obtaining the displacement field loss:
a shift field consistency loss function to keep the output of the second module in the overlap region consistent with the output of the first module:
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is also calculated, at this time, the input is a deformed image, and the binary mask containing the original viewpoint image content in the virtual viewpoint is M:
the overall loss function of the module is defined as:
further, the specific method of step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o :
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flowg A 、flowg B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
Has the advantages that: the multi-view video splicing method based on deep learning provided by the invention utilizes a deep convolutional neural network to process video splicing, and provides a new idea for solving the problems. The method can be applied to camera arrangement with wide base lines, artifacts caused by parallax error are eliminated through the viewpoint regression idea, and the quality of spliced videos is improved. Meanwhile, due to the advantage of extracting image features by the convolutional neural network, the method has better robustness compared with the traditional method under the challenging scenes of low illumination, low texture or repeated texture and the like. The designed module has short running time and can meet the performance of on-line real-time splicing.
Drawings
Fig. 1 is an overall flowchart of a deep learning-based multi-view video stitching method provided by the invention.
FIG. 2 is a schematic diagram of an arrangement of cameras according to the present invention.
Fig. 3 is a view of a camera configuration in a virtual 3D environment in the present invention.
Fig. 4 is a design of the overall network architecture of the present invention.
FIG. 5 is a comparison of the stitching results of the present invention under different methods, wherein a is a reference true value, b is a multiband fusion method, c is an APAP method, and d is the patented method, and each diagram is a stitching result diagram under different test scenarios from top to bottom.
Detailed Description
The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention.
As shown in the figure, a multi-view video stitching method based on deep learning includes the following steps:
step 1: the method comprises the steps of collecting images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and carrying out cylindrical projection preprocessing on the images in the data set according to a camera view angle.
Step 2: and 3D information of the scene is obtained from the depth data, and a pixel-level displacement field is obtained through conversion.
And 3, step 3: and designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion.
And 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image, so that the smooth transition between the areas is guided, and the visual split feeling is reduced.
And 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
In this embodiment, the specific method of step 1 is as follows:
and (3) splicing videos at different viewpoints to be regarded as a viewpoint regression problem, and mapping the image acquired at the original viewpoint to any common virtual viewpoint so as to process the parallax caused by the misalignment of the optical centers of the cameras. In order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by using an Airsim simulator, and a data set for training is generated.
In this embodiment, the specific method of step 2 is as follows:
the pixel displacement field is obtained by converting depth information in the scene. After the depth information corresponding to the two cameras at the virtual viewpoint is obtained, the 3D coordinates of the pixel points can be obtained. The image at the virtual viewpoint is transformed to the original viewpoint, and the displacement field flow in the viewpoint transformation process can be calculated and obtained in a stereo geometric mode gt 。
In this embodiment, the specific method of step 3 is as follows:
in the video splicing task, the overlapping area is generally less, and in order to reduce the filtering of invalid areas as much as possible and reduce the calculation amount, the maximum binary mask M containing the overlapping area is obtained according to the configuration mode of the camera ov_max This portion is extracted from the input image and input into the current module.
For possible overlap regions, the structure of the encoder-decoder is designed, for the encoder, two pictures are stacked together in channel dimensions, features are extracted using a series of convolutional layer downsampling, and the decoder consists of a series of upsampled layers and convolutional layers, where a skip connection is used. Inputting the characteristics of the corresponding layer of the encoder and the displacement field output by the upper layer of the decoder, gradually upsampling for optimization, and directly upsampling by bilinear interpolation when obtaining the displacement field with the resolution of 1/4 to obtain the displacement field of the overlapped area with the same size as the input resolution.
In order to train the artifact removal module, displacement field loss, content loss, perceptual layer loss are defined.
Transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area ov Pixel displacement field flow predicted from the network O Constructing an L1 loss function for the overlapping area:
content loss computation of images I at virtual viewpoints gt And image I output from network O L1 Loss at the overlap region:
the purpose of perceiving layer Loss is to keep the characteristics of the transformed image consistent as much as possible, a conv5_3 layer in a pre-trained VGG-19 characteristic extraction network is used for extracting deep high-level semantic characteristics, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used for masking M ov Carrying out extraction:
the overall loss function of the module is then:
in this embodiment, the specific method of step 4 is as follows:
the smooth transition module is used for enabling the overlapped area and the non-overlapped area to be connected smoothly, and enabling the image to have better visual appearance. For the non-overlapped region, the design purpose is to form the propagation of the displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as guidance. To be able to form such a propagation relation, the original view image and the overlap region displacement field predicted at the previous stage are input, and the original view image is set to 1/4 resolution to fit the size of the overlap region displacement field. The submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all regions of the two images are predicted through the regression structures of the two images respectively.
In order to train this module, displacement field loss, displacement field consistency loss, perceptual layer loss are defined.
In the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel. Construction of the weights W with a Gaussian function is considered k And obtaining the displacement field loss:
a displacement field consistency loss function for keeping the output of the second module in the overlap region consistent with the output of the first module:
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is calculated in the same way, at the moment, the input is a deformed image, and the binary mask of the virtual viewpoint containing the image content of the original viewpoint is M:
the overall loss function of the module is defined as:
in this embodiment, the specific method of step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o :
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flow A 、flow B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
Examples
According to the multi-view video stitching method based on deep learning, the camera configuration shown in the figure 3 is set in the Airsim simulator, and a camera acquisition model is built. The FOV of the camera is set to 90 degrees, the resolution of the camera is 1280x720, and the included angle between the two cameras is 60 degrees, when the overlapping area of the images is less than 33%. A total of four cameras are arranged, the base line distance between the first two cameras is 1m, the two cameras are used for capturing the original viewpoint images, and the second two cameras are arranged at the same virtual viewpoint and capture the set true value images and the depth data. The method synthesizes thousands of groups of images under a plurality of scene maps and weather conditions, so as to construct a data set for training a network. In the specific training process, only the artifact eliminating module is trained, the parameters of the artifact eliminating module are fixed, and then the smooth transition module is trained continuously.
Experiment: in other map scenes different from training data, test video segments are collected, some frames are selected from video splicing results to be compared with the existing splicing method, the method obtains good splicing results, and the effectiveness of the method is verified.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention, and such modifications and adaptations are intended to be within the scope of the invention.
Claims (6)
1. A multi-view video stitching method based on deep learning is characterized by comprising the following steps:
step 1: acquiring images and depth data at a set virtual public viewpoint, generating a data set for a video splicing task, and performing cylindrical projection pretreatment on the images in the data set according to a camera view angle;
step 2: obtaining 3D information of a scene from the depth data, and converting to obtain a pixel-level displacement field;
and step 3: designing an artifact eliminating module by using a convolutional neural network, aligning the overlapped regions by considering the characteristic correlation of the overlapped regions, and returning the viewpoint to the position below the overlapped virtual optical center so as to eliminate the artifact generated after fusion;
and 4, step 4: a smooth transition module is designed by utilizing a convolutional neural network, and the obtained deformation rule of the overlapped area is transmitted to the non-overlapped area according to the characteristic information of the image so as to guide the smooth transition between the areas and reduce the visual split feeling;
and 5: and (5) transforming the original viewpoint image according to the displacement field distortion, and performing weighted linear fusion to obtain a splicing result.
2. The method for splicing the multi-view video based on the deep learning of claim 1, wherein the specific method in the step 1 is as follows:
splicing videos at different viewpoints is regarded as a viewpoint regression problem, and images collected at an original viewpoint are mapped to an arbitrary public virtual viewpoint so as to process parallax caused by misalignment of optical centers of cameras; in order to build an ideal optical center coincidence model at a virtual viewpoint and obtain reliable depth data, a camera model is built in a virtual 3D environment by utilizing an Airsim simulator, and a data set for training is generated.
3. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 2 is as follows:
obtaining a pixel displacement field by converting depth information in a scene; obtaining the 3D coordinates of the pixel points after obtaining the depth information corresponding to the two cameras at the virtual viewpoint; transforming the image at the virtual viewpoint to the original viewpoint, and calculating to obtain the displacement field flow in the viewpoint transformation process in a stereo geometric mode gt 。
4. The method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in step 3 is as follows:
in a video stitching task, overlapping areas are generally few, and in order to reduce invalid areas filtered as much as possible and reduce the calculation amount, a maximum binary mask M containing the overlapping areas is obtained according to the configuration mode of a camera ov_max Extracting the part from the input image and inputting the part into the current module;
for possible overlapping regions, the structure of an encoder-decoder is designed, for the encoder, two pictures are stacked together according to channel dimensions, a series of convolutional layers are used for downsampling to extract features, and the decoder is composed of a series of upsampling layers and convolutional layers, wherein jump connection is used; inputting the characteristics of a layer corresponding to an encoder and a displacement field output by a layer above a decoder, gradually up-sampling for optimization, and when obtaining the displacement field with the resolution of 1/4, directly up-sampling the displacement field through bilinear interpolation to obtain an overlapped area displacement field with the same size as the input resolution;
in order to train an artifact eliminating module, displacement field loss, content loss and perception layer loss are defined;
transforming the binary mask of the original viewpoint by the displacement fields of the two images to obtain the binary mask M of the actual overlapping area ov Flow of pixel displacement field predicted from network O Constructing an L1 loss function for the overlapping area:
content loss computation of image I at virtual viewpoint gt And image I output by network O L1 Loss at the overlap region:
the purpose of perceiving layer Loss is to keep the characteristics of the transformed image consistent as much as possible, a conv5_3 layer in a pre-trained VGG-19 characteristic extraction network is used for extracting deep high-level semantic characteristics, the process is defined as F (-), MSE Loss on the layer is calculated, and an overlapping region mask M is used for masking M ov Carrying out extraction:
the overall loss function of the module is then:
5. the method for splicing multi-view video based on deep learning according to claim 1, wherein the specific method in step 4 is as follows:
the smooth transition module is used for enabling the overlapping area and the non-overlapping area to be connected smoothly and enabling the image to have better visual impression; for the non-overlapped region, the design aim is to form the propagation of a displacement field from the overlapped region to the non-overlapped region according to the image characteristics of the original viewpoint as a guide; in order to be able to form such a propagation relation, a raw view image and an overlap region displacement field predicted at a previous stage are input, the raw view image is set to 1/4 resolution to fit the size of the overlap region displacement field; the submodule is composed of a series of convolution layers and residual blocks, expansion convolution is used in the residual blocks to expand the receptive field, 6 residual blocks are used in total, expansion parameters are set to be [1,2,4,8, 1], and pixel displacement fields of all areas of the two images are predicted through the regression structures of the two images respectively;
in order to train the module, displacement field loss, displacement field consistency loss and perception layer loss are defined;
in the non-overlapping region, a portion close to the overlapping region should be focused, and a portion far from the overlapping region should be given less attention, and thus is not suitable for applying the same loss of weight for each pixel; construction of the weights W with a Gaussian function is considered k And obtaining the displacement field loss:
a displacement field consistency loss function for keeping the output of the second module in the overlap region consistent with the output of the first module:
for the Loss of the perception layer, the MSE Loss on the conv5_3 layer in the VGG-19 network is calculated in the same way, at the moment, the input is a deformed image, and the binary mask of the virtual viewpoint containing the image content of the original viewpoint is M:
the overall loss function of the module is defined as:
6. the method for splicing multi-view video based on deep learning of claim 1, wherein the specific method in the step 5 is as follows:
transforming the original view according to the final prediction result, and performing simple weighted linear fusion processing on the two images to obtain a splicing result I o :
I o =W·warp(I A ,flow A )+(1-W)·warp(I B ,flow B )
Wherein, flow A 、flow B For the output pixel displacement field, warp is the transform function and W is the set linear fusion weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210956950.6A CN115345781A (en) | 2022-08-10 | 2022-08-10 | Multi-view video stitching method based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210956950.6A CN115345781A (en) | 2022-08-10 | 2022-08-10 | Multi-view video stitching method based on deep learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115345781A true CN115345781A (en) | 2022-11-15 |
Family
ID=83952205
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210956950.6A Pending CN115345781A (en) | 2022-08-10 | 2022-08-10 | Multi-view video stitching method based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115345781A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422617A (en) * | 2023-10-12 | 2024-01-19 | 华能澜沧江水电股份有限公司 | Method and system for realizing image stitching of video conference system |
-
2022
- 2022-08-10 CN CN202210956950.6A patent/CN115345781A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117422617A (en) * | 2023-10-12 | 2024-01-19 | 华能澜沧江水电股份有限公司 | Method and system for realizing image stitching of video conference system |
CN117422617B (en) * | 2023-10-12 | 2024-04-09 | 华能澜沧江水电股份有限公司 | Method and system for realizing image stitching of video conference system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Song et al. | Channel attention based iterative residual learning for depth map super-resolution | |
TWI709107B (en) | Image feature extraction method and saliency prediction method including the same | |
CN111652966B (en) | Three-dimensional reconstruction method and device based on multiple visual angles of unmanned aerial vehicle | |
CN111105432B (en) | Unsupervised end-to-end driving environment perception method based on deep learning | |
CN110610526B (en) | Method for segmenting monocular image and rendering depth of field based on WNET | |
CN112288627B (en) | Recognition-oriented low-resolution face image super-resolution method | |
CN112019828B (en) | Method for converting 2D (two-dimensional) video into 3D video | |
CN111696035A (en) | Multi-frame image super-resolution reconstruction method based on optical flow motion estimation algorithm | |
CN113538243B (en) | Super-resolution image reconstruction method based on multi-parallax attention module combination | |
CN113034563A (en) | Self-supervision type monocular depth estimation method based on feature sharing | |
Yuan et al. | Multiview scene image inpainting based on conditional generative adversarial networks | |
CN114782596A (en) | Voice-driven human face animation generation method, device, equipment and storage medium | |
CN113808005A (en) | Video-driving-based face pose migration method and device | |
CN116563459A (en) | Text-driven immersive open scene neural rendering and mixing enhancement method | |
CN115345781A (en) | Multi-view video stitching method based on deep learning | |
CN116957931A (en) | Method for improving image quality of camera image based on nerve radiation field | |
CN116823908A (en) | Monocular image depth estimation method based on multi-scale feature correlation enhancement | |
CN115330935A (en) | Three-dimensional reconstruction method and system based on deep learning | |
CN115170921A (en) | Binocular stereo matching method based on bilateral grid learning and edge loss | |
CN113362240A (en) | Image restoration method based on lightweight feature pyramid model | |
CN112950481A (en) | Water bloom shielding image data collection method based on image mosaic network | |
Zhao et al. | 3dfill: Reference-guided image inpainting by self-supervised 3d image alignment | |
CN110766732A (en) | Robust single-camera depth map estimation method | |
LIU et al. | A Lightweight and Efficient Infrared Pedestrian Semantic Segmentation Method | |
Zhuang et al. | Dimensional transformation mixer for ultra-high-definition industrial camera dehazing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |