CN113139910B

CN113139910B - Video completion method

Info

Publication number: CN113139910B
Application number: CN202010066844.1A
Authority: CN
Inventors: 付彦伟; 欧阳尔立; 张力
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2022-10-18
Anticipated expiration: 2040-01-20
Also published as: CN113139910A

Abstract

The invention provides a video completion method which is characterized in that a missing video sequence with content missing is completed by reconstructing a three-dimensional scene, and the method comprises the following steps: step 1, identifying a depth map of each frame in a missing video sequence through a preset depth map network; step 2, identifying the relative camera pose between every two adjacent frames in the missing video sequence through a preset pose network; step 3, fusing the depth maps based on the depth maps and the relative camera pose by using a truncated symbol distance function so as to construct a three-dimensional scene corresponding to the video background in the missing video sequence; step 4, projecting the three-dimensional scene to the missing video sequence by using the relative camera pose and the camera internal parameters so as to complete the defective area of each frame to obtain a complete video sequence; and 5, performing secondary completion on the completed video sequence by using a preset missing completion network so as to form a complete video sequence without content missing.

Description

Video completion method

Technical Field

The invention belongs to the field of computer vision video restoration, relates to a video completion method, and particularly relates to a video completion method for estimating depth information of a video sequence and reconstructing a background three-dimensional scene.

Background

The goal of video completion is to reasonably complete the missing damaged part of each frame in the video according to the nearby area and the adjacent frames. Unlike the completion of a single picture, the video completion not only needs to consider the image content near the missing part, but also needs to consider the image content of the frame near the current frame, that is, the spatial consistency and the temporal consistency are simultaneously satisfied.

Although the traditional method based on patch matching achieves certain effect, a large amount of computing power is required to be consumed, so that the video completion is greatly improved through a completion method based on a neural network in recent years. Some of these methods (e.g., references [1] and [2 ]) usually require training a large convolutional neural network to directly fill up the missing region. However, due to the size limitation of the convolutional network, the long-term consistency of the timing is often not well maintained, and as a result, some problems of ambiguity or unreasonable content often occur. In addition, some optical flow-based methods (for example, reference [3 ]) estimate the optical flow corresponding to the video, then use the convolutional neural network to complement the optical flow, and finally use the complemented optical flow to complement the missing part of the original video in a reverse transmission manner. However, the optical flow-based method can only maintain good consistency for adjacent and close video frames, and maintain weaker consistency for video frames far away.

As described above, the existing video repair method generally has a general repair effect when the background of the video to be processed is complex and the area of the region to be repaired in the video is large, and a fuzzy result is likely to occur, and meanwhile, the convolutional network of the existing method has insufficient capability of capturing the video with long-term time sequence consistency; other traditional matching-based traditional methods have the disadvantages of large calculation amount, dependence on the similarity assumption of adjacent frames of the video and the like. Therefore, the video repair methods are difficult to complete the missing video sequence well.

[1]Ya-Liang Chang,Zhe Yu Liu,Kuan-Ying Lee,and Winston Hsu.Free-form video inpainting with 3d gated convolution and temporal patchgan.In ICCV,2019.

[2]Seoung Wug Oh,Sungho Lee,Joon-Young Lee,and Seon Joo Kim.Onion-peel networks for deep video completion.In ICCV,2019.

[3]Rui Xu,Xiaoxiao Li,Bolei Zhou,and Chen Change Loy.Deep flow-guided video inpainting.In CVPR,2019.

Disclosure of Invention

In order to solve the problems, the invention provides a video completion method capable of effectively completing videos with large missing areas and complex backgrounds, which adopts the following technical scheme:

The video completion method provided by the invention can also have the technical characteristics that the step 3 comprises the following substeps: step 3-1, preprocessing the depth map, and filtering outliers in the depth map by using median and standard deviation statistics to obtain a preprocessed depth map; and 3-2, performing three-dimensional reconstruction through a truncated symbol distance function according to the preprocessed depth map and the relative camera pose.

The video complementing method provided by the invention can also have the technical characteristics that the truncated symbol distance function uses a camera coordinate system corresponding to the first frame in the missing video sequence as a world coordinate system.

The video completion method provided by the invention can also have the technical characteristics that when the three-dimensional scene is projected into the missing video sequence in the step 4, the pixel points of the three-dimensional scene are projected into each frame of the missing video sequence, and the number of effective pixel points is increased by using the maximum pooling operation of 2 multiplied by 2 twice on each frame, thereby completing the completion.

The video completion method provided by the invention can also have the technical characteristics that the depth map network and the pose network complete the joint training in advance through the minimum projection loss function and the smooth loss function, and the joint training adopts an unsupervised training method.

The invention also provides a video completion device, which is characterized in that a missing video sequence with content missing is completed by reconstructing a three-dimensional scene, and the method comprises the following steps: the depth map identification part is used for identifying the depth map of each frame in the missing video sequence through the depth map network; the camera pose recognition part is stored with a preset pose network and used for recognizing the relative camera pose between every two adjacent frames in the missing video sequence through the pose network; the three-dimensional scene constructing part is used for fusing the depth maps by using a truncation symbol distance function based on the depth maps and the relative camera poses so as to construct a three-dimensional scene corresponding to the video background in the missing video sequence; the video complementing device is used for projecting the three-dimensional scene into the missing video sequence by utilizing the relative camera pose and the camera internal parameters so as to complement the defective area of each frame to obtain a complementing video sequence; and the video secondary completion is stored with a preset missing completion network and is used for performing secondary completion on the completed video sequence through the missing completion network so as to form a complete video sequence without content missing.

Action and Effect of the invention

According to the video completion method, the corresponding depth map of each frame in the missing video sequence is estimated, the camera pose between two adjacent frames is estimated, then the video background three-dimensional reconstruction is carried out, the reconstructed scene is projected to the original video plane to obtain a preliminary repair result, and finally the completion of the remaining small area is carried out, so that the video completion method not only can reasonably complete the content of the original video which never appears, but also can accurately recover the blocked area in the video, and simultaneously well solves the problem of long-term consistency of the time sequence in the missing video sequence. When a video repairing task with a large missing area and a complex background is processed, a supplemented video with good quality effect can be repaired by the video supplementing method of the embodiment, and good robustness and algorithm universality are shown.

Drawings

FIG. 1 is a flow chart of a video completion method according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of a video completion method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating the effect of a video completion method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the test results of the video completion method of the present invention on a DAVIS 2016 dataset;

FIG. 5 is a diagram illustrating the test results of the video completion method on the KITTI data set according to the embodiment of the present invention;

FIG. 6 is a comparison between the video completion method and the references [1] and [2] according to the embodiment of the present invention; and

fig. 7 is a block diagram of a video completion apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation features, the achievement purposes and the effects of the invention easy to understand, the video complementing method of the invention is specifically described below with reference to the embodiments and the accompanying drawings.

< example >

In this embodiment, the video completion method is implemented by a computer, and the video completion method is written as a corresponding computer program to run in the computer, and a user using the computer can import a missing video sequence to the computer and run the program, so that the computer performs corresponding processing on the missing video sequence according to the video completion method.

Fig. 1 is a flowchart of a video completion method according to an embodiment of the present invention.

As shown in fig. 1, the video completion method includes the following steps:

step 1, identifying the depth map of each frame of video image in the missing video sequence by using a depth map network, and then entering step 2.

And 2, identifying the relative camera pose between every two adjacent frames of video images in the missing video sequence by using a pose network, and then entering the step 3.

In this embodiment, the depth map network and the pose network used in step 1 and step 2 need to be obtained by training in advance, and specifically, in this embodiment, a DAVIS 2016 data set and a KITTI data set are used as training sets for training, only RGB video sequences in the data sets are used as video data in the training sets, no label is needed, and the size of each frame is set to 240 × 424 (DAVIS 2016) and 256 × 832 (KITTI).

When training is started, firstly, a depth map network and a pose network are initialized by using weight parameters pre-trained on KITTI, and then optimization training is carried out by adopting an Adam optimizer and a training set, wherein a coefficient beta = (0.9,0.999) and a learning rate is 1e-4.

In the training process, camera internal parameters of videos are required to be used, common civil camera internal parameters are used and are consistent with those used in the following steps, and the selected specific numerical value does not influence the final video completion effect.

And 3, constructing a video background three-dimensional scene, and then entering the step 4. In this embodiment, step 3 specifically includes the following sub-steps:

and 3-1, preprocessing the depth map, and filtering outliers in the depth map by using the median and standard deviation statistics to obtain a preprocessed depth map.

In this embodiment, the median and the standard deviation statistics are obtained by calculating the depth value corresponding to each pixel point in the whole depth map, and then the depth values with a difference of more than 3 standard deviations from the median are filtered.

And 3-2, fusing the depth maps based on the depth maps and the relative camera pose by using a truncated symbol distance function so as to construct a three-dimensional scene corresponding to the video background in the missing video sequence.

In this embodiment, a truncated symbolic distance function is used to represent the reconstructed three-dimensional scene. The truncated symbol distance function uses a camera coordinate reference system corresponding to a first frame of video image in the missing video sequence as a world coordinate system. Wherein the camera coordinate reference system is determined by camera internal parameters. The three-dimensional resolution of a scene used by the truncated symbolic distance function is dynamically adjusted along with a depth map and a camera pose, and the average number of voxels in each dimension of an xyz coordinate is about 500.

And 4, projecting the three-dimensional scene into the missing video sequence by using the relative camera pose and the camera internal parameters so as to complete the defective area of each frame to obtain a complete video sequence, and then entering the step 5.

In step 4 of this embodiment, marching Cubes are used to obtain RGB point clouds from a reconstructed scene, and then the RGB point clouds are re-projected back to each frame of video image of a missing video sequence by using camera internal parameters and relative camera poses to complete the missing region. Two 2 x 2 max pooling operations are used after projection to increase the number of active pixels in each frame of video image.

And 5, performing secondary completion on the completed video sequence by using a preset missing completion network so as to form a complete video sequence without content missing.

In this embodiment, the missing complement network is a conventional video complement network (for example, the network provided by the above-mentioned document [2] in the background art), and is used to perform final complement processing on the pixel region that is not completed in each frame of video image of the supplemented video sequence obtained in step 4.

In this embodiment, an effect diagram of the video completing method for completing the video image is shown in fig. 3.

After the video completion method is completed, the embodiment further tests the finally obtained complete video sequence through the test data. The test data adopts data which are not trained in the DAVIS 2016 data set and the KITTI data set, and the mask of an object in the video is used as a region to be supplemented. Wherein the DAVIS 2016 dataset directly uses tags and the KITTI dataset uses mask R-CNN to extract masks for objects.

After the test, the input/output and intermediate results in the DAVIS 2016 dataset and the input/output and intermediate results in the KITTI dataset of the video completion method of this embodiment are shown in fig. 4, respectively, and fig. 5 shows that the large-area missing video can be completed well.

Further, the result is compared with the previous method, and the method provided by the invention is found to have better completion effect than the existing method, and is particularly effective under the conditions of completing complex background and large missing area. Specifically, the comparison graph of the effect completion of the method of the present invention and the reference documents [1] and [2] is shown in fig. 6, it can be clearly seen that the documents [1] and [2] still have a large area of deficiency after completion, and meanwhile, a large block of shadow remains, and the quality of the video image repaired by the method of the present invention is undoubtedly better.

As described above, the present embodiment provides a video completion method, which mainly includes steps 1 to 5. In practical application, the steps of this embodiment may be packaged to form a device capable of directly completing a missing video sequence, and thus this embodiment can also provide a video completing device.

As shown in fig. 7, the video complementing device 100 includes a depth map identifying unit 11, a camera pose identifying unit 12, a three-dimensional scene constructing unit 13, a video complementing unit 14, a video secondary complementing unit 15, a device communicating unit 16, and a device controlling unit 17 for controlling the above units.

The device communication unit 16 is used for data communication between the respective components of the video complementing device 100 and between the video complementing device 100 and another device or system. The device control unit 17 stores a computer program for controlling each component of the video complementing device 100.

The depth map identifying unit 11 is configured to identify a depth map for each frame of a missing video sequence.

The camera pose identifier 12 is used to identify the relative camera pose between each two adjacent frames in the missing video sequence.

In this embodiment, the depth map network and the pose network trained in

steps

1 and 2 can be operated as the depth map recognizer 11 and the camera pose recognizer 12, respectively, in the device, so as to perform corresponding recognition operations.

The three-dimensional scene constructing part 13 is configured to perform depth map fusion based on the depth map and the relative camera pose and using a truncated symbol distance function to construct a three-dimensional scene corresponding to the video background in the missing video sequence.

The video complementing part 14 is used for projecting the three-dimensional scene into the missing video sequence by using the relative camera pose and the camera internal parameters so as to complement the defective area of each frame to obtain a complementing video sequence.

The video secondary complement 15 is used to perform secondary complement on the complemented video sequence to form a complete video sequence without content loss.

In this embodiment, the three-dimensional scene constructing unit 13, the video complementing unit 14, and the video secondary complementing unit 15 correspond to steps 3 to 5 in the video complementing method, respectively, and the working principles of these components are consistent with the description actions in the corresponding steps, which is not described herein again.

Effects and effects of the embodiments

According to the video completion method provided by the embodiment, the corresponding depth map of each frame in the missing video sequence is estimated, the camera pose between two adjacent frames is estimated, then the video background three-dimensional reconstruction is carried out, the reconstructed scene is projected to the original video plane to obtain a preliminary repair result, and finally the completion of the remaining small area is carried out, so that the video completion method not only can reasonably complete the content of the original video which never appears, but also can accurately recover the blocked area in the video, and simultaneously well solves the problem of long-term consistency of the time sequence in the missing video sequence. When a video repairing task with a large missing area and a complex background is processed, a supplemented video with good quality effect can be repaired by the video completing method, and good robustness and algorithm universality are shown.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

Claims

1. A video completion method is characterized in that a missing video sequence with content missing is completed by reconstructing a three-dimensional scene, and comprises the following steps:

step 1, identifying the missing video sequence through a preset depth map network to obtain a depth map of each frame of missing video image;

step 2, identifying the relative camera pose between every two adjacent frames in the missing video sequence through a preset pose network;

step 3, based on the depth map and the relative camera pose, performing fusion of the depth map by using a truncated symbol distance function so as to construct a three-dimensional scene corresponding to a video background in the missing video sequence;

step 4, projecting the three-dimensional scene to the missing video sequence by using the relative camera pose and camera internal parameters so as to complete the defective area of each frame to obtain a complete video sequence;

step 5, performing secondary completion on the completed video sequence by utilizing a preset missing completion network to form a complete video sequence without content missing, wherein the missing completion network is a conventional video completion network,

wherein, step S3 comprises the following substeps:

step 3-1, preprocessing the depth map, and filtering outliers in the depth map by using median and standard deviation statistics to obtain a preprocessed depth map;

step 3-2, three-dimensional reconstruction is carried out through the truncated symbol distance function according to the preprocessed depth map and the relative camera pose,

wherein the reconstructed three-dimensional scene is represented using the truncated symbol distance function with a scene three-dimensional resolution that dynamically adjusts with the depth map and the relative camera pose.

2. The video completion method according to claim 1, wherein:

wherein the truncated symbol distance function uses a camera coordinate system corresponding to a first frame in the missing video sequence as a world coordinate system.

3. The video completion method according to claim 1, wherein:

wherein, when the three-dimensional scene is projected into the missing video sequence in the step 4, the pixel points of the three-dimensional scene are projected into each frame of the missing video sequence, and the number of effective pixel points is increased by using the maximum pooling operation of 2 × 2 twice for each frame, thereby completing the completion.

4. The video completion method according to claim 1, wherein:

wherein the depth map network and the pose network complete joint training in advance through a minimum projection loss function and a smooth loss function,

the joint training adopts an unsupervised training method.

5. A video completion apparatus for completing a missing video sequence with a missing content by reconstructing a three-dimensional scene, comprising:

the depth map identification part is used for identifying the missing video sequence through the depth map network to obtain a depth map of each frame of missing video image;

the camera pose recognition part is used for recognizing the relative camera pose between every two adjacent frames in the missing video sequence through the pose network;

a three-dimensional scene construction part, which is used for carrying out fusion of the depth map by using a truncation symbol distance function based on the depth map and the relative camera pose so as to construct a three-dimensional scene corresponding to the video background in the missing video sequence;

the video complementing device is used for projecting the three-dimensional scene into the missing video sequence by utilizing the relative camera pose and the camera internal parameters so as to complement the defective area of each frame to obtain a complemented video sequence;

the video secondary complementing network is stored for secondarily complementing the complementing video sequence through the missing complementing network so as to form a complete video sequence without content missing, wherein the missing complementing network is a conventional video complementing network,

wherein the three-dimensional scene constructing section constructs the three-dimensional scene by:

preprocessing the depth map, and filtering outliers in the depth map by using median and standard deviation statistics to obtain a preprocessed depth map;

performing three-dimensional reconstruction by the truncated symbol distance function according to the preprocessed depth map and the relative camera pose,

wherein the three-dimensional scene being reconstructed is represented using the truncated symbol distance function, the three-dimensional resolution of the scene used by the truncated symbol distance function being dynamically adjusted with the depth map and the relative camera pose.