CN112019828B

CN112019828B - Method for converting 2D (two-dimensional) video into 3D video

Info

Publication number: CN112019828B
Application number: CN202010819481.4A
Authority: CN
Inventors: 唐杰; 李进; 李庆瑜; 戴立言
Original assignee: SHANGHAI WONDERTEK SOFTWARE CO Ltd
Current assignee: SHANGHAI WONDERTEK SOFTWARE CO Ltd
Priority date: 2020-08-14
Filing date: 2020-08-14
Publication date: 2022-07-19
Anticipated expiration: 2040-08-14
Also published as: CN112019828A

Abstract

The invention relates to the technical field of video dimension conversion, and provides a method for converting 2D (two-dimensional) video into 3D video, which comprises the following steps: s1: collecting and expanding an open-source RGB-D image data set to form a depth estimation data set, constructing a depth estimation model through the depth estimation data set and training the depth estimation model; s2: collecting a 4K high-definition picture making image restoration data set, expanding the image restoration data set, constructing an image restoration model through the image restoration data set, and training the image restoration model; s3: extracting an original image Mask by using a pre-training Mask-RCNN model, adjusting the resolution of the original image and the Mask, sending the original image and the Mask into a depth estimation model, calculating original left and right projection images according to the depth image, and sending the left and right projection images into an image restoration model respectively to restore a black hole area. And combining a deep learning algorithm and a traditional algorithm, and replacing a depth map estimation algorithm and a black hole filling algorithm in the traditional DIBR method by using a deep learning model, so that 2D/3D conversion on the ultrahigh-resolution image is realized.

Description

Method for converting 2D (two-dimensional) video into 3D video

Technical Field

The invention relates to the technical field of video dimension conversion, in particular to a method for converting 2D (two-dimensional) video into 3D video, which comprises processing methods such as computer image processing, computer vision, deep learning, CUDA (compute unified device architecture) high-performance programming and the like.

Background

At present, 5G communication with high speed, large capacity and low delay is gradually popularized, and the possibility is provided for internet of everything. Compared with a 2D video, the 3D video is rich in scene depth information, the parallax synthetic image generated through conversion of the 3D algorithm accords with 3D stereoscopic perception of human eyes of the real world, and immersive experience can be obtained by means of VR equipment.

At present, 2D/3D conversion methods are mainly classified into end-to-end methods based on deep learning and traditional methods based on DIBR.

(1) Deep learning end-to-end method

At present, the deep learning technology has achieved great success in many computer vision fields by means of a high-performance computing platform, and for example, the fields of video classification, understanding, motion recognition and the like greatly exceed the traditional method. However, deep learning is biased in versatility, flexibility and adaptability compared to the human visual system, and deep learning "end-to-end" solutions may also suffer mechanistic difficulties when encountering complex visual tasks.

The deep learning-based method relies on massive 3D movie data, and can directly perform three-dimensional conversion on an original image by establishing an end-to-end model architecture, and a right view can be obtained by inputting a left view. The end-to-end model architecture simplifies the three-dimensional conversion process, but due to the lack of depth estimation, in a complex scene, the phenomenon that the image depth does not accord with the visual common sense occurs. Experiments show that the model processing speed is very low for high-definition videos with 2K resolution and above. In a fast moving scene, the continuity of the composite video is seriously degraded, resulting in jitter of the picture and the subtitle.

(2) Conventional method of DIBR

The conventional DIBR-based method mainly comprises the following steps: depth map estimation, depth map preprocessing, three-dimensional image transformation, black hole area and synthetic image post-processing. The method has higher processing speed on the RGB-D video with high resolution. However, experiments show that low-quality depth information provided by a camera cannot be aligned with the contour edge of a source image, which often causes artifacts, irregular object deformation and large-area black holes on a synthetic image after three-dimensional projection transformation.

Disclosure of Invention

In view of the above problems, the present invention aims to provide a method for converting 2D video into 3D video, which combines a deep learning algorithm and a conventional algorithm, and uses a deep learning model to replace a depth map estimation algorithm and a black hole filling algorithm in a conventional DIBR method. Based on a Tensorflow open source framework, parallel computing is accelerated by using a CUDA engine, and 2D/3D conversion on an ultrahigh-resolution image is realized. The 2D/3D conversion method solves the problems of jitter of pictures and subtitles, deformation and distortion of objects and the like caused by inaccurate depth estimation in a complex scene. The problem that a large black hole area formed by source image distortion cannot be filled is solved. The problem of slow three-dimensional conversion speed of images with 2K resolution ratio and above is solved.

The above object of the present invention is achieved by the following technical solutions:

a method of 2D to 3D conversion of video, comprising the steps of:

s1: collecting an open-source RGB-D image data set, expanding the open-source RGB-D image data set to form a depth estimation data set, constructing a depth estimation model through the depth estimation data set, and training the depth estimation model;

s2: collecting a 4K high-definition picture to manufacture an image restoration data set, expanding the image restoration data set, constructing an image restoration model through the image restoration data set, and training the image restoration model;

s3: and extracting an original image Mask by using a pre-training Mask-RCNN model, adjusting the resolution of the original image and the Mask, sending the original image and the Mask into the depth estimation model, calculating original left and right projection images according to the depth image, and sending the left and right projection images into the image restoration model respectively to restore the black hole area.

Further, in step S1, collecting an open-source RGB-D image data set and expanding the open-source RGB-D image data set to form a depth estimation data set, specifically:

collecting the open-source RGB-D image data set, inputting an RGB image in the RGB-D image as a model, and taking a depth image D as a depth annotation image;

Expanding the data set by adopting a data enhancement technology comprising random left-right turning, random color shift and camera noise increase;

dividing the enhanced depth estimation data set to form different data sets including a depth training set, a depth verification set and a depth test set, wherein the depth training set is used for adjusting model parameters of the depth estimation model to obtain a local optimal solution, the depth verification set is used for verifying whether the model parameters of the depth estimation model are reliable in a training process, and the depth test set is used for evaluating the generalization performance of the depth estimation model.

Further, in step S1, the depth estimation model is constructed by the depth estimation data set, specifically:

the depth estimation model is used for learning the mapping from the monocular RGB image to the depth image and realizing the depth estimation of the 2D video;

the depth estimation model uses the basic framework of a coder-decoder, jump connection is used between the coder and the decoder, and corresponding characteristic graphs are connected according to the channel direction to obtain more detailed depth estimation;

through the encoder-decoder infrastructure, an input image is mapped into a single-channel depth image in which pixel values represent object depth.

Further, in step S1, the depth estimation model is trained, specifically:

extracting a single-channel segmentation Mask of the input image by using a pre-training Mask-RCNN model, adjusting the resolution of the input image and the single-channel segmentation Mask and connecting the input image and the single-channel segmentation Mask according to a channel direction;

inputting the input image into the depth estimation model, outputting the single-channel depth image, and calculating loss values for the single-channel depth image and the depth annotation image through a loss function including L2 distance loss and TV total variation loss;

and testing the precision of the depth estimation model by using the depth verification set, and testing the performance of the depth estimation model by using the depth test set.

Further, in step S2, a 4K high definition picture is collected to produce an image restoration data set, and the image restoration data set is expanded, specifically:

generating black curves with different positions, different shapes and different thicknesses on the 4K high-definition picture by using an open source library OpenCV so as to simulate a black hole area caused by three-dimensional image transformation;

expanding the image restoration data set by adopting a data enhancement technology including random left-right turning, random color variation and camera noise increase;

And dividing the image restoration data set to form different data sets including a restoration training set, a restoration verification set and a restoration testing set.

Further, in step S2, an image restoration model is constructed by the image restoration data set, specifically:

the image restoration model uses the infrastructure of an encoder-decoder, uses a skip connection between the encoder and the decoder, uses an activation function Tanh at the last layer of the decoder, and maps a feature map into an output image of three channels.

Further, in step S3, the image inpainting model is trained, specifically:

inputting an input image into the image restoration model, and outputting an output image of the three channels;

calculating reconstruction loss for the output image and the original map of the three channels using a loss function including L2 distance loss;

and testing the model precision by using the repair verification set, and testing the performance of the model by using the repair test set.

Further, in step S3, extracting an original image Mask using the pre-trained Mask-RCNN model, adjusting resolutions of the original image and the Mask, and sending the original image and the Mask to the depth estimation model, specifically:

and sending the RGB image to be predicted into a Mask-RCNN model of a training model to obtain an example segmentation Mask, connecting the predicted image and the Mask according to a channel direction, and inputting the predicted image and the Mask into the depth estimation model to obtain a predicted depth map.

Further, in step S3, the original left and right projection maps are calculated according to the depth map, specifically:

combining the distance f from the human eyes to the focal plane and the human eye distance t for the predicted depth map D_cOrthographically projecting the 2D image to the positions of the two eye viewpoints by using parameters such as the visual field difference D and the like, and initializing a projection template, wherein a left projection template LD is 0, and a right projection template RD is 0;

computing 2D image pixel coordinate offset

The left projection LD (x, y) ═ I (x, y) + Δ x, and the right projection RD (x, y) ═ I (x, y) - Δ x are obtained.

Further, in step S3, the left and right projection views are respectively sent to the image restoration model to restore the black hole area, specifically:

and the left and right projection drawings LD and RD are respectively input into the image restoration model to obtain restoration drawings LDI and RDI. Filling the black hole area in the projection image with the pixel value of the repaired image, wherein:

compared with the prior art, the invention has at least one of the following beneficial effects:

(1) the invention fully considers the respective advantages of the deep learning algorithm and the traditional DIBR algorithm, and replaces the steps of depth map acquisition, depth map preprocessing and black hole filling in the traditional DIBR algorithm by using the deep learning restoration model and the depth estimation model, thereby simplifying the processing flow of the traditional algorithm.

(2) According to the invention, the 2D/3D algorithm is packaged on the Tensorflow framework, and the parallel optimization of operations such as image bottom layer traversal, retrieval and the like is realized by utilizing the Tensorflow, so that the high-efficiency processing of the 2D/3D algorithm in the GPU environment is realized.

(3) According to the method, by utilizing the high generalization and strong fitting characteristics of the deep learning model, the rendering effects of the image restoration model and the depth estimation model constructed by the method are more excellent than those of the traditional algorithm, and the method is specifically represented by the following steps: the method reduces object deformation and image jitter caused by inaccurate depth estimation, and improves the detail texture of the repaired black hole image.

Drawings

FIG. 1 is a general flow chart of a method for converting 2D video into 3D video according to the present invention;

FIG. 2 is a schematic diagram of a depth image generated by a depth estimation model according to the present invention, wherein an original image and a mask are sent to a left and right projection images;

FIG. 3 is a schematic diagram of the present invention repairing a depth image containing a black hole into a complete image;

FIG. 4 is a schematic diagram of the deep network model inference process of the present invention;

FIG. 5 is a schematic diagram of a composite left and right view of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

In order to overcome the defects of the prior art or the process, the invention fully combines the respective advantages of the traditional DIBR algorithm and the depth learning-based algorithm, aims at the 2D/3D conversion algorithm of the images with the resolution of 2K and above, and is an efficient and stable three-dimensional image transformation technology.

The core of the invention is: in order to solve the problems that the depth estimation of a source image is inaccurate and large black areas are difficult to process in the existing conversion algorithm, and jitter and the like which are discontinuous in time and space of a synthesized video frame occur, a 2D/3D conversion technology aiming at ultra-high definition video is provided.

The present invention will be described in detail below for the purpose of making the object and technical solution of the present invention clearer.

In this case, 4K high definition video is used, and the technology of the present invention is used to realize 2D/3D three-dimensional conversion.

The method comprises the following preparation steps: a skeleton depth estimation dataset; building a model and training; constructing an image restoration data set; building a restoration model and training;

the method comprises the following 2D/3D conversion steps: extracting an original image Mask by using a pre-training Mask-RCNN model; adjusting the resolution of an original image and a mask and sending the resolution into a trained depth estimation model; calculating original left and right projection images according to the depth image; and respectively sending the left projection drawing and the right projection drawing into the trained image repairing model to repair the black hole area.

Examples

As shown in fig. 1, the embodiment provides a specific method for converting 2D to 3D of a video, a depth learning model is used to replace a depth map estimation algorithm and a black hole filling algorithm in a conventional DIBR method, and the characteristics of high generalization and high accuracy of the depth learning model are used to make up for the shortcomings in the conventional algorithm. On the basis of a Tensorflow framework, a high-efficiency and high-quality conversion technology is realized.

The invention relates to a 2D/3D conversion method aiming at 2K resolution ratio and above. Based on a Tensorflow open source framework, a CUDA engine is used for accelerating parallel computation to realize 2D/3D conversion on an ultrahigh-resolution image, wherein an original image and a mask are sent into a depth estimation model to generate a left projection image and a right projection image as shown in figure 2, a depth image containing black holes is restored into a complete image as shown in figure 3, a depth network model reasoning flow is shown in figure 4, and a left view and a right view are synthesized as shown in figure 5, and the method specifically comprises the following steps:

(1) Producing a depth estimation dataset

For an open source RGB-D image dataset, the visual RGB image is the model input and the depth image D is the annotation image. For example, the present invention may use data sets including NYU Depth Dataset V2 and Meg aDepth, but the above data sets are only used for illustration of the present invention and are not used for limitation of the present invention.

In order to further improve the generalization and noise resistance of the model, the experiment adopts a data enhancement technology to expand a data set, wherein the data set comprises random inversion, random color shift, camera noise increase and the like. Random gaussian noise and salt and pepper noise were used in this experiment to simulate camera noise. Dividing a data set obtained after data enhancement into a training set, a verification set and a test set, wherein the proportion is 7: 2: 1. the training set is used for adjusting model parameters to obtain a local optimal solution, the verification set is used for verifying whether the model parameters are reliable in the training process, and the test set is used for evaluating the generalization performance of the model.

(2) Constructing a depth estimation Model_depth

And constructing a depth estimation model by adopting a basic model comprising an open source model U-Net for learning the mapping from the monocular RGB image to the depth image, and realizing the depth estimation of the 2D video. For the depth estimation task, the depth estimation model of the invention uses an encoder-decoder infrastructure, the convolutional kernels in the encoder are all 3 × 3 in size, and the convolutional layer padding parameter is set to 1. The sliding step length is set to be 1, so that the size of the feature map after convolution is ensured to be unchanged. The activation function is set to LeakyReLU, for downsampling down-scaling resolution, maximum pooling is used, padding is set to 2, template size is 5 × 5, step size is set to 2. In the decoder, the up-sampling layer uses bilinear interpolation to enlarge the resolution of the characteristic diagram by 2 times, the parameter of the convolution layer is consistent with that of the encoder, and the final layer of activation function uses Tanh. By using a jump connection between the encoder and the decoder, the corresponding feature maps are connected in the channel direction, and a more refined depth estimation can be obtained. The input image is mapped into a single-channel depth image O through a coding and decoding structure _d，O_dThe middle pixel value represents the object depth.

(3) Training depth estimation model

Firstly, setting the batch processing size to be 128, randomly extracting 128 source pictures I from a training set, and extracting a single-channel Mask of the input picture I by using a pre-training Mask-RCNN model. Then respectively connectThe resolution of the input picture I and the mask is adjusted to be 540 multiplied by 960 and the input picture I and the mask are connected in the channel direction, and the good mask instance segmentation mask is beneficial to obtaining a depth map with an accurate parallax relation. Inputting the merged data into a depth estimation model to obtain the output characteristic O of a single channel_d. To O_dAnd calculating loss by using a label depth map label, wherein the loss function adopts L2 distance loss and TV (total variation) loss, the model learns the mapping relation by using L2 loss, and the predicted depth map is kept continuous locally by using TV loss, so that the binocular visual effect is improved. Total Loss λ₁TV+λ₂L₂Wherein the weight is λ₁＝0.3，λ₂0.7. The parameter update optimizer uses Adam, the initial learning rate lr is set to 1e-3, 300 cycles are trained, the learning rate is adjusted down by 15% every 10 training cycles, and the model accuracy is tested using the validation set. If the precision is continuously reduced for 5 times, the training is stopped, otherwise, the training is carried out until the whole period is finished. And after the training is finished, testing the performance of the model by using the test set data, if the scoring index reaches a set threshold value, retaining the model, otherwise, randomly initializing the model parameters again, and repeating the training steps until the performance of the model reaches the scoring index.

(4) Making an image restoration dataset

Local 4K high-definition pictures are used as training data sources, and in order to further improve the generalization performance and the robustness performance of the model, the data enhancement technology is adopted to expand a data set in the embodiment, wherein the expansion comprises random left-right turning, random color shift, camera noise increasing and the like. Random gaussian noise and salt and pepper noise are used in this case to simulate camera noise. In order to simulate a black hole area caused by image distortion, the invention uses an open source library OpenCV to carry out image I_iBlack curves with different positions, different shapes and different thicknesses are generated. Dividing the whole data set obtained after data enhancement into a training set, a verification set and a test set, wherein the proportion is 7: 2: 1.

(5) model for constructing image restoration Model_inpaint

The model structure is similar to the depth estimation model, and an average pooling layer is used for replacing a maximum pooling layer aiming at the image restoration task.

The model structure adopts an open source model U-Net as a basic model, and particularly, for an image restoration task, a 3 x 3 convolutional layer padding parameter of an encoder part is set to be 1, an average pooling layer is used for a down-sampling layer, and the step length is set to be 2. The activation function is set to LeakyReLU; for a decoder part, an up-sampling layer amplifies a feature map by 2 times by using bilinear interpolation, jump connection is used between an encoder and a decoder, an activation function Tanh is used at the last layer of the decoder, and the feature map is mapped into a three-channel output image O _i。

(6) Training image restoration model

Firstly setting the batch processing size to be 128, randomly extracting 128 source picture pairs from a training set, then adjusting the resolution of the pictures to be 540 multiplied by 960, inputting data into an image restoration model, and obtaining a three-channel output image O_d. To O is_dAnd calculating reconstruction loss by using the original graph, wherein the loss function adopts L2 distance loss and L1 distance loss. The parameter update optimizer uses Adam, initial learning rate lr set 1e-3, trains 300 cycles, adjusts learning rate down by 15% every 15 training cycles, and tests model accuracy using the validation set. If the precision is continuously reduced for 5 times, the training is stopped, otherwise, the training is carried out until the whole period is finished. And after training is finished, testing the performance of the model by using the test set data, if the scoring index reaches a set threshold value, reserving the model, otherwise, initializing the model parameters at random again, and repeating the training steps until the performance of the model reaches the scoring index.

(7) Monocular image depth estimation

The RGB image to be predicted is sent into a pre-training Model Mask-RCNN to obtain an example segmentation Mask, a predicted image and the Mask are connected according to the channel direction, and a depth estimation Model is input_depthAnd obtaining a prediction depth map.

(8) Three-dimensional image conversion

This case used CBA video with a resolution of 3840 × 2160 as the converted video and read the video stream using OpenCV. Firstly, the resolution of a captured source picture I is reduced to 960 multiplied by 540, a segmentation picture mask is extracted by using a pre-training mask rcnn, and the mask and the I are combined_960×540After connecting according to the channel direction, inputting the depth estimation model in the present case to obtain a depth map I_depth. Depth map I_depthThe resolution is recovered to 3840 × 1920 by combining the distance f from the human eyes to the focal plane and the human eye spacing t_cCalculating the source image I according to the parameters of the visual field difference d_pre(x, y) pixel coordinate offset

Initializing a projection template: t is_(x,y)＝0. Orthographically projecting the source image I to the positions of the viewpoints of both eyes to obtain a left projection LI (x, y) ═ I (x, y) + Δ x, and a right projection RI (x, y) ═ I (x, y) - Δ x.

(9) Black hole filling

And reducing the resolution of the left and right projection images LI and RI to 1920 x 1080, inputting the resolution into the image restoration model of the invention, and outputting restored images LII and RII. Restoring the resolution of the restored image directly to the original resolution would be blurred, in this case, restoring the resolution of the restored image to 3840 × 2160, LII first_3840×2160,RII_3840×2160. Filling the black hole area in the projection image with the pixel value of the repaired image, wherein:

it should be noted that the present embodiment is only used for example, and all values, CBA video, loss functions, etc. contained in the present embodiment are only some specific examples, which are used for illustrating the content of the present invention and are not used for limiting the present invention.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for 2D to 3D conversion of video, comprising the steps of:

2. The method for 2D to 3D conversion of video according to claim 1, wherein in step S1, an open-source RGB-D image data set is collected and augmented to form a depth estimation data set, specifically:

Expanding a data set by adopting a data enhancement technology including random left-right turning, random color shift and camera noise increase;

3. The method for 2D to 3D conversion of video according to claim 2, wherein in step S1, the depth estimation model is constructed from the depth estimation data set, specifically:

4. The method for converting video from 2D to 3D according to claim 3, wherein in step S1, the depth estimation model is trained, specifically:

5. The method for 2D to 3D conversion of video according to claim 1, wherein in step S2, a 4K high definition picture production image restoration data set is collected and expanded, specifically:

6. The method for 2D to 3D conversion of video according to claim 5, wherein in step S2, an image restoration model is constructed from the image restoration data set, specifically:

the image restoration model uses the infrastructure of the encoder-decoder, uses a skip connection between the encoder and decoder, uses the activation function Tanh at the last layer of the decoder, and maps the feature map to a three-channel output image.

7. The method for converting 2D to 3D of video according to claim 6, wherein in step S3, the image inpainting model is trained, specifically:

Calculating reconstruction loss for the three-channel output image and the original map using a loss function including L2 distance loss;

8. The method for converting 2D to 3D of video according to claim 1, wherein in step S3, the pre-trained Mask-RCNN model is used to extract the original image Mask, adjust the original image and Mask resolution and send them to the depth estimation model, specifically:

9. The method for converting a video from 2D to 3D according to claim 1, wherein in step S3, the left and right projection views are respectively fed into the image restoration model to restore black hole regions, specifically:

and respectively inputting the left projection drawing LD and the right projection drawing RD into the image restoration model to obtain restoration drawings LDI and RDI, and filling the pixel values of the restored image into the black hole area in the projection drawings, wherein: