CN112019828B - Method for converting 2D (two-dimensional) video into 3D video - Google Patents

Method for converting 2D (two-dimensional) video into 3D video Download PDF

Info

Publication number
CN112019828B
CN112019828B CN202010819481.4A CN202010819481A CN112019828B CN 112019828 B CN112019828 B CN 112019828B CN 202010819481 A CN202010819481 A CN 202010819481A CN 112019828 B CN112019828 B CN 112019828B
Authority
CN
China
Prior art keywords
image
model
depth
data set
depth estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010819481.4A
Other languages
Chinese (zh)
Other versions
CN112019828A (en
Inventor
唐杰
李进
李庆瑜
戴立言
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI WONDERTEK SOFTWARE CO Ltd
Original Assignee
SHANGHAI WONDERTEK SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI WONDERTEK SOFTWARE CO Ltd filed Critical SHANGHAI WONDERTEK SOFTWARE CO Ltd
Priority to CN202010819481.4A priority Critical patent/CN112019828B/en
Publication of CN112019828A publication Critical patent/CN112019828A/en
Application granted granted Critical
Publication of CN112019828B publication Critical patent/CN112019828B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/111Transformation of image signals corresponding to virtual viewpoints, e.g. spatial image interpolation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N13/00Stereoscopic video systems; Multi-view video systems; Details thereof
    • H04N13/10Processing, recording or transmission of stereoscopic or multi-view image signals
    • H04N13/106Processing image signals
    • H04N13/128Adjusting depth or disparity

Abstract

The invention relates to the technical field of video dimension conversion, and provides a method for converting 2D (two-dimensional) video into 3D video, which comprises the following steps: s1: collecting and expanding an open-source RGB-D image data set to form a depth estimation data set, constructing a depth estimation model through the depth estimation data set and training the depth estimation model; s2: collecting a 4K high-definition picture making image restoration data set, expanding the image restoration data set, constructing an image restoration model through the image restoration data set, and training the image restoration model; s3: extracting an original image Mask by using a pre-training Mask-RCNN model, adjusting the resolution of the original image and the Mask, sending the original image and the Mask into a depth estimation model, calculating original left and right projection images according to the depth image, and sending the left and right projection images into an image restoration model respectively to restore a black hole area. And combining a deep learning algorithm and a traditional algorithm, and replacing a depth map estimation algorithm and a black hole filling algorithm in the traditional DIBR method by using a deep learning model, so that 2D/3D conversion on the ultrahigh-resolution image is realized.

Description

Method for converting 2D (two-dimensional) video into 3D video
Technical Field
The invention relates to the technical field of video dimension conversion, in particular to a method for converting 2D (two-dimensional) video into 3D video, which comprises processing methods such as computer image processing, computer vision, deep learning, CUDA (compute unified device architecture) high-performance programming and the like.
Background
At present, 5G communication with high speed, large capacity and low delay is gradually popularized, and the possibility is provided for internet of everything. Compared with a 2D video, the 3D video is rich in scene depth information, the parallax synthetic image generated through conversion of the 3D algorithm accords with 3D stereoscopic perception of human eyes of the real world, and immersive experience can be obtained by means of VR equipment.
At present, 2D/3D conversion methods are mainly classified into end-to-end methods based on deep learning and traditional methods based on DIBR.
(1) Deep learning end-to-end method
At present, the deep learning technology has achieved great success in many computer vision fields by means of a high-performance computing platform, and for example, the fields of video classification, understanding, motion recognition and the like greatly exceed the traditional method. However, deep learning is biased in versatility, flexibility and adaptability compared to the human visual system, and deep learning "end-to-end" solutions may also suffer mechanistic difficulties when encountering complex visual tasks.
The deep learning-based method relies on massive 3D movie data, and can directly perform three-dimensional conversion on an original image by establishing an end-to-end model architecture, and a right view can be obtained by inputting a left view. The end-to-end model architecture simplifies the three-dimensional conversion process, but due to the lack of depth estimation, in a complex scene, the phenomenon that the image depth does not accord with the visual common sense occurs. Experiments show that the model processing speed is very low for high-definition videos with 2K resolution and above. In a fast moving scene, the continuity of the composite video is seriously degraded, resulting in jitter of the picture and the subtitle.
(2) Conventional method of DIBR
The conventional DIBR-based method mainly comprises the following steps: depth map estimation, depth map preprocessing, three-dimensional image transformation, black hole area and synthetic image post-processing. The method has higher processing speed on the RGB-D video with high resolution. However, experiments show that low-quality depth information provided by a camera cannot be aligned with the contour edge of a source image, which often causes artifacts, irregular object deformation and large-area black holes on a synthetic image after three-dimensional projection transformation.
Disclosure of Invention
In view of the above problems, the present invention aims to provide a method for converting 2D video into 3D video, which combines a deep learning algorithm and a conventional algorithm, and uses a deep learning model to replace a depth map estimation algorithm and a black hole filling algorithm in a conventional DIBR method. Based on a Tensorflow open source framework, parallel computing is accelerated by using a CUDA engine, and 2D/3D conversion on an ultrahigh-resolution image is realized. The 2D/3D conversion method solves the problems of jitter of pictures and subtitles, deformation and distortion of objects and the like caused by inaccurate depth estimation in a complex scene. The problem that a large black hole area formed by source image distortion cannot be filled is solved. The problem of slow three-dimensional conversion speed of images with 2K resolution ratio and above is solved.
The above object of the present invention is achieved by the following technical solutions:
a method of 2D to 3D conversion of video, comprising the steps of:
s1: collecting an open-source RGB-D image data set, expanding the open-source RGB-D image data set to form a depth estimation data set, constructing a depth estimation model through the depth estimation data set, and training the depth estimation model;
s2: collecting a 4K high-definition picture to manufacture an image restoration data set, expanding the image restoration data set, constructing an image restoration model through the image restoration data set, and training the image restoration model;
s3: and extracting an original image Mask by using a pre-training Mask-RCNN model, adjusting the resolution of the original image and the Mask, sending the original image and the Mask into the depth estimation model, calculating original left and right projection images according to the depth image, and sending the left and right projection images into the image restoration model respectively to restore the black hole area.
Further, in step S1, collecting an open-source RGB-D image data set and expanding the open-source RGB-D image data set to form a depth estimation data set, specifically:
collecting the open-source RGB-D image data set, inputting an RGB image in the RGB-D image as a model, and taking a depth image D as a depth annotation image;
Expanding the data set by adopting a data enhancement technology comprising random left-right turning, random color shift and camera noise increase;
dividing the enhanced depth estimation data set to form different data sets including a depth training set, a depth verification set and a depth test set, wherein the depth training set is used for adjusting model parameters of the depth estimation model to obtain a local optimal solution, the depth verification set is used for verifying whether the model parameters of the depth estimation model are reliable in a training process, and the depth test set is used for evaluating the generalization performance of the depth estimation model.
Further, in step S1, the depth estimation model is constructed by the depth estimation data set, specifically:
the depth estimation model is used for learning the mapping from the monocular RGB image to the depth image and realizing the depth estimation of the 2D video;
the depth estimation model uses the basic framework of a coder-decoder, jump connection is used between the coder and the decoder, and corresponding characteristic graphs are connected according to the channel direction to obtain more detailed depth estimation;
through the encoder-decoder infrastructure, an input image is mapped into a single-channel depth image in which pixel values represent object depth.
Further, in step S1, the depth estimation model is trained, specifically:
extracting a single-channel segmentation Mask of the input image by using a pre-training Mask-RCNN model, adjusting the resolution of the input image and the single-channel segmentation Mask and connecting the input image and the single-channel segmentation Mask according to a channel direction;
inputting the input image into the depth estimation model, outputting the single-channel depth image, and calculating loss values for the single-channel depth image and the depth annotation image through a loss function including L2 distance loss and TV total variation loss;
and testing the precision of the depth estimation model by using the depth verification set, and testing the performance of the depth estimation model by using the depth test set.
Further, in step S2, a 4K high definition picture is collected to produce an image restoration data set, and the image restoration data set is expanded, specifically:
generating black curves with different positions, different shapes and different thicknesses on the 4K high-definition picture by using an open source library OpenCV so as to simulate a black hole area caused by three-dimensional image transformation;
expanding the image restoration data set by adopting a data enhancement technology including random left-right turning, random color variation and camera noise increase;
And dividing the image restoration data set to form different data sets including a restoration training set, a restoration verification set and a restoration testing set.
Further, in step S2, an image restoration model is constructed by the image restoration data set, specifically:
the image restoration model uses the infrastructure of an encoder-decoder, uses a skip connection between the encoder and the decoder, uses an activation function Tanh at the last layer of the decoder, and maps a feature map into an output image of three channels.
Further, in step S3, the image inpainting model is trained, specifically:
inputting an input image into the image restoration model, and outputting an output image of the three channels;
calculating reconstruction loss for the output image and the original map of the three channels using a loss function including L2 distance loss;
and testing the model precision by using the repair verification set, and testing the performance of the model by using the repair test set.
Further, in step S3, extracting an original image Mask using the pre-trained Mask-RCNN model, adjusting resolutions of the original image and the Mask, and sending the original image and the Mask to the depth estimation model, specifically:
and sending the RGB image to be predicted into a Mask-RCNN model of a training model to obtain an example segmentation Mask, connecting the predicted image and the Mask according to a channel direction, and inputting the predicted image and the Mask into the depth estimation model to obtain a predicted depth map.
Further, in step S3, the original left and right projection maps are calculated according to the depth map, specifically:
combining the distance f from the human eyes to the focal plane and the human eye distance t for the predicted depth map DcOrthographically projecting the 2D image to the positions of the two eye viewpoints by using parameters such as the visual field difference D and the like, and initializing a projection template, wherein a left projection template LD is 0, and a right projection template RD is 0;
computing 2D image pixel coordinate offset
Figure BDA0002633956010000041
The left projection LD (x, y) ═ I (x, y) + Δ x, and the right projection RD (x, y) ═ I (x, y) - Δ x are obtained.
Further, in step S3, the left and right projection views are respectively sent to the image restoration model to restore the black hole area, specifically:
and the left and right projection drawings LD and RD are respectively input into the image restoration model to obtain restoration drawings LDI and RDI. Filling the black hole area in the projection image with the pixel value of the repaired image, wherein:
Figure BDA0002633956010000051
Figure BDA0002633956010000052
compared with the prior art, the invention has at least one of the following beneficial effects:
(1) the invention fully considers the respective advantages of the deep learning algorithm and the traditional DIBR algorithm, and replaces the steps of depth map acquisition, depth map preprocessing and black hole filling in the traditional DIBR algorithm by using the deep learning restoration model and the depth estimation model, thereby simplifying the processing flow of the traditional algorithm.
(2) According to the invention, the 2D/3D algorithm is packaged on the Tensorflow framework, and the parallel optimization of operations such as image bottom layer traversal, retrieval and the like is realized by utilizing the Tensorflow, so that the high-efficiency processing of the 2D/3D algorithm in the GPU environment is realized.
(3) According to the method, by utilizing the high generalization and strong fitting characteristics of the deep learning model, the rendering effects of the image restoration model and the depth estimation model constructed by the method are more excellent than those of the traditional algorithm, and the method is specifically represented by the following steps: the method reduces object deformation and image jitter caused by inaccurate depth estimation, and improves the detail texture of the repaired black hole image.
Drawings
FIG. 1 is a general flow chart of a method for converting 2D video into 3D video according to the present invention;
FIG. 2 is a schematic diagram of a depth image generated by a depth estimation model according to the present invention, wherein an original image and a mask are sent to a left and right projection images;
FIG. 3 is a schematic diagram of the present invention repairing a depth image containing a black hole into a complete image;
FIG. 4 is a schematic diagram of the deep network model inference process of the present invention;
FIG. 5 is a schematic diagram of a composite left and right view of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In order to overcome the defects of the prior art or the process, the invention fully combines the respective advantages of the traditional DIBR algorithm and the depth learning-based algorithm, aims at the 2D/3D conversion algorithm of the images with the resolution of 2K and above, and is an efficient and stable three-dimensional image transformation technology.
The core of the invention is: in order to solve the problems that the depth estimation of a source image is inaccurate and large black areas are difficult to process in the existing conversion algorithm, and jitter and the like which are discontinuous in time and space of a synthesized video frame occur, a 2D/3D conversion technology aiming at ultra-high definition video is provided.
The present invention will be described in detail below for the purpose of making the object and technical solution of the present invention clearer.
In this case, 4K high definition video is used, and the technology of the present invention is used to realize 2D/3D three-dimensional conversion.
The method comprises the following preparation steps: a skeleton depth estimation dataset; building a model and training; constructing an image restoration data set; building a restoration model and training;
the method comprises the following 2D/3D conversion steps: extracting an original image Mask by using a pre-training Mask-RCNN model; adjusting the resolution of an original image and a mask and sending the resolution into a trained depth estimation model; calculating original left and right projection images according to the depth image; and respectively sending the left projection drawing and the right projection drawing into the trained image repairing model to repair the black hole area.
Examples
As shown in fig. 1, the embodiment provides a specific method for converting 2D to 3D of a video, a depth learning model is used to replace a depth map estimation algorithm and a black hole filling algorithm in a conventional DIBR method, and the characteristics of high generalization and high accuracy of the depth learning model are used to make up for the shortcomings in the conventional algorithm. On the basis of a Tensorflow framework, a high-efficiency and high-quality conversion technology is realized.
The invention relates to a 2D/3D conversion method aiming at 2K resolution ratio and above. Based on a Tensorflow open source framework, a CUDA engine is used for accelerating parallel computation to realize 2D/3D conversion on an ultrahigh-resolution image, wherein an original image and a mask are sent into a depth estimation model to generate a left projection image and a right projection image as shown in figure 2, a depth image containing black holes is restored into a complete image as shown in figure 3, a depth network model reasoning flow is shown in figure 4, and a left view and a right view are synthesized as shown in figure 5, and the method specifically comprises the following steps:
(1) Producing a depth estimation dataset
For an open source RGB-D image dataset, the visual RGB image is the model input and the depth image D is the annotation image. For example, the present invention may use data sets including NYU Depth Dataset V2 and Meg aDepth, but the above data sets are only used for illustration of the present invention and are not used for limitation of the present invention.
In order to further improve the generalization and noise resistance of the model, the experiment adopts a data enhancement technology to expand a data set, wherein the data set comprises random inversion, random color shift, camera noise increase and the like. Random gaussian noise and salt and pepper noise were used in this experiment to simulate camera noise. Dividing a data set obtained after data enhancement into a training set, a verification set and a test set, wherein the proportion is 7: 2: 1. the training set is used for adjusting model parameters to obtain a local optimal solution, the verification set is used for verifying whether the model parameters are reliable in the training process, and the test set is used for evaluating the generalization performance of the model.
(2) Constructing a depth estimation Modeldepth
And constructing a depth estimation model by adopting a basic model comprising an open source model U-Net for learning the mapping from the monocular RGB image to the depth image, and realizing the depth estimation of the 2D video. For the depth estimation task, the depth estimation model of the invention uses an encoder-decoder infrastructure, the convolutional kernels in the encoder are all 3 × 3 in size, and the convolutional layer padding parameter is set to 1. The sliding step length is set to be 1, so that the size of the feature map after convolution is ensured to be unchanged. The activation function is set to LeakyReLU, for downsampling down-scaling resolution, maximum pooling is used, padding is set to 2, template size is 5 × 5, step size is set to 2. In the decoder, the up-sampling layer uses bilinear interpolation to enlarge the resolution of the characteristic diagram by 2 times, the parameter of the convolution layer is consistent with that of the encoder, and the final layer of activation function uses Tanh. By using a jump connection between the encoder and the decoder, the corresponding feature maps are connected in the channel direction, and a more refined depth estimation can be obtained. The input image is mapped into a single-channel depth image O through a coding and decoding structure d,OdThe middle pixel value represents the object depth.
(3) Training depth estimation model
Firstly, setting the batch processing size to be 128, randomly extracting 128 source pictures I from a training set, and extracting a single-channel Mask of the input picture I by using a pre-training Mask-RCNN model. Then respectively connectThe resolution of the input picture I and the mask is adjusted to be 540 multiplied by 960 and the input picture I and the mask are connected in the channel direction, and the good mask instance segmentation mask is beneficial to obtaining a depth map with an accurate parallax relation. Inputting the merged data into a depth estimation model to obtain the output characteristic O of a single channeld. To OdAnd calculating loss by using a label depth map label, wherein the loss function adopts L2 distance loss and TV (total variation) loss, the model learns the mapping relation by using L2 loss, and the predicted depth map is kept continuous locally by using TV loss, so that the binocular visual effect is improved. Total Loss λ1TV+λ2L2Wherein the weight is λ1=0.3,λ20.7. The parameter update optimizer uses Adam, the initial learning rate lr is set to 1e-3, 300 cycles are trained, the learning rate is adjusted down by 15% every 10 training cycles, and the model accuracy is tested using the validation set. If the precision is continuously reduced for 5 times, the training is stopped, otherwise, the training is carried out until the whole period is finished. And after the training is finished, testing the performance of the model by using the test set data, if the scoring index reaches a set threshold value, retaining the model, otherwise, randomly initializing the model parameters again, and repeating the training steps until the performance of the model reaches the scoring index.
(4) Making an image restoration dataset
Local 4K high-definition pictures are used as training data sources, and in order to further improve the generalization performance and the robustness performance of the model, the data enhancement technology is adopted to expand a data set in the embodiment, wherein the expansion comprises random left-right turning, random color shift, camera noise increasing and the like. Random gaussian noise and salt and pepper noise are used in this case to simulate camera noise. In order to simulate a black hole area caused by image distortion, the invention uses an open source library OpenCV to carry out image IiBlack curves with different positions, different shapes and different thicknesses are generated. Dividing the whole data set obtained after data enhancement into a training set, a verification set and a test set, wherein the proportion is 7: 2: 1.
(5) model for constructing image restoration Modelinpaint
The model structure is similar to the depth estimation model, and an average pooling layer is used for replacing a maximum pooling layer aiming at the image restoration task.
The model structure adopts an open source model U-Net as a basic model, and particularly, for an image restoration task, a 3 x 3 convolutional layer padding parameter of an encoder part is set to be 1, an average pooling layer is used for a down-sampling layer, and the step length is set to be 2. The activation function is set to LeakyReLU; for a decoder part, an up-sampling layer amplifies a feature map by 2 times by using bilinear interpolation, jump connection is used between an encoder and a decoder, an activation function Tanh is used at the last layer of the decoder, and the feature map is mapped into a three-channel output image O i
(6) Training image restoration model
Firstly setting the batch processing size to be 128, randomly extracting 128 source picture pairs from a training set, then adjusting the resolution of the pictures to be 540 multiplied by 960, inputting data into an image restoration model, and obtaining a three-channel output image Od. To O isdAnd calculating reconstruction loss by using the original graph, wherein the loss function adopts L2 distance loss and L1 distance loss. The parameter update optimizer uses Adam, initial learning rate lr set 1e-3, trains 300 cycles, adjusts learning rate down by 15% every 15 training cycles, and tests model accuracy using the validation set. If the precision is continuously reduced for 5 times, the training is stopped, otherwise, the training is carried out until the whole period is finished. And after training is finished, testing the performance of the model by using the test set data, if the scoring index reaches a set threshold value, reserving the model, otherwise, initializing the model parameters at random again, and repeating the training steps until the performance of the model reaches the scoring index.
(7) Monocular image depth estimation
The RGB image to be predicted is sent into a pre-training Model Mask-RCNN to obtain an example segmentation Mask, a predicted image and the Mask are connected according to the channel direction, and a depth estimation Model is inputdepthAnd obtaining a prediction depth map.
(8) Three-dimensional image conversion
This case used CBA video with a resolution of 3840 × 2160 as the converted video and read the video stream using OpenCV. Firstly, the resolution of a captured source picture I is reduced to 960 multiplied by 540, a segmentation picture mask is extracted by using a pre-training mask rcnn, and the mask and the I are combined960×540After connecting according to the channel direction, inputting the depth estimation model in the present case to obtain a depth map Idepth. Depth map IdepthThe resolution is recovered to 3840 × 1920 by combining the distance f from the human eyes to the focal plane and the human eye spacing tcCalculating the source image I according to the parameters of the visual field difference dpre(x, y) pixel coordinate offset
Figure BDA0002633956010000091
Initializing a projection template: t is(x,y)=0. Orthographically projecting the source image I to the positions of the viewpoints of both eyes to obtain a left projection LI (x, y) ═ I (x, y) + Δ x, and a right projection RI (x, y) ═ I (x, y) - Δ x.
(9) Black hole filling
And reducing the resolution of the left and right projection images LI and RI to 1920 x 1080, inputting the resolution into the image restoration model of the invention, and outputting restored images LII and RII. Restoring the resolution of the restored image directly to the original resolution would be blurred, in this case, restoring the resolution of the restored image to 3840 × 2160, LII first3840×2160,RII3840×2160. Filling the black hole area in the projection image with the pixel value of the repaired image, wherein:
Figure BDA0002633956010000101
Figure BDA0002633956010000102
it should be noted that the present embodiment is only used for example, and all values, CBA video, loss functions, etc. contained in the present embodiment are only some specific examples, which are used for illustrating the content of the present invention and are not used for limiting the present invention.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A method for 2D to 3D conversion of video, comprising the steps of:
S1: collecting an open-source RGB-D image data set, expanding the open-source RGB-D image data set to form a depth estimation data set, constructing a depth estimation model through the depth estimation data set, and training the depth estimation model;
s2: collecting a 4K high-definition picture to manufacture an image restoration data set, expanding the image restoration data set, constructing an image restoration model through the image restoration data set, and training the image restoration model;
s3: and extracting an original image Mask by using a pre-training Mask-RCNN model, adjusting the resolution of the original image and the Mask, sending the original image and the Mask into the depth estimation model, calculating original left and right projection images according to the depth image, and sending the left and right projection images into the image restoration model respectively to restore the black hole area.
2. The method for 2D to 3D conversion of video according to claim 1, wherein in step S1, an open-source RGB-D image data set is collected and augmented to form a depth estimation data set, specifically:
collecting the open-source RGB-D image data set, inputting an RGB image in the RGB-D image as a model, and taking a depth image D as a depth annotation image;
Expanding a data set by adopting a data enhancement technology including random left-right turning, random color shift and camera noise increase;
dividing the enhanced depth estimation data set to form different data sets including a depth training set, a depth verification set and a depth test set, wherein the depth training set is used for adjusting model parameters of the depth estimation model to obtain a local optimal solution, the depth verification set is used for verifying whether the model parameters of the depth estimation model are reliable in a training process, and the depth test set is used for evaluating the generalization performance of the depth estimation model.
3. The method for 2D to 3D conversion of video according to claim 2, wherein in step S1, the depth estimation model is constructed from the depth estimation data set, specifically:
the depth estimation model is used for learning the mapping from the monocular RGB image to the depth image and realizing the depth estimation of the 2D video;
the depth estimation model uses the basic framework of a coder-decoder, jump connection is used between the coder and the decoder, and corresponding characteristic graphs are connected according to the channel direction to obtain more detailed depth estimation;
Through the encoder-decoder infrastructure, an input image is mapped into a single-channel depth image in which pixel values represent object depth.
4. The method for converting video from 2D to 3D according to claim 3, wherein in step S1, the depth estimation model is trained, specifically:
extracting a single-channel segmentation Mask of the input image by using a pre-training Mask-RCNN model, adjusting the resolution of the input image and the single-channel segmentation Mask and connecting the input image and the single-channel segmentation Mask according to a channel direction;
inputting the input image into the depth estimation model, outputting the single-channel depth image, and calculating loss values for the single-channel depth image and the depth annotation image through a loss function including L2 distance loss and TV total variation loss;
and testing the precision of the depth estimation model by using the depth verification set, and testing the performance of the depth estimation model by using the depth test set.
5. The method for 2D to 3D conversion of video according to claim 1, wherein in step S2, a 4K high definition picture production image restoration data set is collected and expanded, specifically:
Generating black curves with different positions, different shapes and different thicknesses on the 4K high-definition picture by using an open source library OpenCV so as to simulate a black hole area caused by three-dimensional image transformation;
expanding the image restoration data set by adopting a data enhancement technology including random left-right turning, random color variation and camera noise increase;
and dividing the image restoration data set to form different data sets including a restoration training set, a restoration verification set and a restoration testing set.
6. The method for 2D to 3D conversion of video according to claim 5, wherein in step S2, an image restoration model is constructed from the image restoration data set, specifically:
the image restoration model uses the infrastructure of the encoder-decoder, uses a skip connection between the encoder and decoder, uses the activation function Tanh at the last layer of the decoder, and maps the feature map to a three-channel output image.
7. The method for converting 2D to 3D of video according to claim 6, wherein in step S3, the image inpainting model is trained, specifically:
inputting an input image into the image restoration model, and outputting an output image of the three channels;
Calculating reconstruction loss for the three-channel output image and the original map using a loss function including L2 distance loss;
and testing the model precision by using the repair verification set, and testing the performance of the model by using the repair test set.
8. The method for converting 2D to 3D of video according to claim 1, wherein in step S3, the pre-trained Mask-RCNN model is used to extract the original image Mask, adjust the original image and Mask resolution and send them to the depth estimation model, specifically:
and sending the RGB image to be predicted into a Mask-RCNN model of a training model to obtain an example segmentation Mask, connecting the predicted image and the Mask according to a channel direction, and inputting the predicted image and the Mask into the depth estimation model to obtain a predicted depth map.
9. The method for converting a video from 2D to 3D according to claim 1, wherein in step S3, the left and right projection views are respectively fed into the image restoration model to restore black hole regions, specifically:
and respectively inputting the left projection drawing LD and the right projection drawing RD into the image restoration model to obtain restoration drawings LDI and RDI, and filling the pixel values of the restored image into the black hole area in the projection drawings, wherein:
Figure FDA0003632570450000031
Figure FDA0003632570450000032
CN202010819481.4A 2020-08-14 2020-08-14 Method for converting 2D (two-dimensional) video into 3D video Active CN112019828B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010819481.4A CN112019828B (en) 2020-08-14 2020-08-14 Method for converting 2D (two-dimensional) video into 3D video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010819481.4A CN112019828B (en) 2020-08-14 2020-08-14 Method for converting 2D (two-dimensional) video into 3D video

Publications (2)

Publication Number Publication Date
CN112019828A CN112019828A (en) 2020-12-01
CN112019828B true CN112019828B (en) 2022-07-19

Family

ID=73504511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010819481.4A Active CN112019828B (en) 2020-08-14 2020-08-14 Method for converting 2D (two-dimensional) video into 3D video

Country Status (1)

Country Link
CN (1) CN112019828B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI754487B (en) * 2020-12-11 2022-02-01 大陸商深圳市博浩光電科技有限公司 System for converting two-dimensional image to three-dimensional images using deep learning and method thereof
CN112967200A (en) * 2021-03-05 2021-06-15 北京字跳网络技术有限公司 Image processing method, apparatus, electronic device, medium, and computer program product
CN113989349B (en) * 2021-10-25 2022-11-25 北京百度网讯科技有限公司 Image generation method, training method of image processing model, and image processing method
CN115761565B (en) * 2022-10-09 2023-07-21 名之梦(上海)科技有限公司 Video generation method, device, equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741383A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Picture depth estimating system and method based on empty convolution sum semi-supervised learning
CN109978786A (en) * 2019-03-22 2019-07-05 北京工业大学 A kind of Kinect depth map restorative procedure based on convolutional neural networks
CN111325693A (en) * 2020-02-24 2020-06-23 西安交通大学 Large-scale panoramic viewpoint synthesis method based on single-viewpoint RGB-D image

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9609307B1 (en) * 2015-09-17 2017-03-28 Legend3D, Inc. Method of converting 2D video to 3D video using machine learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109741383A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Picture depth estimating system and method based on empty convolution sum semi-supervised learning
CN109978786A (en) * 2019-03-22 2019-07-05 北京工业大学 A kind of Kinect depth map restorative procedure based on convolutional neural networks
CN111325693A (en) * 2020-02-24 2020-06-23 西安交通大学 Large-scale panoramic viewpoint synthesis method based on single-viewpoint RGB-D image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度引导空洞填补的虚拟视点绘制方法;娄达平;《计算机应用与软件》;20170630;第34卷(第6期);全文 *

Also Published As

Publication number Publication date
CN112019828A (en) 2020-12-01

Similar Documents

Publication Publication Date Title
CN112019828B (en) Method for converting 2D (two-dimensional) video into 3D video
CN112543317B (en) Method for converting high-resolution monocular 2D video into binocular 3D video
CN111028150B (en) Rapid space-time residual attention video super-resolution reconstruction method
CN100576934C (en) Virtual visual point synthesizing method based on the degree of depth and block information
CN101771893B (en) Video frequency sequence background modeling based virtual viewpoint rendering method
EP2595116A1 (en) Method for generating depth maps for converting moving 2d images to 3d
CN111105432B (en) Unsupervised end-to-end driving environment perception method based on deep learning
CN111739082B (en) Stereo vision unsupervised depth estimation method based on convolutional neural network
CN110113593B (en) Wide baseline multi-view video synthesis method based on convolutional neural network
KR102141319B1 (en) Super-resolution method for multi-view 360-degree image and image processing apparatus
CN110930500A (en) Dynamic hair modeling method based on single-view video
CN115883764B (en) Underwater high-speed video frame inserting method and system based on data collaboration
Li et al. Deep sketch-guided cartoon video inbetweening
Bleyer et al. Temporally consistent disparity maps from uncalibrated stereo videos
CN114677479A (en) Natural landscape multi-view three-dimensional reconstruction method based on deep learning
CN112634127B (en) Unsupervised stereo image redirection method
CN113436254B (en) Cascade decoupling pose estimation method
CN111652922B (en) Binocular vision-based monocular video depth estimation method
CN112927348A (en) High-resolution human body three-dimensional reconstruction method based on multi-viewpoint RGBD camera
TWI754487B (en) System for converting two-dimensional image to three-dimensional images using deep learning and method thereof
Evain et al. A lightweight neural network for monocular view generation with occlusion handling
Eisert et al. Volumetric video–acquisition, interaction, streaming and rendering
Caviedes et al. Real time 2D to 3D conversion: Technical and visual quality requirements
CN116546183B (en) Dynamic image generation method and system with parallax effect based on single frame image
CN105096352A (en) Significance-driven depth image compression method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant