CN110111244A - Image conversion, depth map prediction and model training method, device and electronic equipment - Google Patents

Image conversion, depth map prediction and model training method, device and electronic equipment Download PDF

Info

Publication number
CN110111244A
CN110111244A CN201910381527.6A CN201910381527A CN110111244A CN 110111244 A CN110111244 A CN 110111244A CN 201910381527 A CN201910381527 A CN 201910381527A CN 110111244 A CN110111244 A CN 110111244A
Authority
CN
China
Prior art keywords
prediction
view
network model
right view
depth map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910381527.6A
Other languages
Chinese (zh)
Other versions
CN110111244B (en
Inventor
吴方印
陈平
杨东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing QIYI Century Science and Technology Co Ltd
Original Assignee
Beijing QIYI Century Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing QIYI Century Science and Technology Co Ltd filed Critical Beijing QIYI Century Science and Technology Co Ltd
Priority to CN201910381527.6A priority Critical patent/CN110111244B/en
Publication of CN110111244A publication Critical patent/CN110111244A/en
Application granted granted Critical
Publication of CN110111244B publication Critical patent/CN110111244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/80Analysis of captured images to determine intrinsic or extrinsic camera parameters, i.e. camera calibration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The embodiment of the invention provides image conversion, depth map prediction and model training method, device and electronic equipment, the available two-dimentional 2D image to be converted for three-dimensional 3D rendering;Using 2D image as the first monocular view for being used to generate 3D rendering, it is input to trained depth map prediction network model in advance;Depth map prediction network model is to predict that network model and initial camera parameter prediction network model are trained acquisition to initial depth figure based on multiple and different 3D film source samples;Obtain first predetermined depth figure of depth map prediction network model output;First predetermined depth figure is handled based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and preset first sample mode, obtains the second monocular view;Based on the first monocular view and the second monocular view, 3D rendering is generated.As it can be seen that the depth prediction based on a 2D image may be implemented by 2D image and be converted to 3D rendering using the embodiment of the present invention.

Description

Image conversion, depth map prediction and model training method, device and electronic equipment
Technical field
The present invention relates to the technical fields that 2D image is converted to 3D rendering, predict more particularly to image conversion, depth map With model training method, device and electronic equipment.
Background technique
Currently, can be by the method for predetermined depth figure when 2D image is converted into 3D rendering, wherein the pre- survey grid of depth map The process of network model training is usually: the continuous multiframe 2D image input initial depth figure of a video source is predicted network model In, the depth map of prediction is obtained, seeks loss function value with true depth map, according to loss function come the pre- survey grid of percentage regulation figure The network parameter of network model, the final depth map that obtains predict network model.Network model is predicted further according to trained depth map The depth map of prediction, obtains 3D rendering.This depth map prediction network model is with continuous a large amount of 2D image and each 2D The true depth map of image is trained acquisition to single network model.
Inventor has found that at least there are the following problems for the prior art in the implementation of the present invention:
The prior art carries out during depth map prediction, it is necessary to have true depth map to coach, can just train depth Figure prediction network model.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of conversion of image, depth map prediction and model training method, device And electronic equipment can train depth map prediction network model, and realize 2D to realize that not having to true depth map coaches Image is converted to 3D rendering.Specific technical solution is as follows:
In a first aspect, the embodiment of the invention provides a kind of method that two dimension 2D image is converted to three-dimensional 3D rendering, it is described Method includes:
Obtain the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
Using the 2D image as the first monocular view for being used to generate 3D rendering, it is input to preparatory trained depth map Predict network model;The depth map prediction network model is to be predicted based on multiple and different 3D film source samples initial depth figure Network model and initial camera parameter prediction network model are trained acquisition;The first monocular view is left view or right view Figure;
Obtain first predetermined depth figure of depth map prediction network model output;
Based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and preset One sample mode handles first predetermined depth figure, obtains the second monocular view;The second monocular view be with The corresponding right view of first monocular view or left view;
Based on the first monocular view and the second monocular view, 3D rendering is generated.
Optionally, the training process of the depth map prediction network model includes:
Multiple and different 3D film sources of different cameral shooting are obtained as training sample;Wherein, each training sample includes Left view and corresponding right view;
Right view input initial depth figure in preset quantity training sample is predicted into network model, obtains initial depth The second right depth map of prediction of each training sample of figure prediction network model output;
The left view of preset quantity training sample and right view are inputted into initial camera parameter prediction network model, obtained The prediction camera parameter of each training sample of initial camera parameter prediction network model output;
It is the prediction camera parameter of the second right depth map of prediction, each training sample based on each training sample, pre- If camera imaging formula and preset second sample mode, the second right depth map of prediction is handled, each training is obtained Second prediction right view of sample;
According to right view true in preset loss function, each training sample and its corresponding second prediction Right view calculates penalty values;
Judge whether initial depth figure prediction network model and initial camera parameter prediction network model are equal according to penalty values Converge to stabilization;
If converging to stabilization, the quantity of frequency of training is increased once, and judges whether to reach preset training time Number;If not reaching preset frequency of training, returns to the right view by preset quantity training sample and input initially Depth map predicts network model, obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output The step of degree figure;If reaching preset frequency of training, current initial depth figure prediction network model is what training was completed Depth map predicts network model;
If not converged to stable, the quantity of frequency of training is increased once, and adjusts the initial depth figure prediction The network parameter of network model and initial camera parameter prediction network model, return is described will be in preset quantity training sample Right view inputs depth map prediction network model to be trained, and obtains each training of initial depth figure prediction network model output The second of sample predicts the step of right depth map.
Optionally, the prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The prediction camera ginseng of the right depth map of the second prediction based on each training sample, each training sample Several, preset camera imaging formula and preset second sample mode handle the second right depth map of prediction, obtain each The step of second prediction right view of training sample, comprising:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally, the initial depth figure predicts network model are as follows: view-based access control model geometry group VGG or U-net network structure Network;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
Optionally, the first monocular view is left view, and first predetermined depth figure is the first left depth map of prediction;
It is described based on first predetermined depth figure, the camera parameter of the 2D image, preset camera imaging formula and The step of preset first sample mode handles first predetermined depth figure, obtains the second monocular view, comprising:
Based on the left depth map of first prediction, the camera parameter of the 2D image, preset camera imaging formula and pre- If the first sample mode the left depth map of first prediction is handled, obtain the first prediction right view;
It is described be based on the first monocular view and the second monocular view, generate 3D rendering the step of, comprising:
Based on the left view and the first prediction right view, 3D rendering is generated.
Optionally, the camera parameter of the 2D image, comprising: the camera intrinsic parameter and rotation translation parameters of 2D image;
It is described to predict left depth map, the camera parameter of the 2D image, preset camera imaging formula based on described first And the step of preset first sample mode handles the right depth map of first prediction, and right view is predicted in acquisition first, Include:
Bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into the preset camera In imaging formula, first prediction mapping point of the left view in right view is obtained;
According to first prediction mapping point of the left view in right view, left view is sampled, the first prediction is obtained Right view.
Second aspect, the embodiment of the invention provides a kind of training method of depth map prediction network model, the methods Include:
Multiple and different 3D film sources of different cameral shooting are obtained as training sample;Wherein, each training sample includes Left view and corresponding right view;
Right view input initial depth figure in preset quantity training sample is predicted into network model, obtains initial depth The second right depth map of prediction of each training sample of figure prediction network model output;
The left view of preset quantity training sample and right view are inputted into initial camera parameter prediction network model, obtained The prediction camera parameter of each training sample of initial camera parameter prediction network model output;
It is the prediction camera parameter of the second right depth map of prediction, each training sample based on each training sample, pre- If camera imaging formula and preset second sample mode, the second right depth map of prediction is handled, each training is obtained Second prediction right view of sample;
According to right view true in preset loss function, each training sample and its corresponding second prediction Right view calculates penalty values;
Judge whether initial depth figure prediction network model and initial camera parameter prediction network model are equal according to penalty values Converge to stabilization;
If converging to stabilization, the quantity of frequency of training is increased once, and judges whether to reach preset training time Number;If not reaching preset frequency of training, returns to the right view by preset quantity training sample and input initially Depth map predicts network model, obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output The step of degree figure;If reaching preset frequency of training, current initial depth figure prediction network model is what training was completed Depth map predicts network model;
If not converged to stable, the quantity of frequency of training is increased once, and adjusts the initial depth figure prediction The network parameter of network model and initial camera parameter prediction network model, return is described will be in preset quantity training sample Right view inputs depth map prediction network model to be trained, and obtains each training of initial depth figure prediction network model output The second of sample predicts the step of right depth map.
Optionally,
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The prediction camera ginseng of the right depth map of the second prediction based on each training sample, each training sample Several, preset camera imaging formula and preset second sample mode handle the second right depth map of prediction, obtain each The step of second prediction right view of training sample, comprising:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally,
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product and 1 layer of average 1 layer FC layers of pondization;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
The third aspect, the embodiment of the invention provides a kind of depth map prediction techniques, which comprises
Obtain the first monocular view of depth map to be predicted;
By the first monocular view, it is input to trained depth map prediction network model in advance;The depth map is pre- Survey network model are as follows: acquisition is trained based on use any of the above-described training method;The first monocular view is left view Or right view;
Obtain first predetermined depth figure of depth map prediction network model output.
Fourth aspect, it is described the embodiment of the invention provides the device that a kind of two dimension 2D image is converted to three-dimensional 3D rendering Device includes:
First 2D image acquisition unit, for obtaining the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
First 2D image input units, for using the 2D image as the first monocular view for being used to generate 3D rendering, It is input to trained depth map prediction network model in advance;The depth map prediction network model is based on multiple and different 3D Film source sample predicts that network model and initial camera parameter prediction network model are trained acquisition to initial depth figure;Described One monocular view is left view or right view;
First predetermined depth figure acquiring unit, for obtaining first predetermined depth of depth map prediction network model output Figure;
Second monocular view obtaining unit, for the camera parameter, default based on first predetermined depth figure, 2D image Camera imaging formula and preset first sample mode first predetermined depth figure is handled, obtain the second monocular view Figure;The second monocular view is right view corresponding with the first monocular view or left view;
3D rendering generation unit generates 3D rendering for being based on the first monocular view and the second monocular view.
Optionally, described device includes depth map prediction network model training unit;The depth map predicts network model Training unit, comprising:
Training sample obtains module, for obtaining multiple and different 3D film sources of different cameral shooting as training sample; Wherein, each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module, for the right view input in preset quantity training sample is initial deep Degree figure prediction network model obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output Figure;
Predict that camera parameter obtains module, it is initial for inputting the left view of preset quantity training sample and right view Camera parameter predicts network model, obtains the prediction camera of each training sample of initial camera parameter prediction network model output Parameter;
Second prediction right view obtains module, for the second right depth map of prediction, each based on each training sample The prediction camera parameter of a training sample, preset camera imaging formula and preset second sample mode, it is right to the second prediction Depth map is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module, for according to true right view in preset loss function, each training sample And its corresponding second prediction right view calculates penalty values;
Judgment module is restrained, for judging that initial depth figure prediction network model and initial camera parameter are pre- according to penalty values Survey whether network model converges to stabilization;
Frequency of training judgment module, if increased the quantity of frequency of training once, and judge for converging to stabilization Whether preset frequency of training is reached;If not reaching preset frequency of training, triggers the right depth map of second prediction and obtain It obtains module and the right view input initial depth figure in preset quantity training sample is predicted into network model, obtain initial depth figure Predict the second right depth map of prediction of each training sample of network model output;If reaching preset frequency of training, when Preceding initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module, if increased the quantity of frequency of training once, and adjust for not converged to stable The network parameter of the whole prediction of initial depth the figure network model and initial camera parameter prediction network model, triggering described second Predict that right depth map obtains module and the right view in preset quantity training sample is inputted to depth map prediction network to be trained Model obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
Optionally, the prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally, the initial depth figure predicts network model are as follows: view-based access control model geometry group VGG (Visual Geometry ) or the network of U-net network structure Group;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
Optionally, the first monocular view is left view, and first predetermined depth figure is the first left depth map of prediction;
Second monocular view obtaining unit, comprising: the first prediction right view obtains module;
The first prediction right view obtains module, for based on first prediction left depth map, the 2D image Camera parameter, preset camera imaging formula and preset first sample mode are predicted at left depth map described first Reason obtains the first prediction right view;
The 3D rendering generation unit, is specifically used for: based on the left view and the first prediction right view, generating 3D Image.
Optionally, the camera parameter of the 2D image, comprising: the camera intrinsic parameter and rotation translation parameters of 2D image;
The first prediction right view obtains module, is specifically used for:
Bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into the preset camera In imaging formula, first prediction mapping point of the left view in right view is obtained;
According to first prediction mapping point of the left view in right view, left view is sampled, the first prediction is obtained Right view.
5th aspect, the embodiment of the invention provides a kind of training device of depth map prediction network model, described devices Include:
Training sample obtains module, for obtaining multiple and different 3D film sources of different cameral shooting as training sample; Wherein, each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module, for the right view input in preset quantity training sample is initial deep Degree figure prediction network model obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output Figure;
Predict that camera parameter obtains module, it is initial for inputting the left view of preset quantity training sample and right view Camera parameter predicts network model, obtains the prediction camera of each training sample of initial camera parameter prediction network model output Parameter;
Second prediction right view obtains module, for the second right depth map of prediction, each based on each training sample The prediction camera parameter of a training sample, preset camera imaging formula and preset second sample mode, it is right to the second prediction Depth map is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module, for according to true right view in preset loss function, each training sample And its corresponding second prediction right view calculates penalty values;
Judgment module is restrained, for judging that initial depth figure prediction network model and initial camera parameter are pre- according to penalty values Survey whether network model converges to stabilization;
Frequency of training judgment module, if increased the quantity of frequency of training once, and judge for converging to stabilization Whether preset frequency of training is reached;If not reaching preset frequency of training, triggers the right depth map of second prediction and obtain It obtains module and the right view input initial depth figure in preset quantity training sample is predicted into network model, obtain initial depth figure Predict the second right depth map of prediction of each training sample of network model output;If reaching preset frequency of training, when Preceding initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module, if increased the quantity of frequency of training once, and adjust for not converged to stable The network parameter of the whole prediction of initial depth the figure network model and initial camera parameter prediction network model, triggering described second Predict that right depth map obtains module and the right view in preset quantity training sample is inputted to depth map prediction network to be trained Model obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
Optionally,
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally,
The initial depth figure predicts network model are as follows: view-based access control model geometry group VGG (Visual Geometry Group) Or the network of U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
6th aspect, the embodiment of the invention provides a kind of depth map prediction meanss, described device includes:
First monocular view obtaining unit, for obtaining the first monocular view of depth map to be predicted;
First monocular view input unit, for by the first monocular view, being input to preparatory trained depth map Predict network model;The depth map predicts network model are as follows: is trained acquisition based on the above-mentioned training method of use;Described One monocular view is left view or right view;
First depth map obtaining unit, for obtaining first predetermined depth figure of depth map prediction network model output.
7th aspect, the embodiment of the invention provides a kind of electronic equipment, including processor, communication interface, memory and Communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes that any of the above-described two dimension 2D image is converted to three The step of tieing up the method for 3D rendering.
Eighth aspect, the embodiment of the invention provides a kind of electronic equipment, including processor, communication interface, memory and Communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Any of the above-described depth map prediction is realized or realized to processor when for executing the program stored on memory, The step of training method of network model.
9th aspect, the embodiment of the invention provides a kind of electronic equipment, including processor, communication interface, memory and Communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes the depth map prediction side of any of the above-described image The step of method.
Present invention implementation additionally provides a kind of computer readable storage medium, storage in the computer readable storage medium There is computer program, the computer program realizes that any of the above-described two dimension 2D image is converted to three-dimensional 3D figure when being executed by processor The step of method of picture;Or the step of realizing the training method of any of the above-described depth map prediction network model;Or on realizing The step of stating the depth map prediction technique of any image.
The embodiment of the invention also provides a kind of computer program products comprising instruction, when it runs on computers When, so that computer executes any of the above-described two dimension 2D image and is converted to three-dimensional 3D rendering method;Or realize any of the above-described depth The training method of figure prediction network model;Or realize the depth map prediction technique of any of the above-described image.
The embodiment of the present invention the utility model has the advantages that
Image provided in an embodiment of the present invention is converted, depth map is predicted and model training method, device and electronic equipment, can To obtain the two-dimentional 2D image to be converted for three-dimensional 3D rendering;Using the 2D image as the first monocular for being used to generate 3D rendering View is input to trained depth map prediction network model in advance;Depth map prediction network model be based on it is multiple not Same 3D film source sample predicts that network model and initial camera parameter prediction network model are trained acquisition to initial depth figure; The first monocular view is left view or right view;Obtain first predetermined depth figure of depth map prediction network model output; Based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and preset first sampling side Formula handles first predetermined depth figure, obtains the second monocular view;The second monocular view be and the first monocular The corresponding right view of view or left view;Based on the first monocular view and the second monocular view, 3D rendering is generated.It can See, does not need true depth map based on 3D rendering to the prediction network model training of initial depth figure using the embodiment of the present invention It coaches, depth map prediction network model can be trained, realize that 2D image is converted to 3D rendering.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of process for the method that a kind of two dimension 2D image provided in an embodiment of the present invention is converted to three-dimensional 3D rendering Figure;
Fig. 2 is depth prediction network structure provided in an embodiment of the present invention;
Fig. 3 is the training principle that depth map provided in an embodiment of the present invention predicts network model and camera parameter predicts network Figure;
Fig. 4 is the structure chart that camera parameter provided in an embodiment of the present invention predicts network;
Fig. 5 is the training flow chart that depth map provided in an embodiment of the present invention predicts network model;
Fig. 6 is a kind of flow chart of the depth map prediction technique of image provided in an embodiment of the present invention;
Fig. 7 provides the structural schematic diagram that a kind of two dimension 2D image is converted to three-dimensional 3D rendering device for the embodiment of the present invention;
Fig. 8 provides a kind of structural schematic diagram of the training device of depth map prediction network model for the embodiment of the present invention;
Fig. 9 provides a kind of structural schematic diagram of the depth map prediction meanss of image for the embodiment of the present invention;
Figure 10 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention;
Figure 11 is the structural schematic diagram of another electronic equipment provided in an embodiment of the present invention;
Figure 12 is the structural schematic diagram of another electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In order to realize that not having to true depth map coaches, depth map prediction network model can be trained, and realize 2D Image is converted to 3D rendering, and the embodiment of the invention provides image conversion, depth map prediction and model training method, device and electricity Sub- equipment.
The conversion of image provided by the embodiment of the present invention, depth map prediction and model training method can be applied to any need Image is converted, the electronic equipment of depth map prediction and model training, such as: computer or mobile terminal are not done specific herein It limits.For convenience, hereinafter referred to as electronic equipment.
The method that two dimension 2D image provided in an embodiment of the present invention is converted to three-dimensional 3D rendering, as shown in Figure 1, this method Specifically process flow includes:
Step S101 obtains the two-dimentional 2D image to be converted for three-dimensional 3D rendering.
It is enforceable, the available two-dimentional 2D image to be converted for three-dimensional 3D rendering of electronic equipment.
Step S102 is input to preparatory training using the 2D image as the first monocular view for being used to generate 3D rendering Good depth map predicts network model;The depth map prediction network model is based on multiple and different 3D film source samples to initial Depth map prediction network model and initial camera parameter prediction network model are trained acquisition;The first monocular view is a left side View or right view.
It is enforceable, it, can will be described using the 2D image as in the first monocular view for being used to generate 3D rendering 2D image is as the left view for generating 3D rendering.
Depth map used in the present embodiment predicts network model, can be as shown in Figure 2 based on VGG (Visual Geometry Group, visual geometric group) or U-net network structure network, comprising: coded portion and decoded portion;Coding 14 layers of convolution are passed through in part, and decoded portion passes through 14 layers of convolution.
It can be found in the encoding and decoding of the depth map prediction network model coding and decoding part of the embodiment of the present invention shown in table 1 Table.
Table 1
As shown in table 1, coded portion includes the first cascade down-sampling network, the second cascade down-sampling network, third cascade Down-sampling network, fourth stage connection down-sampling network and level V connection down-sampling network, level V connection down-sampling network, the 6th cascade Down-sampling network and the 7th cascade down-sampling network.Each down-sampling cascade network separately includes two convolutional layers, certainly may be used To be adjusted according to actual needs to the structure of cascade network.
It is enforceable, be illustrated by taking right view as an example, coded portion to the right view in sample twice convolution respectively into Row increases channel and reduces size processing, obtains the second coding down-sampled images of the last layer convolutional layer output.Such as 1 institute of table Show, the right view having a size of 265*512*3 is input in the first cascade down-sampling network, wherein 265 can indicate right view It is wide;512 can indicate the height of right view;3 can indicate the port number of the right view.First cascade down-sampling network includes conv1 (the One convolutional layer) and conv2 (second convolutional layer), conv1 (first convolutional layer) right view of 265*512*3 is increased The process of convolution of dimension is added to obtain the characteristic pattern 1 of 265*512*32, conv2 (second convolutional layer) reduces characteristic pattern 1 The process of convolution of size obtains the characteristic pattern 2 of 128*265*32;Characteristic pattern 2 is passed through into conv3 (third convolutional layer) convolution again Processing obtains the characteristic pattern 3 of 128*265*64.And so on, it eventually passes through conv14 (the 14th convolutional layer) process of convolution and obtains To the down-sampled images of 2*4*512.Again by down-sampled images decoded portion.
Decoded portion includes: the first cascade up-sampling network, the second cascade up-sampling network, third cascade up-sampling net Network, fourth stage connection up-sampling network and level V connection up-sampling network, level V connection up-sampling network, the 6th cascade up-sampling net Network and the 7th cascade up-sampling network.Each up-sampling cascade network separately includes up-sampling and two convolutional layers, certainly may be used To be adjusted according to actual needs to the structure of cascade network.Each up-sampling cascade network includes up-sampling bilinear interpolation The processing of increased in size and two convolutional layers reduce the processing of dimension, and one of convolutional layer, which is done, reduces dimension processing, another Convolution, which is not done, reduces dimension processing.
Decoded portion carries out the first up-sampling to the down-sampled images obtained by coded portion, to the image of 2*4*512 Bilinear interpolation, increased in size handle to obtain the up-sampling intermediate image 1 of 4*8*512, and conv1 (first convolutional layer) is to above adopting 1 process of convolution of sample intermediate image obtains the up-sampling characteristic pattern 1 of 4*8*512, then up-sampling characteristic pattern 1 is passed through conv2 (second A convolutional layer) process of convolution obtain up-sampling characteristic pattern 2.Illustrate here twice convolution do not do reduce channel processing be For model needs, can adjust according to the actual situation.
Again by characteristic pattern 2 by the second up-sampling bilinear interpolation, increased in size handles to obtain the up-sampling of 8*16*512 Intermediate image 2, up-sampling characteristic pattern of the conv3 (third convolutional layer) to up-sampling 2 process of convolution 8*16*512 of intermediate image 3, then up-sampling characteristic pattern 3 is obtained to up-sample characteristic pattern 4 by conv4 (the 4th convolutional layer) process of convolution.It illustrates Convolution, which is not done and reduces channel processing, twice here can be adjusted according to the actual situation for model needs.
Characteristic pattern 4 is up-sampled into bilinear interpolation by third again, increased in size handles to obtain the up-sampling of 8*16*512 Intermediate image 3, conv5 (the 5th convolutional layer) to up-sampling 3 convolution of intermediate image reduce channel processing 16*32*512 on adopt Sample characteristic pattern 5, then up-sampling characteristic pattern 5 is obtained to up-sample characteristic pattern 6 by conv6 (the 6th convolutional layer) process of convolution. And so on.It illustrates, a right depth map is exported respectively in Conv8, Conv10, Conv12 and Conv14, in table Shown in Conv8_out, Conv10_out, Conv12_out and Conv14_out.4 predictions can be exported by being equivalent to a sample Right depth map, finally according to this 4 times predict right depth map be averaging penalty values.
It should be noted that seven cascade sampling networks are provided in optional embodiment of the present invention, in actual implementation mistake Cheng Zhong can be more than seven or less than seven cascade sampling networks according to the specific requirements setting for implementing personnel.
Step S103 obtains first predetermined depth figure of depth map prediction network model output.
In a kind of specific embodiment, if using the first monocular view as left view, first predetermined depth Figure is the first left depth map of prediction.
Step S104, based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and Preset first sample mode handles first predetermined depth figure, obtains the second monocular view;Second monocular View is right view corresponding with the first monocular view or left view.
It is enforceable, if using the first monocular view as left view, can based on the left depth map of first prediction, The camera parameter of 2D image, preset camera imaging formula and preset first sample mode are to the left depth map of first prediction It is handled, obtains the first prediction right view.Wherein, the camera parameter of 2D image can use the phase obtained during model training Machine parameter, it is pre-set to be also possible to user;
In a kind of specific embodiment, if will acquire the two-dimentional 2D image to be converted for three-dimensional 3D rendering as left View then can bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into preset camera In imaging formula, first prediction mapping point of the left view in right view is obtained;According to left view in right view first It predicts mapping point, left view is sampled, obtain the first prediction right view.
Wherein, preset camera imaging formula can be camera motion imaging formula:
Wherein ,~it is mapping operations;Ps is coordinate of the binocular image reference point in left view, and Pt is that binocular image is related Coordinate of the point in right view, K are camera Intrinsic Matrix, K-11For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) be Pt this The depth of a bit, Tt→sFor rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, wherein Fx and fy is camera focus, and x0 and y0 are principal point coordinate;
The camera parameter of 2D image may include: the camera intrinsic parameter and rotation translation parameters of 2D image.It is enforceable, If the first monocular view is right view, the second monocular view is left view corresponding with the first monocular view, is not done herein It is specific to limit.
Step S105 is based on the first monocular view and the second monocular view, generates 3D rendering.
It is enforceable, in a kind of specific embodiment, 3D can be generated based on left view and the first prediction right view The left view, can be can be used as the picture that left eye monocular is seen by image, and the picture that right view is seen as right eye monocular leads to It crosses existing 3D equipment and watches the left view and right view, obtain the corresponding 3D video of 2D video data to be converted.Or pass through It is existing that left view and right view are handled to obtain the mode of 3D video, the left view and right view are handled, is obtained to be converted The corresponding 3D video of 2D video data.In the embodiment of the present invention specifically without limitation.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, the second monocular view in the present embodiment, be based on first predetermined depth figure, 2D image camera parameter, What preset camera imaging formula and preset first sample mode obtained after handling first predetermined depth figure.? With reference to camera parameter during prediction, so that the prediction right view obtained is truer.
The training schematic diagram of depth map prediction network model and camera parameter prediction network provided in an embodiment of the present invention, such as Shown in Fig. 3, may include:
301 predict that network model, depth map predict that the network structure of network model is as shown in Figure 2 for depth map;302 be phase Machine parameter prediction network model, camera parameter predict the network structure of network model as shown in figure 4, by preset quantity training sample Right view R input initial depth figure in this predicts network model, obtains each of initial depth figure prediction network model output The second right depth map Z ' of prediction of training sample;The preset quantity training sample that will be inputted in depth map prediction network model simultaneously Left view and right view elder generation concat (splicing) in this input initial camera parameter prediction network model, obtain at the figure in 6 channels Obtain 10 prediction camera parameters of each training sample of initial camera parameter prediction network model output, comprising: camera internal reference Matrix number K, camera intrinsic parameter inverse matrix K-1 and rotational translation matrix T;According to the right depth of the camera parameter of prediction and prediction Z ' is schemed, according to preset camera imaging formulaWherein ,~and it is mapping operations, obtain right view The mapping point that figure is predicted in left view obtains the pixel of left view from left view Sampler (sampling);According to left view In identical pixel corresponding coordinate in right view, the right view predicted;By the right view of prediction and the true right side View R calculates penalty values according to preset loss function, judges depth map prediction network model model and camera according to penalty values Whether parameter prediction network model model converges to stabilization.
Camera parameter predicts that the structure chart of network is as shown in Figure 4, comprising: is divided into both direction up and down after 5 layers of convolution, each Direction includes level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection.First by left view and the spliced ruler of corresponding right view The very little image for 256*512*6 inputs down-sampling cascade network, wherein 6 be port number, wherein passing through down-sampling cascade network every time Network reduces size, increases port number, obtains 1 down-sampled images, and each down-sampling cascade network can have a convolutional layer, volume Reduce size after product and increases port number;Reduce size by 5 down-sampling cascade networks, increases port number, obtain 8*16* 512 the 5th down-sampled images.5th down-sampled images are separated into both direction up and down, and a direction is by level 1 volume product Layer increases dimension, reduces size, obtains the image of 4*8*512, then the image of 4*8*512 is increased dimension by level 1 volume lamination, Size is reduced, the image of 2*4*1024 is obtained, the image of 1*1*1024 is obtained using 1 layer of average pond, connects entirely using 1 layer It connects layer (FC) and obtains first time of 1*1*6 and connect image entirely, exporting 6 parameters can be rotary flat shifting parameter;Another direction Increase dimension by level 1 volume lamination, reduce size, obtains the image of 4*8*512, then by the image of 4*8*512 by level 1 volume product Layer increases dimension, reduces size, obtains the image of 2*4*1024, obtains the image of 1*1*1024 using 1 layer of average pond, then Second of full connection image of 1*1*4 is obtained by 1 layer of full articulamentum (FC), exporting 4 parameters can be camera intrinsic parameter, phase Machine focal length fx and fy, principal point coordinate x0 and y0, the activation primitive of output layer are softplus function (softplus function) Or exported without activation primitive, remainder layer is relu activation primitive.
The training method of depth map prediction network model provided in an embodiment of the present invention, can be as shown in Figure 5.Fig. 5 is this A kind of two dimension 2D image of inventive embodiments is converted to the training stream of depth map prediction network model in the method for three-dimensional 3D rendering Cheng Tu, comprising:
Step S501 obtains multiple and different 3D film sources of different cameral shooting as training sample;Wherein, each training Sample includes left view and corresponding right view.
Right view input initial depth figure in preset quantity training sample is predicted network model, obtained by step S502 Obtain the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
It is enforceable, the right view in 8 training samples can be inputted into initial depth figure and predict network model, obtained just Beginning depth map predicts that the second right depth map of prediction of each training sample of network model output, initial depth figure predict network mould Type can be the network of view-based access control model geometry group VGG or U-net network structure;
Coded portion may include 14 convolutional layers, and decoded portion may include 14 convolutional layers, can specifically refer to Table 1, wherein activation primitive can be softplus (softplus function) function and elu function.Wherein softplus (softplus function) function can be used as output layer activation primitive, and remainder layer is elu activation primitive.
The left view of preset quantity training sample and right view are inputted initial camera parameter prediction network by step S503 Model obtains the prediction camera parameter of each training sample of initial camera parameter prediction network model output.
Enforceable, prediction camera parameter may include: prediction camera intrinsic parameter and prediction rotation translation parameters;It is wherein pre- The camera intrinsic parameter of survey is activation primitive softplus (softplus function) output, and the rotation translation parameters of prediction can Think no activation primitive output.Preset camera imaging formula can be with are as follows:~it is mapping Operation;Ps is coordinate of the binocular image reference point in left view, and Pt is coordinate of the binocular image reference point in right view, K For camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor rotation Translation matrix;The camera Intrinsic Matrix include (fx, fy, x0, y0) 4 parameters, wherein fx and fy be camera focus, x0 and Y0 is principal point coordinate.
Step S504, the prediction phase of the right depth map of the second prediction, each training sample based on each training sample Machine parameter, preset camera imaging formula and preset second sample mode handle the second right depth map of prediction, obtain Second prediction right view of each training sample.
It, can be first by the right depth map of the second prediction, prediction camera intrinsic parameter and prediction in a kind of specific embodiment Rotation translation parameters is brought into the preset camera imaging formula, is obtained second prediction mapping of the right view in left view and is sat Mark;According still further to second prediction mapping point of the right view in left view, left view is sampled, the second right view of prediction is obtained Figure.
Step S505, according to right view true in preset loss function, each training sample and its corresponding Second prediction right view calculate penalty values.
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
Step S506 judges initial depth figure prediction network model and initial camera parameter prediction network mould according to penalty values Whether type converges to stabilization.
It is enforceable, can be judged according to penalty values size variation and fluctuating change initial depth figure prediction network model and Whether initial camera parameter prediction network model converges to stabilization.
If it is determined that result be it is yes, i.e., if converging to stabilization, then follow the steps S507;The result judged be it is no, I.e. if not converging to stabilization, S509 is thened follow the steps.
The quantity of frequency of training is increased once, and judges whether to reach preset frequency of training by step S507.
If it is determined that result be it is yes, that is, reach preset frequency of training, then return to step S502;If it is determined that As a result be it is no, that is, reach preset frequency of training, then follow the steps S508.
Step S508 determines that current initial depth figure prediction network model is the depth map prediction network mould that training is completed Type.Process terminates.
Step S509 adjusts the net of the prediction of initial depth the figure network model and initial camera parameter prediction network model Network parameter, returns to step S502.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, during depth map prediction network model training, camera parameter prediction is added in the embodiment of the present invention Network model learns camera parameter, eliminates the shadow that different cameral parameter during model training predicts depth map It rings, the depth map for predicting depth map prediction network model is truer, and 3D rendering depth stereovision is richer, and stereoscopic effect is more By force.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, during depth map prediction network model training, camera parameter prediction is added in the embodiment of the present invention Network model learns camera parameter, reduces the shadow that different cameral parameter during model training predicts depth map It rings, the depth map for predicting depth map prediction network model is truer, and 3D rendering depth stereovision is richer, and stereoscopic effect is more By force.
The depth map prediction technique of image provided in an embodiment of the present invention, as shown in fig. 6, the specific process flow of this method Include:
Step S601 obtains the first monocular view of depth map to be predicted.
The first monocular view is input to trained depth map prediction network model in advance by step S602;It is described Depth map predicts that network model can be with are as follows: based on being trained acquisition using Fig. 5 training method;The first monocular view is a left side View or right view.
Step S603 obtains first predetermined depth figure of depth map prediction network model output.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, during depth map prediction network model training, camera parameter prediction is added in the embodiment of the present invention Network model learns camera parameter, reduces the shadow that different cameral parameter during model training predicts depth map It rings, the depth map for predicting depth map prediction network model is truer, and 3D rendering depth stereovision is richer, and stereoscopic effect is more By force.
The embodiment of the present invention provides the structural schematic diagram that a kind of two dimension 2D image is converted to three-dimensional 3D rendering device, such as Fig. 7 It is shown, comprising:
First 2D image acquisition unit 701, for obtaining the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
First 2D image input units 702, for being regarded the 2D image as the first monocular for being used to generate 3D rendering Figure is input to trained depth map prediction network model in advance;The depth map prediction network model is based on multiple and different 3D film source sample network model and initial camera parameter prediction network model, which are trained acquisition, to be predicted to initial depth figure;Institute Stating the first monocular view is left view or right view;
First predetermined depth figure acquiring unit 703, for obtaining the first pre- depth measurement of depth map prediction network model output Degree figure;
Second monocular view obtaining unit 704, for based on first predetermined depth figure, 2D image camera parameter, Preset camera imaging formula and preset first sample mode handle first predetermined depth figure, and it is single to obtain second Eye diagram;The second monocular view is right view corresponding with the first monocular view or left view;
3D rendering generation unit 705 generates 3D figure for being based on the first monocular view and the second monocular view Picture.
Optionally, described device includes depth map prediction network model training unit;The depth map predicts network model Training unit, comprising:
Training sample obtains module, for obtaining multiple and different 3D film sources of different cameral shooting as training sample; Wherein, each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module, for the right view input in preset quantity training sample is initial deep Degree figure prediction network model obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output Figure;
Predict that camera parameter obtains module, it is initial for inputting the left view of preset quantity training sample and right view Camera parameter predicts network model, obtains the prediction camera of each training sample of initial camera parameter prediction network model output Parameter;
Second prediction right view obtains module, for the second right depth map of prediction, each based on each training sample The prediction camera parameter of a training sample, preset camera imaging formula and preset second sample mode, it is right to the second prediction Depth map is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module, for according to true right view in preset loss function, each training sample And its corresponding second prediction right view calculates penalty values;
Judgment module is restrained, for judging that initial depth figure prediction network model and initial camera parameter are pre- according to penalty values Survey whether network model converges to stabilization;
Frequency of training judgment module, if increased the quantity of frequency of training once, and judge for converging to stabilization Whether preset frequency of training is reached;If not reaching preset frequency of training, triggers the right depth map of second prediction and obtain It obtains module and the right view input initial depth figure in preset quantity training sample is predicted into network model, obtain initial depth figure Predict the second right depth map of prediction of each training sample of network model output;If reaching preset frequency of training, when Preceding initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module, if increased the quantity of frequency of training once, and adjust for not converged to stable The network parameter of the whole prediction of initial depth the figure network model and initial camera parameter prediction network model, triggering described second Predict that right depth map obtains module and the right view in preset quantity training sample is inputted to depth map prediction network to be trained Model obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
Optionally, the prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally, the initial depth figure predicts network model are as follows: view-based access control model geometry group VGG or U-net network structure Network;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
Optionally, the first monocular view is left view, and first predetermined depth figure is the first left depth map of prediction;
Second monocular view obtaining unit, comprising: the first prediction right view obtains module;
The first prediction right view obtains module, for based on first prediction left depth map, the 2D image Camera parameter, preset camera imaging formula and preset first sample mode are predicted at left depth map described first Reason obtains the first prediction right view;
The 3D rendering generation unit, is specifically used for: based on the left view and the first prediction right view, generating 3D Image.
Optionally, the camera parameter of the 2D image, comprising: the camera intrinsic parameter and rotation translation parameters of 2D image;
The first prediction right view obtains module, is specifically used for:
Bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into the preset camera In imaging formula, first prediction mapping point of the left view in right view is obtained;
According to first prediction mapping point of the left view in right view, left view is sampled, the first prediction is obtained Right view.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, the second monocular view in the present embodiment, be based on first predetermined depth figure, 2D image camera parameter, After preset camera imaging formula and preset first sample mode handle first predetermined depth figure, acquisition. With reference to camera parameter during prediction, so that the prediction right view obtained is truer.
The embodiment of the present invention provides a kind of structural schematic diagram of the training device of depth map prediction network model, such as Fig. 8 institute Show, comprising:
Training sample obtains module 801, for obtaining multiple and different 3D film sources of different cameral shooting as training sample This;Wherein, each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module 802, for inputting just the right view in preset quantity training sample Beginning depth map predicts network model, and the second prediction for obtaining each training sample of initial depth figure prediction network model output is right Depth map;
Predict that camera parameter obtains module 803, for inputting the left view of preset quantity training sample and right view Initial camera parameter prediction network model obtains the prediction of each training sample of initial camera parameter prediction network model output Camera parameter;
Second prediction right view obtains module 804, for based on each training sample the second right depth map of prediction, The prediction camera parameter of each training sample, preset camera imaging formula and preset second sample mode are predicted second Right depth map is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module 805, for according to right view true in preset loss function, each training sample Figure and its corresponding second prediction right view calculate penalty values;
Judgment module 806 is restrained, for judging that initial depth figure prediction network model and initial camera are joined according to penalty values Whether number prediction network model converges to stabilization;
Frequency of training judgment module 807, if the quantity of frequency of training is increased once for converging to stabilization, and Judge whether to reach preset frequency of training;If not reaching preset frequency of training, the right depth of second prediction is triggered Figure obtains module and the right view input initial depth figure in preset quantity training sample is predicted network model, obtains initial deep The second right depth map of prediction of each training sample of degree figure prediction network model output;If reaching preset frequency of training, Then current initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module 808, if the quantity of frequency of training increased once for not converged to stable, And the network parameter of the prediction of initial depth the figure network model and initial camera parameter prediction network model is adjusted, described in triggering The second right depth map of prediction obtains module and the right view in preset quantity training sample is inputted to depth map prediction to be trained Network model obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
Optionally, the prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is seat of the binocular image reference point in right view Mark, K are camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor Rotational translation matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, X0 and y0 is principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera In imaging formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second prediction is obtained Right view.
Optionally,
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, each direction packet Containing level 2 volume product, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents True right view,Represent prediction right view;Indicate the structure of prediction right view and true right view Similitude;Indicate the absolute value error L1 of prediction right view and true right view.
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,The first derivative of right depth map in y-direction is represented,The first derivative of right view in the x direction is represented,Represent the right side The first derivative of view in y-direction;I, j represent pixel coordinate.
Final third loss function is
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, during depth map prediction network model training, camera parameter prediction is added in the embodiment of the present invention Network model learns camera parameter, reduces the shadow that different cameral parameter during model training predicts depth map It rings, the depth map for predicting depth map prediction network model is truer, and 3D rendering depth stereovision is richer, and stereoscopic effect is more By force.
The embodiment of the present invention provides a kind of structural schematic diagram of the depth map prediction meanss of image, as shown in Figure 9, comprising:
First monocular view obtaining unit 901, for obtaining the first monocular view of depth map to be predicted;
First monocular view input unit 902, for by the first monocular view, being input to preparatory trained depth Figure prediction network model;The depth map predicts network model are as follows: is trained acquisition based on use Fig. 8 device;Described first Monocular view is left view or right view;
First depth map obtaining unit 903, for obtaining first predetermined depth figure of depth map prediction network model output.
As it can be seen that, based on 3D rendering to the prediction network model training of initial depth figure, not needed true using the embodiment of the present invention Real depth map coaches, and can train depth map prediction network model, realize that 2D image is converted to 3D rendering.
In addition, during depth map prediction network model training, camera parameter prediction is added in the embodiment of the present invention Network model learns camera parameter, reduces the shadow that different cameral parameter during model training predicts depth map It rings, the depth map for predicting depth map prediction network model is truer, and 3D rendering depth stereovision is richer, and stereoscopic effect is more By force.
The embodiment of the invention also provides a kind of electronic equipment, as shown in Figure 10, including processor 1001, communication interface 1002, memory 1003 and communication bus 1004, wherein processor 1001, communication interface 1002, memory 1003 pass through communication Bus 1004 completes mutual communication,
Memory 1003, for storing computer program;
Processor 1001 when for executing the program stored on memory 1003, realizes following steps:
Obtain the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
Using the 2D image as the first monocular view for being used to generate 3D rendering, it is input to preparatory trained depth map Predict network model;The depth map prediction network model is to be predicted based on multiple and different 3D film source samples initial depth figure Network model and initial camera parameter prediction network model are trained acquisition;The first monocular view is left view or right view Figure;
Obtain first predetermined depth figure of depth map prediction network model output;
Based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and preset One sample mode handles first predetermined depth figure, obtains the second monocular view;The second monocular view be with The corresponding right view of first monocular view or left view;
Based on the first monocular view and the second monocular view, 3D rendering is generated.
The embodiment of the invention also provides another electronic equipments, as shown in figure 11, including processor 1101, communication interface 1102, memory 1103 and communication bus 1104, wherein processor 1101, communication interface 1102, memory 1103 pass through communication Bus 1104 completes mutual communication,
Memory 1103, for storing computer program;
Processor 1101 when for executing the program stored on memory 1103, obtains the multiple of different cameral shooting Different 3D film sources are as training sample;Wherein, each training sample includes left view and corresponding right view;
Right view input initial depth figure in preset quantity training sample is predicted into network model, obtains initial depth The second right depth map of prediction of each training sample of figure prediction network model output;
The left view of preset quantity training sample and right view are inputted into initial camera parameter prediction network model, obtained The prediction camera parameter of each training sample of initial camera parameter prediction network model output;
It is the prediction camera parameter of the second right depth map of prediction, each training sample based on each training sample, pre- If camera imaging formula and preset second sample mode, the second right depth map of prediction is handled, each training is obtained Second prediction right view of sample;
According to right view true in preset loss function, each training sample and its corresponding second prediction Right view calculates penalty values;
Judge whether initial depth figure prediction network model and initial camera parameter prediction network model are equal according to penalty values Converge to stabilization;
If converging to stabilization, the quantity of frequency of training is increased once, and judges whether to reach preset training time Number;If not reaching preset frequency of training, returns to the right view by preset quantity training sample and input initially Depth map predicts network model, obtains the second right depth of prediction of each training sample of initial depth figure prediction network model output The step of degree figure;If reaching preset frequency of training, current initial depth figure prediction network model is what training was completed Depth map predicts network model;
If not converged to stable, the quantity of frequency of training is increased once, and adjusts the initial depth figure prediction The network parameter of network model and initial camera parameter prediction network model, return is described will be in preset quantity training sample Right view inputs depth map prediction network model to be trained, and obtains each training of initial depth figure prediction network model output The second of sample predicts the step of right depth map.
The embodiment of the invention also provides another electronic equipments, as shown in figure 12, including processor 1201, communication interface 1202, memory 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through communication Bus 1204 completes mutual communication,
Memory 1203, for storing computer program;
Processor 1201 when for executing the program stored on memory 1203, obtains the first of depth map to be predicted Monocular view;
By the first monocular view, it is input to trained depth map prediction network model in advance;The depth map is pre- Survey network model are as follows: acquisition is trained based on use any of the above-described training method;The first monocular view is left view Or right view;
Obtain first predetermined depth figure of depth map prediction network model output.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal Processing, DSP), it is specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete Door or transistor logic, discrete hardware components.
In another embodiment provided by the invention, a kind of computer readable storage medium is additionally provided, which can It reads to be stored with computer program in storage medium, the computer program realizes any of the above-described any two dimension when being executed by processor 2D image is converted to the step of method of three-dimensional 3D rendering;Or realize the training side of any of the above-described depth map prediction network model The step of method;Or the step of realizing the depth map prediction technique of any of the above-described image.
In another embodiment provided by the invention, a kind of computer program product comprising instruction is additionally provided, when it When running on computers, so that computer executes any two dimension 2D image in above-described embodiment and is converted to three-dimensional 3D rendering side Method;Or realize the training method of any of the above-described depth map prediction network model;Or realize the depth map of any of the above-described image Prediction technique.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When loading on computers and executing the computer program instructions, all or It partly generates according to process or function described in the embodiment of the present invention.The computer can be general purpose computer, dedicated meter Calculation machine, computer network or other programmable devices.The computer instruction can store in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For the embodiments such as computer readable storage medium and computer program product, since it is substantially similar to the method embodiment, institute To be described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (23)

1. a kind of method that two dimension 2D image is converted to three-dimensional 3D rendering, which is characterized in that the described method includes:
Obtain the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
Using the 2D image as the first monocular view for being used to generate 3D rendering, it is input to trained depth map prediction in advance Network model;The depth map prediction network model is to predict network to initial depth figure based on multiple and different 3D film source samples Model and initial camera parameter prediction network model are trained acquisition;The first monocular view is left view or right view;
Obtain first predetermined depth figure of the depth map prediction network model output;
It is adopted based on first predetermined depth figure, the camera parameter of 2D image, preset camera imaging formula and preset first Sample loading mode handles first predetermined depth figure, obtains the second monocular view;The second monocular view be and first The corresponding right view of monocular view or left view;
Based on the first monocular view and the second monocular view, 3D rendering is generated.
2. the method according to claim 1, wherein the training process packet of depth map prediction network model It includes:
Multiple and different 3D film sources of different cameral shooting are obtained as training sample;Wherein, each training sample includes left view Figure and corresponding right view;
Right view input initial depth figure in preset quantity training sample is predicted into network model, it is pre- to obtain initial depth figure Survey the second right depth map of prediction of each training sample of network model output;
The left view of preset quantity training sample and right view are inputted into initial camera parameter prediction network model, obtained initial Camera parameter predicts the prediction camera parameter of each training sample of network model output;
It is the prediction camera parameter of the second right depth map of prediction, each training sample based on each training sample, preset Camera imaging formula and preset second sample mode handle the second right depth map of prediction, obtain each training sample Second prediction right view;
According to right view true in preset loss function, each training sample and its right view of corresponding second prediction Figure calculates penalty values;
Judge whether initial depth figure prediction network model and initial camera parameter prediction network model restrain according to penalty values To stabilization;
If converging to stabilization, the quantity of frequency of training is increased once, and judge whether to reach preset frequency of training;Such as Fruit does not reach preset frequency of training, then returns to the right view by preset quantity training sample and input initial depth figure It predicts network model, obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output Step;If reaching preset frequency of training, current initial depth figure prediction network model is the depth map that training is completed Predict network model;
If not converged to stable, the quantity of frequency of training is increased once, and it is pre- to adjust the current initial depth figure The network parameter of network model and initial camera parameter prediction network model is surveyed, return is described will be in preset quantity training sample Right view input depth map prediction network model to be trained, obtain each instruction of initial depth figure prediction network model output The step of practicing the second right depth map of prediction of sample.
3. according to the method described in claim 2, it is characterized in that,
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is coordinate of the binocular image reference point in right view, and K is Camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor rotary flat Move matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, x0 and y0 For principal point coordinate;
The prediction camera parameter, pre- of the second right depth map of prediction based on each training sample, each training sample If camera imaging formula and preset second sample mode, the second right depth map of prediction is handled, each training is obtained The step of second prediction right view of sample, comprising:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera imaging In formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second right view of prediction is obtained Figure.
4. according to the method described in claim 2, it is characterized in that,
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, and each direction includes 2 Layer convolution, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;A weight is 0.85;It represents true Right view,Represent prediction right view;Indicate that prediction right view is similar to the structure of true right view Property;Indicate the absolute value error L1 of prediction right view and true right view;
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,Generation The first derivative of the right depth map of table in y-direction,The first derivative of right view in the x direction is represented,Right view is represented to exist First derivative on the direction y;I, j represent pixel coordinate;
Final third loss function is
5. the method according to claim 1, wherein
The first monocular view is left view, and first predetermined depth figure is the first left depth map of prediction;
It is described based on first predetermined depth figure, the camera parameter of the 2D image, preset camera imaging formula and default The first sample mode first predetermined depth figure is handled, obtain the second monocular view the step of, comprising:
Based on first predetermined depth figure, the camera parameter of the 2D image, preset camera imaging formula and preset One sample mode handles the left depth map of first prediction, obtains the first prediction right view;
It is described be based on the first monocular view and the second monocular view, generate 3D rendering the step of, comprising:
Based on the left view and the first prediction right view, 3D rendering is generated.
6. according to the method described in claim 5, it is characterized in that,
The camera parameter of the 2D image, comprising: the camera intrinsic parameter and rotation translation parameters of 2D image;
It is described based on the left depth map of first prediction, the camera parameter of the 2D image, preset camera imaging formula and pre- If the first sample mode the step of right depth map of first prediction is handled, obtains the first prediction right view, comprising:
Bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into the preset camera imaging In formula, first prediction mapping point of the left view in right view is obtained;
According to first prediction mapping point of the left view in right view, left view is sampled, the first right view of prediction is obtained Figure.
7. a kind of training method of depth map prediction network model characterized by comprising
Multiple and different 3D film sources of different cameral shooting are obtained as training sample;Wherein, each training sample includes left view Figure and corresponding right view;
Right view input initial depth figure in preset quantity training sample is predicted into network model, it is pre- to obtain initial depth figure Survey the second right depth map of prediction of each training sample of network model output;
The left view of preset quantity training sample and right view are inputted into initial camera parameter prediction network model, obtained initial Camera parameter predicts the prediction camera parameter of each training sample of network model output;
It is the prediction camera parameter of the second right depth map of prediction, each training sample based on each training sample, preset Camera imaging formula and preset second sample mode handle the second right depth map of prediction, obtain each training sample Second prediction right view;
According to right view true in preset loss function, each training sample and its right view of corresponding second prediction Figure calculates penalty values;
Judge whether initial depth figure prediction network model and initial camera parameter prediction network model restrain according to penalty values To stabilization;
If converging to stabilization, the quantity of frequency of training is increased once, and judge whether to reach preset frequency of training;Such as Fruit does not reach preset frequency of training, then returns to the right view by preset quantity training sample and input initial depth figure It predicts network model, obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output Step;If reaching preset frequency of training, current initial depth figure prediction network model is the depth map that training is completed Predict network model;
If not converged to stable, the quantity of frequency of training is increased once, and it is pre- to adjust the current initial depth figure The network parameter of network model and initial camera parameter prediction network model is surveyed, return is described will be in preset quantity training sample Right view input depth map prediction network model to be trained, obtain each instruction of initial depth figure prediction network model output The step of practicing the second right depth map of prediction of sample.
8. the method according to the description of claim 7 is characterized in that
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is coordinate of the binocular image reference point in right view, and K is Camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor rotary flat Move matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, x0 and y0 For principal point coordinate;
The prediction camera parameter, pre- of the second right depth map of prediction based on each training sample, each training sample If camera imaging formula and preset second sample mode, the second right depth map of prediction is handled, each training is obtained The step of second prediction right view of sample, comprising:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera imaging In formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second right view of prediction is obtained Figure.
9. according to the method described in claim 8, it is characterized in that,
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, and each direction includes 2 Layer convolution, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents true Right view,Represent prediction right view;Indicate that prediction right view is similar to the structure of true right view Property;Indicate the absolute value error L1 of prediction right view and true right view;
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,Generation The first derivative of the right depth map of table in y-direction,The first derivative of right view in the x direction is represented,Right view is represented to exist First derivative on the direction y;I, j represent pixel coordinate;
Final third loss function is
10. a kind of depth map prediction technique of image characterized by comprising
Obtain the first monocular view of depth map to be predicted;
By the first monocular view, it is input to trained depth map prediction network model in advance;The pre- survey grid of depth map Network model are as follows: acquisition is trained based on any one of use claim 7~9 training method;The first monocular view is a left side View or right view;
Obtain first predetermined depth figure of the depth map prediction network model output.
11. the device that a kind of two dimension 2D image is converted to three-dimensional 3D rendering, which is characterized in that described device includes:
First 2D image acquisition unit, for obtaining the two-dimentional 2D image to be converted for three-dimensional 3D rendering;
First 2D image input units, for using the 2D image as the first monocular view for being used to generate 3D rendering, input Network model is predicted to trained depth map in advance;The depth map prediction network model is based on multiple and different 3D film sources Sample predicts that network model and initial camera parameter prediction network model are trained acquisition to initial depth figure;Described first is single Eye diagram is left view or right view;
First predetermined depth figure acquiring unit, for obtaining first predetermined depth figure of depth map prediction network model output;
Second monocular view obtaining unit, for based on first predetermined depth figure, the camera parameter of 2D image, preset phase Machine imaging formula and preset first sample mode handle first predetermined depth figure, obtain the second monocular view; The second monocular view is right view corresponding with the first monocular view or left view;
3D rendering generation unit generates 3D rendering for being based on the first monocular view and the second monocular view.
12. device according to claim 11, which is characterized in that described device includes depth map prediction network model training Unit;The depth map predicts network model training unit, comprising:
Training sample obtains module, for obtaining multiple and different 3D film sources of different cameral shooting as training sample;Wherein, Each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module, for the right view in preset quantity training sample to be inputted initial depth figure It predicts network model, obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output;
Predict that camera parameter obtains module, for the left view of preset quantity training sample and right view to be inputted initial camera Parameter prediction network model obtains the prediction camera ginseng of each training sample of initial camera parameter prediction network model output Number;
Second prediction right view obtains module, for the right depth map of the second prediction, Ge Gexun based on each training sample Prediction camera parameter, preset camera imaging formula and preset second sample mode for practicing sample, to the second right depth of prediction Figure is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module, for according to right view true in preset loss function, each training sample and its Corresponding second prediction right view calculates penalty values;
Judgment module is restrained, for judging initial depth figure prediction network model and initial camera parameter prediction net according to penalty values Whether network model converges to stabilization;
Frequency of training judgment module, if increased the quantity of frequency of training once, and judge whether for converging to stabilization Reach preset frequency of training;If not reaching preset frequency of training, triggers the right depth map of second prediction and obtain mould Right view input initial depth figure in preset quantity training sample is predicted network model by block, obtains the prediction of initial depth figure The second right depth map of prediction of each training sample of network model output;It is current if reaching preset frequency of training Initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module, if increased the quantity of frequency of training once, and adjust institute for not converged to stable The network parameter for stating the current prediction of initial depth figure network model and initial camera parameter prediction network model triggers described the The two right depth maps of prediction obtain module and the right view in preset quantity training sample are inputted the pre- survey grid of depth map to be trained Network model obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
13. device according to claim 12, which is characterized in that
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is coordinate of the binocular image reference point in right view, and K is Camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor rotary flat Move matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, x0 and y0 For principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera imaging In formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second right view of prediction is obtained Figure.
14. device according to claim 12, which is characterized in that
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, and each direction includes 2 Layer convolution, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents true Right view,Represent prediction right view;Indicate that prediction right view is similar to the structure of true right view Property;Indicate the absolute value error L1 of prediction right view and true right view;
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,Generation The first derivative of the right depth map of table in y-direction,The first derivative of right view in the x direction is represented,Right view is represented to exist First derivative on the direction y;I, j represent pixel coordinate;
Final third loss function is
15. device according to claim 11, which is characterized in that
The first monocular view is left view, and first predetermined depth figure is the first left depth map of prediction;
Second monocular view obtaining unit, comprising: the first prediction right view obtains module;
The first prediction right view obtains module, for the camera based on the left depth map of first prediction, the 2D image Parameter, preset camera imaging formula and preset first sample mode handle the left depth map of first prediction, obtain Obtain the first prediction right view;
The 3D rendering generation unit, is specifically used for: based on the left view and the first prediction right view, generating 3D figure Picture.
16. device according to claim 15, which is characterized in that
The camera parameter of the 2D image, comprising: the camera intrinsic parameter and rotation translation parameters of 2D image;
The first prediction right view obtains module, is specifically used for:
Bring the left depth map of the first prediction, the camera intrinsic parameter of 2D image and rotation translation parameters into the preset camera imaging In formula, first prediction mapping point of the left view in right view is obtained;
According to first prediction mapping point of the left view in right view, left view is sampled, the first right view of prediction is obtained Figure.
17. a kind of training device of depth map prediction network model characterized by comprising
Training sample obtains module, for obtaining multiple and different 3D film sources of different cameral shooting as training sample;Wherein, Each training sample includes left view and corresponding right view;
The second right depth map of prediction obtains module, for the right view in preset quantity training sample to be inputted initial depth figure It predicts network model, obtains the second right depth map of prediction of each training sample of initial depth figure prediction network model output;
Predict that camera parameter obtains module, for the left view of preset quantity training sample and right view to be inputted initial camera Parameter prediction network model obtains the prediction camera ginseng of each training sample of initial camera parameter prediction network model output Number;
Second prediction right view obtains module, for the right depth map of the second prediction, Ge Gexun based on each training sample Prediction camera parameter, preset camera imaging formula and preset second sample mode for practicing sample, to the second right depth of prediction Figure is handled, and the second prediction right view of each training sample is obtained;
Penalty values computing module, for according to right view true in preset loss function, each training sample and its Corresponding second prediction right view calculates penalty values;
Judgment module is restrained, for judging initial depth figure prediction network model and initial camera parameter prediction net according to penalty values Whether network model converges to stabilization;
Frequency of training judgment module, if increased the quantity of frequency of training once, and judge whether for converging to stabilization Reach preset frequency of training;If not reaching preset frequency of training, triggers the right depth map of second prediction and obtain mould Right view input initial depth figure in preset quantity training sample is predicted network model by block, obtains the prediction of initial depth figure The second right depth map of prediction of each training sample of network model output;It is current if reaching preset frequency of training Initial depth figure prediction network model is the depth map prediction network model that training is completed;
Network parameter adjusts module, if increased the quantity of frequency of training once, and adjust institute for not converged to stable The network parameter for stating initial depth figure prediction network model and initial camera parameter prediction network model triggers second prediction Right depth map obtains module and the right view in preset quantity training sample is inputted to depth map prediction network model to be trained, Obtain the second right depth map of prediction of each training sample of initial depth figure prediction network model output.
18. device according to claim 17, which is characterized in that
The prediction camera parameter includes: prediction camera intrinsic parameter and prediction rotation translation parameters;
The preset camera imaging formula are as follows:
Wherein ,~it is mapping operations;
Ps is coordinate of the binocular image reference point in left view, and Pt is coordinate of the binocular image reference point in right view, and K is Camera Intrinsic Matrix, K-1For the inverse matrix of camera Intrinsic Matrix, Dt (Pt) is the depth of Pt this point, Tt→sFor rotary flat Move matrix;The camera Intrinsic Matrix includes (fx, fy, x0, y0) 4 parameters, and wherein fx and fy is camera focus, x0 and y0 For principal point coordinate;
The second prediction right view obtains module, is specifically used for:
Bring the right depth map of the second prediction, prediction camera intrinsic parameter and prediction rotation translation parameters into the preset camera imaging In formula, second prediction mapping point of the right view in left view is obtained;
According to second prediction mapping point of the right view in left view, left view is sampled, the second right view of prediction is obtained Figure.
19. device according to claim 18, which is characterized in that
The initial depth figure predicts network model are as follows: the network of view-based access control model geometry group VGG or U-net network structure;
The initial camera parameter prediction network model includes: to be divided into both direction up and down after 5 layers of convolution, and each direction includes 2 Layer convolution, 1 layer of average pond and 1 layer of full FC layers of connection;
The preset loss function includes: SSIM+L1 loss function and First-order Gradient loss function;
Penalty values are obtained according to the SSIM+L1 loss function according to prediction right view and true right view
The SSIM+L1 loss function formula are as follows:
Wherein,Indicate penalty values;N expression takes N number of sample every time;R indicates right view;α weight is 0.85;It represents true Right view,Represent prediction right view;Indicate that prediction right view is similar to the structure of true right view Property;Indicate the absolute value error L1 of prediction right view and true right view;
According to right depth map and practical right view is predicted, according to the First-order Gradient loss function, penalty values are obtained
The First-order Gradient loss function formula are as follows:
Indicate penalty values,The first derivative of right depth map in the x direction is represented, N expression takes N number of sample every time,Generation The first derivative of the right depth map of table in y-direction,The first derivative of right view in the x direction is represented,Right view is represented to exist First derivative on the direction y;I, j represent pixel coordinate;
Final third loss function is
20. a kind of depth map prediction meanss of image, which is characterized in that described device includes:
First monocular view obtaining unit, for obtaining the first monocular view of depth map to be predicted;
First monocular view input unit, for by the first monocular view, being input to trained depth map prediction in advance Network model;The depth map predicts network model are as follows: is trained based on any one of use claim 7~9 training method It obtains;The first monocular view is left view or right view;
First depth map obtaining unit, for obtaining first predetermined depth figure of depth map prediction network model output.
21. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any method and step of claim 1-6.
22. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any method and step of claim 7-9.
23. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes method and step described in any one of claim 10.
CN201910381527.6A 2019-05-08 2019-05-08 Image conversion, depth map prediction and model training method and device and electronic equipment Active CN110111244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910381527.6A CN110111244B (en) 2019-05-08 2019-05-08 Image conversion, depth map prediction and model training method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910381527.6A CN110111244B (en) 2019-05-08 2019-05-08 Image conversion, depth map prediction and model training method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110111244A true CN110111244A (en) 2019-08-09
CN110111244B CN110111244B (en) 2024-01-26

Family

ID=67488916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910381527.6A Active CN110111244B (en) 2019-05-08 2019-05-08 Image conversion, depth map prediction and model training method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110111244B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429501A (en) * 2020-03-25 2020-07-17 贝壳技术有限公司 Depth map prediction model generation method and device and depth map prediction method and device
CN111445518A (en) * 2020-03-25 2020-07-24 贝壳技术有限公司 Image conversion method and device, depth map prediction method and device
CN112468828A (en) * 2020-11-25 2021-03-09 深圳大学 Code rate allocation method and device for panoramic video, mobile terminal and storage medium
CN112530003A (en) * 2020-12-11 2021-03-19 北京奇艺世纪科技有限公司 Three-dimensional human hand reconstruction method and device and electronic equipment
CN116740158A (en) * 2023-08-14 2023-09-12 小米汽车科技有限公司 Image depth determining method, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106412560A (en) * 2016-09-28 2017-02-15 湖南优象科技有限公司 Three-dimensional image generating method based on depth map
WO2018046964A1 (en) * 2016-09-12 2018-03-15 Ucl Business Plc Predicting depth from image data using a statistical model
CN109255831A (en) * 2018-09-21 2019-01-22 南京大学 The method that single-view face three-dimensional reconstruction and texture based on multi-task learning generate

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018046964A1 (en) * 2016-09-12 2018-03-15 Ucl Business Plc Predicting depth from image data using a statistical model
CN106412560A (en) * 2016-09-28 2017-02-15 湖南优象科技有限公司 Three-dimensional image generating method based on depth map
CN109255831A (en) * 2018-09-21 2019-01-22 南京大学 The method that single-view face three-dimensional reconstruction and texture based on multi-task learning generate

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429501A (en) * 2020-03-25 2020-07-17 贝壳技术有限公司 Depth map prediction model generation method and device and depth map prediction method and device
CN111445518A (en) * 2020-03-25 2020-07-24 贝壳技术有限公司 Image conversion method and device, depth map prediction method and device
CN111445518B (en) * 2020-03-25 2023-04-18 如你所视(北京)科技有限公司 Image conversion method and device, depth map prediction method and device
CN112468828A (en) * 2020-11-25 2021-03-09 深圳大学 Code rate allocation method and device for panoramic video, mobile terminal and storage medium
CN112468828B (en) * 2020-11-25 2022-06-17 深圳大学 Code rate distribution method and device for panoramic video, mobile terminal and storage medium
CN112530003A (en) * 2020-12-11 2021-03-19 北京奇艺世纪科技有限公司 Three-dimensional human hand reconstruction method and device and electronic equipment
CN112530003B (en) * 2020-12-11 2023-10-27 北京奇艺世纪科技有限公司 Three-dimensional human hand reconstruction method and device and electronic equipment
CN116740158A (en) * 2023-08-14 2023-09-12 小米汽车科技有限公司 Image depth determining method, device and storage medium
CN116740158B (en) * 2023-08-14 2023-12-05 小米汽车科技有限公司 Image depth determining method, device and storage medium

Also Published As

Publication number Publication date
CN110111244B (en) 2024-01-26

Similar Documents

Publication Publication Date Title
CN110111244A (en) Image conversion, depth map prediction and model training method, device and electronic equipment
JP7392227B2 (en) Feature pyramid warping for video frame interpolation
US20180063504A1 (en) Selective culling of multi-dimensional data sets
CN110599395B (en) Target image generation method, device, server and storage medium
JP2019534606A (en) Method and apparatus for reconstructing a point cloud representing a scene using light field data
EP3511909A1 (en) Image processing method and device for projecting image of virtual reality content
US11871127B2 (en) High-speed video from camera arrays
CN113034380A (en) Video space-time super-resolution method and device based on improved deformable convolution correction
Li et al. PolarGlobe: A web-wide virtual globe system for visualizing multidimensional, time-varying, big climate data
CN105493501A (en) Virtual video camera
CN109934764A (en) Processing method, device, terminal, server and the storage medium of panoramic video file
JP2018109958A (en) Method and apparatus for encoding signal transporting data to reconfigure sparse matrix
CN109120869A (en) Double light image integration methods, integration equipment and unmanned plane
CN111667438B (en) Video reconstruction method, system, device and computer readable storage medium
CN109934307A (en) Disparity map prediction model training method, prediction technique, device and electronic equipment
CN113852829A (en) Method and device for encapsulating and decapsulating point cloud media file and storage medium
CN115359173A (en) Virtual multi-view video generation method and device, electronic equipment and storage medium
CN111243085A (en) Training method and device for image reconstruction network model and electronic equipment
CN110324585B (en) SLAM system implementation method based on high-speed mobile platform
CN107707830A (en) Panoramic video based on one-way communication plays camera system
CN112785634A (en) Computer device and synthetic depth map generation method
CN115272667A (en) Farmland image segmentation model training method and device, electronic equipment and medium
JP5086120B2 (en) Depth information acquisition method, depth information acquisition device, program, and recording medium
CN115209185A (en) Video frame insertion method and device and readable storage medium
JP2011210232A (en) Image conversion device, image generation system, image conversion method, and image generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant