CN110223382B

CN110223382B - Single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning

Info

Publication number: CN110223382B
Application number: CN201910509328.9A
Authority: CN
Inventors: 杨路; 李佑华; 杨经纶
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2021-02-12
Anticipated expiration: 2039-06-13
Also published as: CN110223382A

Abstract

The invention discloses a single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning, which comprises the following steps of: generating a training sample; acquiring high-level semantics of the picture by utilizing a feature extraction network; image semantic decoupling is converted into output viewpoint-independent three-dimensional model point cloud and camera viewpoint parameters through a decoupling network; reconstructing a viewpoint-independent three-dimensional model; estimating a camera viewpoint and generating a free viewpoint; generating a free viewpoint three-dimensional model; and deep learning model training. The method can simply and efficiently reconstruct the three-dimensional model of the free viewpoint from the single-frame image, improves the generalization of the model and widens the application range.

Description

Single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning

Technical Field

The invention relates to the field of three-dimensional model reconstruction, in particular to a single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning.

Background

The free viewpoint three-dimensional model is based on a common three-dimensional model, allows people to have the same stereoscopic impression when the watching angles are switched, and can provide a more real and natural visual environment for stereoscopic multimedia. Due to the complexity of the three-dimensional model, the traditional method has high cost for generating the free viewpoint three-dimensional model, needs workers to manually render under different viewpoints to generate more three-dimensional models with different viewpoints, and has low efficiency and complex operation. How to simply and efficiently generate the free viewpoint three-dimensional model is always a research hotspot of researchers, and has great application potential.

The viewpoint-independent three-dimensional model can be regarded as a special free viewpoint three-dimensional model under the initial viewpoint, and the model shape is the same between the two models, but the viewpoint difference exists. The three-dimensional model of the free viewpoint may be generated by performing a viewpoint transformation on the viewpoint-independent three-dimensional model. The viewpoint-independent three-dimensional model has wide application prospects in the fields of attitude estimation, object tracking, target detection and the like. In the three-dimensional object posture estimation, a researcher needs to perform characteristic adaptation on a pre-established viewpoint-independent three-dimensional model and a two-dimensional contour in an image so as to realize posture estimation; in the three-dimensional object tracking detection, the situation that the viewpoint of an object is changed obviously often exists, and at the moment, a camera is required to track the three-dimensional motion of the object independently of the viewpoint, so that the feature extraction and the result matching are conveniently and efficiently carried out.

At present, deep learning has achieved great success in the field of single-frame image three-dimensional reconstruction. A researcher can easily complete the full extraction of the prior knowledge of shape semantics, viewpoint semantics and the like of a single-frame image by utilizing the strong feature extraction capability of the convolutional neural network, so that high-level abstract semantic features with strong generalization capability are obtained, the high-level abstract semantic features are converted into geometric parameters with specific significance through certain mapping, and the reconstruction of a three-dimensional model is guided. However, many studies at present always bind the shape of the three-dimensional model and the learning of the viewpoint together, and the generated three-dimensional model is only suitable for a single camera viewpoint, lacks certain changeability, cannot be flexibly changed along with viewpoint changes, and is limited in practical application. How to widen the application range of the generation method and improve the generation efficiency of the three-dimensional model is a problem which is puzzling.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for reconstructing a three-dimensional model of a free viewpoint of a single-frame image based on deep learning, which can simply and efficiently reconstruct the three-dimensional model of the free viewpoint from the single-frame image.

In order to solve the technical problems, the invention adopts the technical scheme that:

a single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning comprises the following steps:

the method comprises the following steps: sampling and rendering the CAD model to generate a single-frame image of the initial viewpoint real shape point cloud and different viewpoints at different distances;

step two: gradually acquiring high-level semantics of the image through deepening of a feature extraction network;

step three: converting the high-level semantics of the image through a decoupling network, and outputting point cloud coordinates of the viewpoint-independent three-dimensional model and camera viewpoint parameters;

step four: correcting the point cloud coordinates of the viewpoint-independent three-dimensional model output by the decoupling network, and performing three-dimensional shape reconstruction by triangular plate fitting to obtain a viewpoint-independent three-dimensional model;

step five: camera viewpoint parameters output by the decoupling network are subjected to homogeneous transformation to obtain camera viewpoints, and transformation is performed on the basis to generate free viewpoints;

step six: multiplying the free viewpoint by the viewpoint-independent three-dimensional model to obtain a free viewpoint three-dimensional model;

step seven: inputting the training sample into a neural network for automatic training, gradually updating network parameters, and optimizing a free viewpoint three-dimensional model to obtain an optimal result.

Further, in the step one, OpenGL is adopted to sample and render the CAD model to generate a training sample.

Further, in step two, ResNet is used as a feature extraction network, that is, feature extraction is performed on the input image of each training sample by using the following formula:

wherein N is a positive integer;

representing semantic information generated by a jth image in an ith CAD training sample in the nth class; ResNet represents a feature extraction network;

representing the jth image in the ith CAD training sample in the nth class.

Further, the third step is specifically:

coupling and converting the extracted high-level semantics by using a neural network global average pooling layer and a full connection layer to the convolved feature diagram to obtain viewpoint-independent three-dimensional model point cloud and camera viewpoint parameters; the camera viewpoint parameters include: the camera attitude is represented by euler angles including a pitch angle pitch (γ), a roll angle roll (β), and an yaw angle yaw (α); the camera coordinates are determined by the coordinates t of the camera in the initial coordinate system_x,t_y,t_zAnd (4) showing.

Further, the fourth step is specifically:

correcting the point cloud coordinates output by the decoupling network, and fitting the point cloud coordinates which are densely distributed in a piece by a triangular plate to form a three-dimensional curved surface model irrelevant to a viewpoint;

correcting the negative value in the point cloud coordinate required for the viewpoint-independent three-dimensional reconstruction by the following formula:

wherein, Y represents the output result,

indicating the final output result, the ReLU function indicates a positive-valued modified response unit.

Further, the fifth step is specifically:

camera attitude and coordinates are obtained by performing homogeneous transformation on camera viewpoint parameters output by the decoupling network; calculating three Euler angles including a pitch angle pitch (gamma), a roll angle (beta) and an yaw angle yaw (alpha) to obtain a rotation matrix which represents the posture of the camera; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; obtaining the coordinates of the CAD model under a camera coordinate system through homogeneous transformation, wherein the coordinates are as follows:

t＝(t_x,t_y,t_z)^T

wherein x, y, z represent fixed coordinates of the CAD model; x ', y ', z ' represent CAD model coordinates in a camera coordinate system; r is a rotation matrix calculated from Euler angles including pitch (γ), roll (β) and yaw (α)Obtaining, representing a camera pose; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; t is a posture transformation matrix which represents a camera viewpoint and comprises a posture and coordinates;

generating a free viewpoint on the basis of the estimated camera viewpoint; taking the CAD model as a sphere center, changing the coordinate of the camera to move on the sphere, adjusting the posture of the camera to be directed at the CAD model, recording R 'and t' at the moment, and obtaining a free viewpoint according to the following formula:

t'＝(x,y,z)

where T ', i.e., (x, y, z), represents the coordinates of the free viewpoint, R ' represents the pose of the camera at that time, and T ' represents the pose transformation matrix at that time, representing the pose and coordinates of the free viewpoint.

Further, the sixth step is:

multiplying the learned viewpoint-independent three-dimensional model by the free viewpoint to obtain a three-dimensional model of the free viewpoint by the following formula:

wherein the Model is_iRepresenting viewpoint-independent three-dimensional models, models_cRepresenting a free viewpoint three-dimensional model.

Further, the seventh step is specifically:

training a deep learning model through the weighted sum of the chamfering distance between the three-dimensional model predicted by the network and the real free viewpoint three-dimensional model and the earth moving distance; the following formula:

Loss＝λ₁loss_EMD+λ₂loss_CD

therein, loss_EMDAnd loss_CDRespectively representing chamfering distance and earth moving distance loss between the three-dimensional model predicted by the network and the real free viewpoint three-dimensional model; lambda [ alpha ]₁，λ₂Representing a loss weight; p represents a three-dimensional model of network prediction; q represents a real free viewpoint three-dimensional model; i | · | |2 represents a two-norm; f (x) represents an equivalence map.

Compared with the prior art, the invention has the beneficial effects that: 1) based on the deep learning neural network, the three-dimensional reconstruction task is completed by training and learning, the three-dimensional reconstruction efficiency is improved, and the operation difficulty is reduced; 2) the neural network decoupling method is utilized to decouple the three-dimensional model shape reconstruction and the viewpoint learning, so that the viewpoint-independent three-dimensional model and the camera viewpoint estimation can be completed, and the function is strong; 3) on the basis, the learned camera viewpoint can be converted into a free viewpoint, and the three-dimensional model under the free viewpoint is obtained by multiplying the three-dimensional model independent of the viewpoint, so that the application range is wide.

Drawings

FIG. 1 is a flow chart of a method for reconstructing a free viewpoint three-dimensional model of a single frame image based on deep learning;

FIG. 2 is a schematic diagram of training sample generation;

FIG. 3 is a data transmission diagram of a feature extraction network;

FIG. 4 is a graph of image semantic decoupling;

fig. 5 is a schematic diagram of a free viewpoint three-dimensional model reconstruction process.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. According to the method, through a deep learning neural network, the viewpoint-independent three-dimensional model reconstruction in the three-dimensional reconstruction is decoupled from a camera viewpoint estimation task. On one hand, the viewpoint-independent three-dimensional model does not change along with the change of the camera viewpoint, and has remarkable mobility and adaptability; on the other hand, a free viewpoint can be derived from the estimated camera viewpoint and multiplied by the viewpoint-independent three-dimensional model to obtain a free viewpoint three-dimensional model, so that the application range of the method is expanded.

It is worth noting that the neural network decoupling of the present invention does not adopt any mandatory measures, but is automatically guided and completed through the physical computation that the viewpoint independent model and the free viewpoint jointly generate the free viewpoint three-dimensional model, which strongly proves that the neural network itself can be guided and normatively learned through the physical computation, and does not need to adopt a strong supervision mode to learn in a black box manner, so that the neural network decoupling has certain interpretability and strong persuasiveness.

The invention can be realized on Windows and Linux platforms, and the programming language can also be selected and can be realized by Python.

As shown in fig. 1, the method for reconstructing a free viewpoint three-dimensional model of a single frame image based on deep learning includes the following steps:

the method comprises the following steps: generating training samples

And using OpenGL to sample and render the CAD model to generate training samples for neural network training. Of course, the method for generating the training samples is not limited to OpenGL, and any method capable of achieving the same technical effect may be used for generating the training samples.

A CAD data set is selected. The ModelNet data set provided by Princeton is widely used in the fields of computer vision, computer graphics, robots and cognitive science, the coverage content is comprehensive, and the model object is excellent. The ModelNet has 662 target classes, 127915 CAD, and ten classes of over-oriented data, including three subsets. The present embodiment selects a ModelNet40 subset that contains 40 classes of CAD models.

And generating a single-frame image. Using the camera simulation function of OpenGL, a three-dimensional object is placed at a proper position in a scene, the camera is adjusted to different angles and different distances, so that the three-dimensional object is projected on a two-dimensional film, a view port of 224 × 224 is fixedly used to obtain images after washing, 24 images are generated by each CAD model, and the camera angle and distance corresponding to each image are recorded at the same time, as shown in the right side of fig. 2.

And sampling the real shape of the CAD model. In the embodiment, OpenGL is used to fix the CAD model at a proper position, and point cloud sampling is performed on the CAD model to obtain the real shape of the model, and 4096 points are collected in total, as shown in the left side of fig. 2.

And dividing a training set and a testing set. For convenience of training, the obtained sample data set is disordered, and each category is extracted 4/5 to form a training data set; the remaining 1/5 data are stored separately for each category for testing the model effect of each category.

Step two: high level semantic extraction of images

And performing feature extraction on the input image of each training sample to obtain the high-level semantic features of the input image. Because each CAD model has 24 input images with different angles and distances, the 24 images of each sample are extracted by using the same feature extraction model respectively. In this embodiment, ResNet is used as the feature extraction network, but in other embodiments, feature extraction networks with other numbers of layers may be used. While using networks of different depths, including ResNe, for three-dimensional models of different complexity₅₁And ResNe₁₀₁。

The input picture of each training sample may be feature extracted by:

wherein N is a positive integer, and in this embodiment, N is 40, and represents 40 training data categories in the ModelNet40 subset;

representing semantic information generated by a jth image in an ith CAD training sample in the nth class, wherein the maximum value of j is 24; ResNet represents a feature extraction network;

representing the jth image in the ith CAD training sample in the nth class.

The core of ResNet () consists of a short-circuit connection (short) and a building block (building block). All the feature extraction networks are divided into 5 parts, and the component conditions are as follows: conv1 is a common convolutional layer; conv2_ x, conv3_ x, conv4_ x and conv5_ x are building blocks (building blocks) consisting of convolutional layers and short-circuit connections (shorts), as shown on the left side of fig. 4.

When a short circuit connection (shortcut) participates in a building block (building block), two situations are considered for judging whether the number of channels of an input feature graph is the same as that of channels of an output feature graph. The relationship between the input and the output is as follows:

wherein x represents the input feature map, F (x) represents the output feature map, c represents the number of channels of the feature map, and W is a convolution operation for adjusting the number of channels of the feature map.

Step three: high-level semantic decoupling of images

And coupling and converting the extracted high-level semantics by using a neural network global average pooling layer and a full connection layer to the convolved feature diagram to obtain viewpoint-independent three-dimensional model point cloud and camera viewpoint parameters. The camera viewpoint parameters include: a camera attitude, represented by euler angles (pitch (γ), roll (β), yaw (α)); position of camera, by coordinates t of camera in initial coordinate system_x,t_y,t_z(CAD model is origin) representation;

in the global average pooling, a global semantic average value, namely a single element output, is obtained after the feature map of one channel is subjected to global average pooling. The relationship between the input and the output of the global average pooling layer is as follows:

GAP(i)＝mean(conv(i))

GAP (i) is the output of the global average pooling layer, conv (i) is an input feature graph, mean () represents the average value of the global area of input data, and i represents the number of input and output channels;

and in the full-connection layer, the output of the global average pooling is used as the input of the full-connection layer, and the point cloud coordinates of the free viewpoint three-dimensional model and the total number of the camera viewpoint estimation parameters are matched by setting the number of neurons of the full-connection layer. The input and output of the fully connected layer are related by the following formula:

wherein X represents input data, and i represents the number of input channels; y represents the output data, and j represents the number of output channels, namely the total number of the three-dimensional model point clouds and the camera viewpoint parameters.

Step four: viewpoint-independent three-dimensional model reconstruction

And correcting the point cloud coordinates output by the decoupling network, and fitting the densely distributed point clouds in a triangular plate to form a continuous, accurate and good-state three-dimensional curved surface model unrelated to the viewpoint.

The negative values in the point cloud coordinates needed for the viewpoint-independent reconstruction are corrected by:

wherein, Y represents the output result,

Step five: camera viewpoint estimation and free viewpoint generation

To decoupling networkAnd obtaining the camera posture and the coordinates through homogeneous transformation of the output camera viewpoint parameters. Calculating three Euler angles including a pitch angle pitch (gamma), a roll angle (beta) and an yaw angle yaw (alpha) to obtain a rotation matrix which represents the posture of the camera; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; the coordinates of the CAD model under the camera coordinate system can be obtained through homogeneous transformation, and the following formula is as follows:

t＝(t_x,t_y,t_z)^T

wherein x, y, z represent the fixed position of the CAD model; x ', y ', z ' represent CAD model coordinates in a camera coordinate system; r is a rotation matrix, is obtained by calculation of Euler angles including a pitch angle pitch (gamma), a roll angle (beta) and an yaw angle yaw (alpha), and represents the posture of the camera; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; t is a pose transformation matrix representing the camera viewpoint (including pose and coordinates).

Based on the estimated camera viewpoint, a free viewpoint is generated. Taking the CAD model as a sphere center, changing the position of the camera to move on the sphere, adjusting the posture of the camera to the orientation to align with the CAD model, recording R 'and t' at the moment, and obtaining a free viewpoint according to the following formula:

t'＝(x,y,z)

where T ', i.e., (x, y, z), represents the position of the free viewpoint, R ' represents the pose of the camera at this time, and T ' represents the pose transformation matrix at this time, representing the free viewpoint (including the pose and coordinates).

Step six: free viewpoint three-dimensional model generation

The viewpoint-independent three-dimensional model and the free viewpoint three-dimensional model only have a viewpoint difference, namely the camera posture and the position are different, but the shape information of the three-dimensional models is the same. Multiplying the learned viewpoint-independent three-dimensional model by the free viewpoint by the following formula to obtain a three-dimensional model of the free viewpoint:

Step seven: deep learning model training

Loss＝λ₁loss_EMD+λ₂loss_CD

therein, loss_EMDAnd loss_CDRespectively representing chamfering distance and earth moving distance loss between the three-dimensional model predicted by the network and the three-dimensional model of the real free viewpointLosing; lambda [ alpha ]₁，λ₂Representing a loss weight; p represents a three-dimensional model of network prediction; q represents a real free viewpoint three-dimensional model; i | · | |2 represents a two-norm; f (x) represents an equivalence map.

The method for reconstructing the single-frame image free viewpoint three-dimensional model based on the deep learning is different from the conventional three-dimensional reconstruction method, and the three-dimensional model can be automatically recovered from a single image without manual operation. More particularly, the invention separates the generation of the viewpoint-independent three-dimensional model from the estimation of the camera viewpoint and simultaneously generates the free viewpoint three-dimensional model. The viewpoint-independent three-dimensional model can provide convenience for the fields of attitude estimation, tracking detection and the like; the free viewpoint three-dimensional model can be used for expanding a three-dimensional data set, the data cost is reduced, and the three-dimensional reconstruction working efficiency is improved. Generally, compared with the traditional work, the invention provides a more flexible three-dimensional model reconstruction method, which improves the generalization of the model and widens the application range on the premise of ensuring the basic reconstruction task.

Claims

1. A single-frame image free viewpoint three-dimensional model reconstruction method based on deep learning is characterized by comprising the following steps:

2. The method for reconstructing the free viewpoint three-dimensional model of single frame image based on deep learning of claim 1, wherein in the first step, the CAD model is sampled and rendered by OpenGL to generate training samples.

3. The method for reconstructing the free viewpoint three-dimensional model of the single frame image based on the deep learning of claim 1, wherein in the second step, ResNet is used as the feature extraction network, that is, the feature extraction is performed on the input image of each training sample by using the following formula:

wherein N is a positive integer;

representing the jth image in the ith CAD training sample in the nth class.

4. The method for reconstructing the free viewpoint three-dimensional model of the single frame image based on the deep learning as claimed in claim 1, wherein the third step is specifically as follows:

coupling and converting the extracted high-level semantics by using the global average pooling of the neural network and the characteristics of the convoluted full-connection layer pair diagram to obtain the viewpoint-independent three-dimensional model point cloud and the camera viewpoint parameters(ii) a The camera viewpoint parameters include: the camera attitude is represented by Euler angles including a pitch angle gamma, a roll angle beta and an aircraft yaw angle alpha; the camera coordinates are determined by the coordinates t of the camera in the initial coordinate system_x,t_y,t_zAnd (4) showing.

5. The method for reconstructing the free viewpoint three-dimensional model of the single frame image based on the deep learning as claimed in claim 1, wherein the step four is specifically as follows:

wherein, Y represents the output result,

6. The method for reconstructing the free viewpoint three-dimensional model of the single frame image based on the deep learning as claimed in claim 1, wherein the step five is specifically as follows:

camera attitude and coordinates are obtained by performing homogeneous transformation on camera viewpoint parameters output by the decoupling network; calculating three Euler angles including a pitch angle gamma, a roll angle beta and an yaw angle alpha to obtain a rotation matrix, and representing the posture of the camera; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; obtaining the coordinates of the CAD model under a camera coordinate system through homogeneous transformation, wherein the coordinates are as follows:

t＝(t_x,t_y,t_z)^T

wherein x, y, z represent fixed coordinates of the CAD model; x ', y ', z ' represent CAD model coordinates in a camera coordinate system; r is a rotation matrix which is obtained by calculation of Euler angles including a pitch angle gamma, a roll angle beta and an aircraft yaw angle alpha and represents the posture of the camera; t is t_x,t_y,t_zRepresenting the distance of the camera from the CAD model under the initial coordinate system, and representing the coordinates of the camera; t is a posture transformation matrix which represents a camera viewpoint and comprises a posture and coordinates;

t'＝(x,y,z)

7. The method for reconstructing the free viewpoint three-dimensional model of the single frame image based on the deep learning as claimed in claim 1, wherein the sixth step is: