CN108765333B

CN108765333B - Depth map perfecting method based on depth convolution neural network

Info

Publication number: CN108765333B
Application number: CN201810505428.XA
Authority: CN
Inventors: 袁书聪; 青春美
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-24
Filing date: 2018-05-24
Publication date: 2021-08-10
Anticipated expiration: 2038-05-24
Also published as: CN108765333A

Abstract

The invention discloses a depth map perfecting method based on a depth convolution neural network, which comprises the following steps: 1) extracting samples and labels from depth pictures and RGB pictures in training data, and extracting square picture blocks; 2) performing data enhancement including rotation and distortion operations on a square picture block sample extracted from training data; 3) training the enhanced training data through a deep convolution neural network; 4) preprocessing a depth map and an RGB picture to be processed; 5) and (4) refining the depth by the preprocessed depth map and RGB through a trained neural network. The method of the invention fully utilizes the mutual relation of the structural information and the left and right depths in the RGB picture, solves the problem of low quality of the depth map acquired by the equipment through the powerful feature extraction capability of the neural network, and is better applied to the industrial and living fields.

Description

Depth map perfecting method based on depth convolution neural network

Technical Field

The invention relates to the technical field of unmanned driving and depth reconstruction, in particular to a depth map perfecting method based on a depth convolution neural network.

Background

With the development of science and technology, depth cameras gradually step into the lives of people. The common camera can capture visible light and image the visible light on a planar picture, wherein the value of each pixel point is the components of red, green and blue light; and the value of each pixel point of the picture shot by the depth camera is the distance from the shooting plane of the camera to the point.

The use and demand for high quality depth maps has increased, whether in the industrial or entertainment industries. In the industrial field, the depth map is an unmanned vehicle, and the surrounding environment cannot be sensed without necessary input of an unmanned vehicle navigation system; in the field of robots, the depth map can provide positioning guidance for the operation of robots and mechanical arms; in an intelligent home, a traditional key interaction mode can be gradually banned by a man-machine interaction mode related to gestures; in the game, motion sensing games, virtual reality and augmented reality all need depth pictures acquired by a depth camera. It can be said that a day at all, a depth camera will be standard like a visible light camera.

The depth cameras currently on the market can be roughly divided into two types. One is based on infrared light, such as Kinect, Kinect2, LeapMotion, RealSense, etc., which in turn may be subdivided into those based on coded light and TOF techniques. The other is based on binocular matching, the principle of which is similar to binocular vision imaging of human eyes, and a depth map can be obtained from two visible light pictures about the same scene. However, either method has serious drawbacks. Depth cameras based on infrared light can only be practical in indoor environments, where excessive noise can render the device unusable in outdoor environments, even in indoor environments where noise is a problem. Due to the fact that the binocular camera is shielded, the depth of some areas can be unavailable.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a depth map perfecting method based on a depth convolution neural network, which combines a depth map and RGB pictures, distinguishes a continuous smooth region and a steep edge region in the pictures according to the RGB pictures through the special diagnosis extraction capacity of depth learning, so as to guide the smoothness and perfection of the depth map, and solves the problem of low quality of the depth map acquired by equipment, thereby being better applied to the fields of industry and life.

The technical scheme provided by the invention is as follows: a depth map perfecting method based on a depth convolution neural network comprises the following steps:

1) extracting samples and labels from depth pictures and RGB pictures in training data, and extracting square picture blocks;

2) performing data enhancement including rotation and distortion operations on a square picture block sample extracted from training data;

3) training the enhanced training data through a deep convolution neural network;

4) preprocessing a depth map and an RGB picture to be processed;

5) and (4) refining the depth by the preprocessed depth map and RGB through a trained neural network.

In step 1), in order to improve the training efficiency of the deep convolutional neural network, a known accurate training label depth map with perfect edges, an RGB map and a depth map to be perfected need to be made into a square picture block with a fixed size, and such processing does not affect the learning effect of the neural network.

In the step 2), the RGB image, the depth image to be perfected and the training label depth image are subjected to multiple transformations with the same force, including rotation angle, scale enlargement or scale reduction and turnover, so that robustness can be improved, and overfitting is avoided.

In the step 3), a neural network is constructed and trained, for a depth camera based on infrared light, only one group of RGB and a depth map to be completed is provided, so that the training input of the network comprises a group of RGB square picture blocks and a group of square picture blocks to be completed, the label is the completed square picture block, the input data is subjected to feature extraction and convolution to extract rich features, feature screening is performed through a multi-scale perception domain residual error network, and finally MSE is adopted as a cost function; for a depth camera with a binocular matching structure, RGB and depth respectively have a left group and a right group, RGB rectangular picture blocks of the left group and the right group and rectangular picture blocks of a depth picture to be completed are input, a label is the completed depth picture of a left picture or a right picture, input data are extracted to obtain rich features through a feature extraction convolutional layer, feature screening is carried out through a multi-scale perception domain residual error network, and finally MSE is adopted as a cost function; then, a neural network is trained by adopting back propagation; the multi-scale perception domain residual error network is a neural network sub-module, the input of a module is a rectangular picture block, convolution cores with different sizes are used for carrying out convolution on the picture block, edge filling is carried out on the picture according to the size of the corresponding convolution core so that the feature scales obtained by the convolution are consistent, then the feature matrixes are overlapped or averaged according to channels, and meanwhile, the input of the module is directly cascaded to the output of the module.

In step 4), the completed depth map is not needed any more when the network is used, the depth map to be completed and the RGB map are input, the depth map to be completed and the RGB map are subjected to the same preprocessing, and the pixel value of the picture is normalized to be between 0 and 1.

In step 5), the processed depth map to be refined and the RGB picture are propagated forward through the trained neural network to obtain a refined depth map.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention realizes the perfection function of the depth map guided by the RGB picture based on the deep learning for the first time, and breaks through the technical defects of fuzzy edge, information loss and low precision of the depth map caused by the traditional method.

2. The invention utilizes the multi-scale perception domain residual error network to lead the neural network to be easier to train.

3. The invention realizes the left and right depth map checking function of the binocular matching depth camera based on depth learning for the first time, and combines the left and right depth and RGB information. The method overcomes the defects that the traditional method is too simple, cannot utilize global information and only uses a left depth map and a right depth map to complete the depth.

4. The invention realizes the depth map shielding perfection function of the binocular matching depth camera based on depth learning for the first time, and solves the problem of depth map deficiency caused by the visual angle shielding problem of the binocular matching camera.

5. The invention has simple use method and high speed, and has wide application in the aspects of industry, robots, entertainment and the like.

Drawings

FIG. 1 is a diagram of a depth network architecture for an infrared depth camera in accordance with the present invention.

Fig. 2 is a diagram of a multi-scale perceptual domain residual error network architecture.

FIG. 3 is a depth network architecture diagram for a binocular matched depth camera of the present invention.

Detailed Description

The present invention will be further described with reference to the following specific examples.

The depth map perfecting method based on the depth convolution neural network provided by the embodiment has the following specific conditions:

1) acquiring input data and enhancing

Training data is obtained, and data sets such as SUN3D, Middlebury data sets and the like with depth maps are obtained on the web at the same time. In order to improve the universality of the deep learning network and prevent overfitting, data enhancement operation is carried out on the obtained data. And for pictures in the same group, the depth map to be completed, the RGB map and the completed depth map are included. The three pictures are subjected to the same random transformation, such as scaling, small-angle rotation, picture brightness enhancement and the like, and then RGB and depth of the pictures are normalized to be between 0 and 1.

2) Converting a complete picture into a square or rectangular picture block

For the depth camera based on infrared light, a complete picture is divided into square picture block groups, the positions of the center points of the square picture blocks are the same, the sizes of the square picture blocks are the same, the input of each group comprises an RGB square picture block and a square picture block of a depth map to be perfected, and the output comprises the perfected square picture block of the depth map. For a depth camera with a basic binocular matching structure, a complete picture is divided into long strip-shaped rectangular picture blocks, and the horizontal center position of each rectangular picture block is the same. The training of the network can be accelerated by dividing the complete picture into square picture blocks or rectangular picture blocks.

3) Training of deep convolutional neural networks

3.1) for an infrared light based depth camera, the structure of the network is as shown in FIG. 1. For each group of data, the data are transmitted in the forward direction through a network, the RGB square picture block and the depth map forward direction picture block to be completed pass through respective feature extraction networks, and rich features are extracted through a depth convolution network; then, the RGB and the depth map to be perfected are output through a feature extraction network and pass through a multi-scale perception domain residual error network, and extracted features are screened; the multi-scale perception domain residual error network is a neural network sub-module, when a module inputs a rectangular picture block, convolution is carried out on the rectangular picture block by convolution kernels with different sizes (such as the sizes of 3 x 3,5 x 5,9 x 9,11 x 11 and the like), edges of the picture are filled according to the sizes of the corresponding convolution kernels so that feature scales obtained by convolution are consistent, then the feature matrices are overlapped or averaged according to channels, meanwhile, the input of the module is directly cascaded to the output of the module, and the structure of the multi-scale perception domain residual error network can be shown in fig. 2 in a mode of overlapping or directly adding according to the channels. Because the perfection of the depth map not only needs detail information on a small scale, but also needs structural information on a large scale, and a multi-scale perception domain has a better effect, and meanwhile, the residual error training method can enable the training to be more accurate. Then, the screened features are output through a network obtained by a convolution neural network through a full connection layer or through superposition of two channels; comparing with the improved depth map square picture block to calculate MSE error; parameters of the network are then adjusted by a back propagation algorithm using a stochastic gradient descent optimization algorithm. Wherein the multi-scale perceptual domain residual network is shown in fig. 2.

3.2) for binocular matching based depth cameras, the network structure is shown in FIG. 3. For each group of data, firstly, the data is transmitted in the forward direction through a network, a rectangular picture block and a depth matrix picture block of a left picture RGB (red, green and blue) picture are subjected to respective feature extraction convolution networks to obtain a feature matrix, and then feature screening is carried out through respective multi-scale perception domain residual error networks to obtain a feature matrix 1 and a feature matrix 2; and (3) respectively extracting a convolution network from the rectangular picture block and the depth matrix picture block of the right picture RGB through respective features to obtain a feature matrix, then respectively screening the features through respective multi-scale perception domain residual error networks to obtain a feature matrix 3 and a feature matrix 4, then obtaining network output from the feature matrix 1, the feature matrix 2, the feature matrix 3 and the feature matrix 4 through a full connection layer, and comparing the network output with the rectangular picture block at the corresponding position of the improved left picture depth map to calculate the MSE error. Parameters of the network are then adjusted by a back propagation algorithm using a stochastic gradient descent optimization algorithm.

3.3) we get all data into an epoch through one forward propagation and one backward propagation, and the training process of the network can be accelerated through the GPU. At each epoch, first, N groups of data are randomly extracted from the data set, and then the network parameters are optimized by using the forward propagation and backward propagation algorithms in steps 3.1) and 3.2). Until the differential reduction rate of the network output and the tag no longer changes significantly or reaches a specified number of iterations.

4) Pre-processing of perfected depth maps

When the network is used, the completed depth map is not needed any more, and the input is the depth map to be completed and the RGB map. And carrying out the same pretreatment on the depth map to be perfected and the RGB map, and normalizing the pixel value of the picture to be between 0 and 1.

5) By neural networks

And forward propagating the processed depth map to be completed and the RGB picture through the trained neural network to obtain a completed depth map.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. A depth map perfecting method based on a depth convolution neural network is characterized by comprising the following steps:

1) extracting samples and labels from a depth map and an RGB map in training data, and extracting a square picture block;

constructing a neural network and training, wherein for a depth camera based on infrared light, an RGB (red, green and blue) image and a depth image to be perfected are only one group, so that the training input of the network comprises a square image block of a group of RGB images and a square image block of a depth image to be perfected, a label is the perfected square image block, the input data is subjected to feature extraction and convolution to extract rich features, then the feature screening is carried out through a multi-scale perception domain residual error network, and finally MSE (mean square error) is adopted as a cost function; for a binocular matching structure depth camera, an RGB image and a depth image respectively have a left group and a right group, rectangular image blocks of the RGB image and rectangular image blocks of the depth image to be perfected are input, labels are the perfected depth images of the left image or the right image, input data are extracted to have rich features through a feature extraction convolutional layer, feature screening is carried out through a multi-scale perception domain residual error network, and finally MSE is adopted as a cost function; then, a neural network is trained by adopting back propagation; the multi-scale perception domain residual error network is a neural network sub-module, the input of a module is a rectangular picture block, convolution cores with different sizes are used for carrying out convolution on the rectangular picture block, the edge filling is carried out on the picture according to the size of the corresponding convolution core so that the feature scales obtained by the convolution are consistent, then the feature matrixes are overlapped or averaged according to channels, and meanwhile, the input of the module is directly cascaded to the output of the module;

4) preprocessing a depth map and an RGB map which need to be processed;

5) and (4) refining the depth by the trained neural network through the preprocessed depth map and the RGB map.

2. The method for improving the depth map based on the deep convolutional neural network as claimed in claim 1, wherein: in step 4), the completed depth map is not needed any more when the network is used, the depth map to be completed and the RGB map are input, the depth map to be completed and the RGB map are subjected to the same preprocessing, and the pixel value of the picture is normalized to be between 0 and 1.