CN110349087B

CN110349087B - RGB-D image high-quality grid generation method based on adaptive convolution

Info

Publication number: CN110349087B
Application number: CN201910609314.4A
Authority: CN
Inventors: 张东九; 冼楚华; 杨煜; 钱锟; 李桂清
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-07-08
Filing date: 2019-07-08
Publication date: 2021-02-12
Anticipated expiration: 2039-07-08
Also published as: CN110349087A

Abstract

The invention discloses a method for generating a high-quality RGB-D image grid based on adaptive convolution, which comprises the following steps: 1) constructing a training data set; 2) data amplification and normalization; 3) constructing an adaptive convolutional layer; 4) constructing a depth image completion network and a super-resolution network and training; 5) and inputting the test data into the two trained networks in sequence, outputting the repaired high-resolution picture and further converting the high-resolution picture into a high-quality grid. The data set constructed by the method solves the problem that a high-quality large-scale data set is lacked in the depth image completion field at present; by using the coding and decoding structure and the cross-layer connection structure, the low-layer characteristics and the high-layer characteristics in the data can be effectively fused, and meanwhile, the redundancy of parameters is avoided; the problem that the current method is difficult to generate a complete depth image with high quality can be effectively solved by using the adaptive convolution structure. The invention can solve the problems of low data precision and large missing area of the current kinect.

Description

RGB-D image high-quality grid generation method based on adaptive convolution

Technical Field

The invention relates to the technical field of high-quality three-dimensional grid generation, in particular to a method for generating a high-quality grid of an RGB-D image based on adaptive convolution.

Background

With the great application of depth sensors in the fields of automatic driving, augmented reality, indoor navigation, safe payment, scene reconstruction and the like, the demands for obtaining high-precision depth information and subsequent high-quality three-dimensional reconstruction results become more and more important. Although great progress has been made in depth sensing technology recently, on one hand, commercial grade RGB-D cameras such as Microsoft Kinect, Intel real sense and Google Tango devices still have the disadvantage that the lack of depth data often appears in the collected depth image when the collected surface is too smooth, high light, too fine, too close to or far away from the camera. These situations are frequently encountered in large rooms, in bar-like objects and in scenes with intense light. Even at home, depth images typically lack more than 50% of the pixels. On the other hand, limited by the lower resolution of the depth camera, the point cloud reconstructed from the sensor data is too sparse. The raw data from these depth sensor scans may be less suitable for use as described above for three-dimensional reconstruction applications.

The fast generation of high-quality grid data mainly has two key parts: first, data completion, i.e., recovery of the missing depth data due to various adverse factors, is performed. Then, data super-resolution, i.e. complete point cloud data of low resolution from the previous step, generates point cloud data of high resolution. And finally, further generating grid data from the point cloud data.

Many indoor RGB-D data completion and super-resolution methods based on the conventional method are unsatisfactory in effect, and recently, a few methods based on deep learning have certain effect but have the following main disadvantages: 1) non-end-to-end learning causes the method to fail in real time; 2) the larger field of convolution causes the destruction of edge information.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an RGB-D image high-quality grid generation method based on adaptive convolution.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the method for generating the RGB-D image high-quality grid based on the adaptive convolution comprises the following steps:

1) constructing a training data set;

2) data amplification and normalization;

3) constructing an adaptive convolutional layer;

4) constructing a depth image completion network and a super-resolution network and training;

5) and inputting the test data into the two trained networks in sequence, outputting the repaired high-resolution picture and further converting the high-resolution picture into a high-quality grid.

In the step 1), basic data including an RGBD data set NYU-DATASET and an RGBD-SCENE-DATASET, wherein the two data of NYU-DATASET and RGBD-SCENE-DATASET comprise an indoor SCENE color image I acquired by using kinect v1_RGBAnd the corresponding depth image I with missing band_DincAnd repairing the depth image with band deletion by using a method based on a Poisson equation and a sparse matrix to obtain a complete depth image I_DcFor training.

In the step 2), data amplification comprises horizontal turning and expansion operation on the missing region of the depth image so as to obtain training data with different missing proportions; data normalization refers to scaling all image pixel values between 0 and 1 for a color image and performing the following processing for a depth image:

wherein the content of the first and second substances,

representing the minimum and maximum values, respectively, of the pixels in the depth image before normalization.

In step 3), an adaptive convolution layer is constructed, and the operation of the adaptive convolution is as follows:

wherein x is_iIs a certain point in the tensor, x_jIs x_iNeighborhood point of, m_jIs x_jCorresponding mask, ω being the standard convolution operation, b being an offset,. indicates a multiplication operation by elements,. Ψ (M)_i) Is a weight value regular term;

the adaptive convolutional layer gives different weights to different regions of the image, so that the depth network can better learn effective features in the image; network Net for semantic filling_fillEdge enhanced network Net_refineAnd super-resolution network Net_sr，m_jThe calculation methods of (a) are different, and specifically, the following are as follows:

network Net for semantic filling_fill：

Wherein x is judged_jThe basis for whether it is valid is that in the current feature, x_jWhether the pixel value of (a) is 0;

for edge enhanced network Net_refine：

Wherein x is judged_jWhether the criterion is valid is that in the current RGB image, x_jWhether the pixel difference from the center of the corresponding sliding window is less than 5 pixel values;

for super-resolution network Net_sr：

In step 4), a depth image completion network for completing the task and a super-resolution network for the super-resolution task are respectively constructed and trained, specifically as follows:

a. depth image completion network

The completion network is constructed by adopting a multi-scale coding and decoding network, and the network Net is filled by semantics_fillAnd edge enhanced network Net_refineThe two parts are formed and sequentially connected in sequence, wherein all standard convolutions except the last layer are replaced by adaptive convolutions;

network Net for semantic filling_fillInputting depth image I with missing_DincThe tensor form is H W1, wherein H is the height of the image, and W is the width of the image; through Net_fillObtaining a complete depth image I after semantic completion_DoutThen mix I_DoutAnd a color image I_RGBInput edge enhancement network Net together_refineRefining and adjusting to finally obtain a deletion repair result I_repairThe output tensor form is the same as the input image size; the network loss function is composed of loss of a missing area and loss of a non-missing area respectively, and the weight ratio is 10: 1;

semantic filling network Net_fillThe method comprises the following steps that a U-shaped neural network (U-Net) is adopted as a basic structure, the network comprises an encoder and a decoder, the encoder is used for encoding image information and converting a feature space, the decoder is used for decoding high-order information, and the two parts adopt a 5-layer convolutional neural network architecture;

the encoder adopts a five-layer structure, each layer respectively comprises two operations of adaptive convolution and batch regularization, a leak-relu is used as an activation function, the sizes of convolution kernels are respectively 7 × 7,5 × 5,3 × 3 and 3 × 3, the convolution step length is all 2, the height and width of each layer of features are reduced to half of the original height and width, and 0 complementing processing is carried out on the boundary of an input image; the number of convolution kernels is 16,32,64,128 and 128 respectively; all the missing areas are finally repaired by continuously extracting features in different sizes to fill the missing areas;

the decoder is 5 layers of structures equally, and every layer contains four operations of upsampling, feature splicing, adaptive convolution and batch regularization, adopts leak relu as the activation function, carries out cross-layer connection between encoder and the decoder, and the output of every encoder all copies the concatenation with the same size's after the decoder upsampling signature graph output promptly to as the input of decoder, specifically be: after the input of the previous layer is sampled and spliced with the corresponding encoder features with the same size, the current adaptive convolution is input for feature learning, the sizes of convolution kernels are all 3 x 3, the step lengths are all 1 x 1, and the number of the convolution kernels is 128,64,32,16 and 1 respectively;

the last layer of the network is a convolution layer with the convolution kernel size of 1 x 1 and is used for channel transformation and numerical value interval mapping of features;

b. constructing super-resolution networks

The super-resolution task adopts a method of fusing global features and local features, and uses Manhattan distance as a loss function for optimization;

for super-resolution network Net_srAdopting a dense connection block (dense block) as a basic structure, performing up-sampling through sub-pixel convolution, replacing all standard convolutions with adaptive convolutions, and using 1 × 1 convolution for channel adjustment in the last layer of the network; the network uses five dense connection blocks to extract features, each dense connection block uses two times of adaptive convolution, the sizes of convolution kernels are all 3 x 3, the step length is 1, 0 is supplemented to the periphery of input to keep the feature sizes of input and output consistent, and the number of the convolution kernels is 64; the input and the output of the dense connection block are connected in a cross-layer way, namely the input and the output of the dense connection block are connected inSplicing the line characteristic dimensions and then taking the spliced line characteristic dimensions as the input of the next dense connecting block; the network learns richer information by continuously fusing features from low dimension to high dimension; the up-sampling factor of the sub-pixel convolution is 4, and at the end of the network, a standard convolution with the convolution kernel size of 1 x 1 takes relu as an activation function;

c. training the constructed network

Designing a corresponding loss function for the constructed network, and optimizing the loss function by using an Adam method to finally obtain the trained network.

In the step 5), high-quality indoor scene grid data is generated by the high-resolution point cloud data repaired by the neural network through a rolling Ball method Ball pitching method.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. aiming at the condition that a high-quality RGB-D data set is not disclosed in the prior art, a method for constructing a high-low quality indoor scene RGB-D image data set is provided.

2. Aiming at a consumption level or a missing depth information map acquired by a depth camera of mobile equipment, a depth information map repairing algorithm combining RGB color image features and fusion convolution operation is provided.

3. Aiming at a low-quality and high-noise depth image, a method for denoising and enhancing features by utilizing RGB color image semantic information is provided.

4. A method for reconstructing super-resolution of depth images by a point cloud-based convolution network is provided, wherein the method is based on RGB color image semantic features.

Drawings

FIG. 1 is a logic flow diagram of the present invention.

FIG. 2 is an architecture diagram of a semantic filling network.

Detailed Description

The present invention will be further described with reference to the following specific examples.

As shown in fig. 1, the method for generating a high-quality mesh of an RGB-D image based on adaptive convolution according to this embodiment includes the following steps:

step 1, constructing a training data set

1.1 construction of indoor scene completion dataset databasecomplete

More than one hundred thousand sets of RGBD images of indoor scenes are arranged in an open data set of New York university, the RGBD images are collected from https:// cs.nyu.edu/. about.silberman/datasets/nyu _ depth _ v2.html website, and N is more than or equal to 9000 and less than or equal to N_RGBD10000 or less, 640 x 480 resolution, wherein the RGBD image comprises a color image I_RGBIncomplete depth image I_DincAnd repairing the band-missing depth image by using the colorization method of Ant Levin's to obtain a complete depth image I_Dc. First, the depth image and the RGB image are aligned and noise reduced, and then the RGBD data is cropped around, resulting in 557 × 423 resolution images. Subsequently, extracting masks from all data in the data set to construct a Mask image set Mask_Depth：

In the formula (I), the compound is shown in the specification,

is a mask in which the depth image corresponds to missing information;

the above-mentioned generated I_DincAnd corresponding to I_RGBAnd I_DcNew training data is formed, and a completion task training data set is further constructed.

1.2 construction of indoor scene super resolution dataset DatabaseSR

Two sets of pairs of depth maps M with the same content are acquired from http:// rgbd-dataset.cs. washington. edu/dataset/rgbd-scenes-v2, and the down-sampling factor of each image is 4 x. A set of bit low resolution depth maps I_LRThe other set is a high resolution depth map I_HRAnd performing data amplification on the horizontal turning of the image pairs to finally obtain 24000-25000 pairs of data pairs.

Step 2, data amplification and normalization

And the data amplification comprises horizontal inversion and expansion operation on the missing region of the depth image so as to obtain training data with different missing proportions.

For color image I_RGBThe pixel values are scaled to 0-1 by dividing the pixel values by 255. For depth image I_DincFirst, noise reduction processing is performed, and the depth values are set to zero at 200 or less and 40000 or more. Calculating the minimum value for each image separately

And maximum value

Then, the following treatments were carried out:

step 3, constructing an adaptive convolutional layer

The core here is to build a suitable adaptive convolution layer to replace the standard convolution to better serve the task. The flow of the adaptive convolutional layer is as follows:

wherein x is_iIs a certain point in the tensor, x_jIs x_iNeighborhood point of, m_jIs x_jCorresponding mask, ω being the standard convolution operation, b being an offset,. indicates a multiplication operation by elements,. Ψ (M)_i) Is a weight regularization term. The adaptive convolution gives different weights to different regions, and compared with the traditional convolution, the network can learn effective characteristics better. Network Net for semantic filling_fillEdge enhanced network Net_refineAnd super-resolution network Net_sr，m_jThe calculation of (c) is different. The calculation is as follows.

Network Net for semantic filling_fill：

Wherein x is judged_jThe basis for whether it is valid is that in the current feature, x_jIs 0.

For edge enhanced network Net_refine：

Wherein x is judged_jWhether the criterion is valid or not is that in the current RGB image, x_jWhether the pixel difference from the center of the corresponding sliding window is less than 5 pixel values.

For super-resolution network Net_sr：

Step 4, constructing a depth image completion network and a super-resolution network and training

a. Constructing a depth image completion network

The depth image completion network is constructed by adopting a multi-scale coding and decoding structure network and is filled with a semantic network Net_fillAnd edge enhanced network Net_refineThe two parts are composed and connected sequentially in sequence, wherein all standard convolutions except the last layer are replaced by adaptive convolutions.

Network Net for semantic filling_fillInputting depth image I with missing_DincThe tensor form is H W1, wherein H is the height of the image, and W is the width of the image; through Net_fillObtaining a complete depth image I after semantic completion_Dout. Then adding I_DoutAnd a color image I_rgbInput Net together_refineRefining and adjusting to finally obtain a cavity filling result I_FillThe output tensor form is also H × W × 1. The network loss function is composed of loss of a missing area and loss of a non-missing area respectively, and the weight ratio is 1: 10.

As shown in FIG. 2, Net_fillThe method adopts U-Net as a basic structure, and an encoder and a decoder both adopt a 5-layer convolutional neural network architecture;

the encoder adopts a five-layer structure, each layer respectively comprises two operations of adaptive convolution and batch regularization, a leak-relu is used as an activation function, the sizes of convolution kernels are respectively 7 × 7,5 × 5,3 × 3 and 3 × 3, the convolution step sizes are all 2, the characteristic height and width of each coding layer are reduced to half of the original height and width, and 0 complementing processing is carried out on the boundary of an input image. The number of convolution kernels is 16,32,64,128 and 128 respectively. By continuously extracting features in different sizes to fill the missing regions, all the missing regions are finally repaired.

The decoder is also of a 5-layer structure, each layer comprises four operations of upsampling, feature splicing, adaptive convolution and regularization, and the leak relu is also adopted as an activation function. Each decoder layer firstly performs upsampling on the input of the previous layer, then splices the upsampled input with the corresponding features in the encoder with the same size, and then inputs the current adaptive convolution for feature learning. The convolution kernel sizes are all 3 × 3, the step sizes are all 1 × 1, and the number of convolution kernels is 128,64,32,16,1, respectively.

The last layer of the network is a common convolution layer with convolution kernel size of 1 x 1, and is the same as the channel transformation and the value interval mapping of the features.

c. Constructing super-resolution networks

The super-resolution task adopts a method of fusing global features and local features and uses Manhattan distance as a loss function for optimization.

For super-resolution network Net_srInputting the result I obtained by the completion network_FillObtaining complete depth with high resolution ratio by semantic extraction of dense connection blocks and up-sampling of sub-pixel convolutionAnd measuring the image and finally converting into a grid.

Super-resolution network Net_srAnd (3) adopting a dense connecting block as a basic structure, and performing up-sampling through sub-pixel convolution. Likewise, all standard convolutions are replaced with adaptive convolutions. Similarly, the last layer of the network uses 1 × 1 convolution for channel tuning. The model uses five dense connection blocks to extract features, each dense connection block comprises two times of adaptive convolution, the sizes of convolution kernels are all 3 x 3, the step length is 1, 0 is supplemented to the periphery of the input to keep the feature sizes of the input and the output consistent, and the number of the convolution kernels is 64. And the input and the output of the dense connecting block are connected in a cross-layer manner, namely the input and the output of the dense connecting block are spliced in characteristic dimensions and then serve as the input of the next dense connecting block. The network learns richer information by constantly fusing features from low dimension to high dimension. The up-sampling factor of the sub-pixel convolution is 4 and the feature becomes 4 times higher and wider after passing through this layer. At the end of the network, there is a standard convolution with a convolution kernel size of 1 x 1 and relu as the activation function.

Training a neural network: and dividing the data set into a training set, a verification set and a test set according to the ratio of 7:2:1, and respectively training the completion network and the super-resolution network. And evaluating the model in real time and calculating evaluation indexes by using the verification set, and performing performance test on the trained network by using the test set. The processor of the used equipment is Intel i7-7700, and the video card is Invitta 1080 ti;

for the completion task, Net_fillInput as a depth map I_inTraining is carried out for one day by using the batch size of 4 and the learning rate of 0.001, then the training is continued by reducing the learning rate to 0.0001, and the whole process takes three days. The training process takes the mean square error between the network output and the true value as the loss function. Net_refineWill be input into I_rgbExtracted weight sum Net_fillAnd the corresponding input is multiplied by the corresponding element, and is convolved by a standard convolution kernel with fixed parameters, so that the method has no trainable parameters and is quick to execute.

For the super-resolution task, Net_srIs input as_lrIn a batch size of 8Training, the learning rate is 0.0001. Training takes 200 batches of models to converge.

Step 5, inputting the test data into the two trained networks in sequence, outputting the repaired high-resolution picture and further converting the high-resolution picture into a high-quality grid, wherein the method specifically comprises the following steps:

and generating high-quality indoor scene grid data by using the high-resolution point cloud data repaired by the neural network through a Ball Pivoting method.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The method for generating the RGB-D image high-quality grid based on the adaptive convolution is characterized by comprising the following steps of:

1) constructing a training data set;

2) data amplification and normalization;

3) constructing an adaptive convolution layer, wherein the operation of the adaptive convolution is as follows:

different weights are given to different regions of the image by the adaptive convolution, so that the effective features in the image can be better learned by the depth network; network Net for semantic filling_fillEdge enhanced network Net_refineAnd super-resolution meshNet (Net)_sr，m_jThe calculation methods of (a) are different, and specifically, the following are as follows:

network Net for semantic filling_fill：

for edge enhanced network Net_refine：

for super-resolution network Net_sr：

2. The method for generating a high quality mesh for RGB-D images based on adaptive convolution of claim 1, wherein: in the step 1), basic data including an RGBD data set NYU-DATASET and an RGBD-SCENE-DATASET, wherein the two data of NYU-DATASET and RGBD-SCENE-DATASET comprise an indoor SCENE color image I acquired by using kinect v1_RGBAnd the corresponding depth image I with missing band_DincAnd using the equation based on Poisson's equation and sparse momentRepairing the depth image with band missing by the array method to obtain a complete depth image I_DcFor training.

3. The method for generating a high quality mesh for RGB-D images based on adaptive convolution of claim 1, wherein: in step 4), a depth image completion network for completing the task and a super-resolution network for the super-resolution task are respectively constructed and trained, specifically as follows:

a. depth image completion network

The completion network is constructed by adopting a multi-scale coding and decoding network, and the network Net is filled by semantics_fillAnd edge enhanced network Net_refineThe two parts are formed and sequentially connected in sequence, wherein all convolutions except the last layer are replaced by adaptive convolutions;

semantic filling network Net_fillThe method comprises the following steps that a U-shaped neural network U-Net is adopted as a basic structure, the network comprises an encoder and a decoder, the encoder is used for encoding image information and converting a feature space, the decoder is used for decoding high-order information, and the two parts adopt a 5-layer convolutional neural network architecture;

the encoder adopts a five-layer structure, each layer respectively comprises two operations of adaptive convolution and batch regularization, a leak-relu is used as an activation function, the sizes of convolution kernels are respectively 7 × 7,5 × 5,3 × 3 and 3 × 3, the convolution step length is all 2, the height and width of the features are reduced to half after each time of adaptive convolution, and 0 complementing processing is carried out on the boundary of an input image to eliminate the difference of convolution edge regions; the number of convolution kernels is 16,32,64,128 and 128 respectively; all the missing areas are finally repaired by continuously extracting features in different sizes to fill the missing areas;

the last layer of the network is standard convolution with convolution kernel size of 1 x 1, and is used for channel transformation and numerical value interval mapping of features;

b. constructing super-resolution networks

for super-resolution network Net_srSampling is carried out by sub-pixel convolution by adopting a dense connection block as a basic structure, all standard convolutions are replaced by adaptive convolutions, and the last layer of the network uses 1 × 1 convolution for channel adjustment; the network uses five dense connection blocks to extract features, each dense connection block comprises two times of adaptive convolution, the sizes of convolution kernels are all 3 x 3, the step length is 1, 0 is supplemented to the periphery of input to keep the feature sizes of input and output consistent, and the number of the convolution kernels is 64; performing cross-layer connection on the input and the output of the dense connecting blocks, namely splicing the input and the output of the dense connecting blocks in characteristic dimensions and then using the spliced input and the output as the input of the next dense connecting block; from low dimension to high dimension by constant fusionTo enable the network to learn richer information; the up-sampling factor of the sub-pixel convolution is 4, and at the end of the network, a standard convolution with the convolution kernel size of 1 x 1 takes relu as an activation function;

c. training the constructed network

4. The adaptive convolution-based RGB-D image high quality mesh generation method of claim 1, wherein: in the step 5), high-quality indoor scene grid data is generated by rolling Ball method from the high-resolution point cloud data repaired by the neural network.