CN113033645A

CN113033645A - Multi-scale fusion depth image enhancement method and device for RGB-D image

Info

Publication number: CN113033645A
Application number: CN202110290784.6A
Authority: CN
Inventors: 赖水长; 过洁; 郭延文
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2021-06-25

Abstract

The invention discloses a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image. The invention comprises the following steps: (1) through a scheme of double-branch gradual fusion, the input of an RGB image and a depth image can complement each other in depth prediction, the depth is used for ensuring the general structure of the image to be complete, and missing pixel values are filled with colors; (2) by analyzing the noise distribution of the real data, a mixed multi-scale loss function is designed, and a high-quality clear image can still be generated even under the condition that the real image data is noisy. The method can reasonably utilize respective characteristics of the RGB image and the depth image, ensure that characteristic information obtained by the color image plays an auxiliary role in repairing the depth image, finally predict the complete depth image and obviously improve the quality of the depth image.

Description

Multi-scale fusion depth image enhancement method and device for RGB-D image

Technical Field

The invention relates to an image processing technology, in particular to a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image.

Background

Depth image enhancement techniques can be divided into two categories, one of which is at the hardware level to improve the quality and accuracy of the device or to improve the design scheme, thereby obtaining a higher quality depth image. And the other type is that an algorithm is designed according to the image processing principle to enhance the depth image from a software layer. The problem that hardware needs to consider cost and physical factor limitation is solved, development cost of software algorithm is low, multiple limitations do not need to be considered, and the advantages are more obvious.

In recent years, deep learning has made a remarkable progress in the field of conventional RGB image enhancement, and many ideas have been applied to the field of deep image enhancement. Jeon et al select a Laplacian pyramid depth network as a basic network structure, and propose LapDEN, and the method can generate a clean and clear depth image from an original depth image, but cannot accurately recover large-area holes and object edges in the original image. While Zhang et al propose to generate depth image data sets from RGB-D streams using 3D reconstruction, they focus primarily on estimating larger unobserved depth values, contrary to LapDEN's work, but fail to eliminate noise and holes from low quality RGB-D images. These methods are all directed to enhancing the quality of the depth image captured by an RGB-D camera, but the critical issues of depth noise, depth holes and depth discontinuities are not well solved.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image, which can solve the problems of depth noise, depth holes and depth discontinuity.

The technical scheme is as follows: the method for enhancing the multi-scale fusion depth image of the RGB-D image comprises the following steps:

(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image;

(2) acquiring a plurality of image pairs comprising depth images and corresponding RGB images as samples, taking the reference depth images after the depth images are enhanced as sample labels, inputting the established multi-scale fusion network model, and carrying out network training;

(3) and inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.

Further, the depth image processing branch in step (1) is specifically a residual learning network, and includes a first convolution module, a second residual module, a third residual module, and a fourth residual module, which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.

Further, in the step (1), the RGB image processing branch is specifically a full convolution network, the full convolution network includes a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are connected in sequence, the scales of the four modules are reduced in sequence 1/2, each convolution module includes three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.

Further, the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,

wherein the first convolution module comprises two convolution layers respectively input to an output of a fourth residual module in the depth image processing branch and an output of a fourth convolution module of the RGB image processing branch, the second convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the third residual module of the depth image processing branch and an output of the third convolution module of the RGB image processing branch, respectively, the third convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the second residual module of the depth image processing branch and an output of the second convolution module of the RGB image processing branch, respectively, the fourth convolution module includes two convolution layers and a transposed convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch.

Further, the loss function adopted in the step (2) for network training is as follows:

in the formula, L represents the total loss,

which is indicative of a loss of data,

representing the loss of structure retention, y represents the depth image in the sample,

a label representing the sample is attached to the sample,

1,2,3, 4 respectively represent the image scales from large to small, specifically the scales of the first convolution module, the second residual module, the third residual module and the fourth residual module of the depth image processing branch in turn,

representing a predicted depth image of the l-scale, y_lA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,

respectively represent y,

The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.

The device for enhancing the multi-scale fusion depth image of the RGB-D image comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the computer program.

Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:

1. the design of the double branches utilizes the correlation between the RGB image and the depth image, and the rich characteristics and the object contour boundary information obtained from the RGB image have an auxiliary effect on the enhancement of the depth image;

2. by adopting the method of gradually fusing the multi-scale high-dimensional features, the details can be gradually filled from low resolution to high resolution images, so that the robustness in the enhancement process is stronger, the problems of boundary blurring, depth noise and the like are not easy to occur, and the problems of depth noise, depth holes and depth discontinuity are solved.

Drawings

FIG. 1 is an architecture diagram of a multi-scale converged network model provided by the present invention;

FIG. 2 is a graph of the predicted effect of the present invention;

FIG. 3 is a comparison of the present invention with methods for predicting an image;

FIG. 4 is a diagram comparing desktop details of the present invention with methods;

FIG. 5 is a graph comparing the effect of different fusion modes and different Loss selections on depth image restoration;

FIG. 6 is a graph comparing the effect of using different inputs on depth image restoration.

Detailed Description

The embodiment provides a method for enhancing a multi-scale fusion depth image of an RGB-D image, which comprises the following steps:

(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, as shown in FIG. 1, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image.

The depth image processing branch is specifically a residual error learning network and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are sequentially connected, and the scales of the four modules are sequentially reduced by 1/2. The branch input is an original depth image, the difference between the original depth image and a reference image is mainly holes and noise, but most of low-frequency information between the original depth image and the reference image is similar, and a residual learning network is used, so that abundant characteristic information in the image can be extracted through a deep network, and meanwhile, the input is superposed on the output through jump connection, and the information of the original depth image is ensured not to be lost. The features of residual network learning only need to be the difference between the original image and the reference image, and can be converged more quickly. The residual module used in the present invention is the residual module used in ResNetv2 proposed in the document [ He K, Zhang X, Ren S, et al, Identityymappapings in deep residual-al networks [ C ]// European conference on computer vision.

The RGB image processing branch is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels. The branch input is an RGB image paired with a depth image. According to the assumption of the lambertian reflection and spherical harmonic illumination model, an RGB image can be decomposed into two parts, namely a reflection Map (reflection Map) and an illumination Map (Shading Map), which are defined as follows:

I(R,D,L)＝S(N(D),L)×R

where I is the RGB image, D is the depth map, N is the object surface normal, L is the illumination condition, S is the illumination map derived from the normal map N and the illumination condition L, and R is the reflectance map. From the above formula, it can be seen that the RGB image contains partial information of the depth image, the two are highly correlated, and although the mapping relationship is complex, the depth network can learn and fit the mapping, so as to achieve the purpose of the RGB image to assist in enhancing the depth image. However, the RGB image mainly contributes to depth image enhancement and is part of characteristics such as overall structure information and object outline, and the characteristics obtained by the RGB image are quite rich, so that a full convolution network structure is selected.

The multi-scale fusion prediction branch comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are sequentially increased by multiples of 2, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module, wherein the first convolution module comprises two convolution layers which are respectively input into the output of a fourth residual error module in the depth image processing branch and the output of the fourth convolution module of the RGB image processing branch, the second convolution module comprises two convolution layers and a transposed convolution layer which are respectively input into the output of a third residual error module of the depth image processing branch and the output of a third convolution module of the RGB image processing branch, the third convolution module includes two convolution layers and a transpose convolution layer, the two convolution layers are respectively input to the output of the second residual error module of the depth image processing branch and the output of the second convolution module of the RGB image processing branch, the fourth convolution module includes two convolution layers and a transpose convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch. The branch is mainly used for fusing the multi-scale high-dimensional features generated by the depth image processing branch and the RGB image processing branch and predicting to obtain a high-quality depth image. Due to the fact that certain perspective distortion exists in the scene image, the visual sizes of different objects in the picture are different. However, the size of the receptive field of the convolution layer is fixed, and for visual elements with larger scale in the image, the receptive field can only cover partial area of the visual elements, so that boundary blurring easily occurs; for visual elements with smaller dimensions, the receptive field can be included in other visual elements, so that the problem that small objects are not clearly distinguished is easily caused. Therefore, the method of multi-scale high-dimensional feature gradual fusion is adopted, and the details can be filled from low resolution to high resolution images step by step. Because the depth image is not rich in information content as color images, the feature dimension of the recovered depth image needs to be reduced, and the high-dimensional features generated by the depth image processing branch and the RGB image processing branch are respectively reduced to 16-dimensional features through two convolution layers, so that the effect of removing complex useless features and retaining important features is achieved. Then, the two are superposed and the convolution layer is processed by the device to generate a depth image with the resolution doubled compared with the original characteristic diagram. Finally, three-scale predicted depth images can be obtained.

(2) And acquiring a plurality of image pairs comprising the depth image and the corresponding RGB image as samples, taking the reference depth image after the depth image enhancement as a sample label, inputting the established multi-scale fusion network model, and performing network training.

Wherein, the loss function adopted during network training is as follows:

in the formula, L represents the total loss,

which is indicative of a loss of data,

a label representing the sample is attached to the sample,

1,2,3, and 4 respectively represent image scales from large to small, specifically, scales of a first convolution module, a second residual module, a third residual module, and a fourth residual module of the depth image processing branch in this order, ω_lRepresenting the weighting factors of different scales l, the present embodiment is set to [0.2, 0.4, 0.7, 1.0 ]]，

respectively represent y,

Is L₁Loss function and L₂A combination of loss functions, due to L₂The loss function has the characteristics of high derivative solving speed and sensitivity to outliers, so the function can be used by many image restoration tasks. However, since some reference images in the data set used in the present invention still have hole or noise problems, if only L is used₂The loss function may cause the model to pay too much attention to local holes and noise, thereby affecting the generalization of the model. Therefore, the invention adopts L on the small-scale image₂The loss function ensures that the whole image does not have large deviation, and L is adopted on the large-scale image₁A loss function to ensure that image details can be restored. In addition, the depth image of interest to the present invention has significant discontinuities at the edges between the foreground and background regions. Like L₁Loss function and L₂Conventional loss functions such as the loss function have difficulty in retaining such discontinuities. Therefore, it is not only easy to use

The function focuses primarily on the gradient retention values between adjacent pixels. Since the reference image itself is scaled down by bilinear interpolation, which results in loss of detail, it is only used for the image with the largest scale

The embodiment also provides a multi-scale fusion depth image enhancement device of an RGB-D image, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the computer program.

The following is a simulation verification for the present invention.

(1) Training details and parameter settings

The training image dataset of this experiment was from sungbd, and this dataset included 3389 pairs of pictures taken by the rotation device, 1159 pairs of pictures taken by the RealSense device, 1449 pairs of pictures taken by KinectV1 and 3784 pairs of pictures taken by KinectV2, for a total of 9781 pairs of pictures. We selected 800 of the NYUV2 data sets contained in the data set as the test set and the remaining 8981 as the training set. Before training, because the resolution of the RGB-D images collected by different devices is not used, all images (RGB images and depth images) are cut out to be 400 x 560 in the center and used as the input of the network.

The experiment used a pytorech platform training model, used an Adam optimizer, set the initial learning rate to 0.0001, and set the attenuation per 0.5 before the 2 nd, 4 th, 7 th, 10 th, 15 th iterations. The experimental batch size was 20 and the model was iterated 100 times in total. The experiment was trained using a server with NVIDIA Tesla V100 with a video memory of 32 GB.

(2) Comparison of Experimental results

First comparing the prediction results of the present invention with the reference image, as shown in FIG. 2, it can be seen that the present invention fills the holes well and protects the depth discontinuity between objects.

Next, the present invention was compared with several published models published in recent years, and the evaluation indexes used were RMSE (root mean square error), MAE (mean absolute error), and SSIM (structural similarity), respectively. The RMSE and the MAE are used to directly measure the depth prediction accuracy, and the SSIM measures the structural similarity between the predicted image and the reference image from the overall perspective. The advanced method for comparison comprises the following steps: FCRN: a full convolution residual error neural network for depth image prediction based on color images; SharpNet: the multi-task prediction model based on the color image comprises depth prediction, normal estimation and boundary contour prediction; LapDEN: a depth image restoration model based on a pyramid-like structure of the depth image; HFM-Net: a multi-level depth image prediction model based on the color image and the depth image, wherein the original model is used for normal estimation, and the original model is changed into the depth image; SharpNet, which was originally proposed for multi-tasking generation of color images. But this is also a disadvantage, and the same encoder and similar decoder are used for three tasks, so that the pertinence of the depth image restoration problem cannot be improved. The present invention also compares the results with the latest pyramid image restoration method LapDEN, which uses the image super-resolution concept of laprn to restore depth images. The invention model utilizes a fusion structure between the color and depth images to enhance the depth image. In table 1 (black bold mark with the best effect) it can be seen that the method of the present invention is superior to the existing methods in each index, both in image accuracy and in structural similarity. As shown in FIG. 3, the method of the present invention is obviously close to the reference image compared with the other two methods, and has great improvement in noise removal, hole filling and depth discontinuity preservation. From the detail display of fig. 4, it can be seen that due to the size and material of the object on the desktop, it is relatively difficult for the device to collect data, and a large area of holes is easily generated, where the LapDEN based on the depth image and the sharp net prediction based on the RGB image both have a large deviation. The invention integrates the characteristics of RGB-D images, thus better solving the problem of filling holes and protecting the discontinuity of the boundary depth of objects to a certain extent. It can also be seen from table 1 that the model of the present invention is also greater in model operating speed than the other methods.

In addition, although the method is supervised, the method can still restore image areas which are blurred in the reference image by fusing the characteristics of the RGB images. For example, the faucet in fig. 2, it can be seen that the original depth image lacks depth values in the whole faucet portion, and the reference image apparently smoothes the faucet portion, whereas the method of the present invention can more clearly see the faucet portion, and its depth values are close to the value of the whole faucet.

TABLE 1 comparison of test set Performance for advanced methods

(3) Ablation experiment

In order to investigate the effect of each network component and fusion protocol on the final performance, ablation studies were performed on the NYUV2 validation dataset. The quantitative comparison is summarized in table 2.

Table 2 comparison of performance of ablation experiments

Selection of input branches: in order to determine whether the RGB image and the Depth image have obvious effects in the network, two branches are respectively deleted in the model, that is, the network is changed into a simple codec structure, and the comparison of the results of 'RGB', 'Depth' and 'RGB + Depth' is branched from table 2, so that it can be obviously seen that the network of 'RGB + Depth' is obviously better than the other two networks. It can also be seen from fig. 6 that although the single RGB branch prediction Depth is much less effective than the single Depth branch, the model after RGB branch addition has better repairing capability for holes and noise regions. The cloud is that the RGB image is highly related to the depth image, and can assist in enhancing the depth image.

Selecting a feature fusion mode: existing methods typically use matrix addition (add) or concatenation (concatanate) for multimodal feature fusion. To compare them, the results are denoted as "add" and "splice", respectively, by changing the feature fusion operations in the fusion branch, but leaving the other network components and settings unchanged. It can be seen that the "add" result is somewhat less effective than the "splice". This is because the RGB image and depth image are characterized by heterogeneous data, while the addition means that the model will treat these two different data implicitly in the same way, which can lead to performance degradation. And the splicing represents that the model does not destroy the independence of the original characteristics, and the weight of different characteristics can be learned by self through the new convolutional layer. Therefore, the existing method also uses addition or splicing operation correspondingly according to the source and correlation of the characteristics. It is also apparent from (g-l) of fig. 5 that the stitching fusion method has a good effect on object recovery of the depth image and discontinuity protection of the boundary.

Selection of the loss function: in addition to the modification of the model, the choice of the loss function also has some influence on the final result. If the L1 loss function is used for all layers, the predicted image will be too detailed, and the use of L2 will cause the predicted image to be blurred. It is clear from fig. 5(d-f) that using the loss function (Hybrid loss) of the present invention, both preserves the clarity of the overall structure and preserves the depth discontinuity at the object edges.

In conclusion, the result analysis shows that the multi-scale fusion network provided by the invention fuses the multi-scale RGB image features and the depth image features, not only retains the structural information of the original depth image, but also combines the advantages of high resolution and rich features of the RGB image, and well solves the problems of holes, noise and the like of the depth image. Furthermore, in view of the fact that the reference image of the data set is not perfect, the present invention also designs a blending loss function to ensure the generation of a high quality depth image. Extensive experimental results show that the method is obviously superior to the existing advanced method in the depth image enhancement task. Ablation studies also prove that the RGB-D multi-scale fusion scheme proposed by the invention is superior to the generation method of single RGB images or single depth images, and the mixing loss function is better converged and the result is better than the L1 and L2 loss functions.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A multi-scale fusion depth image enhancement method of an RGB-D image is characterized by comprising the following steps:

2. The method for multi-scale fusion depth image enhancement of RGB-D images according to claim 1, wherein: the depth image processing branch in the step (1) is specifically a residual error learning network, and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.

3. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 2, wherein: the RGB image processing branch in the step (1) is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.

4. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 3, wherein: the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling machine module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,

5. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 4, wherein: the loss function adopted in the step (2) for network training is as follows:

in the formula, L represents the total loss,

which is indicative of a loss of data,

representing the loss of structure retention, y represents the true depth image in the sample,

a depth image representing a prediction of the model,

respectively representing the image scales from large to small, specifically the scales of the first convolution module, the second residual error module, the third residual error module and the fourth residual error module of the depth image processing branch in turn, omega_lThe weight coefficients representing the different scales l,

respectively represent y,

6. A multi-scale fusion depth image enhancement apparatus for RGB-D images, comprising a processor and a computer program stored on a memory and executable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-5.