CN113033645A - Multi-scale fusion depth image enhancement method and device for RGB-D image - Google Patents

Multi-scale fusion depth image enhancement method and device for RGB-D image Download PDF

Info

Publication number
CN113033645A
CN113033645A CN202110290784.6A CN202110290784A CN113033645A CN 113033645 A CN113033645 A CN 113033645A CN 202110290784 A CN202110290784 A CN 202110290784A CN 113033645 A CN113033645 A CN 113033645A
Authority
CN
China
Prior art keywords
convolution
depth image
module
rgb
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110290784.6A
Other languages
Chinese (zh)
Inventor
赖水长
过洁
郭延文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110290784.6A priority Critical patent/CN113033645A/en
Publication of CN113033645A publication Critical patent/CN113033645A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image. The invention comprises the following steps: (1) through a scheme of double-branch gradual fusion, the input of an RGB image and a depth image can complement each other in depth prediction, the depth is used for ensuring the general structure of the image to be complete, and missing pixel values are filled with colors; (2) by analyzing the noise distribution of the real data, a mixed multi-scale loss function is designed, and a high-quality clear image can still be generated even under the condition that the real image data is noisy. The method can reasonably utilize respective characteristics of the RGB image and the depth image, ensure that characteristic information obtained by the color image plays an auxiliary role in repairing the depth image, finally predict the complete depth image and obviously improve the quality of the depth image.

Description

Multi-scale fusion depth image enhancement method and device for RGB-D image
Technical Field
The invention relates to an image processing technology, in particular to a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image.
Background
Depth image enhancement techniques can be divided into two categories, one of which is at the hardware level to improve the quality and accuracy of the device or to improve the design scheme, thereby obtaining a higher quality depth image. And the other type is that an algorithm is designed according to the image processing principle to enhance the depth image from a software layer. The problem that hardware needs to consider cost and physical factor limitation is solved, development cost of software algorithm is low, multiple limitations do not need to be considered, and the advantages are more obvious.
In recent years, deep learning has made a remarkable progress in the field of conventional RGB image enhancement, and many ideas have been applied to the field of deep image enhancement. Jeon et al select a Laplacian pyramid depth network as a basic network structure, and propose LapDEN, and the method can generate a clean and clear depth image from an original depth image, but cannot accurately recover large-area holes and object edges in the original image. While Zhang et al propose to generate depth image data sets from RGB-D streams using 3D reconstruction, they focus primarily on estimating larger unobserved depth values, contrary to LapDEN's work, but fail to eliminate noise and holes from low quality RGB-D images. These methods are all directed to enhancing the quality of the depth image captured by an RGB-D camera, but the critical issues of depth noise, depth holes and depth discontinuities are not well solved.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image, which can solve the problems of depth noise, depth holes and depth discontinuity.
The technical scheme is as follows: the method for enhancing the multi-scale fusion depth image of the RGB-D image comprises the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image;
(2) acquiring a plurality of image pairs comprising depth images and corresponding RGB images as samples, taking the reference depth images after the depth images are enhanced as sample labels, inputting the established multi-scale fusion network model, and carrying out network training;
(3) and inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
Further, the depth image processing branch in step (1) is specifically a residual learning network, and includes a first convolution module, a second residual module, a third residual module, and a fourth residual module, which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.
Further, in the step (1), the RGB image processing branch is specifically a full convolution network, the full convolution network includes a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are connected in sequence, the scales of the four modules are reduced in sequence 1/2, each convolution module includes three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.
Further, the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,
wherein the first convolution module comprises two convolution layers respectively input to an output of a fourth residual module in the depth image processing branch and an output of a fourth convolution module of the RGB image processing branch, the second convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the third residual module of the depth image processing branch and an output of the third convolution module of the RGB image processing branch, respectively, the third convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the second residual module of the depth image processing branch and an output of the second convolution module of the RGB image processing branch, respectively, the fourth convolution module includes two convolution layers and a transposed convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch.
Further, the loss function adopted in the step (2) for network training is as follows:
Figure BDA0002982544200000021
Figure BDA0002982544200000022
Figure BDA0002982544200000023
in the formula, L represents the total loss,
Figure BDA0002982544200000024
which is indicative of a loss of data,
Figure BDA0002982544200000025
representing the loss of structure retention, y represents the depth image in the sample,
Figure BDA0002982544200000026
a label representing the sample is attached to the sample,
Figure BDA0002982544200000027
1,2,3, 4 respectively represent the image scales from large to small, specifically the scales of the first convolution module, the second residual module, the third residual module and the fourth residual module of the depth image processing branch in turn,
Figure BDA0002982544200000028
representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,
Figure BDA0002982544200000029
respectively represent y,
Figure BDA00029825442000000210
The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
The device for enhancing the multi-scale fusion depth image of the RGB-D image comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the computer program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
1. the design of the double branches utilizes the correlation between the RGB image and the depth image, and the rich characteristics and the object contour boundary information obtained from the RGB image have an auxiliary effect on the enhancement of the depth image;
2. by adopting the method of gradually fusing the multi-scale high-dimensional features, the details can be gradually filled from low resolution to high resolution images, so that the robustness in the enhancement process is stronger, the problems of boundary blurring, depth noise and the like are not easy to occur, and the problems of depth noise, depth holes and depth discontinuity are solved.
Drawings
FIG. 1 is an architecture diagram of a multi-scale converged network model provided by the present invention;
FIG. 2 is a graph of the predicted effect of the present invention;
FIG. 3 is a comparison of the present invention with methods for predicting an image;
FIG. 4 is a diagram comparing desktop details of the present invention with methods;
FIG. 5 is a graph comparing the effect of different fusion modes and different Loss selections on depth image restoration;
FIG. 6 is a graph comparing the effect of using different inputs on depth image restoration.
Detailed Description
The embodiment provides a method for enhancing a multi-scale fusion depth image of an RGB-D image, which comprises the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, as shown in FIG. 1, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image.
The depth image processing branch is specifically a residual error learning network and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are sequentially connected, and the scales of the four modules are sequentially reduced by 1/2. The branch input is an original depth image, the difference between the original depth image and a reference image is mainly holes and noise, but most of low-frequency information between the original depth image and the reference image is similar, and a residual learning network is used, so that abundant characteristic information in the image can be extracted through a deep network, and meanwhile, the input is superposed on the output through jump connection, and the information of the original depth image is ensured not to be lost. The features of residual network learning only need to be the difference between the original image and the reference image, and can be converged more quickly. The residual module used in the present invention is the residual module used in ResNetv2 proposed in the document [ He K, Zhang X, Ren S, et al, Identityymappapings in deep residual-al networks [ C ]// European conference on computer vision.
The RGB image processing branch is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels. The branch input is an RGB image paired with a depth image. According to the assumption of the lambertian reflection and spherical harmonic illumination model, an RGB image can be decomposed into two parts, namely a reflection Map (reflection Map) and an illumination Map (Shading Map), which are defined as follows:
I(R,D,L)=S(N(D),L)×R
where I is the RGB image, D is the depth map, N is the object surface normal, L is the illumination condition, S is the illumination map derived from the normal map N and the illumination condition L, and R is the reflectance map. From the above formula, it can be seen that the RGB image contains partial information of the depth image, the two are highly correlated, and although the mapping relationship is complex, the depth network can learn and fit the mapping, so as to achieve the purpose of the RGB image to assist in enhancing the depth image. However, the RGB image mainly contributes to depth image enhancement and is part of characteristics such as overall structure information and object outline, and the characteristics obtained by the RGB image are quite rich, so that a full convolution network structure is selected.
The multi-scale fusion prediction branch comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are sequentially increased by multiples of 2, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module, wherein the first convolution module comprises two convolution layers which are respectively input into the output of a fourth residual error module in the depth image processing branch and the output of the fourth convolution module of the RGB image processing branch, the second convolution module comprises two convolution layers and a transposed convolution layer which are respectively input into the output of a third residual error module of the depth image processing branch and the output of a third convolution module of the RGB image processing branch, the third convolution module includes two convolution layers and a transpose convolution layer, the two convolution layers are respectively input to the output of the second residual error module of the depth image processing branch and the output of the second convolution module of the RGB image processing branch, the fourth convolution module includes two convolution layers and a transpose convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch. The branch is mainly used for fusing the multi-scale high-dimensional features generated by the depth image processing branch and the RGB image processing branch and predicting to obtain a high-quality depth image. Due to the fact that certain perspective distortion exists in the scene image, the visual sizes of different objects in the picture are different. However, the size of the receptive field of the convolution layer is fixed, and for visual elements with larger scale in the image, the receptive field can only cover partial area of the visual elements, so that boundary blurring easily occurs; for visual elements with smaller dimensions, the receptive field can be included in other visual elements, so that the problem that small objects are not clearly distinguished is easily caused. Therefore, the method of multi-scale high-dimensional feature gradual fusion is adopted, and the details can be filled from low resolution to high resolution images step by step. Because the depth image is not rich in information content as color images, the feature dimension of the recovered depth image needs to be reduced, and the high-dimensional features generated by the depth image processing branch and the RGB image processing branch are respectively reduced to 16-dimensional features through two convolution layers, so that the effect of removing complex useless features and retaining important features is achieved. Then, the two are superposed and the convolution layer is processed by the device to generate a depth image with the resolution doubled compared with the original characteristic diagram. Finally, three-scale predicted depth images can be obtained.
(2) And acquiring a plurality of image pairs comprising the depth image and the corresponding RGB image as samples, taking the reference depth image after the depth image enhancement as a sample label, inputting the established multi-scale fusion network model, and performing network training.
Wherein, the loss function adopted during network training is as follows:
Figure BDA0002982544200000051
Figure BDA0002982544200000052
Figure BDA0002982544200000053
in the formula, L represents the total loss,
Figure BDA0002982544200000054
which is indicative of a loss of data,
Figure BDA0002982544200000055
representing the loss of structure retention, y represents the depth image in the sample,
Figure BDA0002982544200000056
a label representing the sample is attached to the sample,
Figure BDA0002982544200000057
1,2,3, and 4 respectively represent image scales from large to small, specifically, scales of a first convolution module, a second residual module, a third residual module, and a fourth residual module of the depth image processing branch in this order, ωlRepresenting the weighting factors of different scales l, the present embodiment is set to [0.2, 0.4, 0.7, 1.0 ]],
Figure BDA00029825442000000513
Representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,
Figure BDA0002982544200000058
respectively represent y,
Figure BDA0002982544200000059
The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
Figure BDA00029825442000000510
Is L1Loss function and L2A combination of loss functions, due to L2The loss function has the characteristics of high derivative solving speed and sensitivity to outliers, so the function can be used by many image restoration tasks. However, since some reference images in the data set used in the present invention still have hole or noise problems, if only L is used2The loss function may cause the model to pay too much attention to local holes and noise, thereby affecting the generalization of the model. Therefore, the invention adopts L on the small-scale image2The loss function ensures that the whole image does not have large deviation, and L is adopted on the large-scale image1A loss function to ensure that image details can be restored. In addition, the depth image of interest to the present invention has significant discontinuities at the edges between the foreground and background regions. Like L1Loss function and L2Conventional loss functions such as the loss function have difficulty in retaining such discontinuities. Therefore, it is not only easy to use
Figure BDA00029825442000000511
The function focuses primarily on the gradient retention values between adjacent pixels. Since the reference image itself is scaled down by bilinear interpolation, which results in loss of detail, it is only used for the image with the largest scale
Figure BDA00029825442000000512
(3) And inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
The embodiment also provides a multi-scale fusion depth image enhancement device of an RGB-D image, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the computer program.
The following is a simulation verification for the present invention.
(1) Training details and parameter settings
The training image dataset of this experiment was from sungbd, and this dataset included 3389 pairs of pictures taken by the rotation device, 1159 pairs of pictures taken by the RealSense device, 1449 pairs of pictures taken by KinectV1 and 3784 pairs of pictures taken by KinectV2, for a total of 9781 pairs of pictures. We selected 800 of the NYUV2 data sets contained in the data set as the test set and the remaining 8981 as the training set. Before training, because the resolution of the RGB-D images collected by different devices is not used, all images (RGB images and depth images) are cut out to be 400 x 560 in the center and used as the input of the network.
The experiment used a pytorech platform training model, used an Adam optimizer, set the initial learning rate to 0.0001, and set the attenuation per 0.5 before the 2 nd, 4 th, 7 th, 10 th, 15 th iterations. The experimental batch size was 20 and the model was iterated 100 times in total. The experiment was trained using a server with NVIDIA Tesla V100 with a video memory of 32 GB.
(2) Comparison of Experimental results
First comparing the prediction results of the present invention with the reference image, as shown in FIG. 2, it can be seen that the present invention fills the holes well and protects the depth discontinuity between objects.
Next, the present invention was compared with several published models published in recent years, and the evaluation indexes used were RMSE (root mean square error), MAE (mean absolute error), and SSIM (structural similarity), respectively. The RMSE and the MAE are used to directly measure the depth prediction accuracy, and the SSIM measures the structural similarity between the predicted image and the reference image from the overall perspective. The advanced method for comparison comprises the following steps: FCRN: a full convolution residual error neural network for depth image prediction based on color images; SharpNet: the multi-task prediction model based on the color image comprises depth prediction, normal estimation and boundary contour prediction; LapDEN: a depth image restoration model based on a pyramid-like structure of the depth image; HFM-Net: a multi-level depth image prediction model based on the color image and the depth image, wherein the original model is used for normal estimation, and the original model is changed into the depth image; SharpNet, which was originally proposed for multi-tasking generation of color images. But this is also a disadvantage, and the same encoder and similar decoder are used for three tasks, so that the pertinence of the depth image restoration problem cannot be improved. The present invention also compares the results with the latest pyramid image restoration method LapDEN, which uses the image super-resolution concept of laprn to restore depth images. The invention model utilizes a fusion structure between the color and depth images to enhance the depth image. In table 1 (black bold mark with the best effect) it can be seen that the method of the present invention is superior to the existing methods in each index, both in image accuracy and in structural similarity. As shown in FIG. 3, the method of the present invention is obviously close to the reference image compared with the other two methods, and has great improvement in noise removal, hole filling and depth discontinuity preservation. From the detail display of fig. 4, it can be seen that due to the size and material of the object on the desktop, it is relatively difficult for the device to collect data, and a large area of holes is easily generated, where the LapDEN based on the depth image and the sharp net prediction based on the RGB image both have a large deviation. The invention integrates the characteristics of RGB-D images, thus better solving the problem of filling holes and protecting the discontinuity of the boundary depth of objects to a certain extent. It can also be seen from table 1 that the model of the present invention is also greater in model operating speed than the other methods.
In addition, although the method is supervised, the method can still restore image areas which are blurred in the reference image by fusing the characteristics of the RGB images. For example, the faucet in fig. 2, it can be seen that the original depth image lacks depth values in the whole faucet portion, and the reference image apparently smoothes the faucet portion, whereas the method of the present invention can more clearly see the faucet portion, and its depth values are close to the value of the whole faucet.
TABLE 1 comparison of test set Performance for advanced methods
Figure BDA0002982544200000071
(3) Ablation experiment
In order to investigate the effect of each network component and fusion protocol on the final performance, ablation studies were performed on the NYUV2 validation dataset. The quantitative comparison is summarized in table 2.
Table 2 comparison of performance of ablation experiments
Figure BDA0002982544200000072
Selection of input branches: in order to determine whether the RGB image and the Depth image have obvious effects in the network, two branches are respectively deleted in the model, that is, the network is changed into a simple codec structure, and the comparison of the results of 'RGB', 'Depth' and 'RGB + Depth' is branched from table 2, so that it can be obviously seen that the network of 'RGB + Depth' is obviously better than the other two networks. It can also be seen from fig. 6 that although the single RGB branch prediction Depth is much less effective than the single Depth branch, the model after RGB branch addition has better repairing capability for holes and noise regions. The cloud is that the RGB image is highly related to the depth image, and can assist in enhancing the depth image.
Selecting a feature fusion mode: existing methods typically use matrix addition (add) or concatenation (concatanate) for multimodal feature fusion. To compare them, the results are denoted as "add" and "splice", respectively, by changing the feature fusion operations in the fusion branch, but leaving the other network components and settings unchanged. It can be seen that the "add" result is somewhat less effective than the "splice". This is because the RGB image and depth image are characterized by heterogeneous data, while the addition means that the model will treat these two different data implicitly in the same way, which can lead to performance degradation. And the splicing represents that the model does not destroy the independence of the original characteristics, and the weight of different characteristics can be learned by self through the new convolutional layer. Therefore, the existing method also uses addition or splicing operation correspondingly according to the source and correlation of the characteristics. It is also apparent from (g-l) of fig. 5 that the stitching fusion method has a good effect on object recovery of the depth image and discontinuity protection of the boundary.
Selection of the loss function: in addition to the modification of the model, the choice of the loss function also has some influence on the final result. If the L1 loss function is used for all layers, the predicted image will be too detailed, and the use of L2 will cause the predicted image to be blurred. It is clear from fig. 5(d-f) that using the loss function (Hybrid loss) of the present invention, both preserves the clarity of the overall structure and preserves the depth discontinuity at the object edges.
In conclusion, the result analysis shows that the multi-scale fusion network provided by the invention fuses the multi-scale RGB image features and the depth image features, not only retains the structural information of the original depth image, but also combines the advantages of high resolution and rich features of the RGB image, and well solves the problems of holes, noise and the like of the depth image. Furthermore, in view of the fact that the reference image of the data set is not perfect, the present invention also designs a blending loss function to ensure the generation of a high quality depth image. Extensive experimental results show that the method is obviously superior to the existing advanced method in the depth image enhancement task. Ablation studies also prove that the RGB-D multi-scale fusion scheme proposed by the invention is superior to the generation method of single RGB images or single depth images, and the mixing loss function is better converged and the result is better than the L1 and L2 loss functions.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (6)

1. A multi-scale fusion depth image enhancement method of an RGB-D image is characterized by comprising the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image;
(2) acquiring a plurality of image pairs comprising depth images and corresponding RGB images as samples, taking the reference depth images after the depth images are enhanced as sample labels, inputting the established multi-scale fusion network model, and carrying out network training;
(3) and inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
2. The method for multi-scale fusion depth image enhancement of RGB-D images according to claim 1, wherein: the depth image processing branch in the step (1) is specifically a residual error learning network, and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.
3. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 2, wherein: the RGB image processing branch in the step (1) is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.
4. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 3, wherein: the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling machine module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,
wherein the first convolution module comprises two convolution layers respectively input to an output of a fourth residual module in the depth image processing branch and an output of a fourth convolution module of the RGB image processing branch, the second convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the third residual module of the depth image processing branch and an output of the third convolution module of the RGB image processing branch, respectively, the third convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the second residual module of the depth image processing branch and an output of the second convolution module of the RGB image processing branch, respectively, the fourth convolution module includes two convolution layers and a transposed convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch.
5. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 4, wherein: the loss function adopted in the step (2) for network training is as follows:
Figure FDA0002982544190000021
Figure FDA0002982544190000022
Figure FDA0002982544190000023
in the formula, L represents the total loss,
Figure FDA0002982544190000024
which is indicative of a loss of data,
Figure FDA0002982544190000025
representing the loss of structure retention, y represents the true depth image in the sample,
Figure FDA0002982544190000026
a depth image representing a prediction of the model,
Figure FDA0002982544190000027
Figure FDA0002982544190000028
respectively representing the image scales from large to small, specifically the scales of the first convolution module, the second residual error module, the third residual error module and the fourth residual error module of the depth image processing branch in turn, omegalThe weight coefficients representing the different scales l,
Figure FDA0002982544190000029
representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,
Figure FDA00029825441900000210
respectively represent y,
Figure FDA00029825441900000211
The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
6. A multi-scale fusion depth image enhancement apparatus for RGB-D images, comprising a processor and a computer program stored on a memory and executable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-5.
CN202110290784.6A 2021-03-18 2021-03-18 Multi-scale fusion depth image enhancement method and device for RGB-D image Pending CN113033645A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110290784.6A CN113033645A (en) 2021-03-18 2021-03-18 Multi-scale fusion depth image enhancement method and device for RGB-D image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110290784.6A CN113033645A (en) 2021-03-18 2021-03-18 Multi-scale fusion depth image enhancement method and device for RGB-D image

Publications (1)

Publication Number Publication Date
CN113033645A true CN113033645A (en) 2021-06-25

Family

ID=76472165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110290784.6A Pending CN113033645A (en) 2021-03-18 2021-03-18 Multi-scale fusion depth image enhancement method and device for RGB-D image

Country Status (1)

Country Link
CN (1) CN113033645A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023050381A1 (en) * 2021-09-30 2023-04-06 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image and video coding using multi-sensor collaboration

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170032222A1 (en) * 2015-07-30 2017-02-02 Xerox Corporation Cross-trained convolutional neural networks using multimodal images
CN108197587A (en) * 2018-01-18 2018-06-22 中科视拓(北京)科技有限公司 A kind of method that multi-modal recognition of face is carried out by face depth prediction
CN110349087A (en) * 2019-07-08 2019-10-18 华南理工大学 RGB-D image superior quality grid generation method based on adaptability convolution
CN111104532A (en) * 2019-12-30 2020-05-05 华南理工大学 RGBD image joint recovery method based on double-current network
CN111832592A (en) * 2019-04-20 2020-10-27 南开大学 RGBD significance detection method and related device
CN111915619A (en) * 2020-06-05 2020-11-10 华南理工大学 Full convolution network semantic segmentation method for dual-feature extraction and fusion
CN112101410A (en) * 2020-08-05 2020-12-18 中国科学院空天信息创新研究院 Image pixel semantic segmentation method and system based on multi-modal feature fusion

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170032222A1 (en) * 2015-07-30 2017-02-02 Xerox Corporation Cross-trained convolutional neural networks using multimodal images
CN108197587A (en) * 2018-01-18 2018-06-22 中科视拓(北京)科技有限公司 A kind of method that multi-modal recognition of face is carried out by face depth prediction
CN111832592A (en) * 2019-04-20 2020-10-27 南开大学 RGBD significance detection method and related device
CN110349087A (en) * 2019-07-08 2019-10-18 华南理工大学 RGB-D image superior quality grid generation method based on adaptability convolution
CN111104532A (en) * 2019-12-30 2020-05-05 华南理工大学 RGBD image joint recovery method based on double-current network
CN111915619A (en) * 2020-06-05 2020-11-10 华南理工大学 Full convolution network semantic segmentation method for dual-feature extraction and fusion
CN112101410A (en) * 2020-08-05 2020-12-18 中国科学院空天信息创新研究院 Image pixel semantic segmentation method and system based on multi-modal feature fusion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
代具亭等: "基于彩色-深度图像和深度学习的场景语义分割网络", 《科学技术与工程》, vol. 18, no. 20, 18 July 2018 (2018-07-18), pages 286 - 291 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023050381A1 (en) * 2021-09-30 2023-04-06 Guangdong Oppo Mobile Telecommunications Corp., Ltd. Image and video coding using multi-sensor collaboration

Similar Documents

Publication Publication Date Title
Liu et al. Trident dehazing network
Liu et al. End-to-end single image fog removal using enhanced cycle consistent adversarial networks
CN111292264B (en) Image high dynamic range reconstruction method based on deep learning
Wang et al. Haze concentration adaptive network for image dehazing
CN107818554B (en) Information processing apparatus and information processing method
CN112767468A (en) Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement
CN111754438B (en) Underwater image restoration model based on multi-branch gating fusion and restoration method thereof
Lu et al. Deep texture and structure aware filtering network for image smoothing
CN113450288B (en) Single image rain removing method and system based on deep convolutional neural network and storage medium
CN112884669B (en) Image restoration method based on multi-scale content attention mechanism, storage medium and terminal
Cheng et al. Zero-shot image super-resolution with depth guided internal degradation learning
Pan et al. MIEGAN: Mobile image enhancement via a multi-module cascade neural network
CN112581370A (en) Training and reconstruction method of super-resolution reconstruction model of face image
Guo et al. Single image dehazing based on fusion strategy
CN110418139B (en) Video super-resolution restoration method, device, equipment and storage medium
CN111179196B (en) Multi-resolution depth network image highlight removing method based on divide-and-conquer
KR102311796B1 (en) Method and Apparatus for Deblurring of Human Motion using Localized Body Prior
CN114897742B (en) Image restoration method with texture and structural features fused twice
CN112116543A (en) Image restoration method, system and device based on detection type generation framework
Chang et al. UIDEF: A real-world underwater image dataset and a color-contrast complementary image enhancement framework
CN112801911B (en) Method and device for removing text noise in natural image and storage medium
Zheng et al. Double-branch dehazing network based on self-calibrated attentional convolution
Chen et al. Attention-based broad self-guided network for low-light image enhancement
CN113033645A (en) Multi-scale fusion depth image enhancement method and device for RGB-D image
CN112598604A (en) Blind face restoration method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination