CN113033645A - Multi-scale fusion depth image enhancement method and device for RGB-D image - Google Patents
Multi-scale fusion depth image enhancement method and device for RGB-D image Download PDFInfo
- Publication number
- CN113033645A CN113033645A CN202110290784.6A CN202110290784A CN113033645A CN 113033645 A CN113033645 A CN 113033645A CN 202110290784 A CN202110290784 A CN 202110290784A CN 113033645 A CN113033645 A CN 113033645A
- Authority
- CN
- China
- Prior art keywords
- convolution
- depth image
- module
- rgb
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000004927 fusion Effects 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012545 processing Methods 0.000 claims description 57
- 230000006870 function Effects 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 5
- 230000014759 maintenance of location Effects 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 abstract description 9
- 239000003086 colorant Substances 0.000 abstract 1
- 230000000295 complement effect Effects 0.000 abstract 1
- 230000000694 effects Effects 0.000 description 9
- 238000002474 experimental method Methods 0.000 description 5
- 238000005286 illumination Methods 0.000 description 5
- 230000000007 visual effect Effects 0.000 description 5
- 238000002679 ablation Methods 0.000 description 4
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013507 mapping Methods 0.000 description 2
- 238000002156 mixing Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 101150050927 Fcgrt gene Proteins 0.000 description 1
- 102100026120 IgG receptor FcRn large subunit p51 Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image. The invention comprises the following steps: (1) through a scheme of double-branch gradual fusion, the input of an RGB image and a depth image can complement each other in depth prediction, the depth is used for ensuring the general structure of the image to be complete, and missing pixel values are filled with colors; (2) by analyzing the noise distribution of the real data, a mixed multi-scale loss function is designed, and a high-quality clear image can still be generated even under the condition that the real image data is noisy. The method can reasonably utilize respective characteristics of the RGB image and the depth image, ensure that characteristic information obtained by the color image plays an auxiliary role in repairing the depth image, finally predict the complete depth image and obviously improve the quality of the depth image.
Description
Technical Field
The invention relates to an image processing technology, in particular to a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image.
Background
Depth image enhancement techniques can be divided into two categories, one of which is at the hardware level to improve the quality and accuracy of the device or to improve the design scheme, thereby obtaining a higher quality depth image. And the other type is that an algorithm is designed according to the image processing principle to enhance the depth image from a software layer. The problem that hardware needs to consider cost and physical factor limitation is solved, development cost of software algorithm is low, multiple limitations do not need to be considered, and the advantages are more obvious.
In recent years, deep learning has made a remarkable progress in the field of conventional RGB image enhancement, and many ideas have been applied to the field of deep image enhancement. Jeon et al select a Laplacian pyramid depth network as a basic network structure, and propose LapDEN, and the method can generate a clean and clear depth image from an original depth image, but cannot accurately recover large-area holes and object edges in the original image. While Zhang et al propose to generate depth image data sets from RGB-D streams using 3D reconstruction, they focus primarily on estimating larger unobserved depth values, contrary to LapDEN's work, but fail to eliminate noise and holes from low quality RGB-D images. These methods are all directed to enhancing the quality of the depth image captured by an RGB-D camera, but the critical issues of depth noise, depth holes and depth discontinuities are not well solved.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a method and a device for enhancing a multi-scale fusion depth image of an RGB-D image, which can solve the problems of depth noise, depth holes and depth discontinuity.
The technical scheme is as follows: the method for enhancing the multi-scale fusion depth image of the RGB-D image comprises the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image;
(2) acquiring a plurality of image pairs comprising depth images and corresponding RGB images as samples, taking the reference depth images after the depth images are enhanced as sample labels, inputting the established multi-scale fusion network model, and carrying out network training;
(3) and inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
Further, the depth image processing branch in step (1) is specifically a residual learning network, and includes a first convolution module, a second residual module, a third residual module, and a fourth residual module, which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.
Further, in the step (1), the RGB image processing branch is specifically a full convolution network, the full convolution network includes a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are connected in sequence, the scales of the four modules are reduced in sequence 1/2, each convolution module includes three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.
Further, the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,
wherein the first convolution module comprises two convolution layers respectively input to an output of a fourth residual module in the depth image processing branch and an output of a fourth convolution module of the RGB image processing branch, the second convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the third residual module of the depth image processing branch and an output of the third convolution module of the RGB image processing branch, respectively, the third convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the second residual module of the depth image processing branch and an output of the second convolution module of the RGB image processing branch, respectively, the fourth convolution module includes two convolution layers and a transposed convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch.
Further, the loss function adopted in the step (2) for network training is as follows:
in the formula, L represents the total loss,which is indicative of a loss of data,representing the loss of structure retention, y represents the depth image in the sample,a label representing the sample is attached to the sample,1,2,3, 4 respectively represent the image scales from large to small, specifically the scales of the first convolution module, the second residual module, the third residual module and the fourth residual module of the depth image processing branch in turn,representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,respectively represent y,The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
The device for enhancing the multi-scale fusion depth image of the RGB-D image comprises a processor and a computer program which is stored on a memory and can run on the processor, wherein the processor realizes the method when executing the computer program.
Has the advantages that: compared with the prior art, the invention has the following remarkable advantages:
1. the design of the double branches utilizes the correlation between the RGB image and the depth image, and the rich characteristics and the object contour boundary information obtained from the RGB image have an auxiliary effect on the enhancement of the depth image;
2. by adopting the method of gradually fusing the multi-scale high-dimensional features, the details can be gradually filled from low resolution to high resolution images, so that the robustness in the enhancement process is stronger, the problems of boundary blurring, depth noise and the like are not easy to occur, and the problems of depth noise, depth holes and depth discontinuity are solved.
Drawings
FIG. 1 is an architecture diagram of a multi-scale converged network model provided by the present invention;
FIG. 2 is a graph of the predicted effect of the present invention;
FIG. 3 is a comparison of the present invention with methods for predicting an image;
FIG. 4 is a diagram comparing desktop details of the present invention with methods;
FIG. 5 is a graph comparing the effect of different fusion modes and different Loss selections on depth image restoration;
FIG. 6 is a graph comparing the effect of using different inputs on depth image restoration.
Detailed Description
The embodiment provides a method for enhancing a multi-scale fusion depth image of an RGB-D image, which comprises the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, as shown in FIG. 1, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image.
The depth image processing branch is specifically a residual error learning network and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are sequentially connected, and the scales of the four modules are sequentially reduced by 1/2. The branch input is an original depth image, the difference between the original depth image and a reference image is mainly holes and noise, but most of low-frequency information between the original depth image and the reference image is similar, and a residual learning network is used, so that abundant characteristic information in the image can be extracted through a deep network, and meanwhile, the input is superposed on the output through jump connection, and the information of the original depth image is ensured not to be lost. The features of residual network learning only need to be the difference between the original image and the reference image, and can be converged more quickly. The residual module used in the present invention is the residual module used in ResNetv2 proposed in the document [ He K, Zhang X, Ren S, et al, Identityymappapings in deep residual-al networks [ C ]// European conference on computer vision.
The RGB image processing branch is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels. The branch input is an RGB image paired with a depth image. According to the assumption of the lambertian reflection and spherical harmonic illumination model, an RGB image can be decomposed into two parts, namely a reflection Map (reflection Map) and an illumination Map (Shading Map), which are defined as follows:
I(R,D,L)=S(N(D),L)×R
where I is the RGB image, D is the depth map, N is the object surface normal, L is the illumination condition, S is the illumination map derived from the normal map N and the illumination condition L, and R is the reflectance map. From the above formula, it can be seen that the RGB image contains partial information of the depth image, the two are highly correlated, and although the mapping relationship is complex, the depth network can learn and fit the mapping, so as to achieve the purpose of the RGB image to assist in enhancing the depth image. However, the RGB image mainly contributes to depth image enhancement and is part of characteristics such as overall structure information and object outline, and the characteristics obtained by the RGB image are quite rich, so that a full convolution network structure is selected.
The multi-scale fusion prediction branch comprises a first convolution module, a second convolution module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are sequentially increased by multiples of 2, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module, wherein the first convolution module comprises two convolution layers which are respectively input into the output of a fourth residual error module in the depth image processing branch and the output of the fourth convolution module of the RGB image processing branch, the second convolution module comprises two convolution layers and a transposed convolution layer which are respectively input into the output of a third residual error module of the depth image processing branch and the output of a third convolution module of the RGB image processing branch, the third convolution module includes two convolution layers and a transpose convolution layer, the two convolution layers are respectively input to the output of the second residual error module of the depth image processing branch and the output of the second convolution module of the RGB image processing branch, the fourth convolution module includes two convolution layers and a transpose convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch. The branch is mainly used for fusing the multi-scale high-dimensional features generated by the depth image processing branch and the RGB image processing branch and predicting to obtain a high-quality depth image. Due to the fact that certain perspective distortion exists in the scene image, the visual sizes of different objects in the picture are different. However, the size of the receptive field of the convolution layer is fixed, and for visual elements with larger scale in the image, the receptive field can only cover partial area of the visual elements, so that boundary blurring easily occurs; for visual elements with smaller dimensions, the receptive field can be included in other visual elements, so that the problem that small objects are not clearly distinguished is easily caused. Therefore, the method of multi-scale high-dimensional feature gradual fusion is adopted, and the details can be filled from low resolution to high resolution images step by step. Because the depth image is not rich in information content as color images, the feature dimension of the recovered depth image needs to be reduced, and the high-dimensional features generated by the depth image processing branch and the RGB image processing branch are respectively reduced to 16-dimensional features through two convolution layers, so that the effect of removing complex useless features and retaining important features is achieved. Then, the two are superposed and the convolution layer is processed by the device to generate a depth image with the resolution doubled compared with the original characteristic diagram. Finally, three-scale predicted depth images can be obtained.
(2) And acquiring a plurality of image pairs comprising the depth image and the corresponding RGB image as samples, taking the reference depth image after the depth image enhancement as a sample label, inputting the established multi-scale fusion network model, and performing network training.
Wherein, the loss function adopted during network training is as follows:
in the formula, L represents the total loss,which is indicative of a loss of data,representing the loss of structure retention, y represents the depth image in the sample,a label representing the sample is attached to the sample,1,2,3, and 4 respectively represent image scales from large to small, specifically, scales of a first convolution module, a second residual module, a third residual module, and a fourth residual module of the depth image processing branch in this order, ωlRepresenting the weighting factors of different scales l, the present embodiment is set to [0.2, 0.4, 0.7, 1.0 ]],Representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,respectively represent y,The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
Is L1Loss function and L2A combination of loss functions, due to L2The loss function has the characteristics of high derivative solving speed and sensitivity to outliers, so the function can be used by many image restoration tasks. However, since some reference images in the data set used in the present invention still have hole or noise problems, if only L is used2The loss function may cause the model to pay too much attention to local holes and noise, thereby affecting the generalization of the model. Therefore, the invention adopts L on the small-scale image2The loss function ensures that the whole image does not have large deviation, and L is adopted on the large-scale image1A loss function to ensure that image details can be restored. In addition, the depth image of interest to the present invention has significant discontinuities at the edges between the foreground and background regions. Like L1Loss function and L2Conventional loss functions such as the loss function have difficulty in retaining such discontinuities. Therefore, it is not only easy to useThe function focuses primarily on the gradient retention values between adjacent pixels. Since the reference image itself is scaled down by bilinear interpolation, which results in loss of detail, it is only used for the image with the largest scale
(3) And inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
The embodiment also provides a multi-scale fusion depth image enhancement device of an RGB-D image, which comprises a processor and a computer program stored on a memory and capable of running on the processor, wherein the processor realizes the method when executing the computer program.
The following is a simulation verification for the present invention.
(1) Training details and parameter settings
The training image dataset of this experiment was from sungbd, and this dataset included 3389 pairs of pictures taken by the rotation device, 1159 pairs of pictures taken by the RealSense device, 1449 pairs of pictures taken by KinectV1 and 3784 pairs of pictures taken by KinectV2, for a total of 9781 pairs of pictures. We selected 800 of the NYUV2 data sets contained in the data set as the test set and the remaining 8981 as the training set. Before training, because the resolution of the RGB-D images collected by different devices is not used, all images (RGB images and depth images) are cut out to be 400 x 560 in the center and used as the input of the network.
The experiment used a pytorech platform training model, used an Adam optimizer, set the initial learning rate to 0.0001, and set the attenuation per 0.5 before the 2 nd, 4 th, 7 th, 10 th, 15 th iterations. The experimental batch size was 20 and the model was iterated 100 times in total. The experiment was trained using a server with NVIDIA Tesla V100 with a video memory of 32 GB.
(2) Comparison of Experimental results
First comparing the prediction results of the present invention with the reference image, as shown in FIG. 2, it can be seen that the present invention fills the holes well and protects the depth discontinuity between objects.
Next, the present invention was compared with several published models published in recent years, and the evaluation indexes used were RMSE (root mean square error), MAE (mean absolute error), and SSIM (structural similarity), respectively. The RMSE and the MAE are used to directly measure the depth prediction accuracy, and the SSIM measures the structural similarity between the predicted image and the reference image from the overall perspective. The advanced method for comparison comprises the following steps: FCRN: a full convolution residual error neural network for depth image prediction based on color images; SharpNet: the multi-task prediction model based on the color image comprises depth prediction, normal estimation and boundary contour prediction; LapDEN: a depth image restoration model based on a pyramid-like structure of the depth image; HFM-Net: a multi-level depth image prediction model based on the color image and the depth image, wherein the original model is used for normal estimation, and the original model is changed into the depth image; SharpNet, which was originally proposed for multi-tasking generation of color images. But this is also a disadvantage, and the same encoder and similar decoder are used for three tasks, so that the pertinence of the depth image restoration problem cannot be improved. The present invention also compares the results with the latest pyramid image restoration method LapDEN, which uses the image super-resolution concept of laprn to restore depth images. The invention model utilizes a fusion structure between the color and depth images to enhance the depth image. In table 1 (black bold mark with the best effect) it can be seen that the method of the present invention is superior to the existing methods in each index, both in image accuracy and in structural similarity. As shown in FIG. 3, the method of the present invention is obviously close to the reference image compared with the other two methods, and has great improvement in noise removal, hole filling and depth discontinuity preservation. From the detail display of fig. 4, it can be seen that due to the size and material of the object on the desktop, it is relatively difficult for the device to collect data, and a large area of holes is easily generated, where the LapDEN based on the depth image and the sharp net prediction based on the RGB image both have a large deviation. The invention integrates the characteristics of RGB-D images, thus better solving the problem of filling holes and protecting the discontinuity of the boundary depth of objects to a certain extent. It can also be seen from table 1 that the model of the present invention is also greater in model operating speed than the other methods.
In addition, although the method is supervised, the method can still restore image areas which are blurred in the reference image by fusing the characteristics of the RGB images. For example, the faucet in fig. 2, it can be seen that the original depth image lacks depth values in the whole faucet portion, and the reference image apparently smoothes the faucet portion, whereas the method of the present invention can more clearly see the faucet portion, and its depth values are close to the value of the whole faucet.
TABLE 1 comparison of test set Performance for advanced methods
(3) Ablation experiment
In order to investigate the effect of each network component and fusion protocol on the final performance, ablation studies were performed on the NYUV2 validation dataset. The quantitative comparison is summarized in table 2.
Table 2 comparison of performance of ablation experiments
Selection of input branches: in order to determine whether the RGB image and the Depth image have obvious effects in the network, two branches are respectively deleted in the model, that is, the network is changed into a simple codec structure, and the comparison of the results of 'RGB', 'Depth' and 'RGB + Depth' is branched from table 2, so that it can be obviously seen that the network of 'RGB + Depth' is obviously better than the other two networks. It can also be seen from fig. 6 that although the single RGB branch prediction Depth is much less effective than the single Depth branch, the model after RGB branch addition has better repairing capability for holes and noise regions. The cloud is that the RGB image is highly related to the depth image, and can assist in enhancing the depth image.
Selecting a feature fusion mode: existing methods typically use matrix addition (add) or concatenation (concatanate) for multimodal feature fusion. To compare them, the results are denoted as "add" and "splice", respectively, by changing the feature fusion operations in the fusion branch, but leaving the other network components and settings unchanged. It can be seen that the "add" result is somewhat less effective than the "splice". This is because the RGB image and depth image are characterized by heterogeneous data, while the addition means that the model will treat these two different data implicitly in the same way, which can lead to performance degradation. And the splicing represents that the model does not destroy the independence of the original characteristics, and the weight of different characteristics can be learned by self through the new convolutional layer. Therefore, the existing method also uses addition or splicing operation correspondingly according to the source and correlation of the characteristics. It is also apparent from (g-l) of fig. 5 that the stitching fusion method has a good effect on object recovery of the depth image and discontinuity protection of the boundary.
Selection of the loss function: in addition to the modification of the model, the choice of the loss function also has some influence on the final result. If the L1 loss function is used for all layers, the predicted image will be too detailed, and the use of L2 will cause the predicted image to be blurred. It is clear from fig. 5(d-f) that using the loss function (Hybrid loss) of the present invention, both preserves the clarity of the overall structure and preserves the depth discontinuity at the object edges.
In conclusion, the result analysis shows that the multi-scale fusion network provided by the invention fuses the multi-scale RGB image features and the depth image features, not only retains the structural information of the original depth image, but also combines the advantages of high resolution and rich features of the RGB image, and well solves the problems of holes, noise and the like of the depth image. Furthermore, in view of the fact that the reference image of the data set is not perfect, the present invention also designs a blending loss function to ensure the generation of a high quality depth image. Extensive experimental results show that the method is obviously superior to the existing advanced method in the depth image enhancement task. Ablation studies also prove that the RGB-D multi-scale fusion scheme proposed by the invention is superior to the generation method of single RGB images or single depth images, and the mixing loss function is better converged and the result is better than the L1 and L2 loss functions.
While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.
Claims (6)
1. A multi-scale fusion depth image enhancement method of an RGB-D image is characterized by comprising the following steps:
(1) establishing a multi-scale fusion network model, wherein the multi-scale fusion network model comprises a depth image processing branch, an RGB image processing branch and a multi-scale fusion prediction branch, the depth image processing branch is used for extracting characteristic information from a depth image, the RGB image processing branch is used for extracting the characteristic information from an RGB image paired with the depth image, and the multi-scale fusion prediction branch is used for gradually fusing the characteristic information extracted by the depth image processing branch and the RGB image processing branch in the order from low scale to high scale to predict an enhanced depth image;
(2) acquiring a plurality of image pairs comprising depth images and corresponding RGB images as samples, taking the reference depth images after the depth images are enhanced as sample labels, inputting the established multi-scale fusion network model, and carrying out network training;
(3) and inputting the depth image to be enhanced and the corresponding RGB image into the trained multi-scale fusion network model to obtain an enhanced image.
2. The method for multi-scale fusion depth image enhancement of RGB-D images according to claim 1, wherein: the depth image processing branch in the step (1) is specifically a residual error learning network, and comprises a first convolution module, a second residual error module, a third residual error module and a fourth residual error module which are connected in sequence, and the scales of the four modules are reduced in sequence 1/2.
3. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 2, wherein: the RGB image processing branch in the step (1) is specifically a full convolution network, the full convolution network comprises a first convolution module, a second convolution module, a third convolution module and a fourth convolution module which are sequentially connected, the scales of the four modules are sequentially reduced by 1/2, each convolution module comprises three convolution layers and a maximum pooling layer, the first convolution layer is a cavity convolution, and the middle two convolution layers are convolution kernels.
4. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 3, wherein: the multi-scale fusion prediction branch in the step (1) comprises a first convolution module, a second coiling machine module, a third convolution module, a fourth convolution module, a fifth convolution module and a sixth convolution module which are connected in sequence, the scales of the first convolution module, the second convolution module, the third convolution module and the fourth convolution module are increased by 2 times in sequence, the scales of the fifth convolution module and the sixth convolution module are the same as the scales of the fourth convolution module,
wherein the first convolution module comprises two convolution layers respectively input to an output of a fourth residual module in the depth image processing branch and an output of a fourth convolution module of the RGB image processing branch, the second convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the third residual module of the depth image processing branch and an output of the third convolution module of the RGB image processing branch, respectively, the third convolution module includes two convolution layers and a transposed convolution layer, the two convolution layers being input to an output of the second residual module of the depth image processing branch and an output of the second convolution module of the RGB image processing branch, respectively, the fourth convolution module includes two convolution layers and a transposed convolution layer, and the two convolution layers are respectively input to the output of the first convolution module of the depth image processing branch and the output of the first convolution module of the RGB image processing branch.
5. The method of multi-scale fusion depth image enhancement of RGB-D images according to claim 4, wherein: the loss function adopted in the step (2) for network training is as follows:
in the formula, L represents the total loss,which is indicative of a loss of data,representing the loss of structure retention, y represents the true depth image in the sample,a depth image representing a prediction of the model, respectively representing the image scales from large to small, specifically the scales of the first convolution module, the second residual error module, the third residual error module and the fourth residual error module of the depth image processing branch in turn, omegalThe weight coefficients representing the different scales l,representing a predicted depth image of the l-scale, ylA real depth image representing the l-scale, H represents the height of the input image, W represents the width of the input image,respectively represent y,The difference value between the pixel of the ith row and the jth column and the adjacent pixel at the left and upper sides thereof, namely the gradient value.
6. A multi-scale fusion depth image enhancement apparatus for RGB-D images, comprising a processor and a computer program stored on a memory and executable on the processor, wherein: the processor, when executing the program, implements the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110290784.6A CN113033645A (en) | 2021-03-18 | 2021-03-18 | Multi-scale fusion depth image enhancement method and device for RGB-D image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110290784.6A CN113033645A (en) | 2021-03-18 | 2021-03-18 | Multi-scale fusion depth image enhancement method and device for RGB-D image |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113033645A true CN113033645A (en) | 2021-06-25 |
Family
ID=76472165
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110290784.6A Pending CN113033645A (en) | 2021-03-18 | 2021-03-18 | Multi-scale fusion depth image enhancement method and device for RGB-D image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113033645A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023050381A1 (en) * | 2021-09-30 | 2023-04-06 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Image and video coding using multi-sensor collaboration |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170032222A1 (en) * | 2015-07-30 | 2017-02-02 | Xerox Corporation | Cross-trained convolutional neural networks using multimodal images |
CN108197587A (en) * | 2018-01-18 | 2018-06-22 | 中科视拓(北京)科技有限公司 | A kind of method that multi-modal recognition of face is carried out by face depth prediction |
CN110349087A (en) * | 2019-07-08 | 2019-10-18 | 华南理工大学 | RGB-D image superior quality grid generation method based on adaptability convolution |
CN111104532A (en) * | 2019-12-30 | 2020-05-05 | 华南理工大学 | RGBD image joint recovery method based on double-current network |
CN111832592A (en) * | 2019-04-20 | 2020-10-27 | 南开大学 | RGBD significance detection method and related device |
CN111915619A (en) * | 2020-06-05 | 2020-11-10 | 华南理工大学 | Full convolution network semantic segmentation method for dual-feature extraction and fusion |
CN112101410A (en) * | 2020-08-05 | 2020-12-18 | 中国科学院空天信息创新研究院 | Image pixel semantic segmentation method and system based on multi-modal feature fusion |
-
2021
- 2021-03-18 CN CN202110290784.6A patent/CN113033645A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170032222A1 (en) * | 2015-07-30 | 2017-02-02 | Xerox Corporation | Cross-trained convolutional neural networks using multimodal images |
CN108197587A (en) * | 2018-01-18 | 2018-06-22 | 中科视拓(北京)科技有限公司 | A kind of method that multi-modal recognition of face is carried out by face depth prediction |
CN111832592A (en) * | 2019-04-20 | 2020-10-27 | 南开大学 | RGBD significance detection method and related device |
CN110349087A (en) * | 2019-07-08 | 2019-10-18 | 华南理工大学 | RGB-D image superior quality grid generation method based on adaptability convolution |
CN111104532A (en) * | 2019-12-30 | 2020-05-05 | 华南理工大学 | RGBD image joint recovery method based on double-current network |
CN111915619A (en) * | 2020-06-05 | 2020-11-10 | 华南理工大学 | Full convolution network semantic segmentation method for dual-feature extraction and fusion |
CN112101410A (en) * | 2020-08-05 | 2020-12-18 | 中国科学院空天信息创新研究院 | Image pixel semantic segmentation method and system based on multi-modal feature fusion |
Non-Patent Citations (1)
Title |
---|
代具亭等: "基于彩色-深度图像和深度学习的场景语义分割网络", 《科学技术与工程》, vol. 18, no. 20, 18 July 2018 (2018-07-18), pages 286 - 291 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023050381A1 (en) * | 2021-09-30 | 2023-04-06 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Image and video coding using multi-sensor collaboration |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Trident dehazing network | |
Liu et al. | End-to-end single image fog removal using enhanced cycle consistent adversarial networks | |
CN111292264B (en) | Image high dynamic range reconstruction method based on deep learning | |
Wang et al. | Haze concentration adaptive network for image dehazing | |
CN107818554B (en) | Information processing apparatus and information processing method | |
CN112767468A (en) | Self-supervision three-dimensional reconstruction method and system based on collaborative segmentation and data enhancement | |
CN111754438B (en) | Underwater image restoration model based on multi-branch gating fusion and restoration method thereof | |
Lu et al. | Deep texture and structure aware filtering network for image smoothing | |
CN113450288B (en) | Single image rain removing method and system based on deep convolutional neural network and storage medium | |
CN112884669B (en) | Image restoration method based on multi-scale content attention mechanism, storage medium and terminal | |
Cheng et al. | Zero-shot image super-resolution with depth guided internal degradation learning | |
Pan et al. | MIEGAN: Mobile image enhancement via a multi-module cascade neural network | |
CN112581370A (en) | Training and reconstruction method of super-resolution reconstruction model of face image | |
Guo et al. | Single image dehazing based on fusion strategy | |
CN110418139B (en) | Video super-resolution restoration method, device, equipment and storage medium | |
CN111179196B (en) | Multi-resolution depth network image highlight removing method based on divide-and-conquer | |
KR102311796B1 (en) | Method and Apparatus for Deblurring of Human Motion using Localized Body Prior | |
CN114897742B (en) | Image restoration method with texture and structural features fused twice | |
CN112116543A (en) | Image restoration method, system and device based on detection type generation framework | |
Chang et al. | UIDEF: A real-world underwater image dataset and a color-contrast complementary image enhancement framework | |
CN112801911B (en) | Method and device for removing text noise in natural image and storage medium | |
Zheng et al. | Double-branch dehazing network based on self-calibrated attentional convolution | |
Chen et al. | Attention-based broad self-guided network for low-light image enhancement | |
CN113033645A (en) | Multi-scale fusion depth image enhancement method and device for RGB-D image | |
CN112598604A (en) | Blind face restoration method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |