CN115330850A

CN115330850A - Sparse image depth completion method, system and equipment

Info

Publication number: CN115330850A
Application number: CN202210848854.XA
Authority: CN
Inventors: 郑庆祥; 金积德; 耿宏凯; 徐嘉伟
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-07-19
Filing date: 2022-07-19
Publication date: 2022-11-11

Abstract

The invention discloses a sparse image depth complementing method, a system and equipment, which comprises the steps of firstly obtaining RGB image data of an image to be processed through a camera, and obtaining sparse depth data depth of the image to be processed through a laser radar; the RGB image data generates semantic image data semantic of an image to be processed through a semantic segmentation network; then, inputting the RGB image data, semantic image data semantic and sparse depth data depth into a sparse image depth completion network to obtain a final depth completion result; the sparse image depth completion network comprises an RGB image feature extraction module, a semantic image feature extraction module, a sparse depth feature extraction module, an MAFF feature fusion module and a weighting fusion module; experiments show that the sparse image depth completion network has higher precision.

Description

Sparse image depth completion method, system and equipment

Technical Field

The invention belongs to the technical field of image data processing and image enhancement, and relates to a sparse image depth completion method, a system and equipment, in particular to a sparse image depth completion method, a system and equipment based on fusion of RGB images and semantic image guidance.

Background

Estimation of dense depth measurements is crucial in various 3D vision and robotics applications, such as augmented and mixed reality, scene reconstruction, autonomous driving, and obstacle avoidance. In order to obtain a reliable prediction of depth in outdoor scenes, measurements of various sensors are used. The most commonly used sensors include a stereo set of RGB cameras, light detection and ranging LiDAR, and time-of-flight cameras. Among the sensors, lidar is considered the most reliable and efficient, allowing accurate depth measurements in outdoor environments. However, the density of lidar depth measurements is small, and a significant amount of depth data is lost. For example, the lidar sensor Velodyne HDL-64e for mobile applications is also used in the KITTI dataset, which generates a depth map containing only 5.9% pixel effective depth values. Such sparse depth maps cannot be used directly in the above application areas. Therefore, it is important to estimate the dense depth map from sparse measurements. This is considered a challenging problem because the measured depth values only account for 5.9% of the full depth map.

To solve this problem, there is a prior art that uses a deep learning-based method to achieve dense deep completion. These methods utilize convolutional neural networks to combine sparse lidar data with different modes, such as RGB images, affinity matrices, surface normals; huang et al used multi-scale features (https:// ieeexpplore. Ie. Org/abstract/document/8946876 /), qiu et al introduced Surface Normal vector information (https:// open. The cvvf. C-om/content _ CVPR _2019/html/Qiu _ Deep Li DAR _ Deep _ Surface _ Normal _ Guided _ D-epth _ Prediction _ for _ exterior _ Scene _ p R _ 2019. Cvhtml). These patterns serve as guides and greatly help to recover missing depth values in sparse maps. The idea is to actively fuse features between different modalities. Most of the existing methods adopt a double-branch network structure to perform feature fusion. For example, deep LiDAR, fusion net, and PENet utilize an encoder-decoder architecture to perform early and late fusion between color images and LiDAR sparse depth maps to achieve dense depth completion. These methods utilize convolutional neural networks to combine sparse lidar data with different modes, with early and late fusion combinations between modes enhancing deep completion capability.

Gu et al added additional structural loss (https:// ieeexpplore. Ie. Org/abstrate/document/9357967), chen et al used L2 loss in combination with smooth L1 loss (https:// open access. The cvf. Com/content _ ICCV _2019/html/Chen _ Learning _ Joint _2D 3D _reparation. For u < depth > U < Completion _ICCV _2019_ paper. Html). Furthermore, uhrig et al utilized different sparse invariance convolutions (https:// ieeexploore. Ie. Org/abstrate/document/8374553 /). Eldesokey et al added the exploration of Uncertainty (https:// openaccess. The cvf. Com/content _ CVPR _2020/html/Eldesokey _ unorty _ Aware _ CNNs _ for _ Depth _ completion _ integrity _ unordered _ CVPR _ 2020. Html), tang et al improved the multi-modal fusion strategy (https:// eexplar. Ie. Org/interaction/documentation/9286883) and so on further improved performance.

Disclosure of Invention

The invention aims to design a sparse image depth completion network under the guidance of RGB images and semantic image data based on a deep learning theory and a deep learning method, better extracts the characteristics of different modal data through a three-branch characteristic extraction module, and can better utilize RGB image data characteristics and semantic image data characteristics to guide sparse image depth completion to dense depth through a multi-mode fusion module.

The method adopts the technical scheme that: a sparse image depth completion method comprises the following steps:

step 1: acquiring RGB image data of an image to be processed through a camera, and acquiring sparse depth data depth of the image to be processed through a laser radar; the RGB image data generates semantic image data semantic of an image to be processed through a semantic segmentation network;

step 2: inputting RGB image data, semantic image data semantic and sparse depth data depth into the sparse image depth completion network to obtain a final depth completion result;

the sparse image depth completion network comprises an RGB image feature extraction module, a semantic image feature extraction module, a sparse depth feature extraction module, an MAFF feature fusion module and a weighting fusion module;

the RGB image feature extraction module inputs RGB image data and sparse depth data depth and outputs a depth completion intermediate result C-depth and a confidence weight C-confidence;

the semantic image feature extraction module inputs semantic image data semantic, sparse depth data depth and C-depth and outputs a depth completion intermediate result S-depth and a confidence weight S-confidence;

the sparse depth feature extraction module inputs sparse depth data depth, C-depth and S-depth and outputs a depth completion intermediate result D-depth and a confidence weight D-confidence;

the MAFF feature fusion module is used for fusing semantic image features, RGB image features and sparse depth features in the semantic image feature extraction module and the sparse depth feature extraction module;

and the weighted fusion module is used for performing weighted fusion on the outputs of the RGB image feature extraction module, the semantic image feature extraction module and the sparse depth feature extraction module.

The technical scheme adopted by the system of the invention is as follows: a sparse image depth completion system comprises an information acquisition module and a depth completion module:

the information acquisition module is used for acquiring RGB image data of the image to be processed through the camera and acquiring sparse depth data depth of the image to be processed through the laser radar; the RGB image data generates semantic image data semantic of an image to be processed through a semantic segmentation network;

the depth completion module is used for inputting the RGB image data, the semantic image data semantic and the sparse depth data depth into the sparse image depth completion network to obtain a final depth completion result;

The technical scheme adopted by the equipment of the invention is as follows: a sparse image depth completion apparatus comprising:

one or more processors;

a storage device to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the sparse image depth completion method.

On the basis of fully analyzing the insufficient characteristics of the deep completion model, the invention provides a new deep completion model from three aspects of changing a network structure, increasing semantic image data and semantic branches, introducing a multi-mode fusion module MAFF and the like.

Drawings

FIG. 1 is a schematic diagram of a sparse image depth completion network according to an embodiment of the present invention;

FIG. 2 is a block diagram of an MAFF feature fusion module according to an embodiment of the present invention;

FIG. 3 is a flow chart of sparse image depth completion network training according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a depth completion result according to an embodiment of the present invention.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The invention provides a sparse image depth completion method, which comprises the following steps:

and 2, step: inputting RGB image data, semantic image data semantic and sparse depth data depth into a sparse image depth completion network to obtain a final depth completion result;

referring to fig. 1, the sparse image depth completion network of the present embodiment includes an RGB image feature extraction module, a semantic image feature extraction module, a sparse depth feature extraction module, an MAFF feature fusion module, and a weighted fusion module; the RGB image feature extraction module inputs RGB image data and sparse depth data depth and outputs a depth completion intermediate result C-depth and a confidence weight C-confidence; the semantic image feature extraction module inputs semantic image data semantic, sparse depth data depth and C-depth and outputs a depth completion intermediate result S-depth and a confidence weight S-confidence; the sparse depth feature extraction module inputs sparse depth data depth, C-depth and S-depth and outputs a depth completion intermediate result D-depth and a confidence weight D-confidence; the MAFF feature fusion module is used for fusing semantic image features, RGB image features and sparse depth features in the semantic image feature extraction module and the sparse depth feature extraction module; and the weighted fusion module is used for performing weighted fusion on the outputs of the RGB image feature extraction module, the semantic image feature extraction module and the sparse depth feature extraction module.

<xnotran> RGB , ResNet, RGB RGB , , RGB 0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 6 , 7 , 8 , 9 , 10 , 1 1216 × 320 × 32 cd _0, 608 × 160 × 64 cd _1, 304 × 80 × 128 cd _2, 152 × 40 × 256 cd _3, 76 × 20 × 512 cd _4, 38 × 10 × 1024 cd _5, 76 × 20 × 512 d _ cd _1, 152 × 40 × 256 d _ cd _2, 304 × 80 × 128 d _ cd _3, 608 × 160 × 64 d _ cd _4, 1216 × 320 × 32 d _ cd _5, 608 × 160 × 64 cd _6, 304 × 80 × 128 cd _7, </xnotran> 152 × 40 × 256 feature map cd _8, 76 × 20 × 512 feature map cd _9, 38 × 10 × 1024 feature map cd _10, 76 × 20 × 512 deconvolution feature map d _ cd _6, 152 × 40 × 256 deconvolution feature map d _ cd _7, 304 × 80 × 128 deconvolution feature map d _ cd _8, 608 × 160 × 64 deconvolution feature map d _ cd _9, 1216 × 320 × 32 deconvolution feature map d _ cd _10, cd _1and d _ cd _4, cd _2 and d _ cd _3, cd _3 and d _ cd _2, cd _4 and d _ cd _1, cd _0 and d _ cd _5 are added correspondingly to generate a first new feature map, a second new feature map, a third new feature map, a fourth new feature map, and a fifth new feature map with unchanged sizes. The first new feature map and cd _6, the second new feature map and cd _7, the third new feature map and cd _8, and the fourth new feature map and cd _9 are correspondingly added, so as to generate a sixth new feature map, a seventh new feature map, an eighth new feature map, and a ninth new feature map with unchanged sizes. Correspondingly adding a ninth new feature map and d _ cd _6, an eighth new feature map and d _ cd _7, a seventh new feature map and d _ cd _8, a sixth new feature map and d _ cd _9, and a fifth new feature map and d _ cd _10 to generate a tenth new feature map, an eleventh new feature map, a twelfth new feature map, a fourteenth new feature map, and a fourteenth new feature map with unchanged sizes, wherein the first to thirteenth feature maps are sequentially used as a 5 th deconvolution module, a 4 th deconvolution module, a 3 rd deconvolution module, a 2 nd deconvolution module, a 6 th residual convolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, and a 10 th deconvolution module to continuously participate in forward propagation; the final first conventional convolutional layer outputs confidence weight CD-confidence with the size of 1216 × 352 × 1 and C-depth with the size of 1216 × 352 × 1

<xnotran> RGB C-depth , 0 , 1 , 1 , 2 , 2 , 3 , 3 , 4 , 4 , 5 , 5 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 6 , 7 , 8 , 9 , 10 , 1 1216 × 320 × 32 sd _0, 1216 × 320 × 32 F1, 608 × 160 × 64 sd _1, 608 × 160 × 64 F2, 304 × 80 × 128 sd _2, 304 × 80 × 128 F3, 152 × 40 × 256 sd _3, 152 × 40 × 256 F4, 76 × 20 × 512 sd _4, 76 × 20 × 512 F5, 76 × 20 × 512 38 × 10 × 1024 sd _5, 76 × 20 × 512 d _ sd _1, 152 × 40 × 256 d _ sd _2, 304 × 80 × 128 d _ sd _3, </xnotran> 608 × 160 × 64 deconvolution feature map D _ sd _4, 1216 × 320 × 32 deconvolution feature map D _ sd _5, 608 × 160 × 64 feature map sd _6, 304 × 80 × 128 feature map sd _7, 152 × 40 × 256 feature map sd _8, 76 × 20 × 512 feature map sd _9, 38 × 10 × 1024 feature map sd _10, 76 × 20 × 512 deconvolution feature map D _ sd _6, 152 × 40 × 256 deconvolution feature map D _ sd _7, 304 × 80 × 128 deconvolution feature map D _ sd _8, 608 × 160 × 64 deconvolution feature map D _ sd _9, 1216 × 320 × 32 deconvolution feature map D _ sd _10, fourteen new feature map and sd _0 generated by rgb image feature extraction module, thirteenth new feature map and sd _1, twelfth new feature map and sd _2, eleventh new feature map and eleventh feature map, new feature map and eleventh feature map generated by rgb image feature extraction module, new feature map and eleventh feature map, and eleventh feature map. And correspondingly adding the new feature maps E and D _ sd _1, the new feature maps D and D _ sd _2, the new feature maps C and D _ sd _3, the new feature maps B and D _ sd _4 and the new feature maps A and D _ sd _5 to generate a new feature map F, a new feature map G, a new feature map H, a new feature map I and a new feature map J with unchanged sizes. And correspondingly adding the new feature maps I and sd _6, the new feature maps H and sd _7, the new feature maps G and sd _8, the new feature maps F and sd _9 to generate a new feature map K, a new feature map L, a new feature map M and a new feature map N with unchanged sizes. The new feature maps N and d _ sd _6, the new feature maps M and d _ sd _7, the new feature maps L and d _ sd _8, the new feature maps K and d _ sd _9, and the new feature maps J and d _ sd _10 are correspondingly added to generate a new feature map O, a new feature map P, a new feature map Q, a new feature map R and a new feature map S with unchanged sizes. A new feature diagram A, a feature diagram F6, a new feature diagram B, a feature diagram F7, a new feature diagram C, a feature diagram F8, a new feature diagram D, a feature diagram F9, a new feature diagram E, a feature diagram F10, a new feature diagram F, a new feature diagram G, a new feature diagram H, a new feature diagram I, a new feature diagram J, a new feature diagram K, a new feature diagram L, a new feature diagram M, a new feature diagram N, a feature diagram sd _10, a new feature diagram O, a new feature diagram P, a new feature diagram Q, a new feature diagram R and a new feature diagram S are respectively added as a 1 st fusion module, a 1 st residual convolution module, a 2 nd fusion module, a 2 nd residual convolution module the input of the 3 rd fusion module, the 3 rd residual convolution module, the 4 th fusion module, the 4 th residual convolution module, the 5 th fusion module, the 5 th residual convolution module, the 6 th deconvolution module, the 7 th deconvolution module, the 8 th deconvolution module, the 9 th deconvolution module, the 10 th deconvolution module, the 7 th residual convolution module, the 8 th residual convolution module, the 9 th residual convolution module, the 10 th residual convolution module, the 6 th deconvolution module, the 7 th deconvolution module, the 8 th deconvolution module, the 9 th deconvolution module and the 10 th deconvolution module continuously participates in forward propagation; the final first conventional convolutional layer outputs a weight SD-confidence of 1216 x 320 x 1 size and S-depth of 1216 x 320 x 1 size.

The input of the sparse depth map data feature extraction module of this embodiment is C-depth output of the RGB feature extraction module, S-depth output of the semantic image feature extraction module, and sparse depth image data, which sequentially pass through the 0 th conventional convolution layer of the sparse depth image data feature extraction module, the 1 st fusion module, the 1 st residual convolution module, the 2 nd fusion module, the 3 rd residual convolution module, the 4 th residual convolution module, the 3 rd fusion module, the 5 th residual convolution module, the 6 th residual convolution module, the 4 th fusion module, the 7 th residual convolution module, the 8 th residual convolution module, the 5 th fusion module, the 9 th residual convolution module, the 10 th residual convolution module, the 1 st deconvolution module, the 2 nd deconvolution module, and the 3 rd deconvolution moduleA product module, a 4 th deconvolution module, a 5 th deconvolution module, a 6 th residual convolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, a 10 th deconvolution module, and a 1 st conventional convolution output layer to obtain feature maps d _0, F6, d 1, d 2, F7, d 3, d 2, F7, 304 x 880 x 9128, 1216 x 320 x 32, d _0, d 3, d # 9 x 2160 x 364, d _1, d _2, d 3, d 6 x 6160 x 764, d feature map d _4 of 304 × 80 × 0128, feature map F8 of 304 × 180 × 2128, feature map d _5 of 152 × 340 × 4256, feature map d _6 of 152 × 540 × 6256, feature map F9 of 152 × 740 × 8256, feature map d _7 of 76 × 920 × 512, feature map d _8 of 76 × 020 × 1512, feature map F10 of 76 × 220 × 3512, feature map d _9 of 38 × 410 × 51024, feature map d _10 of 38 × 610 × 71024, deconvolution feature map d _ d _1 of 76 × 820 × 9512, deconvolution feature map d _ d _2 of 152 × 40 × 0256, deconvolution feature map d _ d _3 of 304 × 180 × 128, deconvolution feature map d _ d _4 of 608 × 160 × 64, deconvolution feature map d _ d _5 of 1216 × 320 × 32, d _0 and the new feature map S, d _2 and the new feature map R, d _4 and the new feature map Q, d _6 and the new feature map P, d _8 and the new feature map O are added to generate the new feature map x 2, the new feature map x 3, the new feature map x 4, the new feature map x 5, the new feature map x 2, the new feature map x 4, the new feature map x 5, the new feature map x 8, and the new feature map O, respectively, the new feature map is multiplied by 6, the new feature maps are multiplied by 7 and d _ d _1, the new feature maps are multiplied by 8 and d _ d _2, the new feature maps are multiplied by 9 and d _ d _3, the new feature maps are multiplied by 0 and d _ d _4, the new feature maps are multiplied by 1 and d _ d _5, and the new feature map is multiplied by 2, the new feature map is multiplied by 3, the new feature map is multiplied by 4, the new feature map is multiplied by 5, and the new feature map is correspondingly added

d _0, new feature map (1), d _1, d _2, new feature map (2), d _3, d _4, new feature map (3), d _5, d _6, new feature map (4), d _7, d _8, new feature map (5), d _9, d _10, feature map (6), new feature map (7), new feature map (8), new feature map (9), new feature map

Respectively adding a 6 th fusion module, a 1 st residual convolution module, a 2 nd residual convolution module, a 7 th fusion module, a 3 rd residual convolution module, a 4 th residual convolution module, an 8 th fusion module, a 5 th residual convolution module, a 6 th residual convolution module, a 9 th fusion module, a 7 th residual convolution module, an 8 th residual convolution module, a 10 th fusion module, a 9 th residual convolution module, a 10 th residual convolution module, a 1 st deconvolution module, a 2 nd deconvolution module, a 3 rd deconvolution module and a 4 th deconvolution module; the 5 th deconvolution module and the 1 st conventional convolution layer continue to participate in forward propagation; the final first conventional convolutional layer outputs confidence weights D-confidence of 1216X 320X 1 size and D-depth of 1216X 320X 1 size.

Please refer to fig. 2, the MAFF feature fusion module of the present embodiment is composed of a concatenation layer, a local attention layer and a global attention layer which are arranged in parallel, a Sigmoid layer, a local attention layer and a global attention layer and a Sigmoid layer which are arranged in parallel, which are sequentially connected; the local attention layer sequentially consists of 1 conventional convolutional layer, 1 BN layer, 1 Leak _ Relu activation function layer, 1 conventional convolutional layer and 1 BN layer; the global attention layer is composed of 1 global pooling layer, 1 conventional convolutional layer, 1 BN layer, relu activation function layer, 1 conventional convolutional layer and 1 BN layer in sequence.

First time RGB image feature (RGB), semantic image feature (semantic), sparse depth map feature (depth) splicing (Cat) _feature ) Then respectively inputting the data into a local attention module and a global attention module, inputting the data into a Sigmoid layer, and outputting the Att _map1 The formula is as follows:

Fusion1＝Cat _feature *Att _map1 +(1-Att _map1 )dept；

inputting the fusion into the local attention module and the global attention module respectively for the second time, then inputting the fusion into the Sigmoid layer, and outputting the Att _map2 . The formula is as follows:

Fusion2＝Cat _feature *Att _map2 +(1-Att _map2 )dept。

the sparse image depth completion network of the embodiment is a trained sparse image depth completion network; please refer to fig. 3, the training process includes the following sub-steps:

step 2.1: acquiring a plurality of RGB images, wherein the RGB images generate semantic image data sematic through a semantic segmentation network, and the RGB image data, the semantic image data sematic, sparse depth data depth and real depth labels form a data set;

dividing a data set into a training set, a testing set and a verification set; the training set and the verification set respectively comprise RGB image data, semantic image data semantic, sparse depth data depth and real depth labels; the test set only contains RGB image data and sparse depth data depth;

in this embodiment, a data set used for training the sparse image depth completion network is a KITTI open source data set, a KITTI (https:// www.shapenet.org /) is one of the most authoritative data sets in the current depth completion field, and includes more than 93000 depth maps and radar scanning information and RGB image information corresponding thereto, and also provides corresponding camera parameters for each image, so that depth completion under the guidance of RGB image data can be realized by using RGB image data information of the data set.

The KITTI open source data set provides RGB image data and corresponding sparse depth maps obtained by projecting 3D lidar points onto corresponding image frames. The resolution of the RGB image data is 1241 × 376, with approximately 5% of the pixels in the wash depth map being valid (depth values > 0) and 16% of the corresponding dense depth labels being valid. The data set does not include a corresponding semantic graph and does not have label data corresponding to semantic graph training, so the embodiment adopts a trained semantic segmentation model to convert an RGB graph into semantic graph data.

The data set comprises 86000 training samples, 7000 verification samples and 1000 testing samples, corresponding semantic images are generated through an HRNet network model, and each sample corresponds to 1 piece of RGB image data, 1 piece of sparse depth map data, 1 piece of semantic image data, 1 piece of depth label and camera parameters (5 samples and camera parameters corresponding to the samples are randomly selected from the samples as input during training).

In order to eliminate the influence caused by the abnormal data, the present embodiment performs normalization operation on the RGB image data, the semantic image data, and the sparse depth data respectively by using a Numpy library () provided by Python, so that the value ranges of the pixel values and the depth values of the RGB image data, the semantic image data, and the sparse depth data are limited to [0,1 ]. For diversification of data, random cropping operation of the same size is performed on four kinds of data (RGB image data, semantic image data, sparse depth data, and real depth tag data) in each sample, so that the image sizes are unified to 1216 × 320.

Step 2.2: setting a loss function weight parameter, an optimization mode, a learning rate and a maximum iteration number;

loss function of L _total ：

L _total ＝L(D)+αL(D _cs )+βL(D _csd )；

Wherein L (D) represents the major loss, L (D) _cs ) Indicating a depth-filled intermediate result loss of one, L (D) _csa ) Representing the intermediate result loss two of depth completion; both alpha and beta are hyper-parameters, and alpha = beta =0.2 is set to be reduced to 0 along with the increase of training turns in the initial training stage; p _v Indicating that there are valid depth worth pixels in the true depth label of the training sample, p is a single pixel,

true depth label representing training sample, gt actual depth label, D _p Expressing a prediction result, and expressing a two-norm of X by | X |;

step 2.3: inputting RGB image data, semantic image data, sparse depth data C-depth and real depth label data in a training set into a sparse image depth completion network in sequence for network training, and calculating loss and reversely propagating gradient of the obtained result and the real depth label;

step 2.4: in this embodiment, 100 epochs are set to train the parameters of the network model, and after each epoch is trained, the model is verified on a verification set to calculate the RMSE error. Training is stopped when the RMSE error is not decreasing for 10 consecutive epochs.

In this embodiment, 10 rounds of training are performed with a learning rate of 0.001, then 10 rounds of training are performed with a learning rate of 0.0005 changed, 10 rounds of training are performed with a learning rate of 0.0001 changed, and finally 10 rounds of training are performed with a learning rate of 0.00001 changed, and the model with the best performance in the verification set is stored.

Step 2.5: and taking the model with the minimum RMSE error in the verification set as a trained sparse image depth completion network. And testing the model on a test set to evaluate the generalization capability of the model.

The present embodiment evaluates the proposed three-dimensional reconstruction model based on the KITTI dataset. The evaluation indexes are selected from RMSE, MAE, iRMSE and iMAE (RMSE represents root mean square error, MAE represents average absolute error, iRMSE represents root mean square of error inverse of real value and predicted value, iMAE represents average absolute value of error inverse of real value and predicted value), meanwhile, an actual performance effect graph of the model on deep completion is also evaluated, and test set division adopts Hu and other human working data set division strategies (86000 groups of data are used for training the model, 7000 groups of data are used for verifying the model, and 1000 groups of data are used for testing the model). The RMSE error of the completion result is 758.549mm, and the MAE is 207.171mm.

In addition, the present embodiment also compares the actual representations of the models on the reconstructed shapes, as shown in fig. 4 (RGB image data, discrete depth data, actual depth labels, and the model completion result are sequentially from left to right), and the model completion result completes part of the depth values compared with the discrete depth data. The complemented depth value compared to the actual depth label is effectively reliable.

On the basis of fully analyzing the insufficient characteristics of the depth completion model, the invention changes the network structure, increases the semantic image characteristics, guides the RGB image simultaneously, adds the multi-mode fusion module to fuse the different modal data characteristics so as to more accurately guide the completion of the depth completion, and experiments show that the invention has good prediction precision.

It should be understood that the above description of the preferred embodiments is illustrative, and not restrictive, and that various changes and modifications may be made therein by those skilled in the art without departing from the scope of the invention as defined in the appended claims.

Claims

1. A sparse image depth completion method is characterized by comprising the following steps:

2. The sparse image depth completion method of claim 1, wherein: <xnotran> RGB 0 , 1 , 2 , 3 , 4 , 5 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 6 , 7 , 8 , 9 , 10 , 1 , 1216 × 320 × 32 cd _0, 608 × 160 × 64 cd _1, 304 × 80 × 128 cd _2, 152 × 40 × 256 cd _3, 76 × 20 × 512 cd _4, 38 × 10 × 1024 cd _5, 76 × 20 × 512 d _ cd _1, 152 × 40 × 256 d _ cd _2, 304 × 80 × 128 d _ cd _3, 608 × 160 × 64 d _ cd _4, 1216 × 320 × 32 d _ cd _5, 608 × 160 × 64 cd _6, 304 × 80 × 128 cd _7, 152 × 40 × 256 cd _8, 76 × 20 × 512 cd _9, 38 × 10 × 1024 cd _10, 76 × 20 × 512 d _ cd _6, 152 × 40 × 256 d _ cd _7, 304 × 80 × 128 d _ cd _8, </xnotran> 608 × 160 × 64 deconvolution feature map d _ cd _9, 1216 × 320 × 32 deconvolution feature map d _ cd _10, where cd _1and d _ cd _4, cd _2 and d _ cd _3, cd _3 and d _ cd _2, cd _4 and d _ cd _1, cd _0 and d _ cd _5 are correspondingly added to generate a first new feature map, a second new feature map, a third new feature map, a fourth new feature map and a fifth new feature map with unchanged sizes; correspondingly adding the first new feature map and cd _6, the second new feature map and cd _7, the third new feature map and cd _8 and the fourth new feature map and cd _9 to generate a sixth new feature map, a seventh new feature map, an eighth new feature map and a ninth new feature map with unchanged sizes; correspondingly adding a ninth new feature map and d _ cd _6, an eighth new feature map and d _ cd _7, a seventh new feature map and d _ cd _8, a sixth new feature map and d _ cd _9, and a fifth new feature map and d _ cd _10 to generate a tenth new feature map, an eleventh new feature map, a twelfth new feature map, a fourteenth new feature map, and a fourteenth new feature map with unchanged sizes, wherein the first to thirteenth feature maps are sequentially used as a 5 th deconvolution module, a 4 th deconvolution module, a 3 rd deconvolution module, a 2 nd deconvolution module, a 6 th residual convolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, and a 10 th deconvolution module to continuously participate in forward propagation; the final first conventional convolutional layer outputs confidence weights C-confidence of 1216X 352X 1 in size and depth-filled intermediate results C-depth of 1216X 352X 1 in size.

3. The sparse image depth completion method of claim 1, wherein: the semantic image feature extraction module comprises a 0 th conventional convolution layer, a 1 st fusion module, a 1 st residual convolution module, a 2 nd fusion module, a 2 nd residual convolution module, a 3 rd fusion module, a 3 rd residual convolution module, a 4 th fusion module, a 4 th residual convolution module, a 5 th fusion module, a 5 th residual convolution module, a 1 st deconvolution module, a 2 nd deconvolution module, a 3 rd deconvolution module, a 4 th deconvolution module, a 5 th deconvolution module, a 6 th residual convolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, a 10 th deconvolution module and a 1 st conventional output layer which are connected in sequence, each layer is respectively provided with a feature map sd _0 of 1216 × 320 × 32, a feature map F1 of 1216 × 320 × 32, a feature map sd _1 of 608 × 160 × 64, a feature map F2 of 608 × 160 × 64, a feature map sd _2 of 304 × 80 × 128, a feature map F3 of 304 × 80 × 128, a feature map sd _3 of 152 × 40 × 256, a feature map F4 of 152 × 40 × 256, a feature map sd _4 of 76 × 20 × 512, a feature map F5 of 76 × 20 × 512, a feature map 38 × 10 × 1024, a feature map sd _5 of 76 × 20 × 512, a deconvolution feature map d _ sd _1 of 76 × 20 × 512, a deconvolution feature map d _ sd _2 of 152 × 40 × 256, a deconvolution feature map d _ sd _3 of 304 × 80 × 128, a deconvolution feature map d _ sd _ 608, a deconvolution feature map 160 × 64, a deconvolution feature map d _ sd _4 of 160 × 32, a feature map sd _ 64, a feature map sd _3 of 1216 × 80 × 32, a deconvolution, the feature map sd _7 of 304 × 80 × 128, the feature map sd _8 of 152 × 40 × 256, the feature map sd _9 of 76 × 20 × 512, the feature map sd _10 of 38 × 10 × 1024, the deconvolution feature map D _ sd _6 of 76 × 20 × 512, the deconvolution feature map D _ sd _7 of 152 × 40 × 256, the deconvolution feature map D _ sd _8 of 304 × 80 × 128, the deconvolution feature map D _ sd _9 of 608 × 160 × 64, the deconvolution feature map D _ sd _10 of 1216 × 320 × 32, the fourteenth new feature map and sd _0, the thirteenth new feature map and sd _1, the twelfth new feature map and sd _2, the eleventh new feature map and sd _3, the tenth new feature map and sd _4 generated by the rgb image feature extraction module are added to generate a new feature map a, a new feature map B, a new feature map C, a new feature map E, and a new feature map. Adding new characteristic diagrams E and D _ sd _1, new characteristic diagrams D and D _ sd _2, new characteristic diagrams C and D _ sd _3, new characteristic diagrams B and D _ sd _4, and new characteristic diagrams A and D _ sd _5 correspondingly to generate a new characteristic diagram F, a new characteristic diagram G, a new characteristic diagram H, a new characteristic diagram I and a new characteristic diagram J with unchanged sizes; the new feature maps I and sd _6, the new feature maps H and sd _7, the new feature maps G and sd _8, the new feature maps F and sd _9 are correspondingly added to generate a new feature map K, a new feature map L, a new feature map M and a new feature map N with unchanged sizes; adding new feature maps N and d _ sd _6, new feature maps M and d _ sd _7, new feature maps L and d _ sd _8, new feature maps K and d _ sd _9, and new feature maps J and d _ sd _10 correspondingly to generate a new feature map O, a new feature map P, a new feature map Q, a new feature map R and a new feature map S with unchanged sizes; a new feature diagram A, a feature diagram F6, a new feature diagram B, a feature diagram F7, a new feature diagram C, a feature diagram F8, a new feature diagram D, a feature diagram F9, a new feature diagram E, a feature diagram F10, a new feature diagram F, a new feature diagram G, a new feature diagram H, a new feature diagram I, a new feature diagram J, a new feature diagram K, a new feature diagram L, a new feature diagram M, a new feature diagram N, a feature diagram sd _10, a new feature diagram O, a new feature diagram P, a new feature diagram Q, a new feature diagram R and a new feature diagram S are respectively added as a 1 st fusion module, a 1 st residual convolution module, a 2 nd fusion module, a 2 nd residual convolution module inputs of a 3 rd fusion module, a 3 rd residual convolution module, a 4 th fusion module, a 4 th residual convolution module, a 5 th fusion module, a 5 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, a 10 th deconvolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th deconvolution module, and a 10 th deconvolution module continue to participate in forward propagation; the final first conventional convolutional layer outputs a weight SD-confidence of a size of 1216 x 320 x 1 and a depth-filled intermediate result S-depth of a size of 1216 x 320 x 1.

4. The sparse image depth completion method of claim 1, wherein: the sparse depth feature extraction module comprises a 0 th conventional convolution layer, a 1 st fusion module, a 1 st residual convolution module, a 2 nd fusion module, a 3 rd residual convolution module, a 4 th residual convolution module, a 3 rd fusion module, a 5 th residual convolution module, a 6 th residual convolution module, a 4 th fusion module, a 7 th residual convolution module, an 8 th residual convolution module, a 5 th fusion module, a 9 th residual convolution module, a 10 th residual convolution module, a 1 st deconvolution module, a 2 nd deconvolution module, a 3 rd deconvolution module, a 4 th deconvolution module, a 5 th deconvolution module, a 6 th residual convolution module, a 7 th residual convolution module, an 8 th residual convolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, an 8 th deconvolution module, a 9 th residual convolution module, a 10 th residual convolution module, a 6 th deconvolution module, a 7 th deconvolution module, a 8 th deconvolution module, a 9 th residual convolution module, a 1 st conventional convolution module, a 1 st output convolution module, obtaining feature maps d _0, F6, d _1, d _2, F7, d _3, d _4, d _3, d _ 80, d _ 64 of 1216 × 320 × 32, 608 × 160 × 64, and 608 × 160 × 64, respectively, for each layer feature map F8 of 304 × 80 × 128, feature map d _5 of 152 × 40 × 256, feature map d _6 of 152 × 40 × 256, feature map F9 of 152 × 40 × 256, feature map d _7 of 76 × 20 × 512, feature map d _8 of 76 × 20 × 512, feature map F10 of 76 × 20 × 512, feature map d _9 of 38 × 10 × 1024, feature map d _9 of 152 × 40 × 256, feature map d _6 of 152 × 40 × 256, feature map F9 of 76 × 20 × 512, feature map d _8 of 76 × 20 × 512, feature map F10 of 38 × 10 × 1024, and feature map d _9 of n, 38 × 10 × 1024 feature map d _10, 76 × 20 × 512 deconvolution feature map d _ d _1, 152 × 40 × 256 deconvolution feature map d _ d _2, 304 × 80 × 128 deconvolution feature map d _ d _3, 608 × 160 × 64 deconvolution feature map d _ d _4, 1216 × 320 × 32 deconvolution feature map d _ d _5, d \u0 and new feature map S, d _2 and new feature map R, d _4 and new feature map Q, d _6 and new feature map P, d _8 and new feature map O are added to generate new feature map (1) with unchanged size, new feature map (2), new feature map (3), new feature map (4), new feature map (5) and d _ d _1, new feature map (4) and d _ d _2, new feature map (3) and new feature map (4), new feature map (5) and d _ d _1, new feature map (4) and new feature map (7). d _0, a new feature map (1), d _1, d _2, a new feature map (2), d _3, d _4, a new feature map (3), d _5, d _6, a new feature map (4), d _7, d _8, a new feature map (5), d _9, d _10, a feature map (6), a new feature map (7), a new feature map (8), a new feature map (9), and a new feature map are respectively added as a 6 th fusion module, a 1 st residual convolution module, a 2 nd residual convolution module, a 7 th fusion module, a 3 rd residual convolution module, a 4 th residual convolution module, an 8 th fusion module, a 5 th residual convolution module, a 6 th residual convolution module, a 9 th fusion module, a 7 th residual convolution module, an 8 th residual convolution module, a 10 th fusion module, a 9 th residual convolution module, a 10 th residual convolution module, a 1 st residual convolution module, a 2 nd deconvolution module, a 4 th residual convolution module, and a 4 th residual convolution module; the 5 th deconvolution module and the 1 st conventional convolution layer continue to participate in forward propagation; the final first conventional convolutional layer outputs confidence weights D-confidence of 1216 × 320 × 1 size and depth-filled intermediate results D-depth of 1216 × 320 × 1 size.

5. The sparse image depth completion method of claim 1, wherein: the MAFF feature fusion module consists of a splicing layer, a local attention layer and a global attention layer which are arranged in parallel, a Sigmoid layer, a local attention layer and a global attention layer which are arranged in parallel and a Sigmoid layer which are connected in sequence; the local attention layer sequentially consists of 1 conventional convolutional layer, 1 BN layer, 1 Leak _ Relu activation function layer, 1 conventional convolutional layer and 1 BN layer; the global attention layer is composed of 1 global pooling layer, 1 conventional convolutional layer, 1 BN layer, relu activation function layer, 1 conventional convolutional layer and 1 BN layer in sequence.

6. The sparse image depth completion method of any one of claims 1 to 5, wherein: the sparse image depth completion network in the step 2 is a trained sparse image depth completion network; the training process comprises the following substeps:

step 2.1: acquiring a plurality of RGB images, wherein the RGB images generate semantic image data semantic through a semantic segmentation network, and the RGB image data, the semantic image data semantic, the sparse depth data depth and a real depth label form a data set;

dividing the data set into a training set, a testing set and a verification set; the training set and the verification set respectively comprise RGB image data, semantic image data semantic, sparse depth data depth and real depth labels; the test set only comprises RGB image data and sparse depth data depth;

the loss function is L _total ：

L _total ＝L(D)+αL(D _cs )+βL(D _csd )；

Wherein L (D) represents the major loss, L (D) _cs ) Indicating a depth-filled intermediate result loss of one, L (D) _csd ) Representing the intermediate result loss two of depth completion; both alpha and beta are hyper-parameters, and alpha = beta =0.2 is set to be reduced to 0 along with the increase of training turns in the initial training stage; p is _v Indicating that there are valid depth worth pixels in the true depth label of the training sample, p is a single pixel,

true depth label representing training sample, gt represents actual depth label, D _p Expressing the prediction result, | X | | | expresses a two-norm of X;

step 2.3: inputting RGB image data, semantic image data, sparse depth data C-depth and real depth label data in a training set into the sparse image depth completion network in sequence for network training, and calculating loss and reversely propagating gradient of the obtained result and the real depth label;

step 2.4: setting T epochs to train network parameters, verifying the network on a verification set after each epoch is trained, calculating RMSE errors, and stopping training when the RMSE errors are not reduced within R epochs continuously; wherein T, R is a preset value;

step 2.5: and taking the network with the minimum RMSE error in the verification set as a trained sparse image deep completion network.

7. A sparse image depth completion system is characterized in that: the system comprises an information acquisition module and a depth completion module:

8. A sparse image depth complementing apparatus, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the sparse image depth complementing method of any one of claims 1 to 6.