CN117557887A

CN117557887A - Sparse depth completion training method and related equipment

Info

Publication number: CN117557887A
Application number: CN202311544514.9A
Authority: CN
Inventors: 郭佳伟; 周鸿钧; 丁宁; 张爱东
Original assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Chinese University of Hong Kong Shenzhen; Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-13

Abstract

The embodiment of the application discloses a sparse depth completion training method, image processing equipment and a computer readable storage medium, which are used for performing sparse depth completion training under the condition of improving the accuracy of learned scene global features. The method comprises the following steps: obtaining an image sample to be processed and a corresponding sparse depth map sample, obtaining a supervision dense depth map corresponding to the image sample to be processed, obtaining near features and far features of the sparse depth map sample by a student model, fusing the near features and the far features of the sparse depth map sample to obtain a fused feature map, determining a first feature map after attention enhancement of a weight determination channel corresponding to each channel in the fused feature map, determining a prediction dense depth map after spatial attention enhancement corresponding to each region, and obtaining the student model after the first training is completed when the loss between the prediction dense depth map and the supervision dense depth map meets convergence conditions.

Description

Sparse depth completion training method and related equipment

Technical Field

The embodiment of the application relates to the field of depth completion of sparse depth, in particular to a depth completion training method, image processing equipment and a computer readable storage medium.

Background

Scene depth perception is widely used in many fields, such as robotics, autopilot, and augmented reality. Conventional scene depth acquisition relies on various sensors such as radar, stereo camera, RGBD camera, etc. However, these devices have the disadvantages of high cost, high power consumption, limited measurement distance, and the like. In recent years, encoder-decoder depth estimation techniques based on deep learning have been developed rapidly, wherein sparse depth-based depth completion methods have been intensively studied by many researchers due to their high feasibility.

The existing sparse depth completion training method comprises the steps of obtaining a to-be-processed image sample and a sparse depth map sample corresponding to the to-be-processed image sample, inputting the to-be-processed image sample and the sparse depth map sample corresponding to the to-be-processed image sample into a teacher model to obtain a dense depth map of the to-be-processed image sample output by the teacher model, inputting the to-be-processed image sample and the sparse depth map sample corresponding to the to-be-processed image sample into a student model, and obtaining a training-completed student model when loss between the predicted dense depth map and the dense depth map output by the teacher model meets convergence conditions.

However, the sparse depth map sample has sparsity, and the existing sparse depth completion training method only monitors model training by means of live sparse depth values, multi-vision geometric constraints and knowledge distillation based on a depth completion model, so that the extraction capacity of the global features of the scene is weak, and the accuracy of the learned global features of the scene is low.

Disclosure of Invention

The embodiment of the application provides a sparse depth completion training method, image processing equipment and a computer readable storage medium, which are used for performing sparse depth completion training under the condition of improving the accuracy of learned scene global features.

In a first aspect, an embodiment of the present application provides a depth completion training method for sparse depth, including:

obtaining an image sample to be processed and a sparse depth map sample corresponding to the image sample to be processed;

obtaining a supervision dense depth map corresponding to the image sample to be processed; wherein the supervision dense depth map comprises depth associations between scene areas and global features of the scene areas;

inputting the image sample to be processed and a sparse depth pattern book corresponding to the image sample to be processed into a student model, obtaining near features and far features of the sparse depth pattern sample by the student model, fusing the near features and the far features of the sparse depth pattern sample to obtain a fused feature map, determining the importance degree of each channel in the fused feature map, determining the weight corresponding to each channel based on the importance degree of each channel, determining a first feature map with enhanced channel attention based on the weight corresponding to each channel, determining the spatial relevance between each region in the first feature map, determining the weight corresponding to each region based on the spatial relevance between each region, determining a second feature map with enhanced spatial attention based on the weight corresponding to each region, and obtaining a predicted dense depth map corresponding to the image sample to be processed output by the student model; wherein the second feature map is the predicted dense depth map;

And when the loss between the predicted dense depth map and the supervision dense depth map meets a convergence condition, obtaining a student model with the first training completed.

In a second aspect, an embodiment of the present application provides a depth completion method for sparse depth, including:

obtaining an image to be processed and a sparse depth map corresponding to the image to be processed;

inputting the to-be-processed image and a sparse depth map corresponding to the to-be-processed image into a pre-trained student model, obtaining near features and far features of the sparse depth map by the student model, fusing the near features and the far features of the sparse depth map to obtain a fused feature map, determining the importance degree of each channel in the fused feature map, determining the weight corresponding to each channel based on the importance degree of each channel, determining a first feature map with enhanced channel attention based on the weight corresponding to each channel, determining the spatial relevance between each region in the first feature map, determining the weight corresponding to each region based on the spatial relevance between each region, determining a second feature map with enhanced spatial attention based on the weight corresponding to each region, and obtaining a dense depth map corresponding to the to-be-processed image output by the student model; wherein the second feature map is the dense depth map.

In a third aspect, an embodiment of the present application provides an image processing apparatus, including:

the device comprises a central processing unit, a memory, an input/output interface, a wired or wireless network interface and a power supply;

the memory is a short-term memory or a persistent memory;

the central processor is configured to communicate with the memory and execute instruction operations in the memory to perform the foregoing sparse depth deep completion training method or sparse depth deep completion method.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the foregoing sparse depth completion training method or sparse depth completion method.

In a fifth aspect, embodiments of the present application provide a computer program product comprising instructions that, when run on a computer, cause the computer to perform the foregoing sparse depth completion training method or sparse depth completion method.

From the above technical solutions, the embodiments of the present application have the following advantages: the method comprises the steps of obtaining near features and far features of a sparse depth map sample by a student model, fusing the near features and the far features of the sparse depth map sample to obtain a fused feature map, determining weights corresponding to each channel in the fused feature map to determine a first feature map after channel attention enhancement, determining weights corresponding to each region to determine a predicted dense depth map after spatial attention enhancement, and obtaining the student model with the first training completed when loss between the predicted dense depth map and a supervision dense depth map meets convergence conditions. The extraction capability of the scene global features is strong, so that the accuracy of the learned scene global features is high.

Drawings

Fig. 1 is a schematic flow chart of a sparse depth completion training method disclosed in an embodiment of the present application;

FIG. 2 is a schematic diagram of depth accuracy comparison according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram for improving knowledge distillation accuracy by introducing a stereoscopic model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an attention mechanism module according to an embodiment of the present disclosure;

FIG. 5 is a flow chart of another sparse deep completion training method disclosed in embodiments of the present application;

FIG. 6 is a graph comparing the overall effect of an embodiment incorporating a stereoscopic model with the overall effect of an existing embodiment as disclosed in the embodiments of the present application;

FIG. 7 is a graph comparing the overall effect of another embodiment incorporating a stereoscopic model disclosed in the embodiments of the present application with the overall effect of the prior embodiment;

FIG. 8 is a graph comparing the overall effect of yet another embodiment incorporating a stereoscopic model disclosed in the embodiments of the present application with the overall effect of the prior embodiments;

FIG. 9 is a graph comparing the overall effect of yet another embodiment incorporating a stereoscopic model disclosed in the embodiments of the present application with the overall effect of the prior embodiments;

FIG. 10 is a graph comparing the overall effect of an embodiment incorporating a sparse depth based attention mechanism with the overall effect of an existing embodiment, as disclosed in the embodiments of the present application;

FIG. 11 is a graph comparing the overall effect of another embodiment of the disclosure that incorporates sparse depth based attention mechanisms with the overall effect of existing embodiments;

FIG. 12 is a graph comparing the overall effect of yet another embodiment of the disclosure that incorporates sparse depth based attention mechanisms with the overall effect of existing embodiments;

fig. 13 is a flow chart of a depth completion method of sparse depth according to an embodiment of the present disclosure;

fig. 14 is a schematic structural diagram of an image processing apparatus disclosed in an embodiment of the present application.

Detailed Description

Referring to fig. 1, fig. 1 is a flow chart of a sparse depth completion training method disclosed in an embodiment of the present application, where the method includes:

101. And obtaining a sparse depth map sample corresponding to the image sample to be processed.

In this embodiment, when the depth of the sparse depth is trained, the image sample to be processed and the sparse depth map sample corresponding to the image sample to be processed may be obtained.

102. Obtaining a supervision dense depth map corresponding to an image sample to be processed; wherein the supervised dense depth map includes depth associations between the scene regions, and global features of the scene regions.

After the image sample to be processed is obtained, a supervision dense depth map corresponding to the image sample to be processed can be obtained; wherein the supervised dense depth map includes depth associations between the scene regions, and global features of the scene regions.

The method for obtaining the supervision dense depth map corresponding to the image sample to be processed includes that the image sample to be processed and the sparse depth map corresponding to the image sample to be processed are sequentially input into N teacher models trained in advance, the N teacher models output dense depth maps corresponding to the image sample to be processed respectively, then the dense depth maps corresponding to the image sample to be processed and the teacher models are input into a new view angle image generation module for the dense depth map output by each teacher model, the dense depth map of the target view angle corresponding to the teacher model output by the new view angle image generation module is obtained, then the dense depth map of the target view angle corresponding to the teacher model is compared with the live map of the target view angle in similarity, an error feature map corresponding to the teacher model is obtained, and finally the supervision depth map corresponding to the image sample to be processed is determined based on the error feature map corresponding to each teacher model. It will be appreciated that other methods than the above described method of obtaining a supervised dense depth map corresponding to the image sample to be processed are also possible, and are not limited in this regard.

103. Inputting an image sample to be processed and a sparse depth pattern book corresponding to the image sample to be processed into a student model, obtaining near features and far features of the sparse depth pattern sample by the student model, fusing the near features and the far features of the sparse depth pattern sample to obtain a fused feature map, determining the importance degree of each channel in the fused feature map, determining the weight corresponding to each channel based on the importance degree of each channel, determining a first feature map with enhanced channel attention based on the weight corresponding to each channel, determining the spatial relevance between each region in the first feature map, determining the weight corresponding to each region based on the spatial relevance between each region, determining a second feature map with enhanced spatial attention based on the weight corresponding to each region, and obtaining a predicted dense depth map corresponding to the image sample to be processed output by the student model; wherein the second feature map is a predicted dense depth map.

After the supervision dense depth map corresponding to the image sample to be processed is obtained, the image sample to be processed and the sparse depth pattern book corresponding to the image sample to be processed can be input into a student model, near features and far features of the sparse depth map sample are obtained by the student model, the near features and the far features of the sparse depth map sample are fused to obtain a fused feature map, the importance degree of each channel in the fused feature map is determined, the weight corresponding to each channel is determined based on the importance degree of each channel, the first feature map after the attention of the channel is enhanced is determined based on the weight corresponding to each channel, the spatial relevance between each region in the first feature map is determined, the weight corresponding to each region is determined based on the spatial relevance between each region, the second feature map after the spatial attention is enhanced is determined based on the weight corresponding to each region, and the prediction depth map corresponding to the image sample to be processed output by the student model is obtained; wherein the second feature map is a predicted dense depth map.

104. And when the loss between the predicted dense depth map and the supervised dense depth map meets the convergence condition, obtaining a student model with the first training completed.

After obtaining the predicted dense depth map corresponding to the image sample to be processed output by the student model, when the loss between the predicted dense depth map and the supervised dense depth map meets the convergence condition, the student model with the first training completed can be obtained.

In the embodiment of the application, near features and far features of a sparse depth map sample can be obtained by a student model, the near features and the far features of the sparse depth map sample are fused to obtain a fused feature map, a first feature map after the attention of each channel is enhanced is determined by weight corresponding to each channel in the fused feature map, a predicted dense depth map after the spatial attention is enhanced is determined by weight corresponding to each region, and when loss between the predicted dense depth map and the supervised dense depth map meets convergence conditions, the student model with the first training completed is obtained. The extraction capability of the scene global features is strong, so that the accuracy of the learned scene global features is high.

In this embodiment of the present application, there may be multiple methods for obtaining a supervised dense depth map corresponding to an image sample to be processed, and a training method is completed based on the depth of the sparse depth shown in fig. 1, and one of the methods is described below.

Specifically, the sparse depth map refers to an image in which only some pixels have depth values and other pixels have no depth values.

Specifically, the supervised dense depth map is the supervision information in deep learning, and is the label or annotation information used for guiding the training of the student model.

The method for obtaining the supervision dense depth map corresponding to the image sample to be processed includes that firstly, the image sample to be processed and a sparse depth map corresponding to the image sample to be processed are sequentially input into N teacher models trained in advance, the N teacher models output dense depth maps corresponding to the image sample to be processed respectively, then, aiming at the dense depth map output by each teacher model, the dense depth maps corresponding to the image sample to be processed and the teacher model are input into a new view angle image generation module to obtain a dense depth map of a target view angle corresponding to the teacher model output by the new view angle image generation module, then, similarity comparison is carried out on the dense depth map of the target view angle corresponding to the teacher model and a live map of the target view angle to obtain an error feature map corresponding to the teacher model, and finally, the supervision depth map corresponding to the image sample to be processed is determined based on the error feature map corresponding to each teacher model; wherein N is an integer greater than or equal to 2.

The method for determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model may be that the weight corresponding to each pixel point of each teacher model is determined based on the error feature map corresponding to each teacher model, and then the supervision dense depth map corresponding to the image sample to be processed is determined based on the weight corresponding to each pixel point of each teacher model.

Specifically, N pre-trained teacher models can be obtained before training, and the image sample I to be processed is obtained _i Image sample I to be processed _i The corresponding sparse depth pattern book is sequentially input into N teacher models trained in advance, and the N teacher models respectively output an image sample I to be processed _i Corresponding dense depth map { d } _h ^k K e N. The depth values of the dense depth maps output by the teacher models are different and the precision is different. In order to screen out the highest precision depth value of each pixel from the dense depth map output by each teacher model, a new view angle image generating function (warping function fω) of a new view angle image generating module can be adopted to perform the image sample I to be processed _i The new view image generation function is shown in the following formula one.

I _t ^k ＝f _ω (I _i ，d _h ^k )

Equation one

In particular, the function may be based on the image sample I to be processed _i Dense depth map { d ] output by each teacher model _h ^k New view map generation by using new view position information (acquired by Pose Network)Image { I } _t’ ^k }. Subsequently, the new view angle image is compared with the view angle live image in image similarity to generate an error map E ^k . Through the self-adaptive integrated judgment (Adaptive Ensemble) of the magnitude and magnitude of the error map values, the depth error judgment of the dense depth map output by each teacher model is completed, and the optimal supervision information d is distilled and screened out _d 。

Before determining a supervision dense depth map corresponding to an image sample to be processed based on an error feature map corresponding to each teacher model, the image sample to be processed and a stereo matching image sample which is in stereo matching with the image sample to be processed can be input into a stereo vision model, parallax information between the image sample to be processed and the stereo matching image sample corresponding to the image sample to be processed is determined by the stereo vision model, the stereo dense depth map corresponding to the image sample to be processed is determined based on parallax, the stereo dense depth map corresponding to the image sample to be processed output by the stereo vision model is obtained, then the stereo dense depth map corresponding to the image sample to be processed and the stereo dense depth map corresponding to the image sample to be processed are input into a new view angle image generation module, the stereo dense depth map of a target view angle output by the new view angle image generation module is obtained, and finally the stereo dense depth map of the target view angle corresponding to the student model is compared with a live view map corresponding to the target view angle to obtain an error feature map corresponding to the student model; and the method for determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model may be to determine the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model and the error feature map corresponding to the student model.

Specifically, the student model of the embodiment of the present application may introduce a stereoscopic model with higher precision to improve the precision of a dense depth map output by the student model, refer to fig. 2 specifically, fig. 2 is a schematic diagram for comparing depth precision disclosed in the embodiment of the present application, as can be seen from fig. 2, the upper graph in fig. 2 is a detected original graph (such as a to-be-processed image sample), and the middle graph is a graph output by the depth completion model (such as a dense depth map corresponding to a to-be-processed image sample output by a teacher model)The lower graph is a graph output by the stereoscopic model (such as a dense depth graph corresponding to a sample of an image to be processed output by a student model incorporating the stereoscopic model). Specifically, before training, the supervision information Loss may be optimized, for example, a stereoscopic vision model (Stereo Network) with higher precision may be introduced, so as to improve the precision of the depth supervision information based on the depth completion teacher model. Referring specifically to fig. 3, fig. 3 is a schematic diagram of improving knowledge distillation accuracy by introducing a stereoscopic model according to an embodiment of the present application, and as can be seen from fig. 3, an image sample I to be processed may be obtained _i And a stereo matching image sample I which is stereo matched with the image sample to be processed _r Inputting a stereoscopic vision model to generate a corresponding image sample I to be processed _i Higher precision depth value (stereo dense depth map) d of (a) _s Then also through the warping function f _ω A corresponding error map (error feature map corresponding to the student model) E is generated. An error map E of the teacher model is completed by combining the error map (error feature map corresponding to the student model) with each depth ^k Performing self-adaptive integrated comparison, and further distilling and screening out depth supervision information d with higher accuracy _d 。

It is to be understood that the parallax information refers to the amount of horizontal pixel displacement between corresponding points in two images of the image sample to be processed and the stereo matching image sample corresponding to the image sample to be processed in stereo vision. For example, in a pair of left and right images (for example, the image sample to be processed is a left image, and the stereo matching image sample corresponding to the image sample to be processed is a right image), if one pixel point in the left image is shifted by 10 pixels to the right in the position of the corresponding pixel point in the right image, then the parallax between the two pixel points is 10. The disparity information may be used to calculate a depth value for each pixel point, thereby generating a dense depth map. The stereoscopic model can be optimized based on the accurate stereoscopic matching mode and the matching error E (D) aggregation, and the stereoscopic model is specifically shown in the following formula II.

From equation two, a depth value with higher accuracy can be estimated. Wherein x= (u, v) represents stereo matching left pixel coordinates, d _x Representing the disparity value at pixel coordinate x, C (x, d _x ) Representing the matching error of the pixel coordinate x in stereo matching.

It is worth mentioning that the stereoscopic model can find the corresponding points between the left and right images by matching the left and right images, and can extract parallax information more accurately, so as to generate a more accurate depth map.

The method for determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model and the error feature map corresponding to the student model may be that the weights corresponding to each pixel point of each teacher model and each student model are determined based on the error feature map corresponding to each teacher model and the error feature map corresponding to each student model, and then the supervision dense depth map corresponding to the image sample to be processed is determined based on the weights corresponding to each pixel point of each teacher model and each student model.

Specifically, it is assumed that a picture is required to be subjected to depth estimation, and a teacher model and a student model output their depth estimation results, respectively. By comparing the output results of the teacher model and the student model, an error feature map therebetween can be calculated. Then, according to the error feature map, the weight of each model on each pixel point, namely the importance of the teacher model and the student model on each pixel point, can be determined. After the weights for each model on each pixel point are determined, these weights may be used to generate a supervised dense depth map. Specifically, for each pixel, the final depth estimation result of the pixel may be calculated according to the weights of the teacher model and the student model on the pixel, and the depth estimation results of the teacher model and the student model on the pixel. These final depth estimation results constitute a supervised dense depth map for the image samples to be processed.

The method for determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model and the error feature map corresponding to the student model may include determining a depth value error corresponding to each pixel point of each teacher model and a depth value error corresponding to each pixel point of the student model based on the error feature map corresponding to each teacher model, then, for each pixel point, if the depth value error corresponding to the student model at the pixel point is greater than or equal to the depth value error corresponding to the target teacher model at the pixel point, taking the depth value corresponding to the target teacher model at the pixel point as a target depth value corresponding to the pixel point, wherein, if the depth value error corresponding to the student model at the pixel point is smaller than the depth value error corresponding to the target teacher model at the pixel point, taking the depth value corresponding to the student model at the pixel point as a target depth value corresponding to the pixel point, and finally obtaining the supervision dense depth map corresponding to the image sample to be processed based on the target depth value corresponding to each pixel point.

Specifically, a depth value with higher accuracy can be learned by adaptively integrating and comparing the depth value errors of the depth completion model (teacher model) and the stereoscopic model (student model including stereoscopic model), as shown in the following formula three.

d _d ^u，v ＝d _s ^u，v if E _s ^u，v ＜E _d ^u，v

Formula III

Wherein E is _d ^u,v Is the depth value error corresponding to each pixel point of the depth completion model (teacher model), E _s ^u,v Is the depth value error d corresponding to each pixel point of the stereoscopic vision model _d ^u,v 、d _s ^u,v And respectively distilling the supervised depth result map and the stereoscopic vision model result map to obtain target depth values corresponding to each pixel point. When E is _d ^u,v Greater than E _s ^u,v When d _s ^u,v Assignment to d _d ^u,v 。

Specifically, in order to achieve dense scene depth map acquisition through a single image (to-be-processed image sample) and a sparse depth map corresponding to the single image (to-be-processed image sample) (sparse depth map sample corresponding to the to-be-processed image sample). The embodiment of the application provides a depth completion function fθ, which can be shown as a formula four.

As shown in the fourth formula, the image sample I to be processed can be input _i Sparse depth map z corresponding to image sample to be processed _i Corresponding internal reference K of image sample to be processed _i Outputting a predicted dense depth map d corresponding to the image sample to be processed _i 。

It should be understood that the embodiments of the present application may meet various needs of users, and two examples thereof are described below.

A. If the user needs to obtain the RGB image of the image to be processed output by the student model, the formula IV can be consulted, and the sample of the image to be processed can be correspondingly referred to the internal reference K _i Also used as input quantity of student model and image sample to be processedI _i Sparse depth map z corresponding to image sample to be processed _i Inputting the two images together into a student model to obtain a predicted RGB image of an image to be processed output by the student model, wherein the predicted RGB image is combined with a predicted dense depth map d _i Corresponding to the image sample to be processed and internal reference K _i An image of information (information of an image size, an image color, etc. of an image sample to be processed). When the loss between the predicted RGB image and the supervised RGB image meets the convergence condition, a student model with the first training completed is obtained, wherein the supervised RGB image is a corresponding internal reference K combined with the supervised dense depth image and the image sample to be processed _i An image of information (information of an image size, an image color, etc. of an image sample to be processed). It is understood that sparse depth map refers to depth map in which only some pixels have depth values and other pixels have no depth values. The dense depth map is a depth map obtained by performing depth complementation on the sparse depth map, and almost all pixels have depth values. Both sparse and dense depth maps are a kind of gray scale image. Dense depth maps provide more information to describe the geometry and structure of a scene relative to sparse depth maps. RGB information refers to color information of each pixel in an image, where R represents a luminance value of a red channel, G represents a luminance value of a green channel, and B represents a luminance value of a blue channel. In the RGB color space, each pixel consists of luminance values of these three channels, and different colors can be obtained by different combinations of luminance values. For example, red may be represented as (255, 0), where the luminance value of the R channel is 255 and the luminance values of the g and B channels are 0.

B. If the user needs to obtain the RGB image of the image to be processed output by the student model, the sample of the image to be processed in the formula IV can be corresponding to the internal reference K _i Removing only the image sample I to be processed _i Sparse depth map z corresponding to image sample to be processed _i Inputting a student model to obtain a predicted dense depth map d corresponding to a to-be-processed image sample output by the student model _i . Obtaining a first training completion study when the loss between the predicted dense depth map and the supervised dense depth map satisfies a convergence conditionAnd (5) generating a model.

Specifically, the student model of the embodiment of the present application introduces an attention mechanism module based on sparse depth values, and embeds the attention mechanism module into an encoder of a depth completion model to realize extraction of global features of a scene, referring specifically to fig. 4, fig. 4 is a schematic diagram of an attention mechanism module disclosed in the embodiment of the present application, and as can be seen from fig. 4, the attention mechanism module includes a minimum pooling layer, a maximum pooling layer, a fusion module, a channel attention module and a spatial attention module. The main idea of the attention mechanism module is that when the features of the sparse depth map F are extracted, the near features in the sparse depth map are extracted through the minimum pooling layer, and the far features in the sparse depth map are extracted through the maximum pooling layer. These near and far features are then fused together by a fusion module through a 1 x 1 convolution layer. Next, a channel attention function is introduced Extracting more identifiable features in each channel, giving greater weight to the features, and generating a feature map F for enhancing the features by multiplication ^∧ Subsequently introducing a spatial attention function->Learning the features with more spatial relevance in each region, giving larger weight to the features, and generating a feature map F for enhancing the features by multiplication (weighted summation) ^∨ Finally, extraction of global features of the scene is achieved to obtain a predicted dense depth map, which is specifically shown in the following formula five and formula six.

F ^∧ ＝A _c (F)·F

F ^∨ ＝A _s (F ^∧ )·F ^∧

Formula five and formula six

It can be understood that, besides the method of implementing extraction of the global feature of the scene by the channel attention function and the spatial attention function, the extraction can be implemented by other functions corresponding to reasonable attention mechanisms, which is not limited herein.

It should be noted that the working principle of the channel attention module is to learn the importance degree of each channel, so as to assign different weight coefficients to each channel, so as to strengthen important features and inhibit non-important features. In this way, the channel attention module can adaptively learn the importance of each channel, thereby improving the expressive power of the features. The spatial attention module may enhance features with spatial relevance by capturing spatial relationships between different regions in the feature map. For example, in an image classification task, if an object in a picture appears in the upper left corner, then the region adjacent to the object (e.g., lower left corner, upper right corner) is likely to also contain features associated with the object. Thus, the spatial attention module may give greater weight to these neighboring regions to enhance these features with spatial relevance. Conversely, for those areas remote from the object, the spatial attention module will be given less weight, as these areas are likely to be independent of the object. Through the combination of the two modules, the extraction of the global features of the scene can be realized.

Specifically, during training of a Student Model, an attention mechanism module may be introduced to the Student Model (Student Model) to be trained. For example, on the basis of a network of encoder-decoder depth completion models, a sparse depth-based attention mechanism can be introduced and embedded into an encoder of the depth completion model so as to improve the extraction capability of the network to global features of a scene. By inputting image-sparse depth pairs (I _i ,z _i ) In an improved student model (student network), a depth map d with higher learning accuracy is obtained _i . Depth map d to be generated _i And distillation supervision chart d _d And (4) carrying out Loss function L1 Loss calculation and optimizing student network model training.

Referring to fig. 5 specifically, fig. 5 is a flow chart of another sparse depth training method disclosed in the embodiment of the present application, and as can be seen from fig. 5, a student model includes an encoder and a decoder, wherein the encoder introduces a attention mechanism, and can input an image sample to be processed and a sparse depth pattern corresponding to the image sample to be processed into the student model to obtain a predicted dense depth map corresponding to the image sample to be processed output by the student model. The method for acquiring the supervision dense depth map may be that a to-be-processed image sample and a sparse depth pattern book corresponding to the to-be-processed image sample are sequentially input into N teacher models trained in advance, the dense depth maps corresponding to the to-be-processed image sample are respectively output by the N teacher models, the to-be-processed image sample and a stereo matching image sample stereo-matched with the to-be-processed image sample are input into a stereo vision model, parallax information between the to-be-processed image sample and the stereo matching image sample corresponding to the to-be-processed image sample is determined by the stereo vision model, the stereo dense depth map corresponding to the to-be-processed image sample is determined based on parallax, similarity comparison is performed between the stereo dense depth map of a target visual angle output by the stereo vision model and a live map corresponding to the target visual angle, an error feature map corresponding to the stereo vision model is obtained, and the supervision dense depth map corresponding to the to-be-processed image sample is determined based on the error feature map corresponding to each teacher model and the error feature map corresponding to the stereo vision model. And when the loss between the predicted dense depth map and the supervised dense depth map meets the convergence condition, obtaining a student model with the first training completed.

After obtaining the predicted dense depth map corresponding to the image sample to be processed output by the student model, the image sample to be processed and the sparse depth pattern corresponding to the image sample to be processed can be input into a new view angle image generation module to obtain a predicted dense depth image of the target view angle output by the new view angle image generation module, then, according to a loss function, the loss between the predicted dense depth map of the target view angle and the live depth map of the target view angle is calculated, and when the loss meets a convergence condition, the student model with the second training completion is obtained.

Specifically, the depth to be generatedFigure d _i By a warping function f _ω Generating a new view angle image I _t ⁱ And then, carrying out Loss function luminosity error L1 Loss and structural error SSIMLoss calculation on the new view angle image and the live image of the view angle, and enhancing the optimization of training of a student network model (student model).

It should be understood that the photometric error L1 Loss refers to the absolute error between the predicted value and the true value, and the SSIM Loss refers to the Loss of structural similarity, which is used to measure the structural similarity between two images. Both of these loss functions are evaluation indexes in image processing for measuring the similarity between images. Specifically, the photometric error L1 Loss refers to an average value of absolute differences between pixel values between the predicted image Iti and the live image of the viewing angle. And the structural error SSIM Loss refers to an average value of structural similarity between two images, including three aspects of brightness, contrast and structure. Where brightness and contrast are related to the calculation of color information, while structural aspects are not related to the calculation of color information. For example, if the color information of two images are identical but their brightness and contrast are different, then the SSIM Loss between them will be large and the L1 Loss will be small. On the other hand, if the color information of the two images is different but their brightness and contrast are the same, the L1 Loss between them will be large and the SSIM Loss will be small.

The following description continues with the above example for various needs of the user.

A. If the user needs to obtain the RGB image of the image to be processed output by the student model, the calculation of three indexes of brightness, contrast and structure of the Loss function luminosity error L1 Loss and the structure error SSIM Loss can be performed so as to enhance the optimization of the training of the student network model (student model).

B. If the user needs to obtain the RGB image of the image to be processed output by the student model, the calculation of the structural indexes of the Loss function luminosity error L1 Loss and the structural error SSIM Loss can be performed so as to enhance the optimization of the student network model (student model) training.

For convenience of the examples of the present application, the examples of the present application are compared with existing methods as follows. In particular, the advantages of the embodiments of the present application over existing embodiments may be illustrated in overall effect from the perspective of introducing a stereoscopic model and the perspective of introducing a sparse depth-based attention mechanism, respectively.

(1) The depth completion accuracy of the embodiment of the application introducing the stereoscopic model is superior to that of the existing knowledge-based distillation depth estimation method. Referring to fig. 6 to 9, fig. 6 is a graph showing the overall effect of an embodiment of introducing a stereoscopic model compared with the overall effect of the prior embodiment, fig. 7 is a graph showing the overall effect of another embodiment of introducing a stereoscopic model compared with the overall effect of the prior embodiment, fig. 8 is a graph showing the overall effect of another embodiment of introducing a stereoscopic model compared with the overall effect of the prior embodiment, fig. 9 is a graph showing the overall effect of another embodiment of introducing a stereoscopic model compared with the overall effect of the prior embodiment, fig. 6 to 9, wherein the upper images Image It and Ir of the sub-images in each graph are left and right images of a single frame time in the KITTI data set, respectively, depthmap (Ours) and Depth map (Baseline) are Depth completion results of the present invention and the prior method, respectively, and Errormap (Ours) and Errormap (Baseline) are Depth completion result error graphs of the present invention and the prior method, respectively, wherein if the color is lower, the darker the color is higher. As can be seen from the boxes in fig. 6 to 9, the depth completion result in the embodiment of the present application has lower error and higher precision.

(2) Effects of embodiments of the present application introducing sparse depth based attention mechanisms: the scene global features can be extracted effectively. Referring specifically to fig. 10 to 12, fig. 10 is a graph showing the overall effect of an embodiment of introducing a Sparse Depth-based attention mechanism compared with the overall effect of an existing embodiment, fig. 11 is a graph showing the overall effect of another embodiment of introducing a Sparse Depth-based attention mechanism compared with the overall effect of an existing embodiment, fig. 12 is a graph showing the overall effect of another embodiment of introducing a Sparse Depth-based attention mechanism compared with the overall effect of an existing embodiment, and fig. 10 to 12, wherein the images and spark Depth on each graph are images and Sparse Depth maps of a single frame time in the VOID data set, respectively, depthmap (Ours) and Depthmap (Baseline) are Depth completion results of an embodiment of introducing a Sparse Depth-based attention mechanism and an existing embodiment, respectively, and Error map (Ours) and Errormap (Baseline) are Depth completion result Error maps of an embodiment of introducing a Sparse Depth-based attention mechanism and an existing embodiment, respectively, wherein the lower the color is the higher the dark accuracy. As can be seen from the boxes in Depthmap and Errormap, even under the condition that the sparse depth value is missing, the depth completion result of the embodiment of the sparse depth-based attention mechanism is better, the global features of the scene can be extracted more effectively, and the missing depth value can be complemented.

It will be appreciated that, in addition to the above-described method of obtaining a supervised dense depth map corresponding to an image sample to be processed, other reasonable methods may be used, not limited in particular herein, in addition to the above-described method of determining a supervised dense depth map corresponding to an image sample to be processed based on an error feature map corresponding to each teacher model and an error feature map corresponding to a student model.

It is also understood that the stereoscopic model of the embodiments of the present application may be replaced by other visual generation models, and is not limited herein. Embodiments of the present application may also introduce attention mechanisms to other different encoder-decoder depth synthesis models, and are not limited in particular herein. The sparse depth guidance attention mechanism of the embodiments of the present application may also be replaced by attention mechanisms of other different sparse feature guidance (such as planar normal vectors), which is not limited herein.

In this embodiment, near features and far features of a sparse depth map sample may be obtained from a student model, the near features and the far features of the sparse depth map sample may be fused to obtain a fused feature map, a first feature map after attention enhancement of a channel is determined by a weight corresponding to each channel in the fused feature map, a predicted dense depth map after spatial attention enhancement is determined by a weight corresponding to each region, and when a loss between the predicted dense depth map and a supervised dense depth map satisfies a convergence condition, a student model with first training completed is obtained. The extraction capability of the scene global features is strong, so that the accuracy of the learned scene global features is high. Secondly, based on the knowledge distillation method of the existing depth completion model (teacher model), the problem of error learning of the existing model and the problem of easy error supervision exist, the embodiment of the application can introduce the stereoscopic vision model to improve the knowledge distillation accuracy, and avoid learning the inherent error of the existing depth completion model (teacher model). Finally, a sparse depth guidance-based attention mechanism can be introduced into the encoder of the depth model, and the extraction capability of the global features of the scene can be enhanced.

The foregoing describes a sparse depth completion training method in the embodiment of the present application, and the following describes a sparse depth completion method in the embodiment of the present application, please refer to fig. 13, fig. 13 is a schematic flow diagram of a sparse depth completion method disclosed in the embodiment of the present application, where the method includes:

1301. and obtaining a sparse depth map corresponding to the image to be processed.

In this embodiment, when the depth of the sparse depth is completed, the image to be processed and the sparse depth map corresponding to the image to be processed may be obtained.

1302. Inputting the to-be-processed image and a sparse depth map corresponding to the to-be-processed image into a pre-trained student model, obtaining near features and far features of the sparse depth map by the student model, fusing the near features and the far features of the sparse depth map to obtain a fused feature map, determining the importance degree of each channel in the fused feature map, determining the weight corresponding to each channel based on the importance degree of each channel, determining a first feature map with enhanced channel attention based on the weight corresponding to each channel, determining the spatial relevance between each region in the first feature map, determining the weight corresponding to each region based on the spatial relevance between each region, determining a second feature map with enhanced spatial attention based on the weight corresponding to each region, and obtaining a dense depth map corresponding to the to-be-processed image output by the student model; wherein the second feature map is a dense depth map.

After the to-be-processed image and the sparse depth map corresponding to the to-be-processed image are obtained, the to-be-processed image and the sparse depth map corresponding to the to-be-processed image can be input into a pre-trained student model, near features and far features of the sparse depth map are obtained through the student model, the near features and the far features of the sparse depth map are fused to obtain a fused feature map, the importance degree of each channel in the fused feature map is determined, the weight corresponding to each channel is determined based on the importance degree of each channel, the first feature map with enhanced channel attention is determined based on the weight corresponding to each channel, the spatial relevance between each region in the first feature map is determined, the weight corresponding to each region is determined based on the spatial relevance between each region, and the dense depth map corresponding to the to-be-processed image output by the student model is obtained through the second feature map; wherein the second feature map is a dense depth map.

Specifically, in order to achieve dense scene depth map acquisition through a single image (to-be-processed image) and a sparse depth map corresponding to the single image (to-be-processed image) (sparse depth map corresponding to the to-be-processed image). The embodiment of the application provides a depth completion function f _θ As shown in equation four above. As shown in the fourth formula, the image I to be processed can be input _i Sparse depth map z corresponding to image to be processed _i Corresponding internal reference K of image to be processed _i Outputting a predicted dense depth map d corresponding to the image to be processed _i 。

Specifically, the student model of the embodiment of the application introduces an attention mechanism module based on sparse depth values, embeds the attention mechanism module into an encoder of a depth completion model, and realizes extraction of global features of a scene, and referring to fig. 4, it can be seen from fig. 4 that the attention mechanism module comprises a minimum pooling layer,The system comprises a maximum pooling layer, a fusion module, a channel attention module and a space attention module. The main idea of the attention mechanism module is that when the features of the sparse depth map F are extracted, the near features in the sparse depth map are extracted through the minimum pooling layer, and the far features in the sparse depth map are extracted through the maximum pooling layer. These near and far features are then fused together by a fusion module through a 1 x 1 convolution layer. Next, a channel attention function is introducedExtracting more identifiable features in each channel, giving greater weight to the features, and generating a feature map F for enhancing the features by multiplication ^∧ Subsequently introducing a spatial attention function->Learning the features with more spatial relevance in each region, giving larger weight to the features, and generating a feature map F for enhancing the features by multiplication (weighted summation) ^∨ Finally, extracting global features of the scene to obtain a dense depth map corresponding to the image to be processed, wherein the dense depth map is specifically shown in the formula five and the formula six.

In the embodiment of the application, an image to be processed and a sparse depth map corresponding to the image to be processed can be obtained, the image to be processed and the sparse depth map corresponding to the image to be processed are input into a pre-trained student model, near features and far features of the sparse depth map are obtained through the student model, the near features and the far features of the sparse depth map are fused to obtain a fused feature map, the importance degree of each channel in the fused feature map is determined, the weight corresponding to each channel is determined based on the importance degree of each channel, a first feature map after the attention of the channel is enhanced is determined based on the weight corresponding to each channel, the spatial relevance between each region in the first feature map is determined, the weight corresponding to each region is determined based on the spatial relevance between each region, and a dense depth map corresponding to the image to be processed, which is output by the student model, is obtained, wherein the second feature map is the dense depth map. The extraction capability of the scene global features is high, so that the accuracy of the scene global features corresponding to the dense depth map obtained based on the pre-trained student model is high.

The foregoing describes a depth completion training method of sparse depth and a depth completion method of sparse depth in the embodiment of the present application, and the following describes an image processing apparatus in the embodiment of the present application, referring to fig. 14, an embodiment of an image processing apparatus 1400 in the embodiment of the present application includes:

a central processor 1401, a memory 1405, an input/output interface 1404, a wired or wireless network interface 1403, and a power supply 1402;

memory 1405 is transient memory or persistent memory;

the central processor 1401 is configured to communicate with the memory 1405 and to execute the instruction operations in the memory 1405 to perform the methods of the embodiments described above with reference to fig. 1 or 13.

Embodiments of the present application also provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of the embodiments shown in fig. 1 or 13.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the embodiments shown in fig. 1 or fig. 13 described above.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A depth completion training method for sparse depth, comprising:

2. The method of claim 1, wherein the student model comprises a new perspective image generation module;

after the predicted dense depth map corresponding to the image sample to be processed output by the student model is obtained, the method further includes:

inputting the image sample to be processed and the sparse depth pattern book corresponding to the image sample to be processed into the new view angle image generation module to obtain a predicted dense depth image of the target view angle output by the new view angle image generation module;

and calculating the loss between the predicted dense depth map of the target visual angle and the live depth map of the target visual angle according to a loss function, and obtaining a second training-completed student model when the loss meets a convergence condition.

3. The method according to claim 2, wherein the obtaining a supervised dense depth map corresponding to the image sample to be processed comprises:

sequentially inputting the image sample to be processed and the sparse depth pattern book corresponding to the image sample to be processed into N teacher models trained in advance, and respectively outputting dense depth maps corresponding to the image sample to be processed by the N teacher models; wherein N is an integer greater than or equal to 2;

Inputting the image sample to be processed and the dense depth map corresponding to the teacher model into the new view angle image generation module aiming at the dense depth map output by each teacher model to obtain the dense depth map of the target view angle corresponding to the teacher model output by the new view angle image generation module;

comparing the similarity of the dense depth map of the target visual angle corresponding to the teacher model with the live map of the target visual angle to obtain an error feature map corresponding to the teacher model;

and determining a supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model.

4. A method according to claim 3, wherein said determining a supervised dense depth map for the image samples to be processed based on the error profile for each teacher model comprises:

determining the weight of each teacher model corresponding to each pixel point based on the error feature map corresponding to each teacher model;

and determining a supervision dense depth map corresponding to the image sample to be processed based on the weight corresponding to each pixel point of each teacher model.

5. A method according to claim 3, wherein the student model comprises a stereoscopic model;

Before the determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model, the method further includes:

inputting the image sample to be processed and a stereo matching image sample which is in stereo matching with the image sample to be processed into the stereo vision model, determining parallax information between the image sample to be processed and the stereo matching image sample corresponding to the image sample to be processed by the stereo vision model, determining a stereo dense depth map corresponding to the image sample to be processed based on the parallax, and obtaining a stereo dense depth map corresponding to the image sample to be processed output by the stereo vision model;

inputting the image sample to be processed and the stereoscopic dense depth map corresponding to the image sample to be processed into the new view angle image generation module to obtain the stereoscopic dense depth map of the target view angle output by the new view angle image generation module;

comparing the similarity between the stereoscopic dense depth map of the target visual angle corresponding to the student model and the live map corresponding to the target visual angle to obtain an error feature map corresponding to the student model;

the determining the supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model comprises the following steps:

And determining a supervision dense depth map corresponding to the image sample to be processed based on the error feature map corresponding to each teacher model and the error feature map corresponding to the student model.

6. The method of claim 5, wherein the determining a supervised dense depth map for the image sample to be processed based on the error feature map for each teacher model and the error feature map for the student model comprises:

determining the weight of each teacher model and each student model corresponding to each pixel point based on the error feature map corresponding to each teacher model and the error feature map corresponding to each student model;

and determining a supervision dense depth map corresponding to the image sample to be processed based on the weight corresponding to each pixel point of each teacher model and each student model.

7. The method of claim 5, wherein the determining a supervised dense depth map for the image sample to be processed based on the error feature map for each teacher model and the error feature map for the student model comprises:

determining a depth value error corresponding to each teacher model at each pixel point and a depth value error corresponding to each student model at each pixel point based on an error feature map corresponding to each teacher model;

For each pixel point, if the depth value error of the student model corresponding to the pixel point is greater than or equal to the depth value error of the target teacher model corresponding to the pixel point, taking the depth value of the target teacher model corresponding to the pixel point as the target depth value corresponding to the pixel point;

if the depth value error of the student model corresponding to the pixel point is smaller than the depth value error of the target teacher model corresponding to the pixel point, taking the depth value of the student model corresponding to the pixel point as the target depth value corresponding to the pixel point;

and obtaining a supervision dense depth map corresponding to the image sample to be processed based on the target depth value corresponding to each pixel point.

8. A depth completion method of sparse depth, comprising:

9. An image processing apparatus, characterized by comprising:

a central processing unit and a memory;

the memory is a short-term memory or a persistent memory;

the central processor is configured to communicate with the memory and to execute instruction operations in the memory to perform the method of any one of claims 1 to 7 or claim 8.

10. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 7 or claim 8.