CN112560875B

CN112560875B - Depth information complement model training method, device, equipment and storage medium

Info

Publication number: CN112560875B
Application number: CN202011563584.5A
Authority: CN
Inventors: 宋希彬; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2023-07-28
Anticipated expiration: 2040-12-25
Also published as: CN112560875A

Abstract

The application discloses a depth information complement model training method, device, equipment and storage medium, and relates to the technical field of artificial intelligence such as computer vision, deep learning and automatic driving. One embodiment of the method comprises the following steps: acquiring training data, wherein the training data comprises a first sparse depth image, a corresponding first color image and a corresponding first dense depth image; and taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model, wherein in the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion, and then feature enhancement is performed. In the training process, the sparse depth image and the corresponding color image are taken as input, and the features of the sparse depth image and the color image are fused for feature enhancement, so that a model for effectively acquiring the dense depth image can be trained.

Description

Depth information complement model training method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as computer vision, deep learning and automatic driving, and especially relates to a training method, device and equipment for a depth information complement model and a storage medium.

Background

The depth image may be applied to three-dimensional object detection, three-dimensional reconstruction of a scene, etc. Currently, high precision depth sensors may be used to acquire depth images, such as radar devices. Common radar devices include 16 lines, 32 lines, 64 lines, and the like. However, the depth images acquired by these radar devices are very sparse, severely limiting the effectiveness of the relevant applications. Therefore, the acquired sparse depth image needs to be subjected to depth information completion.

Currently, depth information complementation mainly adopts an implementation mode based on interpolation. Specifically, a sparse depth image is taken as an input, and a nonlinear interpolation mode, such as nearest neighbor interpolation (Nearest Neighbor interpolation), bi-cubic polynomial interpolation (bi-cubic interpolation) or bilinear interpolation (bilinear interpolation), is used to interpolate a dense depth image based on sparse depth points.

Disclosure of Invention

The embodiment of the application provides a depth information completion model training method, device, equipment and storage medium.

In a first aspect, an embodiment of the present application provides a depth information complement model training method, including: acquiring training data, wherein the training data comprises a first sparse depth image, a corresponding first color image and a corresponding first dense depth image; and taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model, wherein in the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion, and then feature enhancement is performed.

In a second aspect, an embodiment of the present application provides a depth information complement model training apparatus, including: a training data acquisition module configured to acquire training data, wherein the training data comprises a first sparse depth image, a corresponding first color image, and a corresponding first dense depth image; the model training module is configured to take a first sparse depth image and a first color image as input, take a first dense depth image as output, and train to obtain a depth information complement model, wherein in the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion, and then feature enhancement is performed.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in any implementation of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

The depth information completion model training method, device, equipment and storage medium provided by the embodiment of the application firstly acquire training data; and then taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model. In the training process, the sparse depth image and the corresponding color image are taken as input, and the features of the sparse depth image and the color image are fused for feature enhancement, so that a model for effectively acquiring the dense depth image can be trained.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are for better understanding of the present solution and do not constitute a limitation of the present application. Wherein:

FIG. 1 is a flow chart of one embodiment of a depth information completion model training method according to the present application;

FIG. 2 is a flow chart of yet another embodiment of a depth information completion model training method according to the present application;

FIG. 3 is a schematic diagram of a network training architecture of a depth information completion model;

FIG. 4 is a schematic diagram of the basic training unit;

FIG. 5 is a channel level enhancement flow diagram;

FIG. 6 is a pixel level enhancement flow chart;

FIG. 7 is a flow chart of one embodiment of a depth information completion model optimization method according to the present application;

FIG. 8 is a schematic structural diagram of one embodiment of a depth information completion model training apparatus according to the present application;

Fig. 9 is a block diagram of an electronic device for implementing a depth information completion model training method of an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present application to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates a flow 100 of one embodiment of a depth information completion model training method according to the present application. The depth information complement model training method comprises the following steps:

step 101, training data is acquired.

In this embodiment, the execution subject of the depth information complement model training method may acquire training data. Wherein the training data may include a first sparse depth image, a first color image, and a first dense depth image. The pixel value of each pixel point of the depth image may be the distance between the sensor and each point of the acquired object surface. The first sparse depth image is a depth image with sparse pixels. The first dense depth image may be a depth image with denser pixels. The pixel value of each pixel of the color image may be the color value of each point of the acquired object surface. In general, all colors perceived by human vision are obtained by varying the three color channels of red (R), green (G), blue (B) and their superposition with each other. Commonly used sensors are radar devices and cameras. The camera is used for collecting color images. The radar device is used to acquire depth images. The radar apparatus includes 16 lines, 32 lines, 64 lines, etc., and the more lines, the denser the acquired depth image, and the more expensive it is. Radar devices and cameras are typically mounted on autonomous vehicles for capturing images of the surroundings of the autonomous vehicle during travel. Autonomous vehicles may include, but are not limited to, unmanned automobiles, unmanned autonomous excavators, and the like.

The first sparse depth image, the first color image and the first dense depth image correspond to each other. For example, the first sparse depth image, the first dense depth image, and the first color image are images acquired of the same scene at the same time. For another example, the first sparse depth image and the first color image are images acquired of the same scene at the same time, and the first dense depth image is generated based on the first sparse depth image. For another example, the first dense depth image and the first color image are images acquired of the same scene at the same time, and the first sparse depth image is generated based on the first dense depth image.

In some optional implementations of this embodiment, the executing body may first acquire a first dense depth image and a corresponding first color image; and then randomly sampling the first dense depth image to generate a first sparse depth image. For example, a relatively dense depth image is acquired as a first dense depth image with a radar device having a large number of lines. Meanwhile, a camera is utilized to correspondingly collect a color image to be used as a first color image. For the corresponding first sparse depth image, the generation mode of randomly sampling the relatively dense depth image can be used, so that the generation mode of the training sample is enriched.

In some optional implementations of the present embodiment, the executing body may first acquire a first sparse depth image and a first color image; then, adjacent multiframes in the first sparse depth image are projected as one frame, and a first dense depth image is generated. For example, sparse depth images are continuously acquired using a radar device. Meanwhile, a camera is utilized to correspondingly collect a color image to be used as a first color image. For the corresponding first dense depth image, it may be generated by projecting adjacent multi-frame sparse depth images into one frame. For the first sparse depth image, the sparse depth image acquired by the radar equipment can be directly used, and the first dense depth image can be randomly sampled to generate the first sparse depth image, so that the generation mode of training samples is enriched.

Step 102, taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model.

In this embodiment, the execution body may take the first sparse depth image and the first color image as inputs, take the first dense depth image as outputs, and train to obtain the depth information complement model.

The depth information complement model can be obtained through supervised training by using a deep learning method. In the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion, and then feature enhancement is performed, so that a model for effectively acquiring a dense depth image is trained.

In some optional implementations of this embodiment, the executing entity may perform depth information complementation by using a depth information complementation model obtained by training. Specifically, the execution subject may input the sparse depth image to be complemented and the corresponding color image to the depth information complement model, so as to effectively obtain the corresponding dense depth image. The depth information complement has wide application prospect, and can be applied to the fields of computer vision, automatic driving and the like, such as three-dimensional object detection, three-dimensional scene reconstruction and the like.

According to the depth information completion model training method provided by the embodiment of the application, training data are firstly obtained; and then taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model. In the training process, the sparse depth image and the corresponding color image are taken as input, and the features of the sparse depth image and the color image are fused for feature enhancement, so that a model for effectively acquiring the dense depth image can be trained.

With continued reference to fig. 2, a flow 200 of yet another embodiment of a depth information completion model training method in accordance with the present application is shown. The depth information complement model comprises K (K is a positive integer) basic training units, and the depth information complement model training method comprises the following steps:

in step 201, training data is acquired.

In this embodiment, the specific operation of step 201 is described in detail in step 101 in the embodiment shown in fig. 1, and will not be described herein.

Step 202, inputting the input depth image and the first color image into a current basic training unit to obtain a current depth image.

In this embodiment, the execution subject of the depth information complement model training method may input the input depth image and the first color image into the current basic training unit to obtain the current depth image.

The depth information complement model may be a multi-stage network structure, and includes K cascaded basic training units, where a depth image output by a previous basic training unit may be input to a current basic training unit, and a depth image output by the current basic training unit may be input to a next basic training unit. In the training process, the current basic training unit performs feature fusion on features extracted from the input depth image and the first color image respectively, and then performs feature enhancement, so that a depth image denser than the input depth image is output, and the current stage training is completed.

It should be noted that, if the current basic training unit is the first basic training unit, there is no previous basic training unit, and at this time, the input depth image is the first sparse depth image. If the current basic training unit is not the first basic training unit, the depth image is input or the depth image output by the last basic training unit.

Step 203, the current depth image and the first color image are input to the next basic training unit.

In this embodiment, the executing body may input the current depth image and the first color image into the next basic training unit, and the next basic training unit performs operations similar to step 202 until the kth basic training unit completes the training of the kth stage.

Step 204, after training in the K stage, outputting the final depth image.

In this embodiment, after the training in the K stage, the depth image output by the kth basic training unit is the final depth image.

In step 205, parameters of the K basic training units are adjusted based on residuals between the depth images output by the K basic training units and the first dense depth image.

In this embodiment, for any one basic training unit, the execution body may calculate a residual error between the depth image output by the basic training unit and the first dense depth image, and adjust parameters of the basic training unit.

Wherein, the learning objective of each basic training unit is the residual error between the depth image output by the basic training unit and the first dense depth image, so that the output depth image can be iteratively gradually close to the first dense depth image, thereby obtaining a high-quality dense depth image.

For ease of understanding, fig. 3 shows a schematic diagram of the network training structure of the depth information completion model. As shown in fig. 3, the sparse depth image and the color image are input to the basic training unit 1, and the dense depth image 1 is output. Wherein the dense depth image 1 is denser than the sparse depth image. The dense depth image 1 and the color image are input to the basic training unit 2, and the dense depth image 2 is output. Wherein the dense depth image 2 is denser than the dense depth image 1. After K iterations, the basic training unit K outputs a dense depth image K. Wherein the dense depth image K is denser than the dense depth image 1 to the dense depth image K-1.

As can be seen from fig. 2, the flow 200 of the depth information complement model training method in this embodiment highlights the model training step compared to the corresponding embodiment of fig. 1. Thus, the solution described in this embodiment provides a depth information completion model for a multi-stage network structure. In the training process, the output depth image is gradually more dense through multi-stage iteration, so that a model for effectively acquiring high-quality dense depth image is trained.

With further reference to fig. 4, a schematic structural diagram of the basic training unit is shown. As shown in fig. 4, the basic training unit may include an encoding module (diagonal filled boxes), a feature enhancement module (no filled boxes), and a decoding module (vertical filled boxes).

The number of coding modules may be two, one for feature extraction of the input depth image and the other for feature extraction of the first color image. The encoding module may include a plurality of convolution layers that perform feature extraction by performing a plurality of convolution operations. Each convolution operation includes a k x k convolution kernel and a c-layer channel. For the input depth image, c=1. For the first color image, c=3. The encoding module may be a commonly used feature extraction Network including, but not limited to, VGG (Visual Geometry Group ) model, res net (Residual Network), etc.

Features of the input depth image and the first color image may be fused and input to the feature enhancement module. The fusion mode is channel fusion, namely, the characteristics of the input depth image of the 1 channel and the characteristics of the first color image of the 3 channels are subjected to channel fusion, so that the fusion characteristics of the 4 channels are obtained. The feature enhancement module may be configured to perform feature enhancement on a fusion feature of the input depth image and the first color image. The feature enhancement module may include at least one of channel-level feature enhancement and pixel-level feature enhancement. The feature enhancement module may also be used for channel-level feature and pixel-level feature fusion when used for both channel-level feature enhancement and pixel-level feature enhancement.

The decoding model may include the same number of convolutional layers as the number of coding modules, which is the inverse operation of the coding modules. Also, the decoding module may include a plurality of convolution layers for depth image restoration by performing a plurality of convolution operations. Each convolution operation includes a k x k convolution kernel and a c-layer channel. For the decoding module c=4. The decoding module may be a commonly used feature extraction network including, but not limited to, VGG model, res net, etc.

The basic training unit can sum the output of the decoding module and the input depth image and then output the sum, so that the depth image output by the basic training unit is denser than the depth image recovered by the decoding module.

With further reference to fig. 5, a channel level enhancement flow diagram is shown. As shown in fig. 5, the feature enhancement module is configured to perform channel-level enhancement on the fusion feature F (c×w×h) of the input depth image and the first color image. Where c is the number of channels, here a value of 4.w is the width of the matrix and h is the height of the matrix. The channel-level enhancement flow is as follows:

first, for feature F (c×w×h), two features are obtained by convolution and matrix dimension-changing (e.g., reshape) operations: feature Q _c (c× (h×w)) and feature H _c ((h*w)×c)。

Thereafter, feature Q _c (c× (h×w)) and feature H _c ((h x w) x c) performing matrix multiplication to obtain matrix M _c (c×c)。

Then, to matrix M _c (c×c) regressionE.g. softmax) to obtain a new weight M _c ′(c×c)。

In addition, the convolution operation is carried out on the feature F (c multiplied by w multiplied by h) to obtain a new feature F _c ′(c×w×h)。

Subsequently, through weight M _c ' (c×c) and feature F _c ' matrix multiplication is performed on (c×w×h) to obtain enhanced feature F _h (c×w×h)。

Finally, for enhancement feature F _h (c×w×h) and feature F _c ' performing pixel-level addition to obtain channel-level enhancement feature F _c (c×w×h)。

With further reference to fig. 6, a pixel level enhancement flow chart is shown. As shown in fig. 6, the feature enhancement module is configured to perform pixel-level enhancement on a fusion feature F (c×w×h) of the input depth image and the first color image. Where c is the number of channels, here a value of 4.w is the width of the matrix and h is the height of the matrix. The pixel level enhancement flow is as follows:

first, for feature F (c×w×h), two features are obtained by convolution and matrix dimension-changing (e.g., reshape) operations: feature Q _p ((h.times.w). Times.c) and feature H _p (c×(h*w))。

Thereafter, feature Q _p ((h.times.w). Times.c) and feature H _p (c× (h×w)) performing matrix multiplication to obtain matrix M _p ((h*w)×(h*w))。

Then, to matrix M _p Regression (e.g., softmax) is performed (h x w) x (h x w)) to obtain new weight M _p ′((h*w)×(h*w))。

In addition, the convolution operation is carried out on the feature F (c multiplied by w multiplied by h) to obtain a new feature F _p ′(c×w×h)。

Subsequently, by weighting Mp' ((h×w)) and feature F _p ' matrix multiplication is performed on (c×w×h) to obtain enhanced feature F _h ′(c×w×h)。

Finally, for enhancement feature F _h ' (c×w×h) and feature F _p ' performing pixel-level addition to obtain pixel-level enhancement feature F _p (c×w×h)。

When channel-level feature enhancement and pixel-level are performed simultaneouslyIn feature enhancement, the feature enhancement module may enhance the channel level enhancement feature F based on a learnable parameter _c (c×w×h) and pixel-level enhancement features F _p (c×w×h) fusion to obtain fusion characteristics. The specific formula is as follows:

F＝a×F _c (c×w×h)+b×F _p (c×w×h)。

where F is the fusion feature and a and b are the learnable parameters.

With further reference to fig. 7, a flow 700 of one embodiment of a depth information completion model optimization method according to the present application is shown. The depth information complement model optimization method comprises the following steps:

step 701, obtaining test data.

In this embodiment, the execution subject of the depth information complement model optimization method may acquire test data. Wherein the test data may include a second sparse depth image, a second color image, and a second dense depth image. The second sparse depth image, the second color image, and the second dense depth image correspond.

It should be noted that the test data are similar to the training data in the embodiment shown in fig. 1, and will not be described here again.

And step 702, inputting the second sparse depth image and the second color image into a depth information complement model to obtain a complement dense depth image.

In this embodiment, the execution body may input the second sparse depth image and the second color image into the depth information complement model to obtain the complement dense depth image. The full dense depth image is a depth image which is subjected to full complement by the depth information full complement model based on the second sparse depth image and the second color image, and is denser than the second sparse depth image.

Step 703, calculating a residual error between the full dense depth image and the second dense depth image, and optimizing the depth information full model if the residual error is greater than a preset threshold.

In this embodiment, the execution body may calculate a residual error between the full dense depth image and the second dense depth image, and optimize the depth information full model if the residual error is greater than a preset threshold. Among other ways of optimizing the depth information completion model may include, but are not limited to: increasing the number of basic training units, increasing the number of training times, increasing training data, etc.

The residual error can be used for evaluating the performance of the depth information completion model, and the smaller the residual error is, the better the model performance is; the larger the residual, the worse the model performance. Under the condition that the residual error is larger than a preset threshold value, the model performance is poor, and the depth information completion model needs to be optimized; and under the condition that the residual error is not greater than a preset threshold value, the model performance is good, and the depth information complement model is not required to be optimized.

The depth information complement model optimization method provided by the embodiment of the application firstly obtains test data; then, inputting the second sparse depth image and the second color image into a depth information complement model to obtain a complement dense depth image; and finally, calculating residual errors between the full dense depth image and the second dense depth image, and optimizing the depth information full model under the condition that the residual errors are larger than a preset threshold value. And in the training stage, obtaining a depth information complement model by utilizing training data. And in the test stage, testing the performance of the depth information complement model by using the test data. And under the condition that the residual error between the full dense depth image and the second dense depth image output by the depth information full model is overlarge, improving the performance of the depth information full model, thereby realizing the self optimization of the whole system.

With further reference to fig. 8, as an implementation of the method shown in the foregoing drawings, the present application provides an embodiment of a depth information complement model training apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 8, the depth information complement model training apparatus 800 of the present embodiment may include: a training data acquisition module 801 and a model training module 802. Wherein, the training data acquisition module 801 is configured to acquire training data, wherein the training data comprises a first sparse depth image, a corresponding first color image, and a corresponding first dense depth image; the model training module 802 is configured to take a first sparse depth image and a first color image as input, take a first dense depth image as output, and train to obtain a depth information complement model, wherein in the training process, features extracted from the first sparse depth image and the first color image are fused, and then feature enhancement is performed.

In the present embodiment, in the depth information complement model training apparatus 800: the specific processing and technical effects of training data acquisition module 801 and model training module 802 may refer to the relevant descriptions of steps 101-102 in the corresponding embodiment of fig. 1, and are not described herein.

In some optional implementations of this embodiment, the depth information complement model includes K basic training units, K being a positive integer; and model training module 802 is further configured to: inputting an input depth image and a first color image into a current basic training unit to obtain a current depth image, inputting the current depth image and the first color image into a next basic training unit, and performing K-stage training to output a final depth image, wherein the input depth image is a first sparse depth image if the current basic training unit is a first basic training unit, and the input depth image is a depth image output by a previous basic training unit if the current basic training unit is not the first basic training unit; and adjusting parameters of the K basic training units based on residual errors between the depth images output by the K basic training units and the first dense depth image.

In some optional implementations of this embodiment, the basic training unit includes an encoding module, a feature enhancement module, and a decoding module, where the encoding module includes a plurality of convolution layers, and is configured to perform feature extraction on the input depth image and the first color image, respectively, and the feature enhancement module is configured to perform feature enhancement on fusion features of the input depth image and the first color image, and the decoding module includes convolution layers, which are the same number as the encoding modules, and is a reverse operation of the encoding module, and the basic training unit sums and outputs an output of the decoding module and the input depth image.

In some alternative implementations of the present embodiments, the feature enhancement module is used for channel-level feature enhancement and pixel-level feature enhancement, as well as channel-level feature and pixel-level feature fusion.

In some alternative implementations of the present embodiment, model training module 802 includes a channel-level feature enhancer module configured to: for the feature F (c×w×h), the feature Q is obtained by convolution and matrix dimension-changing operation _c (c× (h×w)) and feature H _c ((h x w) x c), where c is the number of channels, w is wide and h is high; will characteristic Q _c (c× (h×w)) and feature H _c ((h x w) x c) performing matrix multiplication to obtain matrix M _c (c×c); for matrix M _c (c×c) regression operation to obtain weight M _c ' (c×c); performing convolution operation on the feature F (c multiplied by w multiplied by h) to obtain the feature F _c ' s (c×w×h); by weight M _c ' (c×c) and feature F _c ' matrix multiplication is performed on (c×w×h) to obtain enhanced feature F _h (c×w×h); for enhancement feature F _h (c×w×h) and feature F _c ' performing pixel-level addition to obtain channel-level enhancement feature F _c (c×w×h)。

In some alternative implementations of the present embodiment, model training module 802 includes a pixel-level feature enhancer module configured to: for the feature F (c×w×h), the feature Q is obtained by convolution and matrix dimension-changing operation _p ((h.times.w). Times.c) and feature H _p (c× (h x w)), where c is the number of channels, w is wide, and h is high; will characteristic Q _p ((h.times.w). Times.c) and feature H _p (c× (h×w)) performing matrix multiplication to obtain matrix M _p ((h x w) x (h x w)); for matrix M _p Performing regression operation (h x w) x (h x w)) to obtain weight M _p ' x (h x w)); performing convolution operation on the feature F (c multiplied by w multiplied by h) to obtain the feature F _p ' s (c×w×h); by weighting Mp' ((h x w)) and feature F _p ' matrix multiplication is performed on (c×w×h) to obtain enhanced feature F _h ' s (c×w×h); for enhancement feature F _h ' (c×w×h) and feature F _p ' (c×w×h) pixel levelAdding to obtain pixel-level enhancement feature F _p (c×w×h)。

In some alternative implementations of the present embodiment, model training module 802 includes a feature fusion sub-module configured to: based on the learnable parameters, channel-level enhancement features F _c (c×w×h) and pixel-level enhancement features F _p (c×w×h) fusion to obtain fusion characteristics.

In some optional implementations of the present embodiment, the depth information complement model training apparatus 800 further includes: a test data acquisition module configured to acquire test data, wherein the test data includes a second sparse depth image, a corresponding second color image, and a corresponding second dense depth image; the information complement module is configured to input a second sparse depth image and a second color image into the depth information complement model to obtain a complement dense depth image; and a model optimization module configured to calculate a residual between the full dense depth image and the second dense depth image, and to optimize the depth information full model if the residual is greater than a preset threshold.

In some optional implementations of this embodiment, the manner in which the depth information completion model is optimized includes at least one of: increasing the number of basic training units, increasing the number of training times, and increasing training data.

In some optional implementations of the present embodiment, the training data acquisition module 801 is further configured to: acquiring a first dense depth image and a first color image; and randomly sampling the first dense depth image to generate a first sparse depth image.

In some optional implementations of the present embodiment, the training data acquisition module 801 is further configured to: acquiring a first sparse depth image and a first color image; and projecting adjacent multiframes in the first sparse depth image into one frame to generate a first dense depth image.

According to embodiments of the present application, there is also provided an electronic device, a readable storage medium, and a program product.

As shown in fig. 9, a block diagram of an electronic device according to a depth information completion model training method according to an embodiment of the present application is shown. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

As shown in fig. 9, the electronic device includes: one or more processors 901, memory 902, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). In fig. 9, a processor 901 is taken as an example.

Memory 902 is a non-transitory computer-readable storage medium provided herein. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the depth information complement model training method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the depth information complement model training method provided by the present application.

The memory 902 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the training data acquisition module 801 and the model training module 802 shown in fig. 8) corresponding to the depth information complement model training method in the embodiments of the present application. The processor 901 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 902, i.e., implements the depth information complement model training method in the above-described method embodiment.

The memory 902 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the electronic device of the depth information complement model training method, and the like. In addition, the memory 902 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 902 optionally includes memory remotely located relative to processor 901, which may be connected to the electronics of the depth information complement model training method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the depth information complement model training method may further include: an input device 903 and an output device 904. The processor 901, memory 902, input devices 903, and output devices 904 may be connected by a bus or other means, for example in fig. 9.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the depth information completion model training method, such as input devices for a touch screen, a keypad, a mouse, a track pad, a touch pad, a joystick, one or more mouse buttons, a track ball, a joystick, and the like. The output means 904 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the application, training data are firstly obtained; and then taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model. In the training process, the sparse depth image and the corresponding color image are taken as input, and the features of the sparse depth image and the color image are fused for feature enhancement, so that a model for effectively acquiring the dense depth image can be trained.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A depth information complement model training method comprises the following steps:

acquiring training data, wherein the training data comprises a first sparse depth image, a corresponding first color image and a corresponding first dense depth image;

taking the first sparse depth image and the first color image as input, taking the first dense depth image as output, and training to obtain a depth information complement model, wherein in the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion and then feature enhancement respectively, the depth information complement model comprises K cascaded basic training units, and the basic training units comprise a feature enhancement module, are used for channel-level feature enhancement and pixel-level feature enhancement, and channel-level feature and pixel-level feature fusion;

The channel-level feature enhancement step of the feature enhancement module comprises the following steps:

for the feature F (c×w×h), the feature Q is obtained by convolution and matrix dimension-changing operation _c (c× (h×w)) and feature H _c ((h x w) x c), where c is the number of channels, w is wide and h is high;

by combining the features Q _c (c× (h×w)) and the feature H _c ((h x w) x c) performing matrix multiplication to obtain matrix M _c (c×c)；

For the matrix M _c (c×c) regression operation to obtain weight M _c ʹ(c×c)；

Performing convolution operation on the feature F (c multiplied by w multiplied by h) to obtain a feature F _c ʹ(c×w×h)；

By the weight M _c ʹ (c×c) and said feature F _c ʹ (c×w×h) performing matrix multiplication to obtain enhanced feature F _h (c×w×h)；

For the enhancement feature F _h (c×w×h) and the feature F _c ʹ (c×w×h) performing pixel-level addition to obtain channel-level enhancement feature F _c (c×w×h)。

2. The method of claim 1, wherein,

the training to obtain a depth information complement model takes the first sparse depth image and the first color image as input and the first dense depth image as output includes:

inputting an input depth image and the first color image into a current basic training unit to obtain a current depth image, inputting the current depth image and the first color image into a next basic training unit, and performing K-stage training to output a final depth image, wherein if the current basic training unit is a first basic training unit, the input depth image is the first sparse depth image, and if the current basic training unit is not the first basic training unit, the input depth image is the depth image output by the last basic training unit;

And adjusting parameters of the K basic training units based on residual errors between the depth images output by the K basic training units and the first dense depth image.

3. The method of claim 2, wherein the basic training unit further comprises an encoding module and a decoding module, the encoding module comprises a plurality of convolution layers for respectively performing feature extraction on the input depth image and the first color image, the feature enhancement module is used for performing feature enhancement on the fusion features of the input depth image and the first color image, the decoding module comprises the same number of convolution layers as the encoding module, and is a reverse operation of the encoding module, and the basic training unit adds and outputs the output of the decoding module and the input depth image.

4. A method according to claim 3, wherein the pixel-level feature enhancement step of the feature enhancement module comprises:

for the feature F (c×w×h), the feature Q is obtained by convolution and matrix dimension-changing operation _p ((h.times.w). Times.c) and feature H _p (c× (h x w)), where c is the number of channels, w is wide, and h is high;

by combining the features Q _p ((H x w) x c) and the characteristic H _p (c× (h×w)) performing matrix multiplication to obtain matrix M _p ((h*w)×(h*w))；

For the matrix M _p Performing regression operation (h x w) x (h x w)) to obtain weight M _p ʹ((h*w)×(h*w))；

Performing convolution operation on the feature F (c multiplied by w multiplied by h) to obtain a feature F _p ʹ(c×w×h)；

By the weight Mp ʹ ((h×w) × (h×w)) and the feature F _p ʹ (c×w×h) performing matrix multiplication to obtain enhanced feature F _h ʹ(c×w×h)；

For the enhancement feature F _h ʹ (c×w×h) and the feature F _p ʹ (c×w×h) performing pixel-level addition to obtain pixel-level enhancement feature F _p (c×w×h)。

5. The method of claim 4, wherein the feature fusion step of the feature enhancement module comprises:

based on the learnable parameters, the channel-level enhancement features F _c (c x w x h) and the pixel-level enhancement feature F _p (c×w×h) fusion to obtain fusion characteristics.

6. The method according to one of claims 1-5, wherein the method further comprises:

obtaining test data, wherein the test data comprises a second sparse depth image, a corresponding second color image and a corresponding second dense depth image;

inputting the second sparse depth image and the second color image into the depth information complement model to obtain a complement dense depth image;

calculating a residual error between the full dense depth image and the second dense depth image, and optimizing the depth information full model if the residual error is greater than a preset threshold.

7. The method of claim 6, wherein optimizing the depth information complement model comprises at least one of: increasing the number of basic training units, increasing the number of training times, and increasing training data.

8. The method according to one of claims 1-5, wherein the acquiring training data comprises:

acquiring the first dense depth image and the first color image;

and randomly sampling the first dense depth image to generate the first sparse depth image.

9. The method according to one of claims 1-5, wherein the acquiring training data comprises:

acquiring the first sparse depth image and the first color image;

projecting adjacent multiframes in the first sparse depth image as one frame to generate the first dense depth image.

10. A depth information completion model training device, comprising:

a training data acquisition module configured to acquire training data, wherein the training data comprises a first sparse depth image, a corresponding first color image, and a corresponding first dense depth image;

the model training module is configured to take the first sparse depth image and the first color image as input, take the first dense depth image as output, and train to obtain a depth information supplementing model, wherein in the training process, features extracted from the first sparse depth image and the first color image are subjected to feature fusion and feature enhancement, the depth information supplementing model comprises K cascaded basic training units, and the basic training units comprise a feature enhancement module, are used for channel-level feature enhancement and pixel-level feature enhancement, and channel-level feature and pixel-level feature fusion;

Wherein the model training module comprises a channel-level feature enhancer module configured to:

11. The apparatus of claim 10, wherein,

the model training module is further configured to:

12. The apparatus of claim 11, wherein the basic training unit further comprises an encoding module and a decoding module, the encoding module comprises a plurality of convolution layers for performing feature extraction on the input depth image and the first color image, respectively, the feature enhancement module is used for performing feature enhancement on the fusion features of the input depth image and the first color image, the decoding module comprises the same number of convolution layers as the encoding module, and is a reverse operation of the encoding module, and the basic training unit adds and outputs the output of the decoding module and the input depth image.

13. The apparatus of claim 12, wherein the model training module comprises a pixel-level feature enhancer module configured to:

14. The apparatus of claim 13, wherein the model training module comprises a feature fusion sub-module configured to:

15. The apparatus according to one of claims 10-14, wherein the apparatus further comprises:

a test data acquisition module configured to acquire test data, wherein the test data includes a second sparse depth image, a corresponding second color image, and a corresponding second dense depth image;

the information complement module is configured to input the second sparse depth image and the second color image into the depth information complement model to obtain a complement dense depth image;

A model optimization module configured to calculate a residual between the full dense depth image and the second dense depth image and to optimize the depth information full model if the residual is greater than a preset threshold.

16. The apparatus of claim 15, wherein the manner in which the depth information completion model is optimized comprises at least one of: increasing the number of basic training units, increasing the number of training times, and increasing training data.

17. The apparatus of one of claims 10-14, wherein the training data acquisition module is further configured to:

acquiring the first dense depth image and the first color image;

18. The apparatus of one of claims 10-14, wherein the training data acquisition module is further configured to:

acquiring the first sparse depth image and the first color image;

19. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.