CN115760942A

CN115760942A - Monocular depth estimation method and device based on neural network and edge computing chip

Info

Publication number: CN115760942A
Application number: CN202211330237.7A
Authority: CN
Inventors: 孟子阳; 沈王天; 尤政
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-07

Abstract

The invention discloses a monocular depth estimation method and a monocular depth estimation device based on a neural network and an edge computing chip, wherein the method comprises the following steps: acquiring a training data set of a camera image, and training a depth estimation network by using the training data set to obtain a trained depth estimation network; quantizing the network parameters of the trained depth estimation network by using a plurality of pictures in the training data set to obtain a convolution tensor; dividing the convolution tensor by using an optimal tensor division mode to generate network code data; and calculating to obtain a depth map estimated by the real-time camera image through the trained network of the depth estimation network by using the real-time camera image and the network code data. The deployment and operation of the monocular depth estimation network are realized on the ultra-low power consumption computing chip, the method has important significance for the nano type unmanned platform which only can use a monocular camera as a perception unit, and the comprehension capability of the nano type unmanned platform on scenes can be obviously improved.

Description

Monocular depth estimation method and device based on neural network and edge computing chip

Technical Field

The invention relates to the technical field of image processing, in particular to a monocular depth estimation method and device based on a neural network and an edge computing chip.

Background

Depth information is crucial for many tasks performed by robots, such as mapping, positioning, and obstacle avoidance. Existing depth sensors (e.g., liDAR, structured light sensors, etc.) are typically bulky, heavy, and consume a relatively high amount of power. These limitations make them unsuitable for nano-robotic platforms (e.g., nano-unmanned aerial vehicles). However, the nano-type unmanned platform has the characteristics of low cost, compact size and high energy efficiency, which also motivates the use of monocular cameras for depth estimation.

Previous research on monocular depth estimation mainly focuses on improving accuracy, and the monocular depth estimation algorithms are large in computational complexity and memory occupancy rate, so that the monocular depth estimation algorithms are difficult to deploy on a robot system, particularly a platform with limited computational resources and power consumption. Therefore, a key challenge is to balance the accuracy of the algorithm with the requirements on computational resources.

Disclosure of Invention

The present invention is directed to solving, at least in part, one of the technical problems in the related art.

Therefore, the invention provides a monocular depth estimation method based on a neural network and an edge computing chip, which mainly focuses on a lighter monocular depth estimation network and quantization and deployment on a low-power-consumption chip thereof. The method comprises the steps of adopting a lighter encoder on the design level of the network, adopting a network pruning method to cut the network, carrying out 8bit quantification on trained network parameters, deploying the calculation of the network on a GAP8 chip by adopting a reasonable partitioning mode, and realizing the real-time operation of the monocular depth estimation network on the low-power chip GAP8 through various operations. The invention has important significance for the intelligent application of the nano robot platform. Meanwhile, experiments prove that the method has reliable results and acceptable speed and has good engineering application value.

Another objective of the present invention is to provide a monocular depth estimation device based on a neural network and an edge computing chip.

In order to achieve the above object, an aspect of the present invention provides a monocular depth estimation method based on a neural network and an edge computing chip, including:

acquiring a training data set of a camera image, and training a depth estimation network by using the training data set to obtain a trained depth estimation network;

quantizing the network parameters of the trained depth estimation network by using a plurality of pictures in the training data set to obtain a convolution tensor;

dividing the convolution tensor by using an optimal tensor division mode to generate network code data;

and calculating the depth map of the real-time camera image estimation through the network of the trained depth estimation network by using the real-time camera image and the network code data.

In addition, the monocular depth estimation method based on the neural network and the edge computing chip according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, after obtaining the trained depth estimation network, the method further includes: and pruning the trained depth estimation network by using a NetAdapt algorithm to filter a preset number of convolution kernels in the network.

Further, in an embodiment of the present invention, the quantizing the network parameters of the trained depth estimation network by using multiple pictures in a training data set to obtain a convolution tensor includes:

acquiring a plurality of pictures in a training data set as the input of the trained depth estimation network to obtain the reference range [ alpha ] of the tensor t of each layer of convolution _t ,β _t ) And maps it to an N-bit pure integer tensor

ε _t ＝(β _t -α _t )/(2 ^N -1)

Wherein epsilon _t Is a scaling factor.

Further, in an embodiment of the present invention, the pruning the trained depth estimation network by using a NetAdapt algorithm to filter out a preset number of convolution kernels in the network includes: performing multiple rounds of iterative pruning processing on the trained depth estimation network by using a NetAdapt pruning method, and in each round of iteration, selectively deleting a preset number of convolution kernels from each layer of the network to obtain a plurality of sub-networks; and selecting the sub-network with the highest accuracy from the plurality of sub-networks to perform the next iteration.

Further, in an embodiment of the present invention, the dividing the convolution tensor by the optimal tensor division manner to generate the network code data includes: and selecting an optimal tensor division mode by using an AutoTiler network tool to divide the convolution tensor so as to encapsulate the calculation process of the trained depth estimation network on the ultra-low power consumption edge calculation chip into code data expressed by the C language.

In order to achieve the above object, another aspect of the present invention provides a monocular depth estimation device based on a neural network and an edge computing chip, including:

the network training module is used for acquiring a training data set of the camera image and training the depth estimation network by using the training data set to obtain a trained depth estimation network;

the convolution quantization module is used for performing quantization operation on the network parameters of the trained depth estimation network by utilizing a plurality of pictures in the training data set to obtain a convolution tensor;

the tensor dividing module is used for dividing the convolution tensor by utilizing an optimal tensor dividing mode to generate network code data;

and the depth estimation module is used for obtaining a depth map estimated by the real-time camera image through network calculation of the trained depth estimation network by utilizing the real-time camera image and the network code data.

The monocular depth estimation method and device based on the neural network and the edge computing chip, provided by the embodiment of the invention, utilize the convolutional neural network to extract the depth characteristics of the picture for coding and decoding so as to estimate and obtain the depth values of different pixel points of the picture, quantize the trained network parameters, and decompose each layer of convolutional operation so as to deploy the network parameters on the ultra-low power consumption edge computing chip.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow chart of a monocular depth estimation method based on a neural network and an edge computing chip according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a GAP8 chip architecture according to an embodiment of the invention;

FIG. 3 is a schematic diagram of a monocular depth estimation neural network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating the operation effect of the ultra-low power chip-based monocular depth estimation network according to the embodiment of the present invention;

fig. 5 is a schematic diagram of a NetAdapt iterative pruning method according to an embodiment of the present invention;

FIG. 6 is a schematic representation of a convolutional layer feature space according to an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a monocular depth estimation device based on a neural network and an edge computing chip according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

The following describes a monocular depth estimation method and apparatus based on a neural network and an edge computing chip according to an embodiment of the present invention with reference to the accompanying drawings.

FIG. 1 is a flowchart of a monocular depth estimation method based on a neural network and an edge computing chip according to an embodiment of the present invention.

As shown in fig. 1, the method includes, but is not limited to, the steps of:

s1, acquiring a training data set of a camera image, and training a depth estimation network by using the training data set to obtain a trained depth estimation network;

s2, carrying out quantization operation on the network parameters of the trained depth estimation network by using a plurality of pictures in the training data set to obtain a convolution tensor;

s3, dividing the convolution tensor by using an optimal tensor dividing mode to generate network code data;

and S4, obtaining a depth map estimated by the real-time camera image through network calculation of the trained depth estimation network by utilizing the real-time camera image and the network code data.

Specifically, in the embodiment of the present invention, a suitable Depth data set is selected for training parameters of a Depth estimation network so that the estimation accuracy of the network meets requirements, and an NYU Depth v2 data set is used in this embodiment;

in order to further lighten the given network, pruning is carried out on the trained network, some convolution kernels in the network are removed, and the network is pruned by adopting a NetAdapt algorithm.

The network parameters are quantized by firstly selecting several pictures as input to obtain the reference of the tensor t of each layer of convolutionRange [ alpha ] _t ,β _t ) Then mapped to an N-bit pure integer tensor

ε _t ＝(β _t -α _t )/(2 ^N -1)

Wherein epsilon _t Commonly referred to as a scaling factor, scales the tensor from a floating point number to an integer representation. The quantized stream enforces a representation of all tensors in the network. Quantification was performed using nntol tool developed by GWT (GreenWaves Technologies).

Deploying models on a GAP8 chip is to enable and utilize a hardware platform by generating C codes which directly control underlying memory computation, and the main difficulty is how to maximize parallel execution of all available kernels; while minimizing the overhead of data transmission. On GAP8 chips, the main challenge is the limited L1 memory (64 kB), which forces the deployment tool to solve the optimization problem, dividing the tensor in network computations into smaller chunks, called tiles, moving between L2 and L1 memories. The invention uses the AutoTiler tool to select the best partitioning mode and packages to generate C codes.

The computing process of the network on the ultra-low power consumption edge computing chip is packaged into code expressed by C language, and before calling the code, an image of a camera needs to be acquired. The image is obtained by calling an interface in the gap _ sdk and is stored into an array, then the image data is input, and the estimated depth information can be obtained through the calculation of a network. The process is cycled through to obtain a depth map estimated from the camera image in real time.

The embodiment is deployed on an AIdeck platform, wherein the AIdeck mainly comprises a GAP8 chip and a Himax camera of GreenWaves Technologies, the computing capability is expanded, and the complex workload based on artificial intelligence can run on a nano-type unmanned platform. Wherein the GAP8 chip is a commercial embedded RISC-V multi-core processor from the PULP open source project. The core of the GAP8 consists of an advanced RISC-V MCU and a programmable eight-core processor, whose architecture is shown in fig. 2. The code part is realized based on Python language, and Pythroch is used as a deep learning framework.

In this embodiment, an NYU Depth v2 dataset is selected to train, tailor the network, and verify the final deployment result. The structure of the network refers to the network structure of FastDepth, the MobileNetv1 of the front end Encoder part is replaced by a lighter MobileNetv2 network, a plurality of layers of upper sampling layers and cross-layer connection at the rear end are added, and the network structure is shown in figure 3. The net adapt pruning method is adopted to carry out 23 rounds of iterative pruning on the network at most, the main principle of the pruning algorithm is shown in figure 5, in each iteration, the net adapt selects and deletes some convolution kernels of each layer of the network to obtain a plurality of self networks, fineTune is carried out, and a sub-network with the highest accuracy is selected from the self networks to carry out the next iteration.

The invention adopts NNTOOL tool to carry out 8bit quantization on the parameters of the network, so that the chip can carry out network operation more quickly. Nntol is an NN mapping tool developed by GWT (GreenWaves Technologies) as part of the GAP8 software development suite. NNTOOL performs "layer fusion", post-training alignment and quantization (8/16 bits), and folding and fusing the batch normalization layer (BN layer) into its predecessor convolutional layer, avoiding the addition of intermediate buffers and reducing memory consumption. The invention uses NNTOOL to carry out 8-bit training quantization on the network, and adopts a Conv-BN-ReLU fusion mode, thereby simplifying quantization and deployment and reducing errors after quantization. The tensor of the Conv operation output requires a higher precision representation than the input and weights-using 32 bits, but this does not mean that the complete tensor of the 32-bit elements needs to be generated and stored in the calculation, but rather each element is generated by Conv in 32 bits, but reduced to 8 bits immediately after passing through the ReLU or BN + ReLU operator. The operation not only greatly reduces the loss of precision, but also can accelerate the calculation speed and reduce the waste of the memory.

The main difficulty in deploying models on GAP8 chips is to maximize parallel execution of all available cores and minimize overhead for data transfer. On GAP8, the main challenge is the limited L1 memory (64 kB), which forces the deployment tool to solve the optimization problem, dividing the tensor in the network computation into smaller chunks, called tiles, moving between L2 and L1 memory. This problem is divided into two separate parts: 1) A set of optimization kernels running exclusively on L1 data tiles; 2) A tiling solver to define the optimal size of the slice and generate a code for the associated data transmission between L2 and L1, including double buffering of all tensors. With reference to fig. 6, each layer in CNN runs on a three-dimensional input tensor representing eigenspace (one eigenmap for each channel) and produces a new 3-D activation tensor as output. In particular, the convolutional layer consists of a linear transform that maps the Kin input profile to the Kout output profile using a Kin x Kout convolution filter (or weight matrix). Thus, in any convolutional layer, we can identify three different data spaces that can be divided into tiles in one or more of the three dimensions (i.e., W, H, and K in FIG. 6). Similar considerations may also be made for other layers in the CNN, allowing them to be handled in the same manner. The present invention uses the AutoTiler tool to help explore a subset of this space, select the best tiling configuration, and wrap-around to generate C code, with the cluster DMA controller effectively moving data between L2 and L1 memory.

In this example, the parameters are set to set the chip CL frequency to 175mhz, the fc frequency to 250mhz, the allocated memory for l1 to 46736bytes, and the allocated memory for l2 to 250000Bytes, and the test result demonstration of this embodiment is shown in fig. 4. Each column in the graph is respectively an original image (gray scale), a real depth map, an original FastDepth network, a newly-built monocular depth estimation network and a depth map output by the 7 th, 15 th and 23 th iteration branch-reducing networks operating at the GAP8 end from left to right. It can be seen that although the network is rebuilt and pruned to improve the operation speed, the depth estimation precision is reduced compared with the original FastDepth, but the regions with higher and lower depths in the picture can still be clearly distinguished, which is sufficient for the downstream task of the unmanned platform.

The monocular depth estimation method based on the neural network and the edge computing chip can be applied to a robot platform with extremely low power consumption requirements. By deploying the light-weight monocular depth estimation neural network on the ultra-low power consumption chip, image calculation of a 10fps frame rate can be achieved with the power consumption of 393mW only, so that the method can be carried on a nano-type unmanned platform and used for intelligent application of the unmanned platform.

In order to implement the foregoing embodiment, as shown in fig. 7, a monocular depth estimation device 10 based on a neural network and an edge computing chip is further provided in this embodiment, where the system 10 includes a network training module 100, a convolution quantization module 200, a tensor division module 300, and a depth estimation module 400.

The network training module 100 is configured to acquire a training data set of a camera image, and train the depth estimation network by using the training data set to obtain a trained depth estimation network;

a convolution quantization module 200, configured to perform quantization operation on the network parameters of the trained depth estimation network by using multiple pictures in the training data set to obtain a convolution tensor;

the tensor division module 300 is configured to divide the convolution tensor by using an optimal tensor division manner to generate network code data;

and the depth estimation module 400 is configured to obtain a depth map estimated by the real-time camera image through network calculation of the trained depth estimation network by using the real-time camera image and the network code data.

Further, after the network training module 100, the method further includes:

and the network pruning module is used for carrying out pruning processing on the trained depth estimation network by utilizing a NetAdapt algorithm so as to filter out a preset number of convolution kernels in the network.

Further, the convolution quantization module 200 is further configured to:

obtaining a plurality of pictures in a training data set as a trained depth estimation networkTo obtain a reference range [ alpha ] of the tensor t of each layer of convolution _t ,β _t ) And maps it to an N-bit pure integer tensor

ε _t ＝(β _t -α _t )/(2 ^N -1)

Wherein epsilon _t Is a scaling factor.

Further, the network pruning module is further configured to:

performing multiple rounds of iterative pruning processing on the trained depth estimation network by using a NetAdapt pruning method, and in each round of iteration, respectively selecting and deleting a preset number of convolution kernels from each layer of the network to obtain a plurality of sub-networks;

and selecting the sub-network with the highest accuracy from the plurality of sub-networks to perform the next iteration.

Further, the tensor division module 300 is further configured to: and selecting an optimal tensor division mode by using an AutoTiler network tool to divide the convolution tensor so as to encapsulate the calculation process of the trained depth estimation network on the ultra-low power consumption edge calculation chip into code data expressed by the C language.

The monocular depth estimation device based on the convolutional neural network and the ultra-low power consumption edge computing chip can be applied to a robot platform under the requirement of extremely low power consumption. By deploying the light-weight monocular depth estimation neural network on the ultra-low power consumption chip, image calculation with a frame rate of 10fps can be achieved with power consumption of 393mW, so that the method can be carried on a nano-type unmanned platform and used for intelligent application of the unmanned platform.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless explicitly specified otherwise.

Claims

1. A monocular depth estimation method based on a neural network and an edge computing chip is characterized by comprising the following steps of:

and calculating to obtain a depth map of the real-time camera image estimation through the network of the trained depth estimation network by utilizing the real-time camera image and the network code data.

2. The method of claim 1, wherein after obtaining the trained depth estimation network, the method further comprises:

and pruning the trained depth estimation network by using a NetAdapt algorithm to filter a preset number of convolution kernels in the network.

3. The method of claim 1, wherein quantizing the network parameters of the trained depth estimation network using the plurality of pictures in the training data set to obtain a convolution tensor comprises:

acquiring a plurality of pictures in a training data set as the input of the trained depth estimation network to obtain a reference range [ alpha ] of the tensor t of each layer of convolution _t ,β _t ) And maps it to an N-bit pure integer tensor

ε _t ＝(β _t -α _t )/(2 ^N -1)

Wherein epsilon _t Is a scaling factor.

4. The method of claim 2, wherein pruning the trained depth estimation network using the NetAdapt algorithm to filter out a preset number of convolution kernels in the network comprises:

5. The method of claim 1, wherein the partitioning the convolution tensor using the optimal tensor partitioning approach to generate network code data comprises:

and selecting an optimal tensor division mode by using an AutoTiler network tool to divide the convolution tensor so as to package the calculation process of the trained depth estimation network on the ultra-low power consumption edge calculation chip into code data expressed by the C language.

6. A monocular depth estimation device based on a convolutional neural network and an ultra-low power consumption edge computing chip is characterized by comprising:

7. The apparatus of claim 6, further comprising, after the network training module:

and the network pruning module is used for pruning the trained depth estimation network by utilizing a NetAdapt algorithm so as to filter a preset number of convolution kernels in the network.

8. The apparatus of claim 6, wherein the convolutional quantization module is further configured to:

obtaining a plurality of pictures in a training data set as input of the trained depth estimation network to obtainObtaining the reference range [ alpha ] of the tensor t of each layer of convolution _t ,β _t ) And maps it to an N-bit pure integer tensor

ε _t ＝(β _t -α _t )/(2 ^N -1)

Wherein epsilon _t Is a scaling factor.

9. The apparatus of claim 7, wherein the network pruning module is further configured to:

10. The apparatus of claim 6, wherein the tensor division module is further configured to: and selecting an optimal tensor division mode by using an AutoTiler network tool to divide the convolution tensor so as to package the calculation process of the trained depth estimation network on the ultra-low power consumption edge calculation chip into code data expressed by the C language.