CN112053374A

CN112053374A - 3D target bounding box estimation system based on GIoU

Info

Publication number: CN112053374A
Application number: CN202010805891.3A
Authority: CN
Inventors: 杨武; 孟涟肖; 唐盖盖; 苘大鹏; 吕继光
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-12-08

Abstract

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a GIoU-based 3D target bounding box estimation system which comprises a radar point cloud preprocessing module, a 2D image preprocessing module and a GIoU-based multi-source fusion module. According to the method, the point cloud characteristics are obtained through the radar point cloud preprocessing module, the image characteristics are obtained through the 2D image preprocessing module, the point cloud characteristics and the image characteristics are subjected to fusion processing through the multi-source fusion module based on the GIoU, and finally the estimation result of the 3D target boundary frame is output. The method solves the problem of low estimation accuracy of the existing 3D target bounding box. The method can obviously improve the calibration accuracy of the 3D target and realize the high-accuracy 3D target boundary box estimation effect.

Description

3D target bounding box estimation system based on GIoU

Technical Field

The invention belongs to the technical field of computer vision and pattern recognition, and particularly relates to a 3D target bounding box estimation system based on a GIoU.

Background

In recent years, unmanned driving has been receiving attention from various enterprises, scholars, and the general public. Currently, there are two distinct ways to achieve unmanned driving: one is a progressive method adopted by traditional enterprises, namely starting from the existing auxiliary driving system, gradually increasing the functions of automatic steering, active collision avoidance and the like, realizing conditional unmanned driving and finally realizing unmanned driving when the cost and related technologies meet certain requirements; the other mode is a 'one-step-in-place' mode which is selected by high-tech IT enterprises as representatives and directly achieves the final aim of unmanned driving, namely the unmanned driving is not required to be achieved in a man-machine cooperation mode, and the participation of people cannot be relied on for ensuring the absolute safety of automatic driving. The latter selected technical route is relatively more challenging and risky, and thus needs innovative algorithms and efficient, robust systems to support. With this need, object detection and localization is particularly important because it corresponds to the ability of an intelligent unmanned system to more accurately "see" the scene in front of the eye and provide a large amount of useful information for unmanned system decision making or planning. 3D object detection is an important topic in automatic driving and robotics, where the accuracy of detection is the current difficulty of 3D object detection technology.

Bounding box regression is one of the most fundamental components in many 2D/3D computer vision tasks. Target location, multi-target detection, target tracking, etc. all rely on relevant bounding box regression. In recent years, the neural network technology is increasingly flourishing, and the strong nonlinear fitting capability of the neural network technology is very suitable for solving the problem of bounding box regression. The main trend towards improving application performance using neural networks is to propose a better framework backbone or better strategies to extract reliable local features. Nowadays, although the latest development of convolutional neural networks has realized 2D target detection in a complex environment, in practical application scenarios, common 2D target detection cannot provide all information required for sensing the environment, and only can provide the position of a target object in a 2D picture and the confidence of a corresponding category, in a real three-dimensional world, objects all have three-dimensional shapes, and most applications require information such as the length, width, height, deflection angle and the like of the target object. In recent years, researchers have proposed several 3D target bounding box estimation methods, but the accuracy of the target bounding box estimation result is low because the definition precision of the loss function of the neural network model adopted by the method is not high, so that it is still an open challenge to effectively improve the 3D target bounding box estimation accuracy.

Disclosure of Invention

The invention aims to solve the problem of low estimation accuracy of the existing 3D target bounding box, and provides a 3D target bounding box estimation system based on a GIoU (general object oriented unit).

The purpose of the invention is realized by the following technical scheme: the system comprises a radar point cloud preprocessing module, a 2D image preprocessing module and a multi-source fusion module based on a GIoU; the radar point cloud preprocessing module converts the input radar point cloud data into a digital feature representation form with fixed dimensionality and transmits the digital features of the radar point cloud data to the GIoU-based multi-source fusion module; the 2D image preprocessing module converts input 2D image data into a digital feature representation form with fixed dimensionality and transmits the digital features of the 2D image data to the GIoU-based multi-source fusion module; the GIoU-based multi-source fusion module fuses digital features of radar point cloud data and digital features of 2D image data into a 3D target boundary frame estimation result, specifically coordinates B of 8 vertexes of a predicted 3D boundary frame_P＝(x^P ₁,y^P ₁,z^P ₁,…,x^P ₈,y^P ₈,z^P ₈)；

The multi-source fusion module based on the GIoU is a Dense neural network model, and the GIoU LOSS is used as a LOSS function; the GIoU LOSS calculation method comprises the following steps:

step 1: inputting 8 vertex coordinates B of a real 3D bounding box_T＝(x^T ₁,y^T ₁,z^T ₁,…,x^T ₈,y^T ₈,z^T ₈)；

Step 2: calculating the length of the real 3D bounding boxDegree L_TWidth W_TAnd height H_T(ii) a Calculating the predicted 3D bounding box length L_PWidth W_PAnd height H_P；

And step 3: selecting a real 3D boundary frame and a 3D boundary frame with a center point closer to an original point from the predicted 3D boundary frame, and acquiring a vertex coordinate MAX (x) of the upper right corner of the 3D boundary frame_MAX,y_MAX,z_MAX) And acquiring the coordinates MIN (x) of the vertex at the lower left corner of the 3D bounding box with the central point far away from the origin_MIN,y_MIN,z_MIN)；

And 4, step 4: selecting the minimum x, y and z values x in the coordinates of all the vertexes of the real 3D bounding box and the predicted 3D bounding box^MIN、y^MIN、z^MINAnd the maximum x, y, z values x^MAX、y^MAX、z^MAX；

And 5: calculating a minimum bounding box B that can enclose the true 3D bounding box and the predicted 3D bounding box_cLength L of_CWidth W_CAnd height H_C；

L_C＝x^MAX-x^MIN

W_C＝y^MAX-y^MIN

H_C＝z^MAX-z^MIN

Step 6: calculating the value of the GIoU;

wherein, V_TVolume of true 3D bounding box, V_T＝L_T*W_T*H_T；V_PFor the predicted volume of the 3D bounding box, V_P＝L_P*W_P*H_P；V_cIs B_cVolume of (V)_c＝L_c*W_c*H_c；

And 7: calculating the value of the LOSS function GIoU LOSS;

GIoU LOSS＝1-GIoU。

the present invention may further comprise:

the Dense neural network model comprises a three-layer structure; the first layer of the Dense neural network model is an input layer, the number of neurons is the same as the dimension of input features, each neuron corresponds to the input of one dimension of a vector in sequence and is directly transmitted to neurons of the second layer, and the input features comprise digital features of radar point cloud data and digital features of 2D image data; the second layer of the Dense neural network model is a Dence layer, which comprises superposition of a plurality of Dences and is used for realizing mapping from input variables to output variables; the third layer of the Dense neural network model is an output layer, and the output layer corresponds to the regression numerical value of the 3D bounding box.

The radar point cloud preprocessing module is a PointNet neural network model, and the PointNet network rate adopts a symmetric function max-firing to realize the replacement invariance of a disordered three-dimensional point set; the 2D image pre-processing module is a Resnet50 neural network model that learns the residual representation between input and output by using multiple layers of parameters.

The invention has the beneficial effects that:

according to the method, the point cloud characteristics are obtained through the radar point cloud preprocessing module, the image characteristics are obtained through the 2D image preprocessing module, the point cloud characteristics and the image characteristics are subjected to fusion processing through the multi-source fusion module based on the GIoU, and finally the estimation result of the 3D target boundary frame is output. The method solves the problem of low estimation accuracy of the existing 3D target bounding box. The method can obviously improve the calibration accuracy of the 3D target and realize the high-accuracy 3D target boundary box estimation effect.

Drawings

Fig. 1 is a general framework schematic of the present invention.

FIG. 2 is a schematic diagram of a GIoU-based multi-source fusion module structure according to the present invention.

Fig. 3 is a general operational flow diagram of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention discloses a GIoU-based 3D target boundary frame estimation system which comprises a radar point cloud preprocessing module, a 2D image preprocessing module and a GIoU-based multi-source fusion module. And point cloud characteristics are obtained through a radar point cloud preprocessing module, image characteristics are obtained through a 2D image preprocessing module, and then the point cloud characteristics and the image characteristics are subjected to fusion processing through a multi-source fusion module based on a GIoU (general information unit), and finally an estimation result of a 3D target boundary frame is output. The method solves the problem of low estimation accuracy of the existing 3D target bounding box. The method can obviously improve the calibration accuracy of the 3D target and realize the high-accuracy 3D target boundary box estimation effect.

A system for estimating a 3D target boundary frame based on a GIoU (geographic information Unit) comprises a radar point cloud preprocessing module, a 2D image preprocessing module and a multi-source fusion module based on the GIoU; the radar point cloud preprocessing module converts the input radar point cloud data into a digital feature representation form with fixed dimensionality and transmits the digital features of the radar point cloud data to the GIoU-based multi-source fusion module; the 2D image preprocessing module converts input 2D image data into a digital feature representation form with fixed dimensionality and transmits the digital features of the 2D image data to the GIoU-based multi-source fusion module; the GIoU-based multi-source fusion module fuses digital features of radar point cloud data and digital features of 2D image data into a 3D target boundary frame estimation result, specifically coordinates B of 8 vertexes of a predicted 3D boundary frame_P＝(x^P ₁,y^P ₁,z^P ₁,…,x^P ₈,y^P ₈,z^P ₈)；

Step 2: calculating the length L of the real 3D bounding box_TWidth W_TAnd height H_T(ii) a Calculating the predicted 3D bounding box length L_PWidth W_PAnd height H_P；

L_C＝x^MAX-x^MIN

W_C＝y^MAX-y^MIN

H_C＝z^MAX-Z^NIN

Step 6: calculating the value of the GIoU;

And 7: calculating the value of the LOSS function GIoU LOSS;

GIoU LOSS＝1-GIoU。

example 1:

the technical scheme of the invention comprises the following steps:

with reference to fig. 3, the overall process of the present invention is:

firstly, a 3D target boundary box estimation device based on a GIoU is built. The system consists of a radar point cloud preprocessing module, a 2D image preprocessing module and a multi-source fusion module based on a GIoU. The radar point cloud preprocessing module is a PointNet neural network model, the 2D image preprocessing module is a Resnet50 neural network model, and the GIoU-based multi-source fusion module is a Dense neural network model.

The radar point cloud preprocessing module can convert point cloud data into a digital feature representation of fixed dimensions.

The 2D image pre-processing module may convert the point 2D image data into a fixed-dimension digital feature representation.

The multi-source fusion module based on the GIoU can fuse the digital features of the point cloud data and the 2D image data features, and finally output 3D target bounding box estimation.

Preferably, the PointNet network of the radar point cloud preprocessing module adopts a symmetric function (max-firing) to realize the displacement invariance of the disordered three-dimensional point set, so that the high-precision point cloud data fusion effect can be realized.

Preferably, the Resnet50 network of the 2D image preprocessing module learns the residual representation between the input and output by using a plurality of parameter layers, instead of directly trying to learn the mapping between the input and output by using parameter layers as in the general CNN network. The direct learning of the residual error by using the general parameter layer is faster and more effective than the direct learning of the mapping convergence speed between the input and the output.

Preferably, the GIoU is used as a loss function in the Dense network of the multi-source fusion module based on the GIoU, and compared with the conventional mean square error and absolute average error which are used as the loss function, the method can more accurately guide the network to learn in the direction of improving the estimation accuracy of the 3D target bounding box in the training process.

And secondly, initializing a 3D target bounding box estimation device based on the GIoU. The method comprises the following steps:

1) firstly, radar point cloud data are input into a radar point cloud preprocessing module to obtain a digital feature representation result of the point cloud.

2) And inputting the 2D image data into a 2D image preprocessing module to obtain a digital feature representation result of the 2D image.

3) And combining and inputting the digital feature representation result of the point cloud and the digital feature representation result of the 2D image into the GIoU-based multi-source fusion module.

And thirdly, receiving the source data file by the 3D target bounding box estimation device based on the GIoU. The method comprises the following steps: and (3) obtaining corresponding digital feature representation of a sample for 3D target detection through 1) and 2) in the second step, and then inputting the digital feature representation as a trained GIoU-based multi-source fusion module, wherein the module output is a 3D target bounding box estimation result.

And fourthly, finishing one-time 3D target bounding box estimation.

With reference to fig. 1, the estimation framework of the GIoU-based 3D target object bounding box according to the present invention includes a radar point cloud preprocessing module, a 2D image preprocessing module, and a GIoU-based multi-source fusion module.

The radar point cloud preprocessing module is a PointNet neural network model and can convert point cloud data into a digital feature representation form with fixed dimensionality. The PointNet network firstly adopts a symmetric function (max-firing) to realize the displacement invariance of the unordered three-dimensional point set, and can realize the fusion effect of high-precision point cloud data.

The 2D image pre-processing module is a Resnet50 neural network model that can convert point 2D image data into a fixed-dimension digital feature representation. Residual representations between input and output are learned by using multiple parametric layers, rather than directly attempting to learn mappings between input and output using the parametric layers as in a general CNN network. The direct learning of the residual error by using the general parameter layer is faster and more effective than the direct learning of the mapping convergence speed between the input and the output.

The multi-source fusion module based on the GIoU is a Dense neural network model, can fuse the digital characteristics of the point cloud data and the 2D image data characteristics, and finally outputs 3D target boundary frame estimation. Particularly, the Dense network takes GIoU as a loss function, and compared with the conventional mean square error and absolute average error as the loss function, the network can be more accurately guided to learn in the direction of improving the estimation accuracy of the 3D target bounding box in the training process.

Referring to fig. 2, the GIoU-based multi-source fusion module of the present invention includes a three-layer structure, in which,

the first layer is an input layer, the number of the neurons is the same as the dimension of the input feature, and each neuron corresponds to the input of one dimension of the vector in sequence and is directly transmitted to the neuron of the layer 2. The input features include point cloud digital features and digital features of the 2D image.

The second layer is a nonce layer, comprises superposition of a plurality of nonces and is used for realizing mapping from the input variable to the output variable;

the third layer is an output layer, and the output layer corresponds to a regression numerical value of the 3D bounding box, specifically, a central point coordinate and a length, a width and a height of the 3D bounding box.

In the model training process, the GIoU loss needs to be calculated after the output layer every time, and the network can be more accurately guided to learn in the direction of improving the estimation precision of the 3D target bounding box in the training process.

Wherein the pseudo code to compute GIOU is as follows:

1) the coordinate information of the vertices of the two 3D bounding boxes 8 is known: b is_T＝(x^T ₁,y^T ₁,z^T ₁,…,x^T ₈,y^T ₈,z^T ₈)，B_P＝(x^P ₁,y^P ₁,z^P ₁,…,x^P ₈,y^P ₈,z^P ₈) In which B is_TRepresenting the real bounding box, B_PRepresenting the predicted bounding box.

2) Calculating the length, width and height of the two bounding boxes to obtain L_T，W_T，H_TAnd L_P，W_P，H_P。

3) Calculating the volume of the two bounding boxes to obtain V_T＝L_T*W_T*H_TAnd V_P＝L_P*W_P*H_P。

4) Get B_TAnd B_PThe coordinate MAX (x) of the top right corner of the bounding box with the center point close to the origin point_MAX,y_MAX,z_MAX) And the coordinates MIN of the lower left corner vertex of the other bounding box (x)_MIN,y_MIN,z_MIN)。

5) Calculating the difference value of the corresponding coordinates of MIN and MAX: x_I＝x_MIN-x_MAX，Y_I＝y_MIN-y_MAX，Z_I＝z_MIN-z_MAX。

6) Calculation of B_TAnd B_PIntersection V_I＝X_I*Y_I*Z_I. If V_IIf not more than 0, it means that there is no intersection, let V_I＝0。

7) Get B_TAnd B_PMinimum x, y, z value x in all point coordinates^MIN，y^MIN，z^MINAnd the maximum x, y, z value x^MAX，y^MAX，z^MAX。

8) The computation may surround B_TAnd B_PMinimum bounding box B of_cThe length, width and height are as follows: l is_C＝x^MAX-x^MIN，W_C＝y^MAX-y^MIN，H_C＝Z^MAX-z^MIN，B_cVolume is V_c＝L_c*W_c*H_c。

9) Computing

The value range of GIoU is [ -1,1 [)]。

10) Calculate GIoU LOSS: GIoU LOSS ═ 1-GIoU. The value range of GIoU LOSS is [0,2 ].

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A GIoU-based 3D object bounding box estimation system, characterized in that: the system comprises a radar point cloud preprocessing module, a 2D image preprocessing module and a multi-source fusion module based on a GIoU; the radar point cloud preprocessing module converts the input radar point cloud data into a digital feature representation form with fixed dimensionality and transmits the digital features of the radar point cloud data to the GIoU-based multi-source fusion module; the 2D image preprocessing module converts input 2D image data into a digital feature representation form with fixed dimensionality and transmits the digital features of the 2D image data to the GIoU-based multi-source fusion module; the GIoU-based multi-source fusion module fuses digital features of radar point cloud data and digital features of 2D image data into a 3D target boundary frame estimation result, specifically coordinates B of 8 vertexes of a predicted 3D boundary frame_P＝(x^P ₁，y^P ₁，z^P ₁，…，x^P ₈，y^P ₈，z^P ₈)；

step 1: inputting 8 vertex coordinates B of a real 3D bounding box_T＝(x^T ₁，y^T ₁，z^T ₁，…，x^T ₈，y^T ₈，z^T ₈)；

And step 3: selecting a real 3D boundary frame and a 3D boundary frame with a center point closer to an original point from the predicted 3D boundary frame, and acquiring a vertex coordinate MAX (x) of the upper right corner of the 3D boundary frame_MAX，y_MAX，z_MAX) And acquiring the coordinates MIN (x) of the vertex at the lower left corner of the 3D bounding box with the central point far away from the origin_MIN，y_MIN，z_MIN)；

L_C＝x^MAX-x^MIN

W_C＝y^MAX-y^MIN

H_C＝z^MAX-z^MIN

Step 6: calculating the value of the GIoU;

And 7: calculating the value of the LOSS function GloU LOSS;

GIoU LOSS＝1-GIoU。

2. the GIoU-based 3D object bounding box estimation system as claimed in claim 1, wherein: the Dense neural network model comprises a three-layer structure; the first layer of the Dense neural network model is an input layer, the number of neurons is the same as the dimension of input features, each neuron corresponds to the input of one dimension of a vector in sequence and is directly transmitted to neurons of the second layer, and the input features comprise digital features of radar point cloud data and digital features of 2D image data; the second layer of the Dense neural network model is a Dence layer, which comprises superposition of a plurality of Dences and is used for realizing mapping from input variables to output variables; the third layer of the Dense neural network model is an output layer, and the output layer corresponds to the regression numerical value of the 3D bounding box.

3. The GIoU-based 3D object bounding box estimation system as claimed in claim 1 or 2, wherein: the radar point cloud preprocessing module is a PointNet neural network model, and the PointNet network rate adopts a symmetric function max-firing to realize the replacement invariance of a disordered three-dimensional point set; the 2D image pre-processing module is a Resnet50 neural network model that learns the residual representation between input and output by using multiple layers of parameters.