CN110909623B

CN110909623B - Three-dimensional target detection method and three-dimensional target detector

Info

Publication number: CN110909623B
Application number: CN201911052349.9A
Authority: CN
Inventors: 吴飞; 陈�峰; 黄庆花; 季一木; 荆晓远
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2022-10-04
Anticipated expiration: 2039-10-31
Also published as: CN110909623A

Abstract

The invention provides a three-dimensional target detection method and a three-dimensional target detector. The three-dimensional target detection method mainly comprises the following steps: performing semantic segmentation on image data of the three-dimensional target detection data set to obtain semantic prediction; projecting the semantic prediction into a point cloud space, and screening points of a specific category to form a viewing cone; constructing a three-dimensional target detection network, and taking a viewing cone as the input of the three-dimensional target detection network; the sensitivity degree of a three-dimensional target detection network to a target is strengthened through a loss function; and optimizing the three-dimensional target detection network. The invention greatly reduces the time of three-dimensional detection and reduces the calculation requirement, so that the input is simpler, the real-time target detection performance is better, and the real-time detection is kept and the good detection effect can be obtained.

Description

Three-dimensional target detection method and three-dimensional target detector

Technical Field

The invention relates to a three-dimensional target detection method and a three-dimensional target detector, and belongs to the field of pattern recognition.

Background

Point cloud based three-dimensional target detection plays an important role in real life, such as autopilot, home robot, augmented reality, and virtual reality. Compared to traditional image data based target detection methods, point clouds can provide more accurate depth information for locating objects and delineating object shapes. However, due to the limitation of factors such as non-uniform three-dimensional space sampling, the effective range of the sensor, and the shielding and relative positions of the object, compared with the traditional image data, the point cloud is more sparse and the density of each part has larger difference.

In order to solve the above problems, currently, a method of manually extracting features is generally used to enable a three-dimensional point cloud to be detected by a corresponding target detector, however, this requires all point clouds to be input, consumes a large amount of computing resources, and cannot achieve real-time detection.

In view of the above, it is necessary to provide a three-dimensional target detection method to solve the above problems.

Disclosure of Invention

The invention aims to provide a three-dimensional target detection method which can obtain good detection effect while maintaining real-time detection.

In order to achieve the above object, the present invention provides a three-dimensional target detection method, which mainly comprises the following steps:

step 1: performing semantic segmentation on image data of the three-dimensional target detection data set to obtain semantic prediction;

step 2: projecting the semantic prediction obtained in the step 1 into a point cloud space, and screening points of a specific category to form a viewing cone;

and 3, step 3: constructing a three-dimensional target detection network, and taking the viewing cone obtained in the step (2) as the input of the three-dimensional target detection network;

and 4, step 4: the sensitivity of a three-dimensional target detection network to a target is strengthened through a loss function;

and 5: and optimizing the three-dimensional target detection network.

Optionally, in step 1, semantic segmentation is performed on image data of the three-dimensional target detection data set by using a depeplav 3+ algorithm, which specifically includes the following steps:

step 11: pre-training on a Cityscapes dataset by a DeepLabv3+ algorithm;

step 12: manually marking image data of the three-dimensional target detection data set, and finely adjusting a manually marked semantic label through a DeepLabv3+ algorithm;

step 13: each pixel in the image data is classified by semantic segmentation to obtain a semantic prediction.

Optionally, step 2 specifically includes the following steps:

step 21: projecting the region of each category in each semantic prediction into a point cloud space using a known projection matrix, such that the category attribute of each region of the point cloud space is consistent with the category attribute of each region of the corresponding semantic prediction;

step 22: and screening and extracting points of a specific category from the original point cloud space to form a viewing cone.

Optionally, in step 3, the three-dimensional target detection network is built by using a pytorch depth frame, and the three-dimensional target detection network includes: a point cloud feature extractor using a mesh, a convolution intermediate extraction layer, and a regional pre-selection network, and the output of the point cloud feature extractor using the mesh is provided by the convolution intermediate extraction layer as an input to the convolution intermediate extraction layer and an input to the regional pre-selection network.

Optionally, the point cloud feature extractor using the mesh consists of a linear layer, a batch normalization layer and a nonlinear activation layer;

the convolution intermediate extraction layer comprises three convolution intermediate modules, and each convolution intermediate module is formed by sequentially connecting a three-dimensional convolution layer, a batch normalization layer and a nonlinear activation layer;

the regional preselection network consists of three full-volume modules.

Optionally, in step 4, a focal local function is used to solve the problem of imbalance between positive and negative anchor points in the regional preselection network, where the focal local function is:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t )，

wherein p is _t Is the estimated probability, alpha, of a three-dimensional target detection network _t And gamma is the over-parameter adjustment coefficient.

Optionally, in step 4, the loss function is:

L _total ＝β ₁ L _cls +β ₂ (L _{reg_θ} +L _{reg_other} )+β ₃ L _dir +β ₄ L _corner wherein L is _cls To classify the loss, L _{reg_θ} Angle loss for three-dimensional candidate frame, L _{reg_other} Correcting for loss, L, for the remaining parameters of the three-dimensional candidate box _dir For directional loss, L _corner Loss of vertex coordinates, beta, for three-dimensional candidate boxes ₁ ，β ₂ ，β ₃ ，β ₄ Is a hyper-parameter.

Optionally, step 5 specifically includes: and (4) carrying out training optimization on the three-dimensional target detection network on the KITTI data set.

Optionally, in step 5, a random gradient descent method and an Adam optimizer are used to train and optimize the three-dimensional target detection network.

In order to achieve the above object, the present invention further provides a three-dimensional target detector, which applies the three-dimensional target detection method.

The invention has the beneficial effects that: the invention greatly reduces the time of three-dimensional detection and reduces the calculation requirement, so that the input is simpler, the real-time target detection performance is better, and the real-time detection is kept and the good detection effect can be obtained.

Drawings

FIG. 1 is a flow chart of a three-dimensional object detection method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention discloses a three-dimensional target detection method and a three-dimensional target detector applying the same. Since the specific structure of the three-dimensional object detector can be set according to actual conditions, it is not described in detail here, and the following will mainly describe the three-dimensional object detection method in detail.

As shown in fig. 1, the three-dimensional target detection method of the present invention mainly includes the following steps:

and 5: and optimizing the three-dimensional target detection network.

The following will specifically explain step 1 to step 5.

In step 1, semantic segmentation is performed on image data of a three-dimensional target detection data set by a deepLabv3+ algorithm (a semantic segmentation algorithm), and since the image data of the three-dimensional target detection data set does not contain a segmentation mark, the image data of the three-dimensional target detection data set needs to be manually marked first, which specifically comprises the following steps:

step 11: pre-training 200 times of iterative cycles on a Cityscapes data set through a DeepLabv3+ algorithm;

step 12: manually marking image data of the three-dimensional target detection data set, and performing fine adjustment of 50 iterative cycles on a manually marked semantic label through a DeepLabv3+ algorithm;

In the step 2, projecting the semantic prediction obtained in the step 1 into a point cloud space, and screening points of a specific category to form a viewing cone, specifically comprising the following steps:

step 21: projecting the region of each category in each semantic prediction into a point cloud space by using a known projection matrix, so that the category attribute of each region in the point cloud space is consistent with the category attribute of each region corresponding to the semantic prediction;

In step 3, a three-dimensional target detection network is built by using the pytorech depth frame, and comprises three parts: a point cloud feature extractor using a mesh, a convolution intermediate extraction layer, and a regional preselection network, and the output of the point cloud feature extractor using the mesh is provided by the convolution intermediate extraction layer as an input to the convolution intermediate extraction layer and an input to the regional preselection network.

Specifically, the point cloud feature extractor using the grids comprises a linear layer, a batch normalization layer and a nonlinear activation layer, when the point cloud feature extractor using the grids is used, a viewing cone is cut by a three-dimensional grid with a set size in order, and all point clouds in each grid are used as the input of the point cloud feature extractor using the grids.

In the convolution intermediate extraction layer, in order to increase the receptive field to obtain more context, the invention uses three convolution intermediate modules, each convolution intermediate module is composed of a three-dimensional convolution layer, a batch normalization layer and a nonlinear activation layer which are sequentially connected, the output of a point cloud feature extractor using a grid is used as the input, and the feature with a three-dimensional structure is converted into a two-dimensional pseudo-graph feature which is used as the final output.

The input of the area preselection network is the output of the convolution intermediate extraction layer, and the architecture of the area preselection network consists of three full convolution modules, wherein each full convolution module comprises a downsampled convolution layer and a plurality of convolution layers. After each convolutional layer, operating by applying a batch normalization layer and a nonlinear activation layer; then, sampling the output of each full convolution module to feature maps with the same size, and connecting the feature maps into a whole; finally, three two-dimensional convolutional layers are applied to the desired learning objective to generate a probability score map, regression bias, and directional prediction.

In step 4, because the view cone does not have the original context information in the point cloud screening process, and the target point cloud data lacking reference makes the detection task more difficult, a special loss function needs to be added into the three-dimensional target detection network to strengthen the sensitivity of the three-dimensional target detection network to the target, and the loss function L _total As follows:

L _total ＝β ₁ L _cls +β ₂ (L _{reg_} θ+L _{reg_other} )+β ₃ L _dir +β ₄ L _corner wherein, L _cls To classify the loss, L _{reg_θ} Is a three-dimensionalLoss of angle of selection frame, L _{reg_other} Correcting for loss, L, for the remaining parameters of the three-dimensional candidate box _dir For directional loss, L _corner Loss of vertex coordinates for the three-dimensional candidate box; beta is a beta ₁ ，β ₂ ，β ₃ ，β ₄ Are set to 1.0,2.0,0.2 and 0.5, respectively, for the over-parameter.

For L _{reg_θ} And L _{reg_other} It can be derived from the following variables:

Δθ＝θ ^g -θ ^a

wherein, the first and the second end of the pipe are connected with each other,

the semantic tags are provided with parameters that describe the corresponding three-dimensional candidate box,

for the anchor parameters, the diagonal of the anchor cube detection box is d ^a ＝(l ^a ) ² +(w ^a ) ² . The anchor points are important parts in a mainstream target detection frame and an expansion algorithm, almost all positions and scales are covered by presetting a group of fixed detection frames with different scales and different positions, each fixed detection frame is responsible for detecting a target with the intersection ratio larger than a threshold value (a training preset value, commonly used 0.5 or 0.7), a multi-scale traversal sliding window is not needed, and the purposes of being good and fast are really realized.

In step 4, in order to solve the problem of imbalance of positive and negative anchor points existing in the regional preselection network, the invention also solves the disadvantages through a focal loss function:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t )，

wherein p is _t Is the estimated probability, alpha, of a three-dimensional target detection network _t And γ is a super parameter adjustment coefficient, set to 0.5 and 2, respectively.

For angle theta _p Angle loss L of three-dimensional candidate frame _{reg_θ} Specifically, it can be expressed as:

L _{reg_θ} ＝SnoothL1(sin(θ _p -Δθ))，

while the remaining parameters of the three-dimensional candidate frame correct for the loss L _{reg_other} A SmoothL1 function with differences Δ x, Δ y, Δ z, Δ w, Δ l, Δ h, Δ θ is used.

Loss of vertex coordinates L for three-dimensional candidate box _corner The composition of (A) is as follows:

wherein NS, NH represents traversing all three-dimensional candidate boxes, P ^* ,P ^** Respectively representing the vertexes of the three-dimensional candidate frames, the vertexes of the three-dimensional candidate frames of the semantic labels and the vertexes of the three-dimensional candidate frames after the semantic labels are inverted.

In step 5, the three-dimensional target detection network is trained and optimized on a KITTI data set, and the specific parameters and the implementation method are as follows: the optimization was trained on a 1080Ti GPU using a stochastic gradient descent method and Adam optimizer, setting the training times for the three-dimensional target detection network to 20 million times (160 iterations), the initial learning rate to 0.0002, an exponential decay factor of 0.8, and decaying every 15 iterations.

In order to verify the detection effect of the invention, the invention tests vehicles, pedestrians and bicycles under the conditions of different difficulties, and simultaneously compares the method with several existing target detection methods, including multi-view three-dimensional (MV three-dimensional), multi-view three-dimensional-laser detection and ranging (MV three-dimensional-LIDAR), cone point cloud (F-PointNet), multi-view target detection network (AVOD), multi-view target detection network-full convolution network (AVOD-FCN) and voxel network (VoxelNet).

As shown in tables 1 and 2 below, the present invention can obtain relatively good test results under different conditions.

TABLE 1 AP-value comparison for three-dimensional detection on KITTI data set

TABLE 2 AP-value comparison of bird's eye view detection on KITTI data set

Furthermore, as shown in table 3 below, although the present invention is not a method that takes the least amount of time, it is considered that the present invention has used a semantic segmentation method, and can achieve a good detection effect while maintaining real-time detection.

TABLE 3 time required to process a scene in different ways on KITTI data sets

In conclusion, the invention greatly reduces the time of three-dimensional detection and reduces the calculation requirement, so that the input is simpler, the real-time target detection performance is better, and the real-time detection is kept and the good detection effect can be obtained.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A three-dimensional target detection method is characterized by mainly comprising the following steps:

step 2: projecting the semantic prediction obtained in the step 1 into a point cloud space, and screening points of specified categories to form viewing cones;

and 3, step 3: constructing a three-dimensional target detection network, and taking the viewing cone obtained in the step (2) as the input of the three-dimensional target detection network; in step 3, the three-dimensional target detection network is built by using a pytorech depth frame, and the three-dimensional target detection network comprises: a point cloud feature extractor using a mesh, a convolution intermediate extraction layer, and a regional pre-selection network, and an output of the point cloud feature extractor using the mesh is provided by the convolution intermediate extraction layer as an input to the convolution intermediate extraction layer and an input to the regional pre-selection network;

and 4, step 4: the sensitivity degree of a three-dimensional target detection network to a target is strengthened through a loss function;

and 5: and optimizing the three-dimensional target detection network.

2. The three-dimensional object detection method according to claim 1, characterized in that: in the step 1, semantic segmentation is carried out on image data of a three-dimensional target detection data set by using a DeepLabv3+ algorithm, and the method specifically comprises the following steps:

step 11: pre-training on a Cityscapes dataset by a DeepLabv3+ algorithm;

3. The three-dimensional object detection method according to claim 1, characterized in that: the step 2 specifically comprises the following steps:

step 22: and screening and extracting points of the specified category from the original point cloud space to form a viewing cone.

4. The three-dimensional object detection method according to claim 1, characterized in that:

the point cloud feature extractor using the grid consists of a linear layer, a batch normalization layer and a nonlinear activation layer;

the regional preselection network consists of three full-volume modules.

5. The method for detecting the three-dimensional object according to the claim 1, characterized in that in the step 4, the unbalance problem of positive and negative anchor points existing in the regional preselection network is solved by using the focal local function, wherein the focal local function is as follows:

FL(p _t )＝-α _t (1-p _t ) ^γ log(p _t )，

6. The three-dimensional object detection method according to claim 1, characterized in that: in step 4, the loss function is:

L _total ＝β ₁ L _cls +β ₂ (L _{reg_θ} +L _{reg_other} )+β ₃ L _dir +β ₄ L _corner wherein L is _cls To classify the loss, L _{reg_θ} Angle loss for three-dimensional candidate box, L _{reg_other} Correcting for loss, L, for the remaining parameters of the three-dimensional candidate box _dir Is a squareDirectional loss, L _corner Loss of vertex coordinates, beta, for three-dimensional candidate boxes ₁ ，β ₂ ，β ₃ ，β ₄ Is a hyper-parameter.

7. The three-dimensional target detection method according to claim 1, wherein step 5 specifically comprises: and (4) carrying out training optimization on the three-dimensional target detection network on the KITTI data set.

8. The three-dimensional object detection method according to claim 7, characterized in that: and 5, training and optimizing the three-dimensional target detection network by using a random gradient descent method and an Adam optimizer.

9. A three-dimensional object detector, characterized by: the three-dimensional object detector applies the three-dimensional object detection method of any one of claims 1 to 8.