CN115457539A

CN115457539A - 3D target detection algorithm based on multiple sensors

Info

Publication number: CN115457539A
Application number: CN202211045902.8A
Authority: CN
Inventors: 姚健; 胡超; 邬伟杰; 邱伟斌; 顾剑峰; 虞祝豪
Original assignee: China Unicom Shanghai Industrial Internet Co Ltd
Current assignee: China Unicom Shanghai Industrial Internet Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-12-09

Abstract

The invention relates to the technical field of 3D target detection, in particular to a multi-sensor-based 3D target detection algorithm, which comprises the following steps: s1, data construction and data preprocessing; s2, extracting the image features of the camera; step S3, liDAR-ImageFusionModule: constructing a LiDAR-Imagefusion module to correspond the layering of semantic features generated by the RGB image to LiDAR point cloud features, and fusing point-wise image features with the LiDAR features to obtain final enhancement features; and S4, liDAR point cloud feature extraction, wherein a 3D target detection algorithm based on a LiDAR-Imagefusion module is designed, the LiDAR-Imagefusion module is utilized to map LiDAR feature points into an image in a point-by-point thinning manner, and finally a high-precision 3D detection network is obtained and is positioned at the front column position on the starting data set.

Description

3D target detection algorithm based on multiple sensors

Technical Field

The invention relates to the technical field of 3D target detection, in particular to a multi-sensor-based 3D target detection algorithm.

Background

The currently popular 3D target detection algorithm collects multi-dimensional images and point cloud information for fusion through different types of sensors such as 2D monocular images, stereo cameras, radars, etc., and good precision has been obtained in 3D target detection. The camera image contains a large number of semantic features such as colors and textures, but lacks depth information of objects, liDAR point cloud data is in a discrete point form, has the characteristics of sparseness, disorder, uneven distribution and the like, contains depth information of targets, but aims at the problem that the distribution of bounding boxes is disordered when detecting geometric objects with similar structures, and therefore, the current 3D target detection algorithm can fuse sensors such as a camera and LiDAR to obtain more accurate detection precision.

According to different sensor use modes, the current sensor fusion method is mainly divided into two main types, one type is a cascade method using different sensors at different stages, and the other type is a fusion method for jointly reasoning multiple sensor inputs. The two methods can effectively fuse information of multiple sensors, but still have some defects, the first cascade method does not utilize complementarity between the sensors, and the performance is limited by a single stage, while the second method firstly converts 3D point cloud data into BEV data through perspective projection and voxelization, a lot of information is lost in the process, and the second method only has rough corresponding relation to the voxel characteristics and the semantic characteristics, so that a lot of detection interference items are brought.

In summary, the present invention solves the existing problems by designing a multi-sensor based 3D object detection algorithm.

Disclosure of Invention

The present invention is directed to a multi-sensor based 3D object detection algorithm to solve the above problems.

In order to achieve the purpose, the invention provides the following technical scheme:

A3D target detection algorithm based on multiple sensors comprises the following steps:

s1, data construction and data preprocessing: the open source data sets for 3D object detection are mainly KITTIDataset and SUN-RGBD Dataset, if training is needed to be carried out on the data sets, the data format needs to be prepared to be consistent with the two data sets, and LiDAR point cloud data features and RGB image features are extracted after data are preprocessed;

s2, camera image feature extraction: the RGB image features are a simple framework composed of four lightweight convolution blocks, semantic image information is extracted through a group of convolution operations; each convolution block consists of two 3 × 3 convolution layers, followed by a BN layer and a ReLU activation function; setting a second convolution layer in each block, wherein the step length is 2, expanding the receptive field and saving GPU memory, fusing the semantic features extracted by the four lightweight convolution blocks with LiDAR point cloud features of different scales according to scales, enriching the LiDAR point cloud features, recovering the resolution of an image by adopting four parallel transposition convolutions with different step lengths, finally obtaining a feature map with the same size as the original RGB image, connecting the four transposition convolutions of different scales in a concat mode, and finally enhancing the LiDAR point cloud features to generate more accurate propofol;

step S3, liDAR-Image Fusion Module: constructing a LiDAR-Image Fusion module to correspond the layering of semantic features generated by the RGB Image to LiDAR point cloud features, and fusing point-wise Image features with the LiDAR features to obtain final enhancement features;

s4, liDAR point cloud feature extraction: the LiDAR features are point cloud serving as input and generate 3 Dposals, and four pairs of Set Abstraction modules SA are provided, wherein SA is Set Abstraction and a Feature propagation module FP, the FP is Feature propagation and performs Feature extraction, and a Feature map generated by the step S2, which is transposed and convolved and has the same size as an original Image, is placed into a LiDAR-Image Fusion module to perform Feature Fusion with a fourth FP module, and is used for final point segmentation and 3D posal generation.

In a preferred embodiment of the present invention, the RGB image dimension in S1 is W × H × 3, the height and width of each of the feature maps of H and W, 3 is the number of channels, and N × 3 is the number of point clouds.

As a preferred embodiment of the present invention, the four lightweight convolution layers in S2 are input with 1280 × 384, channels are [3, 64,128,256,512], respectively, and the step size of the 3 × 3 convolution block is 2; the number of LiDAR point cloud feature points is 16384, the channels of point features are [96,256,512,1024]. .

As a preferred embodiment of the present invention, the LiDAR-Image Fusion module constructed in S3 includes a Grid Generator, wherein the Grid Generator is a Grid Generator, an Image collector is an Image Sampler, and the calculation steps of the LiDAR-Image Fusion module are as follows:

firstly, projecting LiDAR points onto an image based on a mapping matrix M, outputting the corresponding relation of LiDAR point cloud camera pictures under different resolutions by taking the LiDAR point cloud and the mapping matrix M as input, giving the point cloud p (x, y, z), and setting the image position as p ' = (x ', y '), wherein the mapping formula is as follows:

p'＝M×p (1)

wherein M has a dimension of 3 x 4.

Secondly, obtaining semantic features of each point by using an image collector, taking a sampling position p' and an image feature graph F as input, generating an image feature of point-wise for each sampling position to be represented as V, and obtaining the image features under continuous coordinates by using bilinear interpolation in view of the fact that the sampling position may fall on adjacent pixels directly:

V ^(p) ＝K(F ^(Ν(p′)) ) (2)

wherein V ^(p) Is the image characteristic of point p, K is a bilinear interpolation function, F ^(N(p′)) Is the image feature under the adjacent pixel of the p' sampling position;

finally, liDAR point cloud feature F _P And image feature F _I Firstly mapping the same channel through a full connection layer, then compressing the same channel into a single channel weight mapping w through another full connection layer, and finally normalizing by using a sigmoid function, wherein the specific formula is as follows:

w＝σ(W*tanh(μF _P +vF _I )) (3)

wherein w, mu and v are learnable weights, and sigma is a sigmoid activation function;

after the weight w is obtained, the LiDAR point cloud characteristics F are obtained _P And semantic feature F _I The Concat ligation was performed as follows:

F _LI ＝F _P ||wF _I (4)。

as a preferable scheme of the invention, liDAR features in S4 are point clouds serving as input, the number of input feature points is 16384, channels of the point features are [96,256,512,1024], feature extraction is carried out through a four-pair set abstraction module SA and a feature propagation module FP, the module SA and the module FP are feature extraction methods from a pointent + + algorithm, wherein the module SA contains a sampling layer;

for the sampling layer, assuming that point cloud data consists of N points, sampling N1 points from the point cloud data, setting the input size to be N (d + C), wherein d is an xyz three-dimensional coordinate, C is an attribute characteristic of a figure-containing point and is generally 0, and outputting N1 (d + C) after the sampling layer passes through; for a grouping layer, N1 sub-regions are found by taking N1 sampling points as centers, each sub-region comprises k points, and after (d + C) dimension of each point passes through the sampling layer, N1 x k (d + C) is output;

the continuous module SA performs downsampling on the original points to obtain fewer feature points, and a distance difference value hierarchical feature propagation strategy, namely an FP (Fabry-Perot) module is used for realizing the segmentation task of the original feature points;

after four pairs of SA and FP are subjected to LiDAR feature extraction, a feature map which is generated by the transposition and convolution in the step S2 and has the same size as an original Image is placed into a LiDAR-Image Fusion module to be subjected to feature Fusion with a fourth FP module, and the feature map is used for final point segmentation and 3D proxy generation.

Compared with the prior art, the invention has the beneficial effects that:

1. in the invention, a 3D target detection algorithm based on a LiDAR-Image Fusion module is designed, and the LiDAR feature points are thinned and mapped into an Image point by utilizing the LiDAR-Image Fusion module, so that a high-precision 3D detection network is finally obtained and is positioned at the front row position on an open source data set.

Drawings

FIG. 1 is a schematic diagram of a camera image feature extraction process according to the present invention;

FIG. 2 is a schematic view of a LiDAR-Image Fusion module flow structure according to the present invention;

FIG. 3 is a schematic diagram of the SA module according to the present invention;

FIG. 4 is a schematic diagram of an overall network framework structure for LiDAR point cloud feature extraction according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without any creative work based on the embodiments of the present invention belong to the protection scope of the present invention.

While several embodiments of the present invention will be described more fully hereinafter with reference to the accompanying drawings, in order to facilitate an understanding of the invention, the invention may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, but rather should be construed to provide a more complete disclosure of the invention.

It will be understood that when an element is referred to as being "secured to" another element, it can be directly on the other element or intervening elements may also be present, that when an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present, and that the terms "vertical", "horizontal", "left", "right" and the like are used herein for descriptive purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs, and the use of the terms herein in the specification of the present invention are for the purpose of describing particular embodiments only and are not intended to be limiting of the invention, and the use of the term "and/or" herein includes any and all combinations of one or more of the associated listed items.

In an embodiment, referring to fig. 1-4, the present invention provides a technical solution:

wherein, the RGB image dimension in the S1 is W × H × 3, the height and width of the characteristic diagram of H and W respectively, 3 is the channel number, and N × 3 is the number of point clouds;

s2, camera image feature extraction: the RGB image features are a simple framework composed of four lightweight convolution blocks, semantic image information is extracted through a group of convolution operations; each convolution block consists of two 3 × 3 convolution layers, followed by a BN layer and a ReLU activation function; setting a second convolution layer in each block, wherein the step length is 2, expanding the receptive field and saving GPU memory, merging the semantic features extracted by the four lightweight convolution blocks with different scales and LiDAR point cloud features of different scales, enriching the point cloud features of the LiDAR, recovering the resolution of the image by adopting parallel transposition convolution with different step lengths, finally obtaining a feature map with the same size as the original RGB image, connecting the four transposition convolutions of different scales in a concat mode, and finally enhancing the LiDAR point cloud features to generate more accurate propofol, as shown in FIG. 1;

specifically, the input of the four lightweight convolution layers in S2 is 1280 × 384, the channels are [3, 64,128,256,512], respectively, and the step size of the 3 × 3 convolution block is 2; the number of LiDAR point cloud feature points is 16384, the channels of point features are [96,256,512,1024];

specifically, the LiDAR-Image Fusion module is constructed in S3, and includes a Grid Generator and an Image collector, where the Grid Generator is a Grid Generator, and the Image collector is an Image Sampler, and the calculation steps of the LiDAR-Image Fusion module are as follows:

p'＝M×p (1)

wherein M has a dimension of 3 x 4.

Secondly, obtaining semantic features of each point by using an image collector, taking a sampling position p' and an image feature graph F as input, generating an image feature of point-wise for each sampling position and expressing the image feature as V, and obtaining the image features under continuous coordinates by using bilinear interpolation in view of the fact that the sampling position possibly falls on adjacent pixels directly:

V ^(p) ＝K(F ^(Ν(p′)) ) (2)

w＝σ(W*tanh(μF _P +νF _I )) (3)

F _LI ＝F _P ||wF _I (4)。

the LiDAR-Image Fusion module flow diagram is shown in FIG. 2 below:

s4, liDAR point cloud feature extraction: the LiDAR features are point clouds serving as input and generating 3D prosages, and four pairs of Set Abstraction modules SA are adopted, wherein SA is Set Abstraction and a Feature propagation module FP, the FP is Feature propagation for Feature extraction, and Feature images generated by transposing and convolution in the step S2 and having the same size as an original Image are placed into a LiDAR-Image Fusion module to be subjected to Feature Fusion with a fourth FP module for final point segmentation and 3D prosal generation;

in specific S4, liDAR features are point cloud as input, the number of input feature points is 16384, channels of point features are [96,256,512 and 1024], feature extraction is carried out through four pairs of collection abstraction modules SA and feature propagation modules FP, the modules SA and FP are feature extraction methods from a pointenet + + algorithm, wherein sampling layer is contained in the module SA, as shown in FIG. 3;

for the sampling layer, assuming that point cloud data is sampled from N points by N1 points, setting the input size to be N (d + C), d is an xyz three-dimensional coordinate, C is an attribute characteristic of a shape-tolerant point and is generally 0, and outputting to be N1 (d + C) after the sampling layer passes through the sampling layer; for a grouping layer, N1 sub-regions are found by taking N1 sampling points as centers, each sub-region comprises k points, and after (d + C) dimension of each point passes through the sampling layer, N1 x k (d + C) is output;

after LiDAR feature extraction is performed on SA and FP, the feature map generated by the transposition and convolution in the step S2 and having the same size as the original Image is placed into a LiDAR-Image Fusion module to perform feature Fusion with a fourth FP module, and the feature map is used for final point segmentation and 3D proxy generation, and the whole network framework is as shown in FIG. 4.

Detailed description of the preferred embodiment

The specific implementation of the 3D target detection algorithm based on the LiDAR-Image Fusion module is divided into the following parts:

step1 data preparation

The open source data set for 3D object detection mainly comprises KITTI Dataset, SUN-RGBD Dataset and the like, if training is required to be carried out on the data set of the open source data set, a data format is required to be prepared to be consistent with the two data sets, and after data are preprocessed, liDAR point cloud data features and RGB image features are extracted;

step2 training phase

The size of an input image is 1280 by 384 by 3, the number of image characteristic channels is [64,128,256 and 512], and image characteristics are extracted through 4 convolution layers; the method comprises the steps of inputting 16384 number of point cloud data points of LiDAR, leading the number of point cloud characteristic channels to be [96,256,512,1024], firstly extracting point cloud characteristics through SA and FP modules, leading a first convolution layer to obtain enhanced characteristics through a LiDAR-Image Fusion module and to be fused with a first SA model, leading the following three convolution layers to be fused with the following three SA modules in the same way, and then leading the SA module and the FP module to carry out characteristic stacking through a skip-link way, wherein the specific skip-link characteristic stacking is that point cloud original data and FP4, SA1 and FP3, SA2 and FP2, SA3 and SA4 and FP1 are subjected to characteristic stacking to obtain fused characteristics FP4; after the four convolution layers, feature Fusion with feature FP4 using the transposed convolution to generate and raw picture size feature maps as input to the LiDAR-Image Fusion module. Using a sigmoid focalcloss loss function for the classification part, using cross entry loss and smooth L1 loss functions for the segmentation part, and when the loss functions tend to converge, indicating that the training is finished, and finally obtaining a final 3D detection category and a 3D detection frame;

step3 test phase

And sending the test picture into a network for reasoning, and comparing the obtained predicted values of the classified part and the segmented part with the true value to calculate the mAP so as to obtain the final precision.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A3D target detection algorithm based on multiple sensors comprises the following steps:

s1, data construction and data preprocessing: the open source data sets for 3D object detection are mainly KITTI Dataset and SUN-RGBD Dataset, if training is required to be carried out on the data sets, the data format needs to be prepared to be consistent with the two data sets, and LiDAR point cloud data features and RGB image features are extracted after data are preprocessed;

s4, liDAR point cloud feature extraction: the method comprises the steps that LiDAR features are that point cloud is used as input and 3D (three-dimensional) pro-spots are generated, four pairs of Set Abstraction modules SA are adopted, wherein SA are Set Abstraction and a Feature propagation module FP, the FP is Feature propagation and performs Feature extraction, and Feature maps which are generated by the step S2 through transposition and convolution and have the same size as an original Image are placed into a LiDAR-Image Fusion module to perform Feature Fusion with a fourth FP module for final point segmentation and 3D pro-spots generation.

2. The multi-sensor based 3D object detection algorithm of claim 1, wherein: in S1, the dimension of the RGB image is W × H × 3, the height and the width of each characteristic map of H and W are respectively, 3 is the number of channels, and N × 3 is the number of point clouds.

3. The multi-sensor based 3D object detection algorithm of claim 1, wherein: the input of the four lightweight convolution layers in the S2 is 1280 × 384, channels are [3, 64,128,256,512], and the step length of a3 × 3 convolution block is 2; the number of LiDAR point cloud feature points is 16384, the channels of point features are [96,256,512,1024]. .

4. The multi-sensor based 3D object detection algorithm of claim 1, wherein: the LiDAR-Image Fusion module is constructed in the S3 and comprises a grid generator and an Image collector, wherein the grid generator is a GridGenerator, the Image collector is an Image Sampler, and the LiDAR-Image Fusion module comprises the following calculation steps:

p'＝M×p (1)

wherein M has a dimension of 3 x 4.

V ^(p) ＝K(F ^(Ν(p′)) ) (2)

wherein V ^(p) Is the image characteristic of point p, K is a bilinear interpolation function, F ^(N(p′)) Is the image characteristic under the adjacent pixel of the p' sampling position;

w＝σ(w*tanh(μF _P +νF _I )) (3)

after the weight w is obtained, the LiDAR point cloud characteristics F are obtained _P And semantic features F _I The Concat ligation was performed according to the following formula:

F _LI ＝F _P ||wF _I (4)。

5. the multi-sensor based 3D object detection algorithm of claim 1, wherein: in S4, liDAR features are input point clouds, the number of input feature points is 16384, channels of the point features are [96,256,512 and 1024], feature extraction is performed through a four-pair set abstraction module SA and a feature propagation module FP, the module SA and the module FP are feature extraction methods from a pointenet + + algorithm, wherein the module SA contains a sampling layer;

for samplinglayer, assuming that point cloud data is sampled from N points by N1 points, setting the input size to be N (d + C), d is an xyz three-dimensional coordinate, C is an attribute characteristic of a shape-tolerant point and is generally 0, and outputting to be N1 (d + C) after sampling layer; for a grouping layer, N1 sub-regions are found by taking N1 sampling points as centers, each sub-region comprises k points, and after (d + C) dimension of each point passes through the sampling layer, N1 x k (d + C) is output;