CN114118247A - Anchor-frame-free 3D target detection method based on multi-sensor fusion - Google Patents

Anchor-frame-free 3D target detection method based on multi-sensor fusion Download PDF

Info

Publication number
CN114118247A
CN114118247A CN202111384455.4A CN202111384455A CN114118247A CN 114118247 A CN114118247 A CN 114118247A CN 202111384455 A CN202111384455 A CN 202111384455A CN 114118247 A CN114118247 A CN 114118247A
Authority
CN
China
Prior art keywords
point cloud
feature
features
laser point
anchor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111384455.4A
Other languages
Chinese (zh)
Inventor
田炜
殷凌眉
邓振文
黄禹尧
谭大艺
韩帅
余卓平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN202111384455.4A priority Critical patent/CN114118247A/en
Publication of CN114118247A publication Critical patent/CN114118247A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • G06F18/256Fusion techniques of classification results, e.g. of results related to same input data of results relating to different input data, e.g. multimodal recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)
  • Length Measuring Devices By Optical Means (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to a multi-sensor fusion-based anchor-frame-free 3D target detection method, which comprises the following steps: acquiring a color RGB image and a laser point cloud; performing semantic segmentation on the color RGB image to obtain the class information of each pixel; enhancing the characteristics of the laser point cloud by utilizing the category information to obtain enhanced laser point cloud; respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding on the enhanced laser point cloud to obtain geometric features and visibility features; stacking the geometric features and the visibility features to obtain stacking features; inputting the stacking characteristics into a multi-layer characteristic extraction network, extracting characteristic information of different levels, and stacking the characteristic information of each level to obtain fusion characteristics; and outputting the fusion characteristics to a target detector without an anchor frame to obtain a 3D target detection result. Compared with the prior art, the method and the device have the advantages that the 3D target detection performance is enhanced by means of multi-mode data fusion and advantage complementation among sensors, so that the aim of accurate and rapid detection is fulfilled.

Description

Anchor-frame-free 3D target detection method based on multi-sensor fusion
Technical Field
The invention relates to the technical field of machine vision, in particular to an anchor frame-free 3D target detection method based on multi-sensor fusion.
Background
The 3D target detection has wide application in scenes such as unmanned driving, robots and augmented reality, compared with common 2D detection, the 3D detection additionally provides length, width, height and deflection angle information of a target object, and the 3D target detection is an important perception basis for three-dimensional scene understanding and autonomous decision planning.
Lidar and cameras are the most commonly used vehicle-mounted sensors in the field of three-dimensional target detection, wherein images collected by the cameras have rich semantics and context information, and the lidar can acquire accurate spatial information, both of which belong to the mainstream sensors of automatic driving. In addition, many current automatic driving algorithms adopt a sensor fusion mode, and can combine the advantages of different sensors, so that the problems of poor illumination, shielding and the like can be solved.
In addition, the target detection algorithm mainly comprises one-stage and two-stage, wherein the two-stage means that the detection algorithm needs to be completed in two steps: firstly, acquiring a candidate region, and then classifying, such as R-CNN series; in contrast, one-stage detection can be understood as one-step detection, and a candidate area does not need to be searched separately, typically, SSD. In the two-stage detection algorithm, such as the fast R-CNN algorithm, candidate frames are generated first, and then each candidate frame is classified (the position is also corrected), which is slow in detection speed because it needs to run the detection and classification processes many times; the one-stage detection method can predict all the bounding boxes only by sending the bounding boxes to the network once, so the method has higher speed and is very suitable for a mobile terminal. All the algorithms need to set elaborately designed anchor frames for being responsible for target detection in different areas and different sizes, researchers generally consider that the pre-designed parameters are the key to success or failure of a target detection model, and in related experiments in the past, the researchers also prove that hyper-parameters of anchor points have quite important influence on the prediction capability of the model.
However, the above target detection method based on the anchor frame has the following disadvantages:
(1) the super parameters such as the size, the proportion, the number and the like of the anchor point frame are difficult to adjust, and the accuracy fluctuation obtained by different super parameters can reach about 4 percent; the super-parameters of the anchor frame are preset, so that the robustness of the network model is reduced, and if the super-parameters of the anchor frame are applied to a brand-new data set, the super-parameters of the anchor frame need to be redesigned, so that the complexity of debugging the model parameters is increased.
(2) Even through careful design, because the proportion and the size of the anchor point frame must be fixed when the model is built, a serious problem is generated, namely when the detection model detects a target set with large shape change, especially a small target, the detection difficulty of the model is obviously improved.
(3) In order to further obtain better recall rate, a model is usually provided with dense anchor points on each layer of feature layer, but most anchor points are marked as negative samples in the training process, and the imbalance of the number of the positive samples and the negative samples of the model is aggravated by too many negative samples, so that the model cannot be sensitive to the background.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a 3D target detection method without an anchor frame based on multi-sensor fusion, so that 3D target detection can be accurately and quickly carried out.
The purpose of the invention can be realized by the following technical scheme: a multi-sensor fusion-based anchor-frame-free 3D target detection method comprises the following steps:
s1, acquiring a color RGB image and a laser point cloud;
s2, performing semantic segmentation on the color RGB image to obtain the category information of each pixel;
s3, enhancing the characteristics of the laser point cloud by utilizing the category information of each pixel to obtain an enhanced laser point cloud;
s4, respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding according to the enhanced laser point cloud so as to respectively obtain geometric features and visibility features;
s5, stacking the geometric features and the visibility features to obtain stacking features;
s6, inputting the stacking features into a multi-layer feature extraction network, extracting feature information of different levels, and stacking the feature information of each level to obtain fusion features;
and S7, outputting the fusion characteristics to the anchor frame-free target detector to obtain a 3D target detection result.
Further, in the step S2, a semantic segmentation network is specifically adopted to perform semantic segmentation on the color RGB image, where the semantic segmentation network includes U-Net, SegNet, deep lab, or BiSeNet;
the category information of each pixel in the step S2 includes 4 categories of vehicle, pedestrian, non-motor vehicle, and background.
Further, the specific process of step S3 is as follows: through the projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, the pixel positions of the space point cloud and the corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized.
Further, the specific process of performing point cloud geometric feature encoding in step S4 is as follows:
firstly, setting a three-dimensional grid on a horizontal plane in a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is respectively W and H, the number of the Z axis is 1, so that a body column is formed, and mapping laser point cloud to the corresponding body column;
the characteristics contained in each laser point in the body pillar comprise (x, y, z, i, t, c, ox, oy, dx, dy, dz), wherein (x, y, z, i, t) are original characteristics of the laser point and are respectively corresponding to coordinate values, reflectivity and time stamps of a rectangular spatial coordinate system, c is a classification characteristic given by image semantic segmentation, (ox, oy) is the deviation of the laser point from the central axis x, y of the body pillar, and (dx, dy, dz) is the mean value of the laser point and all the laser point positions
Figure BDA0003362294870000031
A deviation of (a);
and respectively carrying out feature coding on all enhanced laser points in the body column through a multilayer perceptron to generate a point cloud geometric feature with fixed feature quantity D, and carrying out maximum pooling operation in the quantity direction, so that the size of the code generation geometric feature is (W, H, D).
Further, the specific process of performing point cloud visibility feature encoding in step S4 is as follows:
firstly, setting a three-dimensional grid on a horizontal plane within a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is W and H respectively, and the number of the Z axis is O, so that voxels are formed;
each laser point in the laser point cloud and the center of the laser radar form a line segment, voxels through which the line segment passes are marked freely, the voxel marking state of the laser point is occupied, and the marking states of the rest voxels are unknown;
all voxels form a feature vector along the Z-axis direction and are converted into a visibility feature with length K through the convolution layer, so that the coding generation visibility feature size is (W, H, K).
Further, the step S5 is specifically to stack the geometric feature and the visibility feature in the depth direction: the geometric features and the visibility features are respectively mapped into a grid image generated from a bird's eye view of the stereoscopic grid and stacked in the depth direction to obtain stacked features of sizes (B, (D + K), H, W), where B, (D + K), H, W correspond to lot, depth, height and width, respectively.
Further, in the step S6, the multi-layer feature extraction network specifically extracts three feature maps of different scales through asynchronous long convolution layers, the feature maps of each scale pass through a corresponding step deconvolution layer, the size is enlarged to the same size, and finally the feature maps are stacked in the depth direction to obtain the fusion features.
Further, the anchor-frame-free target detector in step S7 includes five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D target size detection head, and a direction detection head.
Furthermore, the five detection heads are used for respectively regressing the fusion characteristics through convolution layers to obtain the probability that each pixel point of the fusion characteristics is a category center point, the offset of the fusion characteristics along the X and Y directions with the truth center point, the offset of the fusion characteristics along the Z direction with the truth center point, and the geometric dimension and the orientation of the three-dimensional detection frame.
Further, step S7 is specifically to predict the center of the object on the bird' S eye view plane by using five detection heads, and regress different attributes of the 3D bounding box, and finally, combine the output results of the five detection heads together to generate the 3D target detection result.
Compared with the prior art, the invention provides an anchor-frame-free and simple post-processing one-stage end-to-end 3D point cloud target detection method, which mainly comprises the steps of connecting a fusion feature map extracted by a backbone network to five different detection heads to predict an object center on a bird's-eye view plane and regress different attributes of a 3D boundary frame, wherein in the backbone network, a color RGB image is subjected to semantic segmentation network to obtain the category of each pixel point, and the category information can expand the characteristics of laser point cloud; in addition, according to the laser beam propagation principle, the space between the sensor and the laser point is not blocked by obstacles and can be considered as a free area, so that the 3D target detection task can improve the detection precision of the 3D target by utilizing image semantic features and space visibility features, and finally, fusion features are obtained by using a multi-layer feature extraction network, so that the enhancement of image semantic information on the original point cloud data is realized in the data preprocessing stage; a light projection algorithm is used for successfully reconstructing the spatial visibility state of the point cloud; the point set geometric and semantic features and the point set visibility features are respectively aggregated by utilizing a convolutional neural network, so that the accuracy of 3D target detection is ensured.
The method combines the aggregated point set geometric features, point set visibility features and point set semantic features into an end-to-end three-dimensional target detection framework, realizes data fusion of the point cloud and the image, innovatively provides a concept of space visibility of the point cloud, effectively reconstructs the occupied state of the point cloud space, and helps to accurately sense the real condition of the three-dimensional space. In addition, in the inference phase, each hot peak is picked up through the max pooling operation, and then, no more anchor points of multiple regressions are tiled to one location, so that the conventional NMS is not needed, so that the whole detector can run on a typical CNN accelerator or GPU, and CPU resources are saved for other key tasks of automatic driving.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a flow chart of 3D object detection data processing in an embodiment;
FIG. 3 is a schematic diagram of semantic feature and visibility feature extraction;
FIG. 4 is a diagram of a target detection overall network architecture;
FIG. 5 is a schematic view of a voxel state labeling;
FIG. 6 is a block diagram of five detector modules.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
As shown in fig. 1, a method for detecting an anchor-frame-free 3D target based on multi-sensor fusion includes the following steps:
s1, acquiring a color RGB image and a laser point cloud;
s2, performing semantic segmentation on the color RGB image to obtain the category information of each pixel, wherein in practical application, the semantic segmentation can be performed on the color RGB image by adopting a semantic segmentation network of a main flow frame, and the semantic segmentation mainly comprises network architectures such as U-Net, SegNet, deep Lab or BiSeNet and the like;
inputting the color RGB image into a mainstream semantic segmentation network to obtain the category of each pixel of the color image, wherein the main categories of traffic scenes comprise vehicles, pedestrians, non-motor vehicles, backgrounds and the like;
s3, enhancing the characteristics of the laser point cloud by utilizing the category information of each pixel to obtain an enhanced laser point cloud, wherein the category information of each pixel is endowed to the laser point cloud through the external participation among the sensors to promote the enhancement of the characteristics of the laser point cloud, namely the laser point cloud participates in the formation of data alignment of each pixel of the color image through the outside of the sensors, so as to endow the category attribute to the laser point cloud;
the specific process is as follows: through projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, pixel positions of a space point cloud and a corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized;
s4, respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding according to the enhanced laser point cloud so as to respectively obtain geometric features and visibility features, wherein the specific process of carrying out point cloud geometric feature coding is as follows:
firstly, setting a three-dimensional grid on a horizontal plane in a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is respectively W and H, the number of the Z axis is 1, so that a body column is formed, and mapping laser point cloud to the corresponding body column;
the characteristics contained in each laser point in the body pillar comprise (x, y, z, i, t, c, ox, oy, dx, dy, dz), wherein (x, y, z, i, t) are original characteristics of the laser point and are respectively corresponding to coordinate values, reflectivity and time stamps of a rectangular spatial coordinate system, c is a classification characteristic given by image semantic segmentation, (ox, oy) is the deviation of the laser point from the central axis x, y of the body pillar, and (dx, dy, dz) is the mean value of the laser point and all the laser point positions
Figure BDA0003362294870000061
Of (2)A difference;
respectively carrying out feature coding on all enhanced laser points in the body column through a multilayer perceptron to generate a point cloud geometric feature with fixed feature quantity D, and carrying out maximum pooling operation in the quantity direction, so that the size of the code generation geometric feature is (W, H, D);
the specific process of carrying out point cloud visibility feature coding is as follows:
firstly, setting a three-dimensional grid on a horizontal plane within a self-vehicle sensing range, wherein the quantity of the three-dimensional grid along an X axis and a Y axis is respectively W and H, and the quantity of the Z axis is O, so that voxels are formed;
each laser point in the laser point cloud and the center of the laser radar form a line segment, voxels through which the line segment passes are marked freely, the voxel marking state of the laser point is occupied, and the marking states of the rest voxels are unknown;
forming a feature vector by all voxels along the Z-axis direction, and converting the feature vector into a visibility feature with the length of K through a convolution layer, so that the visibility feature size generated by coding is (W, H, K);
s5, stacking the geometric features and the visibility features to obtain stacking features, specifically stacking the geometric features and the visibility features along the depth direction: respectively corresponding the geometric features and the visibility features to a grid image generated by a bird's eye view of the stereoscopic grid, and stacking along the depth direction to obtain stacking features with the sizes of (B, (D + K), H, W), wherein B, (D + K) and H, W respectively correspond to batch, depth, height and width;
s6, inputting the stacking features into a multi-layer feature extraction network, extracting feature information of different levels, and stacking the feature information of each level to obtain a fusion feature, in this embodiment, the multi-layer feature extraction network specifically extracts three feature maps of different scales through asynchronous long convolution layers, each feature map of each scale passes through a deconvolution layer with a corresponding step length, the size is enlarged to the same size, and finally stacking is performed in the depth direction to obtain a fusion feature;
s7, outputting the fusion features to an anchor-frame-free target detector to obtain a 3D target detection result, in this embodiment, the anchor-frame-free target detector includes five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D target size detection head, and a direction detection head;
the five detection heads are used for respectively regressing the fusion characteristics through the convolution layers so as to obtain the probability that each pixel point of the fusion characteristics is a category central point, the offset of the fusion characteristics along the X and Y directions of the truth central point, the offset of the fusion characteristics along the Z direction of the truth central point and the geometric dimension and the orientation of the three-dimensional detection frame;
and predicting the center of the object on the aerial view plane by using the five detection heads, returning different attributes of the 3D boundary frame, and finally combining the output results of the five detection heads to generate a 3D target detection result.
In summary, the present invention performs 3D target detection based on a neural network, input data thereof includes color images and laser point clouds, and a main data processing flow is shown in fig. 2:
1. obtaining the category of each pixel of the color image through a semantic segmentation network, giving category information to the laser point cloud through external parameters among the sensors, and promoting the characteristics of the laser point cloud to be enhanced to obtain enhanced laser point cloud;
2. respectively carrying out geometric feature coding and visibility feature coding on the enhanced laser point cloud (as shown in FIG. 3), and stacking the aerial view features obtained by each module to obtain stacking features;
3. inputting the stacking characteristics into a multi-layer characteristic extraction network, extracting characteristic information of different layers, and stacking the characteristic information of each layer to obtain fusion characteristics;
4. and obtaining the category, offset, Z value, appearance and orientation of the 3D target by the fusion characteristics through different detection heads.
In the process of neural network training, the invention uses a data augmentation method combining images and point clouds to relieve the problem of unbalanced data categories, thereby improving the accuracy of 3D target detection, and the data augmentation method mainly comprises random overturning and random inserting of true targets along an X axis. Randomly inserting true targets refers to storing image 2D frames, image semantic segmentation pixel points, true target point clouds and 3D frames of all true targets in a data set, and randomly selecting part of the true targets to be inserted into a currently retrieved data set; and deleting all laser points of the currently retrieved data set in the contour region according to the contour region formed by the randomly selected true value target point cloud.
In the embodiment, by applying the technical scheme, the main input sensor data comprises a color image and laser point cloud, wherein the color image is subjected to semantic segmentation network to obtain the category of each pixel point, and the category information can expand the characteristics of the laser point cloud. In addition, according to the principle of laser beam propagation, the space between the sensor and the laser point is not blocked by an obstacle and can be considered as a free area. Therefore, the 3D target detection task can utilize the image semantic features and the spatial visibility features to improve the detection accuracy of the 3D target. And finally, obtaining fusion characteristics by using a multilayer characteristic extraction network, and completing a 3D target detection task through a detection head, wherein the overall network structure of the target detection is shown in FIG. 4.
Specifically, the method comprises the following steps:
step 1: classifying the category of each pixel of the color image by using a mainstream semantic segmentation method (such as DeeplabV3+), projecting the laser point cloud onto the image to obtain the image semantic information of each laser point, and enhancing the characteristics of the laser point cloud;
step 2: dividing the body columns (W, H,1) in a sensing area space, mapping the laser point cloud with enhanced characteristics to each body column, further enhancing the characteristics of the laser points in the body columns by utilizing the geometric information and the position information of all the points in the body columns, and extracting the characteristics in the body columns by a multi-layer sensing machine to obtain the geometric characteristics (W, H, D);
and step 3: dividing voxels (W, H, O) in a sensing region space, obtaining the spatial visibility state of each voxel by using a ray casting algorithm as shown in FIG. 5, and further extracting high-dimensional features through a convolution layer to obtain visibility features (W, H, K);
and 4, step 4: stacking the geometric features and the visibility features along the depth direction, inputting the geometric features and the visibility features into a multi-layer feature extraction network, extracting feature information of different layers, stacking the feature information of each layer to obtain fusion features, wherein the structure of the multi-layer feature extraction network is shown in FIG. 4;
and 5: the feature map output by the multi-layer feature extraction network is connected to five different detection heads (as shown in fig. 6) to predict the object center on the bird's eye view plane and regress the different attributes of the 3D bounding box.
For the above steps, further detailed description is given:
enhancing point cloud characteristics by image semantic characteristics
The semantic segmentation model is pre-trained (e.g. using the cityscaps dataset) and the semantic segmentation results contain 4 classes (class _ id) for Car (Car), Pedestrian (Pedestrian), Cyclist (Cyclist) and Background (Background).
The spatial point cloud can be associated with the pixel positions of the corresponding images by projective transformation from the laser radar coordinate system to the camera coordinate system and then from the camera coordinate system to the image coordinate system. By the method, semantic information represented by image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, namely new data dimensionality is added to the originally input point cloud, so that the enhancement of point cloud semantics is realized. Finally, the original set of points is expanded from 4 dimensions (x, y, z, r) to 5 dimensions (x, y, z, r, class _ id).
Dynamic voxelization
For hard pigmentation, the number of non-empty pillars and the number of points in the pillars are limited. For example, for each input sample, the maximum number of non-hollow columns needs to be set. Generally, the number of non-empty columns in the H × W individual columns does not exceed a set maximum value, and on the contrary, random sampling of the columns is performed. Meanwhile, if the number of the middle points of each body column is set to be N, point random sampling is carried out on each body column exceeding the set point number; if the number of dots in the body column is less than N, zero padding is performed. And each modified point has the characteristic of D-9 dimensions, and a point cloud set with the dimension (N, D) is formed in one body column. And if the number of the selected volume columns is P, finally converting into a point cloud set with the dimension of (P, N, D).
However, for dynamic voxelization, all points within each bin are retained for learning bin-level features. For each input sample, we get a set of points in dimension (N', D). Where N' is the total number of points per input sample. The dynamic voxelization does not fix the number of point clouds in each volume column, but stores the coordinates of the volume column where each point is located so as to correspondingly calculate the volume column level characteristics.
Three, point set visibility feature coding
The number of input point sets is N, each point containing x, y, z spatial coordinates. First, the present invention initializes all voxels to an unknown state and then performs a ray casting algorithm for each ray. If the voxel is already occupied, execution continues. If the final voxel traversed by the ray is reached, the voxel is marked as occupied. Otherwise, all the traversed voxels are labeled as free states. Finally, the whole three-dimensional point cloud space is represented by three states of unknown, occupied and free.
The link mainly comprises the steps of reconstructing a space visibility state, calculating a voxel-level visibility characteristic, and finally integrating the voxel-level visibility characteristic into a three-dimensional target detection framework. The case of a two-dimensional plane is first described here, and then expansion into a three-dimensional space is more convenient to understand.
After the voxel coordinates traversed by the ray are obtained, the visibility state of each voxel in the whole three-dimensional space can be determined. The visibility of three-dimensional space is defined herein as three different states, namely Unknown (Unknown) space, Occupied (Occupied) space and Free (Free) space, and is characterized by specific numerical values. We stack these voxel-level visibility features along the z-axis and then aggregate them using 1 x 1 convolutional layers, resulting in a visibility feature with dimension (H, W, K).
Four, no anchor frame detects head
The target detector without the anchor frame consists of five detection heads. They are key point heat map detection heads, local offset detection heads, z-axis positioning detection heads, 3D target size detection heads and direction detection heads. FIG. 5 shows some details of the anchor-frame-less object detector.
For the heat map detection head and the offset detection head, the keypoint heat map and the local offset regression map are predicted. The keypoint heat map is used to find the location of the center of the target object in the aerial view. The offset regression map not only helps the heat map to find a more accurate target object center in the bird's-eye view, but also can compensate for discretization errors caused by the process of the columnization.
For the offset regression detector head, there are two main functions: first, it is used to eliminate errors caused by the process of columnating, in which floating point target object centers are assigned to integer style column positions in the bird's eye view. Second, it plays an important role in refining the prediction of the heat map target object centers, especially when the heat map predicts the wrong centers. In particular, once the heat map predicts an erroneous center that is several pixels offset from the true center, the offset detection head has the ability to mitigate or even eliminate the pixel error relative to the true center of the target object. A square region having a radius r around the center pixel of the target object is selected in the offset regression heat map. The further away from the center of the object, the larger the offset value, and L1 loss was used to train the offset.
After the target object is located in the bird's eye view, only the target object x-y location is available. Thus requiring the z-axis positioning of the detection head to regress the z-axis values. The z-values were regressed using the L1 loss function.
The method first realizes the enhancement of the image semantic information to the original point cloud data in the data preprocessing stage; secondly, successfully reconstructing the spatial visibility state of the point cloud by using a ray casting algorithm; then, the geometrical and semantic features and the visibility features of the point set are respectively aggregated by utilizing a convolutional neural network.
The upsampled network output feature map is then applied, connected to five different detection heads to predict the object center on the bird's eye view plane, and regress the different attributes of the 3D bounding box. And finally, combining the output results of the five detection heads together to generate a detection result. The key point heat map prediction detection head is used for predicting the centers of objects in the aerial view plane, and each object is encoded into a small area with a heat peak as the center.
The method combines the aggregated point set geometric features, point set visibility features and point set semantic features into an end-to-end three-dimensional target detection framework, realizes data fusion of the point cloud and the image, innovatively provides a concept of space visibility of the point cloud, effectively reconstructs the occupied state of the point cloud space, and helps to accurately sense the real condition of the three-dimensional space. In addition, during the inference phase, each hot peak is picked by the max pooling operation, after which no more anchor points for multiple regressions are tiled to a location, thus eliminating the need for a conventional NMS. This allows the entire detector to run on a typical CNN accelerator or GPU, saving CPU resources for other critical tasks of autonomous driving.

Claims (10)

1. A multi-sensor fusion-based anchor-frame-free 3D target detection method is characterized by comprising the following steps:
s1, acquiring a color RGB image and a laser point cloud;
s2, performing semantic segmentation on the color RGB image to obtain the category information of each pixel;
s3, enhancing the characteristics of the laser point cloud by utilizing the category information of each pixel to obtain an enhanced laser point cloud;
s4, respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding according to the enhanced laser point cloud so as to respectively obtain geometric features and visibility features;
s5, stacking the geometric features and the visibility features to obtain stacking features;
s6, inputting the stacking features into a multi-layer feature extraction network, extracting feature information of different levels, and stacking the feature information of each level to obtain fusion features;
and S7, outputting the fusion characteristics to the anchor frame-free target detector to obtain a 3D target detection result.
2. The multi-sensor fusion-based anchor-frame-free 3D object detection method according to claim 1, wherein the step S2 is specifically to perform semantic segmentation on the color RGB image by using a semantic segmentation network, wherein the semantic segmentation network comprises U-Net, SegNet, DeepLab or BiSeNet;
the category information of each pixel in the step S2 includes 4 categories of vehicle, pedestrian, non-motor vehicle, and background.
3. The anchor-frame-free 3D object detection method based on multi-sensor fusion according to claim 1, wherein the specific process of the step S3 is as follows: through the projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, the pixel positions of the space point cloud and the corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized.
4. The anchor-frame-free 3D target detection method based on multi-sensor fusion as claimed in claim 1, wherein the specific process of point cloud geometric feature encoding in step S4 is as follows:
firstly, setting a three-dimensional grid on a horizontal plane in a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is respectively W and H, the number of the Z axis is 1, so that a body column is formed, and mapping laser point cloud to the corresponding body column;
the characteristics contained in each laser point in the body pillar comprise (x, y, z, i, t, c, ox, oy, dx, dy, dz), wherein (x, y, z, i, t) are original characteristics of the laser point and are respectively corresponding to coordinate values, reflectivity and time stamps of a rectangular spatial coordinate system, c is a classification characteristic given by image semantic segmentation, (ox, oy) is the deviation of the laser point from the central axis x, y of the body pillar, and (dx, dy, dz) is the mean value of the laser point and all the laser point positions
Figure FDA0003362294860000021
A deviation of (a);
and respectively carrying out feature coding on all enhanced laser points in the body column through a multilayer perceptron to generate a point cloud geometric feature with fixed feature quantity D, and carrying out maximum pooling operation in the quantity direction, so that the size of the code generation geometric feature is (W, H, D).
5. The anchor-frame-free 3D object detection method based on multi-sensor fusion as claimed in claim 4, wherein the specific process of point cloud visibility feature encoding in step S4 is as follows:
firstly, setting a three-dimensional grid on a horizontal plane within a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is W and H respectively, and the number of the Z axis is O, so that voxels are formed;
each laser point in the laser point cloud and the center of the laser radar form a line segment, voxels through which the line segment passes are marked freely, the voxel marking state of the laser point is occupied, and the marking states of the rest voxels are unknown;
all voxels form a feature vector along the Z-axis direction and are converted into a visibility feature with length K through the convolution layer, so that the coding generation visibility feature size is (W, H, K).
6. The multi-sensor fusion-based anchor-frame-free 3D object detection method according to claim 5, wherein the step S5 specifically includes stacking geometric features and visibility features in a depth direction: the geometric features and the visibility features are respectively mapped into a grid image generated from a bird's eye view of the stereoscopic grid and stacked in the depth direction to obtain stacked features of sizes (B, (D + K), H, W), where B, (D + K), H, W correspond to lot, depth, height and width, respectively.
7. The method for detecting the anchor-frame-free 3D object based on the multi-sensor fusion as claimed in claim 1, wherein the multi-layer feature extraction network in step S6 extracts three feature maps with different scales through asynchronous long convolution layers, the feature maps with each scale pass through a corresponding step deconvolution layer respectively to expand the size to the same size, and finally are stacked in the depth direction to obtain the fusion feature.
8. The method for detecting the anchor-frame-free 3D object based on the multi-sensor fusion as claimed in claim 1, wherein the anchor-frame-free object detector in the step S7 comprises five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D object size detection head and an orientation detection head.
9. The method according to claim 8, wherein the five detection heads are configured to perform convolution regression on the fusion features respectively to obtain a class center probability of each pixel point of the fusion feature, an offset from a true center X and Y direction, an offset from a true center Z direction, and a geometric size and an orientation of the three-dimensional detection frame.
10. The method for detecting the anchor-frame-free 3D object based on multi-sensor fusion of claim 9, wherein the step S7 is to predict the center of the object on the bird' S eye view plane by using five detection heads, and to regress different attributes of the 3D bounding box, and finally, to combine the output results of the five detection heads to generate the 3D object detection result.
CN202111384455.4A 2021-11-18 2021-11-18 Anchor-frame-free 3D target detection method based on multi-sensor fusion Pending CN114118247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111384455.4A CN114118247A (en) 2021-11-18 2021-11-18 Anchor-frame-free 3D target detection method based on multi-sensor fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111384455.4A CN114118247A (en) 2021-11-18 2021-11-18 Anchor-frame-free 3D target detection method based on multi-sensor fusion

Publications (1)

Publication Number Publication Date
CN114118247A true CN114118247A (en) 2022-03-01

Family

ID=80439224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111384455.4A Pending CN114118247A (en) 2021-11-18 2021-11-18 Anchor-frame-free 3D target detection method based on multi-sensor fusion

Country Status (1)

Country Link
CN (1) CN114118247A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424022A (en) * 2022-11-03 2022-12-02 南方电网数字电网研究院有限公司 Power transmission corridor ground point cloud segmentation method and device and computer equipment
CN117152422A (en) * 2023-10-31 2023-12-01 国网湖北省电力有限公司超高压公司 Ultraviolet image anchor-free frame target detection method, storage medium and electronic equipment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115424022A (en) * 2022-11-03 2022-12-02 南方电网数字电网研究院有限公司 Power transmission corridor ground point cloud segmentation method and device and computer equipment
CN115424022B (en) * 2022-11-03 2023-03-03 南方电网数字电网研究院有限公司 Power transmission corridor ground point cloud segmentation method and device and computer equipment
CN117152422A (en) * 2023-10-31 2023-12-01 国网湖北省电力有限公司超高压公司 Ultraviolet image anchor-free frame target detection method, storage medium and electronic equipment
CN117152422B (en) * 2023-10-31 2024-02-13 国网湖北省电力有限公司超高压公司 Ultraviolet image anchor-free frame target detection method, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US10915793B2 (en) Method and system for converting point cloud data for use with 2D convolutional neural networks
CN113819890B (en) Distance measuring method, distance measuring device, electronic equipment and storage medium
CN112613378B (en) 3D target detection method, system, medium and terminal
CN118115557A (en) Associating LIDAR data and image data
WO2020053611A1 (en) Electronic device, system and method for determining a semantic grid of an environment of a vehicle
Paz et al. Probabilistic semantic mapping for urban autonomous driving applications
US20230035475A1 (en) Methods and systems for semantic segmentation of a point cloud
JP7224682B1 (en) 3D multiple object detection device and method for autonomous driving
CN114118247A (en) Anchor-frame-free 3D target detection method based on multi-sensor fusion
CN113658257B (en) Unmanned equipment positioning method, device, equipment and storage medium
CN114821507A (en) Multi-sensor fusion vehicle-road cooperative sensing method for automatic driving
CN112257668A (en) Main and auxiliary road judging method and device, electronic equipment and storage medium
CN114792416A (en) Target detection method and device
CN113408324A (en) Target detection method, device and system and advanced driving assistance system
Doval et al. Traffic sign detection and 3D localization via deep convolutional neural networks and stereo vision
CN114782785A (en) Multi-sensor information fusion method and device
CN114463713A (en) Information detection method and device of vehicle in 3D space and electronic equipment
CN117173399A (en) Traffic target detection method and system of cross-modal cross-attention mechanism
CN116246033A (en) Rapid semantic map construction method for unstructured road
US20240151855A1 (en) Lidar-based object tracking
CN115880659A (en) 3D target detection method and device for road side system and electronic equipment
Aboah et al. Ai-based framework for understanding car following behaviors of drivers in a naturalistic driving environment
Yang et al. Analysis of Model Optimization Strategies for a Low-Resolution Camera-Lidar Fusion Based Road Detection Network
CN116778262B (en) Three-dimensional target detection method and system based on virtual point cloud
Foster Object detection and sensor data processing for off-road autonomous vehicles

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination