CN114118247A

CN114118247A - Anchor-frame-free 3D target detection method based on multi-sensor fusion

Info

Publication number: CN114118247A
Application number: CN202111384455.4A
Authority: CN
Inventors: 田炜; 殷凌眉; 邓振文; 黄禹尧; 谭大艺; 韩帅; 余卓平
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-03-01

Abstract

The invention relates to a multi-sensor fusion-based anchor-frame-free 3D target detection method, which comprises the following steps: acquiring a color RGB image and a laser point cloud; performing semantic segmentation on the color RGB image to obtain the class information of each pixel; enhancing the characteristics of the laser point cloud by utilizing the category information to obtain enhanced laser point cloud; respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding on the enhanced laser point cloud to obtain geometric features and visibility features; stacking the geometric features and the visibility features to obtain stacking features; inputting the stacking characteristics into a multi-layer characteristic extraction network, extracting characteristic information of different levels, and stacking the characteristic information of each level to obtain fusion characteristics; and outputting the fusion characteristics to a target detector without an anchor frame to obtain a 3D target detection result. Compared with the prior art, the method and the device have the advantages that the 3D target detection performance is enhanced by means of multi-mode data fusion and advantage complementation among sensors, so that the aim of accurate and rapid detection is fulfilled.

Description

Anchor-frame-free 3D target detection method based on multi-sensor fusion

Technical Field

The invention relates to the technical field of machine vision, in particular to an anchor frame-free 3D target detection method based on multi-sensor fusion.

Background

The 3D target detection has wide application in scenes such as unmanned driving, robots and augmented reality, compared with common 2D detection, the 3D detection additionally provides length, width, height and deflection angle information of a target object, and the 3D target detection is an important perception basis for three-dimensional scene understanding and autonomous decision planning.

Lidar and cameras are the most commonly used vehicle-mounted sensors in the field of three-dimensional target detection, wherein images collected by the cameras have rich semantics and context information, and the lidar can acquire accurate spatial information, both of which belong to the mainstream sensors of automatic driving. In addition, many current automatic driving algorithms adopt a sensor fusion mode, and can combine the advantages of different sensors, so that the problems of poor illumination, shielding and the like can be solved.

In addition, the target detection algorithm mainly comprises one-stage and two-stage, wherein the two-stage means that the detection algorithm needs to be completed in two steps: firstly, acquiring a candidate region, and then classifying, such as R-CNN series; in contrast, one-stage detection can be understood as one-step detection, and a candidate area does not need to be searched separately, typically, SSD. In the two-stage detection algorithm, such as the fast R-CNN algorithm, candidate frames are generated first, and then each candidate frame is classified (the position is also corrected), which is slow in detection speed because it needs to run the detection and classification processes many times; the one-stage detection method can predict all the bounding boxes only by sending the bounding boxes to the network once, so the method has higher speed and is very suitable for a mobile terminal. All the algorithms need to set elaborately designed anchor frames for being responsible for target detection in different areas and different sizes, researchers generally consider that the pre-designed parameters are the key to success or failure of a target detection model, and in related experiments in the past, the researchers also prove that hyper-parameters of anchor points have quite important influence on the prediction capability of the model.

However, the above target detection method based on the anchor frame has the following disadvantages:

(1) the super parameters such as the size, the proportion, the number and the like of the anchor point frame are difficult to adjust, and the accuracy fluctuation obtained by different super parameters can reach about 4 percent; the super-parameters of the anchor frame are preset, so that the robustness of the network model is reduced, and if the super-parameters of the anchor frame are applied to a brand-new data set, the super-parameters of the anchor frame need to be redesigned, so that the complexity of debugging the model parameters is increased.

(2) Even through careful design, because the proportion and the size of the anchor point frame must be fixed when the model is built, a serious problem is generated, namely when the detection model detects a target set with large shape change, especially a small target, the detection difficulty of the model is obviously improved.

(3) In order to further obtain better recall rate, a model is usually provided with dense anchor points on each layer of feature layer, but most anchor points are marked as negative samples in the training process, and the imbalance of the number of the positive samples and the negative samples of the model is aggravated by too many negative samples, so that the model cannot be sensitive to the background.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a 3D target detection method without an anchor frame based on multi-sensor fusion, so that 3D target detection can be accurately and quickly carried out.

The purpose of the invention can be realized by the following technical scheme: a multi-sensor fusion-based anchor-frame-free 3D target detection method comprises the following steps:

s1, acquiring a color RGB image and a laser point cloud;

s2, performing semantic segmentation on the color RGB image to obtain the category information of each pixel;

s3, enhancing the characteristics of the laser point cloud by utilizing the category information of each pixel to obtain an enhanced laser point cloud;

s4, respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding according to the enhanced laser point cloud so as to respectively obtain geometric features and visibility features;

s5, stacking the geometric features and the visibility features to obtain stacking features;

s6, inputting the stacking features into a multi-layer feature extraction network, extracting feature information of different levels, and stacking the feature information of each level to obtain fusion features;

and S7, outputting the fusion characteristics to the anchor frame-free target detector to obtain a 3D target detection result.

Further, in the step S2, a semantic segmentation network is specifically adopted to perform semantic segmentation on the color RGB image, where the semantic segmentation network includes U-Net, SegNet, deep lab, or BiSeNet;

the category information of each pixel in the step S2 includes 4 categories of vehicle, pedestrian, non-motor vehicle, and background.

Further, the specific process of step S3 is as follows: through the projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, the pixel positions of the space point cloud and the corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized.

Further, the specific process of performing point cloud geometric feature encoding in step S4 is as follows:

firstly, setting a three-dimensional grid on a horizontal plane in a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is respectively W and H, the number of the Z axis is 1, so that a body column is formed, and mapping laser point cloud to the corresponding body column;

the characteristics contained in each laser point in the body pillar comprise (x, y, z, i, t, c, ox, oy, dx, dy, dz), wherein (x, y, z, i, t) are original characteristics of the laser point and are respectively corresponding to coordinate values, reflectivity and time stamps of a rectangular spatial coordinate system, c is a classification characteristic given by image semantic segmentation, (ox, oy) is the deviation of the laser point from the central axis x, y of the body pillar, and (dx, dy, dz) is the mean value of the laser point and all the laser point positions

A deviation of (a);

and respectively carrying out feature coding on all enhanced laser points in the body column through a multilayer perceptron to generate a point cloud geometric feature with fixed feature quantity D, and carrying out maximum pooling operation in the quantity direction, so that the size of the code generation geometric feature is (W, H, D).

Further, the specific process of performing point cloud visibility feature encoding in step S4 is as follows:

firstly, setting a three-dimensional grid on a horizontal plane within a self-vehicle sensing range, wherein the number of the three-dimensional grid along an X axis and a Y axis is W and H respectively, and the number of the Z axis is O, so that voxels are formed;

each laser point in the laser point cloud and the center of the laser radar form a line segment, voxels through which the line segment passes are marked freely, the voxel marking state of the laser point is occupied, and the marking states of the rest voxels are unknown;

all voxels form a feature vector along the Z-axis direction and are converted into a visibility feature with length K through the convolution layer, so that the coding generation visibility feature size is (W, H, K).

Further, the step S5 is specifically to stack the geometric feature and the visibility feature in the depth direction: the geometric features and the visibility features are respectively mapped into a grid image generated from a bird's eye view of the stereoscopic grid and stacked in the depth direction to obtain stacked features of sizes (B, (D + K), H, W), where B, (D + K), H, W correspond to lot, depth, height and width, respectively.

Further, in the step S6, the multi-layer feature extraction network specifically extracts three feature maps of different scales through asynchronous long convolution layers, the feature maps of each scale pass through a corresponding step deconvolution layer, the size is enlarged to the same size, and finally the feature maps are stacked in the depth direction to obtain the fusion features.

Further, the anchor-frame-free target detector in step S7 includes five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D target size detection head, and a direction detection head.

Furthermore, the five detection heads are used for respectively regressing the fusion characteristics through convolution layers to obtain the probability that each pixel point of the fusion characteristics is a category center point, the offset of the fusion characteristics along the X and Y directions with the truth center point, the offset of the fusion characteristics along the Z direction with the truth center point, and the geometric dimension and the orientation of the three-dimensional detection frame.

Further, step S7 is specifically to predict the center of the object on the bird' S eye view plane by using five detection heads, and regress different attributes of the 3D bounding box, and finally, combine the output results of the five detection heads together to generate the 3D target detection result.

Compared with the prior art, the invention provides an anchor-frame-free and simple post-processing one-stage end-to-end 3D point cloud target detection method, which mainly comprises the steps of connecting a fusion feature map extracted by a backbone network to five different detection heads to predict an object center on a bird's-eye view plane and regress different attributes of a 3D boundary frame, wherein in the backbone network, a color RGB image is subjected to semantic segmentation network to obtain the category of each pixel point, and the category information can expand the characteristics of laser point cloud; in addition, according to the laser beam propagation principle, the space between the sensor and the laser point is not blocked by obstacles and can be considered as a free area, so that the 3D target detection task can improve the detection precision of the 3D target by utilizing image semantic features and space visibility features, and finally, fusion features are obtained by using a multi-layer feature extraction network, so that the enhancement of image semantic information on the original point cloud data is realized in the data preprocessing stage; a light projection algorithm is used for successfully reconstructing the spatial visibility state of the point cloud; the point set geometric and semantic features and the point set visibility features are respectively aggregated by utilizing a convolutional neural network, so that the accuracy of 3D target detection is ensured.

The method combines the aggregated point set geometric features, point set visibility features and point set semantic features into an end-to-end three-dimensional target detection framework, realizes data fusion of the point cloud and the image, innovatively provides a concept of space visibility of the point cloud, effectively reconstructs the occupied state of the point cloud space, and helps to accurately sense the real condition of the three-dimensional space. In addition, in the inference phase, each hot peak is picked up through the max pooling operation, and then, no more anchor points of multiple regressions are tiled to one location, so that the conventional NMS is not needed, so that the whole detector can run on a typical CNN accelerator or GPU, and CPU resources are saved for other key tasks of automatic driving.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a flow chart of 3D object detection data processing in an embodiment;

FIG. 3 is a schematic diagram of semantic feature and visibility feature extraction;

FIG. 4 is a diagram of a target detection overall network architecture;

FIG. 5 is a schematic view of a voxel state labeling;

FIG. 6 is a block diagram of five detector modules.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, a method for detecting an anchor-frame-free 3D target based on multi-sensor fusion includes the following steps:

s1, acquiring a color RGB image and a laser point cloud;

s2, performing semantic segmentation on the color RGB image to obtain the category information of each pixel, wherein in practical application, the semantic segmentation can be performed on the color RGB image by adopting a semantic segmentation network of a main flow frame, and the semantic segmentation mainly comprises network architectures such as U-Net, SegNet, deep Lab or BiSeNet and the like;

inputting the color RGB image into a mainstream semantic segmentation network to obtain the category of each pixel of the color image, wherein the main categories of traffic scenes comprise vehicles, pedestrians, non-motor vehicles, backgrounds and the like;

s3, enhancing the characteristics of the laser point cloud by utilizing the category information of each pixel to obtain an enhanced laser point cloud, wherein the category information of each pixel is endowed to the laser point cloud through the external participation among the sensors to promote the enhancement of the characteristics of the laser point cloud, namely the laser point cloud participates in the formation of data alignment of each pixel of the color image through the outside of the sensors, so as to endow the category attribute to the laser point cloud;

the specific process is as follows: through projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, pixel positions of a space point cloud and a corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized;

s4, respectively carrying out point cloud geometric feature coding and point cloud visibility feature coding according to the enhanced laser point cloud so as to respectively obtain geometric features and visibility features, wherein the specific process of carrying out point cloud geometric feature coding is as follows:

Of (2)A difference;

respectively carrying out feature coding on all enhanced laser points in the body column through a multilayer perceptron to generate a point cloud geometric feature with fixed feature quantity D, and carrying out maximum pooling operation in the quantity direction, so that the size of the code generation geometric feature is (W, H, D);

the specific process of carrying out point cloud visibility feature coding is as follows:

firstly, setting a three-dimensional grid on a horizontal plane within a self-vehicle sensing range, wherein the quantity of the three-dimensional grid along an X axis and a Y axis is respectively W and H, and the quantity of the Z axis is O, so that voxels are formed;

forming a feature vector by all voxels along the Z-axis direction, and converting the feature vector into a visibility feature with the length of K through a convolution layer, so that the visibility feature size generated by coding is (W, H, K);

s5, stacking the geometric features and the visibility features to obtain stacking features, specifically stacking the geometric features and the visibility features along the depth direction: respectively corresponding the geometric features and the visibility features to a grid image generated by a bird's eye view of the stereoscopic grid, and stacking along the depth direction to obtain stacking features with the sizes of (B, (D + K), H, W), wherein B, (D + K) and H, W respectively correspond to batch, depth, height and width;

s6, inputting the stacking features into a multi-layer feature extraction network, extracting feature information of different levels, and stacking the feature information of each level to obtain a fusion feature, in this embodiment, the multi-layer feature extraction network specifically extracts three feature maps of different scales through asynchronous long convolution layers, each feature map of each scale passes through a deconvolution layer with a corresponding step length, the size is enlarged to the same size, and finally stacking is performed in the depth direction to obtain a fusion feature;

s7, outputting the fusion features to an anchor-frame-free target detector to obtain a 3D target detection result, in this embodiment, the anchor-frame-free target detector includes five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D target size detection head, and a direction detection head;

the five detection heads are used for respectively regressing the fusion characteristics through the convolution layers so as to obtain the probability that each pixel point of the fusion characteristics is a category central point, the offset of the fusion characteristics along the X and Y directions of the truth central point, the offset of the fusion characteristics along the Z direction of the truth central point and the geometric dimension and the orientation of the three-dimensional detection frame;

and predicting the center of the object on the aerial view plane by using the five detection heads, returning different attributes of the 3D boundary frame, and finally combining the output results of the five detection heads to generate a 3D target detection result.

In summary, the present invention performs 3D target detection based on a neural network, input data thereof includes color images and laser point clouds, and a main data processing flow is shown in fig. 2:

1. obtaining the category of each pixel of the color image through a semantic segmentation network, giving category information to the laser point cloud through external parameters among the sensors, and promoting the characteristics of the laser point cloud to be enhanced to obtain enhanced laser point cloud;

2. respectively carrying out geometric feature coding and visibility feature coding on the enhanced laser point cloud (as shown in FIG. 3), and stacking the aerial view features obtained by each module to obtain stacking features;

3. inputting the stacking characteristics into a multi-layer characteristic extraction network, extracting characteristic information of different layers, and stacking the characteristic information of each layer to obtain fusion characteristics;

4. and obtaining the category, offset, Z value, appearance and orientation of the 3D target by the fusion characteristics through different detection heads.

In the process of neural network training, the invention uses a data augmentation method combining images and point clouds to relieve the problem of unbalanced data categories, thereby improving the accuracy of 3D target detection, and the data augmentation method mainly comprises random overturning and random inserting of true targets along an X axis. Randomly inserting true targets refers to storing image 2D frames, image semantic segmentation pixel points, true target point clouds and 3D frames of all true targets in a data set, and randomly selecting part of the true targets to be inserted into a currently retrieved data set; and deleting all laser points of the currently retrieved data set in the contour region according to the contour region formed by the randomly selected true value target point cloud.

In the embodiment, by applying the technical scheme, the main input sensor data comprises a color image and laser point cloud, wherein the color image is subjected to semantic segmentation network to obtain the category of each pixel point, and the category information can expand the characteristics of the laser point cloud. In addition, according to the principle of laser beam propagation, the space between the sensor and the laser point is not blocked by an obstacle and can be considered as a free area. Therefore, the 3D target detection task can utilize the image semantic features and the spatial visibility features to improve the detection accuracy of the 3D target. And finally, obtaining fusion characteristics by using a multilayer characteristic extraction network, and completing a 3D target detection task through a detection head, wherein the overall network structure of the target detection is shown in FIG. 4.

Specifically, the method comprises the following steps:

step 1: classifying the category of each pixel of the color image by using a mainstream semantic segmentation method (such as DeeplabV3+), projecting the laser point cloud onto the image to obtain the image semantic information of each laser point, and enhancing the characteristics of the laser point cloud;

step 2: dividing the body columns (W, H,1) in a sensing area space, mapping the laser point cloud with enhanced characteristics to each body column, further enhancing the characteristics of the laser points in the body columns by utilizing the geometric information and the position information of all the points in the body columns, and extracting the characteristics in the body columns by a multi-layer sensing machine to obtain the geometric characteristics (W, H, D);

and step 3: dividing voxels (W, H, O) in a sensing region space, obtaining the spatial visibility state of each voxel by using a ray casting algorithm as shown in FIG. 5, and further extracting high-dimensional features through a convolution layer to obtain visibility features (W, H, K);

and 4, step 4: stacking the geometric features and the visibility features along the depth direction, inputting the geometric features and the visibility features into a multi-layer feature extraction network, extracting feature information of different layers, stacking the feature information of each layer to obtain fusion features, wherein the structure of the multi-layer feature extraction network is shown in FIG. 4;

and 5: the feature map output by the multi-layer feature extraction network is connected to five different detection heads (as shown in fig. 6) to predict the object center on the bird's eye view plane and regress the different attributes of the 3D bounding box.

For the above steps, further detailed description is given:

enhancing point cloud characteristics by image semantic characteristics

The semantic segmentation model is pre-trained (e.g. using the cityscaps dataset) and the semantic segmentation results contain 4 classes (class _ id) for Car (Car), Pedestrian (Pedestrian), Cyclist (Cyclist) and Background (Background).

The spatial point cloud can be associated with the pixel positions of the corresponding images by projective transformation from the laser radar coordinate system to the camera coordinate system and then from the camera coordinate system to the image coordinate system. By the method, semantic information represented by image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, namely new data dimensionality is added to the originally input point cloud, so that the enhancement of point cloud semantics is realized. Finally, the original set of points is expanded from 4 dimensions (x, y, z, r) to 5 dimensions (x, y, z, r, class _ id).

Dynamic voxelization

For hard pigmentation, the number of non-empty pillars and the number of points in the pillars are limited. For example, for each input sample, the maximum number of non-hollow columns needs to be set. Generally, the number of non-empty columns in the H × W individual columns does not exceed a set maximum value, and on the contrary, random sampling of the columns is performed. Meanwhile, if the number of the middle points of each body column is set to be N, point random sampling is carried out on each body column exceeding the set point number; if the number of dots in the body column is less than N, zero padding is performed. And each modified point has the characteristic of D-9 dimensions, and a point cloud set with the dimension (N, D) is formed in one body column. And if the number of the selected volume columns is P, finally converting into a point cloud set with the dimension of (P, N, D).

However, for dynamic voxelization, all points within each bin are retained for learning bin-level features. For each input sample, we get a set of points in dimension (N', D). Where N' is the total number of points per input sample. The dynamic voxelization does not fix the number of point clouds in each volume column, but stores the coordinates of the volume column where each point is located so as to correspondingly calculate the volume column level characteristics.

Three, point set visibility feature coding

The number of input point sets is N, each point containing x, y, z spatial coordinates. First, the present invention initializes all voxels to an unknown state and then performs a ray casting algorithm for each ray. If the voxel is already occupied, execution continues. If the final voxel traversed by the ray is reached, the voxel is marked as occupied. Otherwise, all the traversed voxels are labeled as free states. Finally, the whole three-dimensional point cloud space is represented by three states of unknown, occupied and free.

The link mainly comprises the steps of reconstructing a space visibility state, calculating a voxel-level visibility characteristic, and finally integrating the voxel-level visibility characteristic into a three-dimensional target detection framework. The case of a two-dimensional plane is first described here, and then expansion into a three-dimensional space is more convenient to understand.

After the voxel coordinates traversed by the ray are obtained, the visibility state of each voxel in the whole three-dimensional space can be determined. The visibility of three-dimensional space is defined herein as three different states, namely Unknown (Unknown) space, Occupied (Occupied) space and Free (Free) space, and is characterized by specific numerical values. We stack these voxel-level visibility features along the z-axis and then aggregate them using 1 x 1 convolutional layers, resulting in a visibility feature with dimension (H, W, K).

Four, no anchor frame detects head

The target detector without the anchor frame consists of five detection heads. They are key point heat map detection heads, local offset detection heads, z-axis positioning detection heads, 3D target size detection heads and direction detection heads. FIG. 5 shows some details of the anchor-frame-less object detector.

For the heat map detection head and the offset detection head, the keypoint heat map and the local offset regression map are predicted. The keypoint heat map is used to find the location of the center of the target object in the aerial view. The offset regression map not only helps the heat map to find a more accurate target object center in the bird's-eye view, but also can compensate for discretization errors caused by the process of the columnization.

For the offset regression detector head, there are two main functions: first, it is used to eliminate errors caused by the process of columnating, in which floating point target object centers are assigned to integer style column positions in the bird's eye view. Second, it plays an important role in refining the prediction of the heat map target object centers, especially when the heat map predicts the wrong centers. In particular, once the heat map predicts an erroneous center that is several pixels offset from the true center, the offset detection head has the ability to mitigate or even eliminate the pixel error relative to the true center of the target object. A square region having a radius r around the center pixel of the target object is selected in the offset regression heat map. The further away from the center of the object, the larger the offset value, and L1 loss was used to train the offset.

After the target object is located in the bird's eye view, only the target object x-y location is available. Thus requiring the z-axis positioning of the detection head to regress the z-axis values. The z-values were regressed using the L1 loss function.

The method first realizes the enhancement of the image semantic information to the original point cloud data in the data preprocessing stage; secondly, successfully reconstructing the spatial visibility state of the point cloud by using a ray casting algorithm; then, the geometrical and semantic features and the visibility features of the point set are respectively aggregated by utilizing a convolutional neural network.

The upsampled network output feature map is then applied, connected to five different detection heads to predict the object center on the bird's eye view plane, and regress the different attributes of the 3D bounding box. And finally, combining the output results of the five detection heads together to generate a detection result. The key point heat map prediction detection head is used for predicting the centers of objects in the aerial view plane, and each object is encoded into a small area with a heat peak as the center.

The method combines the aggregated point set geometric features, point set visibility features and point set semantic features into an end-to-end three-dimensional target detection framework, realizes data fusion of the point cloud and the image, innovatively provides a concept of space visibility of the point cloud, effectively reconstructs the occupied state of the point cloud space, and helps to accurately sense the real condition of the three-dimensional space. In addition, during the inference phase, each hot peak is picked by the max pooling operation, after which no more anchor points for multiple regressions are tiled to a location, thus eliminating the need for a conventional NMS. This allows the entire detector to run on a typical CNN accelerator or GPU, saving CPU resources for other critical tasks of autonomous driving.

Claims

1. A multi-sensor fusion-based anchor-frame-free 3D target detection method is characterized by comprising the following steps:

s1, acquiring a color RGB image and a laser point cloud;

2. The multi-sensor fusion-based anchor-frame-free 3D object detection method according to claim 1, wherein the step S2 is specifically to perform semantic segmentation on the color RGB image by using a semantic segmentation network, wherein the semantic segmentation network comprises U-Net, SegNet, DeepLab or BiSeNet;

3. The anchor-frame-free 3D object detection method based on multi-sensor fusion according to claim 1, wherein the specific process of the step S3 is as follows: through the projection transformation from a laser radar coordinate system to a camera coordinate system and then from the camera coordinate system to an image coordinate system, the pixel positions of the space point cloud and the corresponding color RGB image are associated, semantic information represented by the color RGB image pixels is supplemented to each point cloud projected from the laser radar coordinate system to the image coordinate system, and the method is equivalent to adding a new data dimension to the originally input laser point cloud and endowing the laser point cloud with a category attribute, so that the enhancement of point cloud semantics is realized.

4. The anchor-frame-free 3D target detection method based on multi-sensor fusion as claimed in claim 1, wherein the specific process of point cloud geometric feature encoding in step S4 is as follows:

A deviation of (a);

5. The anchor-frame-free 3D object detection method based on multi-sensor fusion as claimed in claim 4, wherein the specific process of point cloud visibility feature encoding in step S4 is as follows:

6. The multi-sensor fusion-based anchor-frame-free 3D object detection method according to claim 5, wherein the step S5 specifically includes stacking geometric features and visibility features in a depth direction: the geometric features and the visibility features are respectively mapped into a grid image generated from a bird's eye view of the stereoscopic grid and stacked in the depth direction to obtain stacked features of sizes (B, (D + K), H, W), where B, (D + K), H, W correspond to lot, depth, height and width, respectively.

7. The method for detecting the anchor-frame-free 3D object based on the multi-sensor fusion as claimed in claim 1, wherein the multi-layer feature extraction network in step S6 extracts three feature maps with different scales through asynchronous long convolution layers, the feature maps with each scale pass through a corresponding step deconvolution layer respectively to expand the size to the same size, and finally are stacked in the depth direction to obtain the fusion feature.

8. The method for detecting the anchor-frame-free 3D object based on the multi-sensor fusion as claimed in claim 1, wherein the anchor-frame-free object detector in the step S7 comprises five detection heads, specifically, a key point heat map detection head, a local offset detection head, a z-axis positioning detection head, a 3D object size detection head and an orientation detection head.

9. The method according to claim 8, wherein the five detection heads are configured to perform convolution regression on the fusion features respectively to obtain a class center probability of each pixel point of the fusion feature, an offset from a true center X and Y direction, an offset from a true center Z direction, and a geometric size and an orientation of the three-dimensional detection frame.

10. The method for detecting the anchor-frame-free 3D object based on multi-sensor fusion of claim 9, wherein the step S7 is to predict the center of the object on the bird' S eye view plane by using five detection heads, and to regress different attributes of the 3D bounding box, and finally, to combine the output results of the five detection heads to generate the 3D object detection result.