CN115457257A

CN115457257A - Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud

Info

Publication number: CN115457257A
Application number: CN202211033470.9A
Authority: CN
Inventors: 王博思; 孙棣华; 赵敏
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-08-26
Filing date: 2022-08-26
Publication date: 2022-12-09

Abstract

The invention discloses a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud, which comprises the following steps: step 1: acquiring point cloud data; step 2: determining a first sampling point from the point cloud data through down-sampling, and recording the first sampling point as a candidate point; and step 3: taking the candidate points as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector; and 4, step 4: judging whether the number of sampling points is not reduced any more, if not, executing the step 5, and if so, executing the step 7; and 5: determining a second sampling point from the candidate points through downsampling; and 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4; and 7: local features of each point cloud are determined.

Description

Cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud

Technical Field

The invention belongs to the field of automatic driving, and particularly relates to a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud.

Background

With the improvement of the automation level of the automobile, the traditional method of acquiring single sensor data and combining multi-sensor data cannot meet the detection requirement, and a cross-modal data fusion solution comes up. The modality means a form of data. The visual information and the point cloud are data in two different modes, and although the data exist in different forms, the visual information and the point cloud are used for realizing the continuous improvement of the detection accuracy of the automatic driving target. Emerging cross-modal data fusion can not only combine sensor data of different modalities, but also combine 3D point cloud data of different modalities by relying on technologies such as a neural network and deep learning, and the like, so that innovative innovation is brought to the field of target detection.

In practical application, based on the difference of fusion levels of different modal sensors, the current automated driving cross-modal data fusion algorithm can be divided into three categories: data level fusion, feature level fusion, and decision level fusion. The data level fusion is a bottom fusion mode, can greatly enrich the data information of a detection target, and overcomes the defect of insufficient data information of a single mode by the advantage of mutual compensation and combination of data information of sensors of different modes. The feature level fusion firstly carries out feature extraction on sensor data of different modes respectively, then fuses the extracted features, and finally obtains a detection result according to the fused output. Many fusion methods based on deep learning are also implemented by cascading or weighting features extracted from different sensors by a neural network, such as AVOD, MV3D, roarNet, pointFusion, F-PointNet, and the like. The decision-level fusion algorithm performs feature extraction on the acquired information and outputs decision information, the decision information can be independently operated by a single sensor, then all the decision information is fused, and finally the fusion result is analyzed to make a final decision.

Aiming at the problems of point cloud disorder, no structure and the like, the PointNet creates a processing scheme for directly and deeply learning the point cloud to extract features. The whole network of PointNet is divided into a classification network and a division network. Although the PointNet network is simple and efficient, it is not difficult to find out from the network structure thereof that the PointNet maps each point from a low dimension to a high dimension by using MLP, and then extracts the features of all the mapped points together through the maximum value Chi Huajie, that is, the whole process always processes a single point or all the points. Therefore, it cannot extract fine local features of the object well, thereby affecting the performance of object segmentation.

Therefore, an accurate feature extraction method is required to extract local features of the point cloud.

Disclosure of Invention

In view of the above, the present invention provides a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud.

The purpose of the invention is realized by the following technical scheme:

the invention provides a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud, which comprises the following steps:

step 1: acquiring point cloud data of a viewing cone candidate area, wherein the number of the point cloud data is recorded as N;

step 2: determining N1 first sampling points from the point cloud data through a down-sampling method, and recording the first sampling points as candidate points;

and step 3: taking each candidate point as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector;

and 4, step 4: judging whether the number of sampling points is not reduced any more, if not, executing a step 5, and if so, executing a step 7;

and 5: determining N2 second sampling points from the candidate points by a down-sampling method;

step 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4;

and 7: local features of each point cloud in the point cloud data are determined.

Further, the down-sampling method is a farthest point sampling.

Further, the determining the K nearest points by adopting the KNN algorithm with the abstract feature constraint comprises acquiring the K points which are closest to the three-dimensional space distance and the feature space distance of the candidate points.

Further, acquiring the point cloud data of the viewing cone candidate region includes:

acquiring multi-modal data of a region to be detected, wherein the multi-modal data comprises RGB image data, binocular vision point cloud data and laser radar point cloud data;

extracting image characteristics by using an image detection network according to the RGB image data, and generating a two-dimensional candidate region by using a region candidate network;

generating the viewing cone candidate area by using the binocular vision point cloud data and the two-dimensional candidate area;

and acquiring point cloud data of the view cone candidate area from the binocular vision point cloud data and the radar point cloud data.

Further, the image detection network includes fast-CNN.

Further, the cross-modal vehicle detection method based on the multilayer local sampling and the three-dimensional view cone point cloud further comprises the following steps: fusing the image features and the local features of each point cloud to generate a three-dimensional detection frame; and carrying out vehicle detection on the area to be detected by utilizing the three-dimensional detection frame.

Further, the loss function for fusing the image feature and the local feature of each point cloud is:

wherein N is the number of input fusion point clouds,

representing real boxesThe offset between the angular position and the angular position of the prediction frame obtained after the ith input fused point cloud,

indicating the offset, L, between the predicted real frame and the anchor frame _score Representing the loss fraction of the function, L _stn Representing the spatial transform regularization loss.

The invention has the beneficial effects that:

according to the invention, the characteristic learning of the network on the point cloud local area is realized through multilayer local sampling and multilayer characteristic extraction, so that the extracted local characteristics of the point cloud are more accurate, the classification precision is improved, and the subsequent target detection is more accurate.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point clouds, according to an embodiment of the application;

FIG. 2 is a schematic diagram of a partial sampling shown in accordance with one embodiment of the present application;

FIG. 3 is a schematic diagram illustrating feature extraction according to one embodiment of the present application;

FIG. 4 is a schematic diagram of a multi-layer local sampling shown in accordance with one embodiment of the present application;

FIG. 5 is a schematic diagram illustrating multi-level feature learning according to one embodiment of the present application;

FIG. 6 is a schematic diagram of local region feature extraction according to one embodiment of the present application;

FIG. 7 is a diagram illustrating a sample selection within a restricted local area according to one embodiment of the present application;

FIG. 8 is a schematic diagram of a cross-modal data feature level fusion network architecture according to an embodiment of the present application;

fig. 9 is a schematic view of a viewing cone candidate region according to an embodiment of the present application.

Detailed Description

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be understood that the preferred embodiments are illustrative of the invention only and are not limiting upon the scope of the invention.

The application provides a cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud. According to the method, the characteristics of each point cloud of the vehicle to be detected are extracted by sampling multiple layers of local sampling and multi-level characteristic learning, so that the point cloud characteristics are extracted more accurately, and the classification accuracy can be improved.

Fig. 8 is a schematic diagram of a cross-modal data feature level fusion network architecture according to an embodiment of the present application. As shown in fig. 8, the input of the cross-modal feature level fusion network proposed by the present application has three different modal data, which are: RGB images, binocular vision point clouds, and lidar point clouds. The binocular vision point cloud and the laser radar point cloud are in point cloud forms, but come from different sensors. The binocular vision point cloud is generated by a binocular camera, and the lidar point cloud is generated by a lidar, so that the two have different modes. The RGB image and the binocular vision point cloud come from the same sensor.

In one embodiment, a cross-modal data feature level fusion network comprises three modules: the device comprises an image feature extraction module, a viewing cone region candidate module and a fusion bounding box regression module.

The image feature extraction module can use the Faster-RCNN as an image detection network to extract features of the image, and a region candidate network (RPN) is used for generating a two-dimensional candidate region to prepare for subsequent viewing cone region candidate extraction. Most 3D sensors, particularly real-time depth sensors such as LiDAR (LiDAR), produce data at a resolution much lower than the RGB images acquired by the camera. Therefore, a two-dimensional target area of an image can be extracted with reference to a widely-used two-dimensional target detection network, and the target can be classified.

After the two-dimensional candidate region is obtained, in order to make the subsequent three-dimensional detection result of the target more accurate, the two-dimensional candidate region is first promoted to the three-dimensional view cone candidate region by using the depth information, as shown in fig. 9. The cone region candidate module may generate a cone candidate region using a two-dimensional candidate region generated by the RPN in combination with the binocular vision point cloud. The cone candidate region has the function of reducing the detection range in the three-dimensional space, so that the detection result is more accurate. And then extracting the point cloud characteristics of the vehicle in the view cone candidate area by a multi-stage characteristic extraction method. The multi-stage feature extraction method can be summarized as follows: firstly, dividing an input point cloud into a plurality of local small areas with overlapped parts, then extracting more accurate features from each local small area similar to CNN, then aggregating each local feature to obtain features of higher layers, and finally repeating the above processes until the features of all points in the point cloud are obtained.

The fused bounding box regression module may fuse the extracted image features and point cloud features (e.g., local features of the point cloud) and generate an accurate three-dimensional detection box from coordinate transformations of the view cone point cloud. And then, the three-dimensional detection frame can be utilized to detect the vehicle in the area to be detected.

When vehicle detection is carried out, multi-modal data of a region to be detected can be obtained firstly, wherein the multi-modal data comprises RGB image data, binocular vision point cloud data and laser radar point cloud data; then, according to the RGB image data, extracting image features by using an image detection network, and generating a two-dimensional candidate area by using an area candidate network; generating the view cone candidate area by using the binocular vision point cloud data and the two-dimensional candidate area; and finally, acquiring point cloud data of the view cone candidate area from the binocular vision point cloud data and the radar point cloud data. After the point cloud data of the view frustum candidate region is determined, a multi-stage extraction network may be employed to extract features of the point cloud in the view frustum candidate region.

Fig. 1 is a flowchart of a cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud according to an embodiment of the present application.

Step 1: and acquiring point cloud data of the view cone candidate area in the area to be detected, wherein the number of the point cloud data is recorded as N.

Step 2: and determining N1 first sampling points from the point cloud data by a down-sampling method, and recording the first sampling points as candidate points. In some embodiments, the downsampling method may include a farthest point sampling.

And step 3: and taking each candidate point as a center, determining K nearest neighbor points by adopting a KNN algorithm with abstract feature constraint, marking a set consisting of each candidate point and the corresponding K nearest neighbor points as a local region, performing feature extraction on each local region by adopting PointNet to generate a feature vector, and enabling each local region to correspond to one candidate point and one feature vector.

And 4, step 4: and (5) judging whether the number of the sampling points is not reduced any more, if not, executing the step 5, and if so, executing the step 7.

And 5: determining N2 second sample points from the candidate points by a down-sampling method. In some embodiments, N2 may be equal to N1.

Step 6: and taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4.

In step 2, the input Point set may be downsampled by iteratively using a Farthest Point Sampling (FPS). In the sampling of the farthest point, the selected local area point set is defined as K points which are nearest to the center o in the euclidean distance in the spherical area with the center o as the spherical center and r as the radius, as shown in fig. 2. After the farthest point has sampled the selected local area, feature extraction is performed using PointNet in its new coordinate system, as shown in fig. 3.

In step 3, after the farthest point sampling and feature extraction are performed on a single local area, a new sampling point can be obtained, which not only has the position information of the original point in the global point cloud, but also has a vector feature. Thus, the sampling point has all the geometric and characteristic information corresponding to K points around the local area. And then sampling and feature extracting are performed on all the divided local regions, and finally a new group of sampling point sets can be obtained, as shown in fig. 4. In order to further enhance the multistage feature extraction capability of the network, the whole multilayer local sampling is repeated for multiple times to realize multistage feature learning, as shown in fig. 5. However, in selecting a local region, if only the euclidean distance between a point and a point is considered, the case of the left diagram in fig. 6 easily occurs. In fig. 6, the point marked with "a" is a type a point, and the point marked with "b" is a type b point. For the class a point, in an ideal situation, if there are many similar points with similar distances around the class a point, since the sampling selection principle is K points with the closest distance, the network will finally extract the features of the class a point, and further distinguish the result of target segmentation and detection as the class a, as shown in the right diagram in fig. 6. However, in practical situations, there may also be a large number of b-class points with close distances around the a-class point, and the number of b-class points is greater than the a-class point, and the b-class point is erroneously determined as the b-class point according to the selection principle, as shown in the left diagram of fig. 6.

In practice, if points are more similar to each other, the abstract features extracted by multi-layer local sampling are closer, that is, the distance between feature vectors corresponding to similar points is smaller. In the three-dimensional coordinate space shown in fig. 7 (a) and the feature space shown in fig. 7 (b), the points marked with "b" are class b points, and the remaining unmarked points are class a points. As shown in fig. 7 (a), in a fixed three-dimensional space, the coordinates of the point cloud are represented as (x, y, z), and for the class a point of the origin of coordinates, the closest euclidean distance around the point is the class b point, in which case the feature extraction is affected. However, in the n-dimensional feature space shown in fig. 7 (b), the more similar the features between the points are, the more easily the points are grouped together, and the closer the abstract feature vectors are. Therefore, the abstract feature vector is also used as a constraint condition for selecting the sampling points, so that guidance can be better provided for the network, namely, the features of the points which are closer to the three-dimensional space distance and the feature space distance of the central point are learned, the feature extraction cannot be interfered by irrelevant points, and the robustness of the network is stronger. And 3, determining K nearest points by adopting a KNN algorithm with abstract feature constraint, namely determining the nearest points from the central point, wherein the Euclidean distance from the central point is considered, and the distance between abstract feature vectors of the points is also considered. Therefore, the accuracy of target detection can be higher by adopting the KNN algorithm with abstract feature constraints.

On the basis of three-dimensional instance segmentation, three-dimensional space target positioning is realized based on residual errors. Three-dimensional target localization is not a regression of the absolute three-dimensional position of the target object, and its deviation from the sensor position may vary over a wide range. For example, the deviation in the KITTI data set ranges from 5 meters to possibly over 50 meters. Aiming at the problem, the method for predicting the point cloud center of the target object in the mask coordinate system specifically comprises the steps of extracting the point cloud with the category of the interested target in the view cone after three-dimensional instance segmentation, and further carrying out standard processing on the coordinate data of the point cloud so as to improve the integral translation invariance of the algorithm.

The method and the device take input three-dimensional point cloud data as anchor points which are densely arranged in a view cone area, and fuse global features (such as global features of images and point cloud global features of binocular cameras) of vehicle targets with all single point cloud features extracted by a network. After the single point cloud features are fused, the point cloud features are set into a three-dimensional anchor frame with the category and the scale consistent with those of the corresponding vehicle, and then the three-dimensional anchor frame is classified and corrected by using new features generated after fusion. The method comprises the following steps of outputting new features of each fused point cloud (namely new features obtained after the global point cloud features of a binocular camera and the local point cloud features of a laser radar are fused) by utilizing a point cloud multilevel feature extraction network, wherein a loss function in the whole process is as follows:

in the above formula, N is the number of input fusion points (i.e. features after fusion),

representing the offset between the true frame angular amount position and the predicted frame angular amount position obtained after the ith input fusion point,

indicating the offset, L, between the predicted real frame and the anchor frame _score Representing the loss fraction of the function, L _stn Representing the spatial transform regularization loss. The coordinate regression function of the fusion network is:

wherein x represents the difference between the prediction frame and the real frame, and the coordinate quantity participating in regression represents seven numerical values with different meanings in the three-dimensional boundary frame, which are respectively a three-dimensional coordinate value of the vehicle target, three values of the length, the width, the height and the orientation angle of the boundary frame.

Table 1 shows the results of 3D AP comparisons for the KITTI validation set vehicle (Car) class using different feature extraction methods. Wherein, "v0" represents a detection algorithm using PointNet as a point cloud feature extraction network, and "v1" represents a detection algorithm using the multi-stage feature extraction point cloud segmentation method provided by the application as the point cloud feature extraction network.

TABLE 1

As can be seen from table 1, the accuracy of the multi-stage feature extraction point cloud segmentation method provided by the present application as a detection method of a point cloud feature extraction network is improved.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A cross-modal vehicle detection method based on multilayer local sampling and three-dimensional view cone point cloud is characterized by comprising the following steps:

step 1: acquiring point cloud data of view cone candidate areas in areas to be detected, wherein the number of the point cloud data is recorded as N;

and 6: taking the second sampling point as a candidate point, and repeatedly executing the steps 3 and 4;

2. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 1, wherein the down-sampling method is a farthest point sampling.

3. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 1, wherein the determining K nearest neighbor points using KNN algorithm with abstract feature constraints comprises obtaining K points closest to both three-dimensional space distance and feature space distance of the candidate points.

4. The cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view cone point cloud of claim 1, wherein obtaining point cloud data of a view cone candidate region comprises:

5. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point cloud of claim 4, wherein the image detection network comprises fast-CNN.

6. The cross-modal vehicle detection method based on multi-layer local sampling and three-dimensional view point cloud of claim 4, further comprising:

fusing the image features and the local features of each point cloud to generate a three-dimensional detection frame;

and carrying out vehicle detection on the area to be detected by utilizing the three-dimensional detection frame.

7. The cross-modal vehicle detection method based on multi-layered local sampling and three-dimensional view cone point clouds of claim 6, wherein the loss function for fusing the image features and the local features of each point cloud is:

wherein N is the number of input fusion point clouds,

representing the offset between the actual frame angular position and the predicted frame angular position obtained after the ith input fused point cloud,

indicating the offset, L, between the predicted real frame and the anchor frame _score Represents the loss fraction of the function, L _stn Representing the spatial transform regularization loss.