CN116758506A

CN116758506A - Three-dimensional vehicle detection method based on point cloud and image fusion

Info

Publication number: CN116758506A
Application number: CN202310786009.9A
Authority: CN
Inventors: 林怡; 丰超; 伍锡如
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-09-15

Abstract

The invention discloses a three-dimensional vehicle detection method based on point cloud and image fusion, in particular to a three-dimensional vehicle detection method for extracting color texture information in an image and size and position information in point cloud data by using a laser radar-camera sensor fusion architecture. The method is realized by the following steps: preprocessing point cloud data and image data; designing an improved key point module for feature extraction; the improved interesting fusion module is used for completing feature fusion of the point cloud and the image data; the use of improved non-maximum suppression reduces redundancy of the detection frame. The method mainly solves the problems that unmanned vehicles in traffic scenes are difficult to accurately measure and estimate the distance of vehicle barriers, realizes information complementation of different modes, provides information such as position, contour, speed and the like for navigation, obstacle avoidance and path planning, and makes up for a short plate for estimating the target distance of an unmanned automobile team.

Description

Three-dimensional vehicle detection method based on point cloud and image fusion

Technical Field

The invention relates to the field of computer vision, in particular to a three-dimensional vehicle detection method based on point cloud and image fusion.

Background

In recent years, unmanned technology is rapidly developed, becomes a focus and focus of attention, and has important significance for development of fields such as transportation, intelligent manufacturing, service robots and the like and improvement of human life quality. The increase of the number of motor vehicles brings unprecedented pressure to the traffic environment, and scientific research institutions and enterprises in all countries of the world input a large amount of manpower and material resources to develop related researches and exploration for the improvement of driving safety and the reduction of the occurrence frequency of traffic accidents.

Along with the superiority of the deep learning technology in the field of target detection, a vehicle detection method of an unmanned vehicle is rapidly developed, but the method still has great defects in the actual traffic environment, firstly, the traffic environment is intricate and complex, the shielding degree among vehicles is different, and the difficulty in vehicle detection is high; secondly, illumination is uneven, and false detection and omission among vehicles are easy to cause. Most existing unmanned automobile vision systems adopt monocular cameras, specific depth information of a vehicle to be detected cannot be obtained, a sensor is required to continuously sense the position of the vehicle in the running process of the vehicle, and multiple image processing operations are carried out, so that inaccurate positioning and overlong time are caused, and the running efficiency of the unmanned vehicle is seriously affected. The main research direction of the current environment perception is to use the fusion of point cloud data and image information to make up characteristic information, separately process the point cloud data and the image information, and then use the sub-network fusion to output results for the target detection of the three-dimensional vehicle.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a three-dimensional vehicle detection method based on point cloud and image fusion, which reduces false detection rate and omission rate of a three-dimensional vehicle.

The three-dimensional vehicle detection method based on the point cloud and the image fusion is realized by the following steps:

step one, preprocessing point cloud data and image data;

step two, designing a key point module for feature extraction;

thirdly, finishing feature fusion of the point cloud and the image data by using an improved interesting fusion module;

and step four, reducing redundancy of the detection frame by using improved non-maximum suppression.

The preprocessing step comprises the steps of removing edge clutter points and noise points in laser radar data through a statistical filtering algorithm, and reducing calculated amount for subsequent detection; the laser radar and the monocular camera are registered in time and space, so that coordinate systems of the two sensors are corresponding and unified, and a foundation for multi-mode target detection is laid.

The key point module is used for respectively extracting the point cloud data and the image data to generate key points of the vehicle to be detected through feature extraction, and estimating respective interested areas according to the key points.

The interesting fusion module is used for extracting local features on the point cloud and the image region, carrying out fusion operation on the 3D ROI and the 2D ROI corresponding to the point cloud data and the image data, completing fusion of the point cloud data and the 2D ROI by adopting a fusion strategy of attention fusion, and predicting the fused local features by using the fused local features in a three-dimensional detection frame.

The improved non-maximum value suppression module is used for setting a fixed threshold value to a threshold value of an improved self-adaptive size, and self-adaptively adjusting the threshold value of non-maximum value suppression according to the density of vehicle distribution, so that the accuracy of three-dimensional vehicle detection is improved as much as possible.

The invention has the following effective effects: 1. the filtering of the point cloud data and the clipping of the image data reduce the operation amount of the data. 2. The feature fusion of the point cloud data and the image data can reduce the false detection rate and the omission rate of vehicles in an unmanned scene, and make up for the defect of lack of depth information in a monocular camera. 3. Through information fusion of a plurality of modes, data of a plurality of angles are provided for environmental information in an unmanned scene, so that robustness of the data information in a complex environment is enhanced, and a more accurate precondition is provided for three-dimensional vehicle detection.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of steps one to four in the present invention.

FIG. 2 is a diagram of a calibration architecture of a data preprocessing sensor in accordance with the present invention.

Fig. 3 is a diagram showing a time synchronization structure of a sensor according to the present invention.

Fig. 4 is a block diagram of a region of interest attention fusion module in accordance with the present invention.

FIG. 5 is a flow chart of improved non-maximum suppression in the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples.

The three-dimensional vehicle detection method based on the point cloud and the image fusion is realized by the following steps, as shown in fig. 1;

step one, preprocessing point cloud data and image data; step two, designing a key point module for feature extraction; thirdly, finishing feature fusion of the point cloud and the image data by using an improved interesting fusion module; and step four, reducing redundancy of the detection frame by using improved non-maximum suppression.

In the first step, the specific steps include:

according to different traffic scenes and different vehicle detection difficulties, collecting a data set, and judging whether outliers exist according to the density of the point cloud according to the sparse distribution characteristics of the outliers in the point cloud data. Firstly, searching a neighborhood point of each point according to the position of the point, then calculating to obtain the distance from each point to the neighborhood point, wherein the distance value meets the characteristic of Gaussian distribution, and the mean value and the variance can be obtained by modeling the distance parameter according to the characteristic of the Gaussian distribution, and the mean value and the variance have the following calculation formulas:

wherein d is _ij For the distance from m points to k clinical points, an average confidence interval, the distance threshold, is calculated, and if a point is outside the center interval, the point is determined to be an outlier, and if it is within the interval, it is not an outlier. The threshold calculation formula is as follows:

d＝μ+λσ (3)

where λ is a coefficient of the threshold. By clipping the fixed point cloud and the size of the image, feature matching is achieved with multiple views having the same aspect ratio.

And converting measured values of different sensor coordinate systems into the same coordinate system, establishing a coordinate conversion relation among an accurate laser radar coordinate system, a three-dimensional world coordinate system, a camera coordinate system, an image coordinate system and a pixel coordinate system, and converting measuring points under the laser radar coordinate system into the pixel coordinate system corresponding to the camera through the coordinate system to realize the spatial synchronization of the two, wherein a multi-sensor spatial calibration structure diagram is shown in figure 2.

The sampling frequencies of the camera sensor and the laser radar are different, so that the two sensors are synchronized in time after the space calibration is finished, the sampling period of the laser radar is 50ms, namely the sampling frequency is 20Hz, the camera sensor samples 30 pictures per second, namely the sampling frequency is 33.3ms, the time synchronization among the sensors is realized by adopting an interval sampling mode, the least common multiple of 100ms of the two sensors is selected as the sampling period of the fusion system, and the sensor time registration schematic diagram is shown in fig. 3.

In the second step, the specific steps include:

the key points in the point cloud are extracted by the SA layer in the PointNet++, the input point cloud is subjected to downsampling operation through the sampling layer, the downsampling operation adopts the furthest point sampling to obtain the center of the local area, and the point cloud data are respectively input to the F-FPS module and the D-FPS module. For the positioning of objects in the point cloud, especially far-small and occluded targets, the texture and color of the objects have certain importance, so that an image segmentation network is adopted to guide the selection of key points, and the segmentation characteristics of the point cloud are realized by using a characteristic propagation (FP) layer in PointNet++. The output of the SA layer is transmitted into the FP layer, the interpolation method based on the weighted average of the K nearest inverse distances is used for carrying out up-sampling operation on the points, and the calculation formula is as follows:

w in _i (x)＝1/d(x,x _i ) ^p Representing an i-th neighbor point x corresponding to a certain point and k nearest neighbors _i Inverse square of the euclidean distance between them. The interpolation point feature is connected with the jump connection point feature in the corresponding SA layer. The connected features pass through a PointNet of one unit, the properties of the connected features are similar to 1X 1 convolution in convolution, the feature vectors of points are updated by using a shared full connection layer and a ReLu layer, and finally the FN layer outputs semantic features corresponding to the same points of the original data. The image contains rich color and texture information, so that additional data information can be provided for the point cloud, and the accuracy of three-dimensional detection is improved. After the image is input, the image enters a deep v3+ neural network to obtain the segmentation characteristics and classification scores of the pixels, so that the selection of key points can ignore the background.

In the third step, the specific steps include:

and obtaining a central point of the target according to the position and the characteristics of the key point, and obtaining the spatial offset between the central point and the corresponding true value by adopting a partial network of Votenote. Firstly, inputting the point cloud with XYZ axes in the previous step into a network, carrying out up-sampling operation on the point cloud, learning deep features, and then outputting a subset of M points with XYZ coordinates and rich C-dimensional feature vectors. These point subsets with (3+C) dimensions are referred to as seed points, and votes are generated from each seed point independently by a shared voting module implemented by a fully connected layer, reLu, and batch normalized multi-layer perceptron (MLP). The multilayer perceptron characterizes the seed _i To output the offset delta x of the European space _i And a characteristic offset Δf _i So that it votes v _i ＝[y _i ；g _i ]From y _i ＝x _i +Δx _i And g _i ＝f _i +Δf _i The composition is formed. When the seed point s _i When the three-dimensional offset is positioned on the surface of an object, the 3D offset is predicted as shown in the following formula:

where M is the number of seeds on the surface of the object,is the position x of the seed _i Offset to the center of the three-dimensional bounding box of the object in which it is located.

And carrying out pooling operation on surrounding points of the previously obtained center points by adopting a 3D ROI pooling layer, learning local features of cluster points near each center point, and carrying out coding operation on the ROI by using a 3D boundary box. Using the obtained center points to center (x _o ,y _o ,z _o ) Parameterization is achieved with the enlarged dimensions as the length and width of the region of interest and the height of the enlarged dimensions as the height value of the region of interest. The dimension of the region of interest may be defined as (x _o ,y _o ,z _o ,δ+h _i ,δ+w _i ,δ+l _i ) Where δ is an expanding parameter. The points in each 3D ROI are moved to relative positions according to the center point for better learning of local features, and then clustered on points within the 3D ROI using a sub-network with stacked multi-layer perceptron layers to extract pooled features of the local ROIs. After the 3D ROI is obtained, it is projected onto the image to generate a corresponding 2D ROI, and the local texture features of the 2D ROI are learned by the 2D ROI pooling layer following in a similar manner to the processing of the 3D ROI. And finally, carrying out fusion operation on the 3D ROI and the 2DROI corresponding to the three-dimensional point cloud and the two-dimensional image, and completing fusion of the three-dimensional point cloud and the 2D ROI through a fusion strategy of attention fusion, wherein the local features are finally fused together for 3D boundary box prediction.

Using an attention fusion algorithm to obtain a two-part ROIFully fusing, weighting the ROI of the two-dimensional image and the ROI of the point cloud data through channels, wherein the ROI in the image is F _i ∈ ^N×N×C ROI in the point cloud is F _l ∈ ^N×N×C Using channel splicing operation to splice F _i And F _l Realizing fusion and representing the fused data as F _f ∈ ^N×N×2C . Then F is carried out _f Input to global average pooling operation, and output F is obtained after pooling operation _fa ∈ ^2C 。F _fa Output is F through the full connection layer _fc ∈ ^2C F is to F _fc Remodelling to F ₁ ∈ ^2×C Then for F ₁ The channel uses the Softmax function to obtain the following equation:

wherein A and B are F ₁ Is F, a and b are the first row vector and the second row vector of _l And F _i C is the number of channels per ROI. And finally, the ROI fuses a calculation formula through an attention mechanism as follows:

wherein F is _w Is F _l And F _i The attention mechanism fusion structure is shown in figure 4.

In the fourth step, the specific steps include:

according to the object distribution density, the threshold value of non-maximum value suppression is adaptively adjusted, when the object to be detected is denser, the correct prediction frame is reserved as far as possible, and when the object to be detected is sparser, the redundant prediction frame is removed by adopting the threshold value transferred to by the smaller non-maximum value. The density of the predicted candidate frames is defined as the maximum IoU of all the rest of the real frames, and the density of the target to be detected is expressed as follows:

wherein i is a generated prediction candidate frame, and j is a real frame. The threshold value thereof can be defined as follows:

N _M ＝Max(N _t ,d _M ) (9)

m is the prediction candidate frame with the highest confidence score in the set. Finally, combining with Soft-NMS to obtain confidence penalty formula as follows:

wherein the method comprises the steps ofThe improved non-maximum suppression architecture is shown in fig. 5 by the confidence penalty formula, which allows for an adaptive process for threshold selection for the NMS.

The invention and the current main stream algorithm are put into the same traffic scene under the same data set for experiment, the data in the comparison algorithm are all from the KITTI official network, three different difficulties of simplicity, generality and difficulty are respectively selected for comparison analysis, and the detection methods are divided into an image detection method, a point cloud detection method and a detection method of point cloud and image fusion, and the test comparison data are shown in the table 1. The three-dimensional vehicle detection method of the pure image is faster, but the detection precision still has a larger room for improvement, and the three-dimensional vehicle detection method of the pure point cloud is far higher than the image detection method in precision, but slightly lower than the vehicle detection algorithm of the point cloud and image fusion.

Table 1 three-dimensional vehicle detection algorithm time and accuracy comparison

And respectively selecting an algorithm from different detection modes to carry out experimental comparison with the method, taking a traffic scene with more complex data set as a comparison experimental environment, wherein the vehicle detection precision of the method is 83.68 percent, the average time for processing single frame data is 0.11s under the traffic scenes such as vehicle shielding, poor illumination, far-small vehicles and the like, and the method has good detection result, and the false detection rate and the omission rate are effectively reduced.

Claims

1. The three-dimensional vehicle detection method based on the point cloud and the image fusion is characterized by comprising the following steps of:

step one, preprocessing point cloud data and image data;

step two, designing a key point module for feature extraction;

2. The three-dimensional vehicle detection method of point cloud and image fusion according to claim 1, characterized in that: in the first step, the specific steps include:

the laser radar and the monocular camera acquire data and respectively output point cloud data and image data, the joint calibration in space is realized according to the joint calibration principle of the sensors, and the time synchronization among the sensors is realized by selecting an interval sampling mode; and reducing the interference of outliers in the point cloud data through a statistical filtering algorithm.

3. The three-dimensional vehicle detection method of point cloud and image fusion according to claim 1, characterized in that: in the second step, the specific steps include:

the improved key point module performs feature extraction in the point cloud data and the image data respectively to generate key points of a vehicle to be detected, the key point module in the point cloud performs feature extraction by an SA layer in PointNet++, and simultaneously performs downsampling and depth feature extraction on the point cloud, and downsampled 3D points and corresponding features are obtained; the image data is input to the deeplabv3+ neural network to obtain the segmentation characteristics and classification scores of the pixels, so that the selection of the key points can ignore the background.

4. The three-dimensional vehicle detection method of point cloud and image fusion according to claim 1, wherein: in the third step, the specific steps include:

and obtaining the central point of the target vehicle according to the positions and the characteristics of the key points, and acquiring the spatial offset between the central point and the corresponding true value by adopting a partial network of Votenote. And carrying out pooling operation on surrounding points of the previously obtained center points by adopting a 3D ROI pooling layer, learning local features of cluster points near each center point, and carrying out coding operation on the ROI by using a 3D boundary box. After the 3D ROI is obtained, it is projected onto the image to generate a corresponding 2D ROI, whose local texture features are learned by the 2D ROI pooling layer followed. Finally, the fusion of the 3D ROI and the 2D ROI is done using an improved attention fusion strategy, these local features being fused together for prediction of the 3D bounding box.

5. The three-dimensional vehicle detection method of point cloud and image fusion according to claim 1, wherein: in the fourth step, the specific steps include:

the improved target detection under the non-maximum suppression self-adaptive traffic scene is designed, the non-maximum suppression threshold value is self-adaptively adjusted according to the distribution density of the vehicles to be detected, and the detection accuracy of the vehicles is improved. According to the density of the object distribution, the threshold value of non-maximum suppression is self-adaptively adjusted, when the object to be detected is denser, the correct prediction frame is reserved as far as possible, and when the object to be detected is sparser, the redundant prediction frame is removed by adopting the smaller threshold value of non-maximum suppression.