CN115187964A

CN115187964A - Automatic driving decision-making method based on multi-sensor data fusion and SoC chip

Info

Publication number: CN115187964A
Application number: CN202211082826.8A
Authority: CN
Inventors: 王嘉诚; 张少仲; 张栩
Original assignee: Zhongcheng Hualong Computer Technology Co Ltd
Current assignee: Zhongcheng Hualong Computer Technology Co Ltd
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2022-10-14

Abstract

The invention discloses an automatic driving decision-making method and an SoC chip based on multi-sensor data fusion, belonging to the technical field of machine learning and automatic driving, wherein an image sensor acquires image data of a road, inputs the image data into a trained image target detection neural network model, carries out lane image target detection and outputs target detection data of a lane image; the method comprises the following steps that a laser radar collects 3D point cloud data, the point cloud data are input into a trained point cloud target detection neural network model to carry out obstacle target detection, and obstacle information output by a binocular camera is fused to generate obstacle position and distance data; and performing data fusion on the lane image data and the obstacle position distance data, and correcting the road condition information of vehicle driving as a basis for automatic driving decision. The scheme of the invention can fully meet the real-time requirement in the automatic driving scene, and simultaneously fuses different sensor data, so that the accuracy of road condition analysis is greatly improved.

Description

Automatic driving decision-making method based on multi-sensor data fusion and SoC chip

Technical Field

The invention belongs to the technical field of machine learning and automatic driving, and particularly relates to an automatic driving decision method based on multi-sensor data fusion and an SoC chip.

Background

The automatic driving technology is more and more concerned by whole vehicle enterprises, and some whole vehicle enterprises invest more and more manpower and material resources to develop automatic driving vehicles, and even the automatic driving vehicles are used as target mass production points of 5-10 years in the future. The realization of automatic driving is divided into three stages of cognition, judgment and control, and the current automatic driving technology has many problems in the aspects of cognitive stages and path generation such as road identification and pedestrian identification and judgment stages such as condition judgment.

Along with the rapid development of artificial intelligence in the years, the application of the artificial intelligence in the field of automatic driving is more and more common, and a Chinese patent with the publication number of CN114708566A discloses an automatic driving target detection method based on improved YOLOv4, which comprises the following specific steps: acquiring a target detection common data set, and preprocessing the acquired data set through Mosaic; constructing a new non-maximum value inhibition algorithm Soft-CIOU-NMS by using NMS, soft-NMS and a CIOU loss function; improving a feature extraction network of YOLOv4, and increasing the three-scale prediction of the original YOVOv4 to four-scale prediction; the ordinary convolution of YOLOv4 is improved, and the depth separable convolution is used for replacing the ordinary convolution, so that the detection speed is accelerated; and the YOLOv4 network structure is improved, and a CBAM attention mechanism is added to enhance the feature extraction capability. However, depending on the image alone as the judgment basis, the image may be deviated in the detection and classification step due to the deviation of the moving image, and a judgment error may occur in the operation of the vehicle due to a set threshold value or an error in image cropping or feature extraction, thereby giving an erroneous command.

With the continuous improvement and popularization of 3D equipment such as laser radars, depth cameras and the like, automatic driving under a real three-dimensional scene becomes possible, the requirements of an automatic driving system on identification and detection of targets in a complex scene are improved, and the requirements of safety and convenience are met. In the automatic driving device, data acquisition is usually performed by an image sensor, a laser sensor and a radar, and a sensor number is combined for comprehensive analysis, so that relevant operations are realized according to an analysis result. The 2D target detection cannot meet the requirement of sensing environment of the unmanned vehicle, the 3D target detection can identify object types and information such as length, width, height, rotation angle and the like in a three-dimensional space, the 3D target detection is applied to the unmanned vehicle to detect targets in a scene, and the automatic vehicle can accurately predict and plan own behaviors and paths by estimating the actual position, so that collision and violation are avoided, the occurrence of traffic accidents can be greatly reduced, and the intellectualization of urban traffic is realized.

In order to solve the problem that the operation of a vehicle is wrongly decided due to the dynamic image deviation existing in a single image sensor, the Chinese patent invention with the publication number of CN114782729A provides a real-time target detection method based on laser radar and vision fusion, which comprises the following steps: acquiring camera image data and three-dimensional laser radar scanning point data of the surrounding environment of the vehicle, converting the point cloud data into a local rectangular coordinate system, and preprocessing the 3D point cloud; performing density clustering on the preprocessed 3D point cloud data, and extracting a 3D region of interest of a target and corresponding point cloud characteristics; s3: screening out sparse clusters of a target 3D region of interest, mapping to a corresponding region of an image, extracting image features and fusing with point cloud features; and inputting the point cloud characteristics and the image characteristics of all the interested areas into an SSD detector, and positioning and identifying the target. The extraction algorithm of the point cloud characteristics comprises PointNet + +, pointNet, voxelNet or SECOND algorithm. However, the problem that the point cloud feature extraction algorithm is low in detection speed generally exists in the technical scheme, for example, the target detection speed of PointNet is only 5.7Hz, a PointNet + + model is proposed based on a dense point cloud data set, the performance on a laser radar sparse point cloud data set is difficult to meet requirements, the VoxelNet algorithm uses 3D convolution to cause overlarge calculation amount, the processing speed is only 4.4Hz, and although the SECOND algorithm is improved on the VoxelNet algorithm, the processing speed is increased to 20Hz, and the real-time requirement under an automatic driving scene is still difficult to meet.

Disclosure of Invention

The invention provides an automatic driving decision-making method based on multi-sensor data fusion and an SoC (system on chip) chip, and aims to solve the problems of low efficiency of road condition information processing and misjudgment depending on a single image sensor in automatic driving in the prior art.

In order to solve the technical problems, the automatic driving decision is carried out based on multi-sensor data fusion, different neural network models are trained to be respectively used for detecting image sensor data and laser radar sensor data, and the specific scheme is as follows:

the automatic driving decision-making method based on multi-sensor data fusion comprises the following steps:

s1: the method comprises the steps that an RGB image sensor collects image data of a vehicle driving road, wherein the image data comprises lane line data, vehicle data, pedestrian data and traffic sign data;

s2: inputting the lane line data, the vehicle data, the pedestrian data and the traffic sign data into a trained image target detection neural network model, performing lane image feature extraction and feature fusion, and outputting target detection data of a lane image, wherein the image target detection neural network model adopts a YOLOv7 target detection algorithm;

s3: the method comprises the steps that 3D point cloud data are collected through a laser radar, the point cloud data are input into a trained point cloud target detection neural network model, distance feature extraction and feature fusion are conducted, target position and distance data are output, fusion is conducted on the target position and distance information output by a binocular camera, and final obstacle position and distance data are generated, wherein a PointPillar target detection algorithm is adopted by the neural network model;

s4: carrying out data fusion on the lane image data generated in the step S2 and the obstacle position distance data generated in the step S3, analyzing whether errors exist in each sensor or not, and correcting the road condition information of vehicle driving;

s5: and making corresponding decision according to the road condition information corrected in the step S4 and applying the decision to automatic driving.

Preferably, the training process of the neural model for detecting image targets in step S2 specifically includes the following steps:

s2-1: establishing a data set of lanes, pedestrians and traffic signs, wherein the data set is used for training a neural network model;

s2-2: preprocessing the lane, pedestrian and traffic sign data sets to generate RGB format images with set resolution;

s2-3: sequentially enabling the format image to pass through an image feature extraction layer, an image feature fusion layer and an image target detection layer of a YOLOv7 network to obtain a neural network model;

s2-4: checking whether the training times reach a set target or not, if not, repeating the step S2-3 until the set training times are reached, and storing the neural network model as an image target detection neural network model.

Preferably, the training process of the point cloud target detection neural model in step S3 specifically includes the following steps:

s3-1: establishing a laser radar data set, wherein the data set is used for training a point cloud target detection neural model;

s3-2: preprocessing the laser radar data set to generate format point cloud data;

s3-3: sequentially passing the format point cloud data through a feature conversion layer, a feature extraction layer and a target detection layer of a PointPillar network to obtain a neural network model;

s3-4: checking whether the training times reach a set target or not, if not, repeating the step S3-3 until the set training times are reached, and storing the neural network model as a point cloud target detection neural network model.

Preferably, in the data fusion of step S4, the processed image data and the radar data are matched in a decision layer fusion manner, and the obstacle position and distance detection result generated by the radar data are mapped to the coordinates of the image data to form a comprehensive characteristic diagram.

Preferably, the YOLOv7 network model comprises an image input layer, an image feature extraction layer, an image feature fusion layer and an image target detection layer; the image input layer aligns input images; the image feature extraction layer further comprises a plurality of convolution layers, a batch normalization layer and a maximum pooling layer and is used for enriching the features of the aligned images and extracting the features of lanes, vehicles and pedestrians; the image feature fusion layer is used for fusing features extracted at different stages, so that the accuracy of the features is improved; and the image target detection layer detects the road condition information characteristics of the fused characteristic graph and outputs an image detection result.

Preferably, the PointPillar network model comprises a point cloud feature conversion layer, a point cloud feature extraction layer and a point cloud target detection layer; the point cloud feature conversion layer converts the input point cloud into a sparse pseudo image; the point cloud feature extraction layer processes the pseudo image to obtain features of a high layer; the point cloud target detection layer detects the position and the distance of a target through a regression 3D frame.

Preferably, the method for detecting the traffic sign data adopts an improved lightweight convolutional neural network, the lightweight convolutional neural network uses an expansion convolution to realize a sliding window method, and statistical information in a data set is used for accelerating the forward propagation speed of the network, so that the efficiency of detecting the traffic sign is improved.

Preferably, the format of the point cloud data input into the pilar feature layer is P × N × D, where P is the selected number of pilars, N is the maximum number of point clouds stored in each pilar, and D is the dimensional attribute of the point cloud.

Preferably, the dimensional attributes of the point cloud are 9-dimensional data, characterized by:

wherein

Is the original laser radar point cloud data,

represents the three-dimensional coordinate data and represents the three-dimensional coordinate data,

which represents the reflected intensity of the laser light,

indicating the offset of the laser point cloud in pilar from the center of the N point clouds,

indicating the offset of the laser point cloud from the pilar coordinates.

An automatic driving decision SoC chip based on multi-sensor data fusion comprises a general processor and a neural network processor; the general-purpose processor controls the operation of the neural network processor through the self-defined instruction, and the neural network processor is used for executing the method.

Compared with the prior art, the invention has the following technical effects:

1. the invention combines the laser radar and the binocular camera to realize the detection of the position and the distance of the obstacle, can efficiently utilize the advantage that the point cloud data has accurate spatial information, simultaneously utilizes the binocular camera to make up the problem that the laser radar has positioning error when working in a severe environment, enlarges the application range of the obstacle detection, and meets the robustness requirement of the obstacle detection in an automatic driving scene.

2. The invention adopts the PointPillar network model to process laser radar data, operates on a columnar body (Pillar) but not a Voxel (Voxel), does not need to manually adjust the box separation in the vertical direction, uses Pillar to represent point cloud, can be used for 3D point cloud detection only by 2D convolution, greatly reduces the calculated amount, increases the processing speed to over 62Hz, and can effectively meet the real-time requirement under the automatic driving scene.

3. Aiming at different tasks, the invention adopts different neural network models for processing, fully exerts the advantages of each neural network model, enables the data before the comprehensive data fusion to be synchronously processed, and improves the integrity of the decision method.

4. The method combines the lane image data, the image data of other traffic participants, the image data of traffic signs and the position distance data of the obstacles to perform comprehensive data fusion on a decision layer, so that the drivable area and the obstacles can be judged more accurately in the automatic driving scene, the target recognition capability is good, and the sensing accuracy of the vehicle to the surrounding environment is improved.

Drawings

FIG. 1 is a flow chart of an automated driving decision method based on multi-sensor data fusion in accordance with the present invention;

FIG. 2 is a flow chart of a YOLOv7 network model structure of the automatic driving decision method based on multi-sensor data fusion according to the present invention;

FIG. 3 is a flow chart of a PointPillar network model structure of the multi-sensor data fusion-based automatic driving decision method of the present invention.

In the figure: 1. an image input layer; 2. an image feature extraction layer; 3. an image target detection layer; 4. a point cloud feature conversion layer; 5. a point cloud feature extraction layer; 6. a point cloud target detection layer; 21. a convolution module; 22. a first pooling module; 23. a second pooling module; 24. a third pooling module; 41. point cloud data; 42. stacking column data; 43. acquiring characteristic data; 44. pseudo image data.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be described in detail and completely with reference to the accompanying drawings.

Referring to fig. 1-3, the present invention provides an automatic driving decision method based on multi-sensor data fusion, comprising the following steps:

s1: the RGB image sensor collects image data of a vehicle driving road, wherein the image data comprises lane line data, vehicle data, pedestrian data, traffic sign data and other traffic participant data.

S2: inputting lane line data, vehicle data, pedestrian data and traffic sign data into a trained image target detection neural network model, performing lane image feature extraction and feature fusion, and outputting target detection data of a lane image, wherein the neural network model adopts a YOLOv7 target detection algorithm.

S3: the method comprises the steps that 3D point cloud data are collected through a laser radar, the point cloud data are input into a trained point cloud target detection neural network model, distance feature extraction and feature fusion are conducted, target position and distance data are output, fusion is conducted on the target position and distance information output by a binocular camera, final obstacle position and distance data are generated, and a PointPillar target detection algorithm is adopted by the neural network model. The method can efficiently utilize the advantage that the point cloud data has accurate spatial information, and simultaneously utilizes the binocular camera to make up the problem that the laser radar has positioning errors when working in a severe environment, thereby expanding the application range of obstacle detection and meeting the robustness requirement of the obstacle detection in an automatic driving scene.

S4: and (3) carrying out data fusion on the lane image data generated in the step (S2) and the obstacle position distance data generated in the step (S3), analyzing whether errors exist in each sensor, and correcting the road condition information of vehicle driving.

The training process of the image target detection neural model in the step S2 specifically comprises the following steps:

s2-1: and establishing a data set of lanes, pedestrians and traffic signs, wherein the data set comprises normal, crowded, night, lane-free lines, shadows, arrows, glare, curves, intersections and lane condition types under different weather and climate conditions, and also comprises pedestrians, animals, non-motor vehicles and other obstacles, and the data set is used for training a neural network model.

In this embodiment, in order to fully train the YOLOv7 neural network model, the TuSimple data set is used in cooperation with the CULane data set to train the detection of the targets of the lanes and the vehicles, and the RESIDE data set is used to train the detection of the targets of other traffic participants in road traffic.

S2-2: preprocessing a data set of lanes, pedestrians and traffic signs to generate an RGB format image with set resolution, wherein the format image is an RGB three-channel format image with 640 x 640 resolution according to the characteristics of the YOLOv7 input layer 1.

S2-3: and sequentially passing the format image through an image feature extraction layer 2, an image feature fusion layer and an image target detection layer 3 of a YOLOv7 network to obtain a neural network model.

S2-4: checking whether the training times reach a set target, if not, repeating the step S2-3 until the set training times are reached, and storing the neural network model as an image target detection neural network model.

S3, the training process of the point cloud target detection neural model specifically comprises the following steps:

s3-1: and establishing a laser radar data set, wherein the data set is used for training a point cloud target detection neural model, and the data set can adopt data sets such as LiDAR-Video Driving Dataset, KITTI, pandaset, waymo, lyft Level 5, DAIR-V2X, nuScenes and the like.

S3-2: and preprocessing the laser radar data set to generate format point cloud data.

The data format of the point cloud inputted into the Pillar feature layer is P × N × D, where P is the number of selected Pillars, N is the maximum number of point clouds stored in each Pillar, and D is the dimensional attribute of the point cloud.

The dimensional attribute of the point cloud is 9-dimensional data, and is characterized in that:

wherein

Is the original laser radar point cloud data,

which represents the reflected intensity of the laser light,

indicating the offset of the laser point cloud from the pilar coordinates.

In this embodiment, the number P of the pilars is 30000, the maximum number of the point clouds stored in each pilar is 20, if the number of the point clouds in a certain pilar is greater than 20, 20 point clouds are randomly sampled and discarded, and if the number of the point clouds in a certain pilar is less than 20, 0 padding is used for supplementing the point clouds. Thus, the point cloud data format input to the pilar feature layer is P × N × D (30000 × 20 × 9).

S3-3: and sequentially passing the format point cloud data through a point cloud feature conversion layer 4, a point cloud feature extraction layer 5 and a point cloud target detection layer 6 of the PointPillar network to obtain a neural network model.

S3-4: checking whether the training times reach a set target, if not, repeating the step S3-3 until the set training times are reached, and storing the neural network model as a point cloud target detection neural network model.

The YOLOv7 network model comprises an image input layer 1, an image feature extraction layer 2, an image feature fusion layer and an image target detection layer 3; the image input layer aligns the input images; the image feature extraction layer further comprises a plurality of convolution layers, a batch normalization layer and a maximum pooling layer and is used for enriching the features of the aligned images and extracting the features of lanes, vehicles and pedestrians; the image feature fusion layer is used for fusing features extracted at different stages, so that the accuracy rate of the features is improved; and the image target detection layer detects the road condition information characteristics of the fused characteristic graph and outputs an image detection result.

The image feature extraction layer 2 comprises a convolution module 21, a first pooling module 22, a second pooling module 23 and a third pooling module 24 which are arranged in sequence; the convolution module 21 outputs a 4-time down-sampled feature map B, the first pooling module 22 receives the feature map B and processes the feature map B to output an 8-time down-sampled feature map C, the second pooling module 23 receives the feature map C and processes the feature map C to output a 16-time down-sampled feature map D, and the third pooling module 24 receives the feature map C and processes the feature map C to output a 32-time down-sampled feature map E. The convolution module 21 includes four CBR convolution layers and an ELAN layer sequentially disposed. The first, second and

third pooling modules

22, 23 and 24 are the largest pooling layer MP1 and ELAN layer sequentially disposed.

And the image target detection layer 3 performs pyramid pooling on the feature map E, and outputs three target detection results with different sizes through three branches, namely a ReVGG _ block layer REP and a layer of convolution CONV.

The PointPillar network model comprises a point cloud feature conversion layer 4, a point cloud feature extraction layer 5 and a point cloud target detection layer 6; the point cloud feature conversion layer 4 converts the input point cloud into a sparse pseudo image; the point cloud feature extraction layer 5 processes the pseudo image to obtain the features of a high layer; the point cloud target detection layer 6 performs Bbox regression through the SSD detection head to realize the position and distance of the 3D frame detection target.

The point cloud feature conversion layer 4 converges the input P × N × D (30000 × 20 × 9) point cloud data 41 into stacked column data 42, then sequentially applies simplified PointNet and 1 × 1 convolution to each point cloud to obtain learned feature data 43, and finally moves the point cloud back to the original position according to the index to obtain pseudo image data 44, wherein the size of the pseudo image data 44 is H × W × C (512 × 512 × 64), where H is the pixel height of the pseudo image, W is the pixel width of the pseudo image, and C is the channel number of the pseudo image.

The processing flow of the point cloud feature extraction layer 5 comprises 3 steps: carrying out progressive downsampling on an input pseudo image to form a pyramid characteristic; corresponding features are up-sampled to a uniform size; and splicing the three uniform characteristics. Wherein the down-sampling is performed by a sequence of

The components of the composition are as follows,

is relative to the stride of the pseudo-image,

is the number of 2D convolution layers of size 3 x 3,

is the number of output channels. For operation of up-sampling

It is shown that the process of the present invention,

and

representing the number of stride inputs and the number of outputs, and finally obtaining F characteristics to be spliced together.

The detection method of the traffic sign data adopts an improved lightweight convolution neural network, the lightweight convolution neural network realizes a sliding window method by using expansion convolution, and statistical information in a data set is used for accelerating the forward propagation speed of the network so as to improve the detection efficiency of the traffic sign.

And S4, fusing data, namely matching the processed image data with radar data in a decision layer fusion mode, and mapping the obstacle position and distance detection result generated by the obstacle data to the coordinates of the image data to form a comprehensive characteristic diagram. Both the obstacle information data and the image data are converted into BEV coordinates. The barrier information data is regarded as a multi-channel image under polar coordinates, the channel of the multi-channel image is characterized by doppler, and the multi-channel image under BEV can be regarded after coordinate conversion is carried out; the image data can be regarded as a multi-channel image under BEV after coordinate conversion. The two data are in the same coordinate system, and the two data are fused on a multi-scale by using a Concat-based mode. The method has the advantages that the vehicle driving road surface is judged by using the fused data, so that the judgment of the drivable area and the barrier in the automatic driving scene is more accurate, the target recognition capability is good, and the accuracy of the vehicle on the sensing of the surrounding environment is improved.

An automatic driving decision SoC chip based on multi-sensor data fusion comprises a general processor and a neural network processor; the general processor controls the operation of the neural network processor through the self-defined instruction, and the neural network processor is used for executing the method.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, it is possible to make various changes and modifications without departing from the inventive concept, and these changes and modifications are all within the scope of the present invention.

Claims

1. The automatic driving decision method based on multi-sensor data fusion is characterized by comprising the following steps of:

s3: collecting 3D point cloud data by a laser radar, inputting the point cloud data into a trained point cloud target detection neural network model, performing distance feature extraction and feature fusion, outputting target position and distance data, fusing the target position and distance data with target position and distance information output by a binocular camera, and generating final obstacle position and distance data, wherein the neural network model adopts a PointPillar target detection algorithm;

s4: carrying out data fusion on the lane image data generated in the step S2 and the obstacle position distance data generated in the step S3, analyzing whether errors exist in all sensors or not, and correcting the road condition information of vehicle driving;

s5: and making corresponding decisions according to the road condition information corrected in the step S4, and applying the decisions to automatic driving.

2. The multi-sensor data fusion-based automatic driving decision method according to claim 1, wherein the training process of the image target detection neural model in the step S2 specifically comprises the following steps:

3. The multi-sensor data fusion-based automatic driving decision method according to claim 1, wherein the training process of the point cloud target detection neural model in the step S3 specifically comprises the following steps:

4. The multi-sensor data fusion-based automatic driving decision method according to claim 1, wherein the data fusion of step S4 is performed by matching the processed image data with the radar data in a decision layer fusion manner, and mapping the obstacle position and distance detection result generated by the radar data to the coordinates of the image data to form a comprehensive feature map.

5. The multi-sensor data fusion-based automatic driving decision method according to claim 1, wherein the YOLOv7 network model comprises an image input layer, an image feature extraction layer, an image feature fusion layer and an image target detection layer; the image input layer aligns input images; the image feature extraction layer further comprises a plurality of convolution layers, a batch normalization layer and a maximum pooling layer and is used for enriching the features of the aligned images and extracting the features of lanes, vehicles and pedestrians; the image feature fusion layer is used for fusing features extracted at different stages, so that the accuracy of the features is improved; and the image target detection layer detects the road condition information characteristics of the fused characteristic graph and outputs an image detection result.

6. The multi-sensor data fusion-based automatic driving decision method according to claim 1, wherein the PointPillar network model comprises a point cloud feature conversion layer, a point cloud feature extraction layer and a point cloud target detection layer; the point cloud feature conversion layer converts the input point cloud into a sparse pseudo image; the point cloud feature extraction layer processes the pseudo image to obtain the features of a high layer; the point cloud target detection layer detects the position and the distance of a target through a regression 3D frame.

7. The multi-sensor data fusion-based automatic driving decision method according to claim 1, characterized in that the detection method of the traffic sign data adopts an improved lightweight convolutional neural network, the lightweight convolutional neural network uses an expansion convolution to implement a sliding window method, and statistical information in a data set is used to accelerate the forward propagation speed of the network, so as to improve the efficiency of the traffic sign detection.

8. The multi-sensor data fusion-based automatic driving decision method of claim 6, wherein the point cloud data format of the input Pillar feature layer is P x N x D, where P is the selected Pillar number, N is the maximum point cloud number stored by each Pillar, and D is the dimensional attribute of the point cloud.

9. The multi-sensor data fusion-based automated driving decision method of claim 8, wherein the point cloud has a dimensional attribute of 9-dimensional data characterized as:

wherein

Is the original laser radar point cloud data,

which represents the intensity of the reflection of the laser light,

indicating the offset of the laser point cloud from the pilar coordinates.

10. An automatic driving decision SoC chip based on multi-sensor data fusion is characterized in that the SoC chip comprises a general processor and a neural network processor; the general purpose processor controls the operation of a neural network processor through custom instructions, the neural network processor being configured to perform the method of any one of claims 1-9.