CN117789193A

CN117789193A - Multimode data fusion 3D target detection method based on secondary enhancement

Info

Publication number: CN117789193A
Application number: CN202311714141.5A
Authority: CN
Inventors: 袁豆豆; 梁艳菊
Original assignee: Wuxi Internet Of Things Innovation Center Co ltd; Zhongwei Wuchuang Intelligent Technology Shanghai Co ltd
Current assignee: Wuxi Internet Of Things Innovation Center Co ltd; Zhongwei Wuchuang Intelligent Technology Shanghai Co ltd
Priority date: 2023-12-13
Filing date: 2023-12-13
Publication date: 2024-03-29

Abstract

The invention discloses a multi-mode data fusion 3D target detection method based on secondary enhancement, which comprises the following steps: s1) adopting two sensors of a camera and a laser radar as sensing data sources to acquire a multi-view image and a laser radar point cloud; s2) carrying out primary enhancement on point cloud data which are N points in total and only contain three-dimensional coordinate values; s3) cylindricating the point cloud data after primary enhancement, and continuing to secondarily enhance the columnar data; s4) processing the data subjected to the two enhancement processes by adopting a multi-scale 3D target detection network to obtain a final 3D target detection result. The invention enhances the perception data twice, fully utilizes the characteristics of each mode data, can accurately perceive surrounding and remote obstacles in complex scenes or poor environments, has stronger robustness, and further effectively improves the obstacle detection capability of the automatic driving vehicle in difficult scenes such as extreme weather, remote distance and the like.

Description

Multimode data fusion 3D target detection method based on secondary enhancement

Technical Field

The invention relates to a multi-mode data fusion 3D target detection method, in particular to a multi-mode data fusion 3D target detection method based on secondary enhancement.

Background

The conventional sensing algorithm for performing primary enhancement based on the data of the camera sensor and the laser radar sensor has the defects of low accuracy, insufficient robustness and the like, and is difficult to cope with complex and difficult driving scenes such as dense areas of personnel and vehicles, under high-brightness or low-illumination conditions, under rainy and snowy weather and the like.

In an autopilot task, it is difficult to deal with a perception task in an extremely even difficult scenario using a single sensor as a source of perception data. The camera sensor can obtain image data including texture information and color information, but it is difficult to obtain effective data in the case of high brightness, low illuminance, or poor visibility. The lidar sensor can obtain a wide range of point cloud data, and can fully perceive surrounding geometric structures, but lacks texture and color information.

The existing multi-modal sensing generally adopts a scheme of taking a camera sensor and a laser radar sensor as data sources. A plurality of cameras are arranged on an autonomous vehicle, and RGB image data of a 360 degree range of the own vehicle can be acquired. The RGB image data contains abundant semantic information, is easy to perform 2D segmentation and classification on the obstacle, but lacks depth information, and cannot accurately obtain the position information of the obstacle in the 3D space. The laser radar sensor is generally arranged on the top of the automatic driving vehicle, and can acquire the point cloud data of the 360-degree range of the automatic driving vehicle. The point cloud data is larger in range, contains rich geometry information, but lacks texture and color information. In the existing scheme, category information or image features are added into point cloud data through image segmentation or feature extraction networks and then through projection relations between the point cloud data and the image data.

For example, in some existing schemes, an auxiliary network is used to process image data, and the output of the auxiliary network is used to enhance the point cloud data, so that the point cloud data can have the characteristics of the image data, and finally, the purpose of enhancing the perception effect is achieved. Specifically, a 2D instance mask and class are obtained from the image instance segmentation network, and then a 3D segmented instance and the center coordinates of each instance can be obtained according to the projection relationship of the original point cloud and the image. Each point in the original point cloud is 3-dimensional, and is coordinate values of an x axis, a y axis and a z axis respectively, the 3-dimensional segmentation example and an example center coordinate are added to the original point cloud data to obtain enhanced point cloud data, and the enhanced point cloud is (3+c+3) -dimensional data and comprises 3-dimensional coordinate values, C-dimensional vectors (with C categories and background included) and 3-dimensional coordinate values of an example center. And then the enhanced point cloud data is applied to an existing 3D target detection model based on the point cloud for detection.

Although the enhanced point cloud data can improve the perception effect of the model to a certain extent, the complete shape of the obstacle is difficult to be displayed because the obstacle is not complete under the view angle of the sensor, and the incomplete data has a larger influence on the detection result.

Disclosure of Invention

The invention aims to solve the technical problem of providing a multi-mode data fusion 3D target detection method based on secondary enhancement, which is used for carrying out secondary perception data enhancement and fully utilizing the characteristics of each mode data, can accurately perceive surrounding and remote obstacles in a complex scene or a poor environment, and has stronger robustness.

The technical scheme adopted for solving the technical problems is to provide a multimode data fusion 3D target detection method based on secondary enhancement, which comprises the following steps: s1) adopting two sensors of a camera and a laser radar as sensing data sources to acquire a multi-view image and a laser radar point cloud; s2) carrying out primary enhancement on point cloud data which are N points in total and only contain three-dimensional coordinate values; s3) cylindricating the point cloud data after primary enhancement, and continuing to secondarily enhance the columnar data; s4) processing the data subjected to the two enhancement processes by adopting a multi-scale 3D target detection network to obtain a final 3D target detection result.

Further, the step S2 includes: s21) firstly inputting a multi-view image, and obtaining a 2D instance mask and an instance category label by an instance segmentation network; s22) obtaining a 3D segmentation example according to the projection relation between the original point cloud and the image; s23) detecting outliers of each 3D segmentation example, and filtering out a small amount of background points contained in the outliers; s24) calculating center coordinates of the instance according to the filtered 3D segmentation instance, and adding the 3D segmentation instance and the center coordinates of the instance to the original point cloud data to obtain enhanced point cloud data.

Further, the enhanced point cloud data in the step S3 includes 3-dimensional coordinate values, C-dimensional vectors, and 3-dimensional coordinate values of the instance center.

Further, the step S3 includes: s31) firstly, cylindricating the input enhanced point cloud data; s32) extracting the characteristics of each non-empty cylinder by adopting a PointNet simplified version to obtain a characteristic diagram; s33) sending the obtained feature images into a column occupation probability prediction network, obtaining three feature images with lower resolution after three downsampling, respectively upsampling the three feature images with lower resolution to the size of original features, then splicing and dimension-reducing according to dimensions, and finally obtaining the probability of whether each column belongs to a certain target; s34) the column occupation probability is respectively added to each point in each column after dimensional change, and finally the reinforced column is obtained.

Further, the step S31 specifically includes: dividing the plane into H multiplied by W plane grids, dividing the 3-dimensional space into H multiplied by W columns according to the grid number, wherein each column contains K points, the dimension of each point is (3+C+3), columns with less than K points are filled with 0, and redundant K points are randomly sampled.

Further, the PointNet simplified version in the step S32 comprises a linear layer, a batch normalization layer, a ReLU and a maximum pooling layer, and feature extraction is performed on each non-empty cylinder to obtain a feature map with the size of H×W×C1; the step S34 finally obtains a reinforced column having a size of hxwxkx (3+c+3+1).

Further, the step S4 includes: s41) firstly, performing feature extraction by using a PointNet simplified version to obtain a feature map with the size of H multiplied by W multiplied by C2; s42) sending the feature map into a backbone network, and generating three feature maps with different sizes through a series of downsampling, upsampling and element-by-element addition operations; s43) finally, the three feature maps are respectively sent to detection heads with different dimensions to detect targets with different sizes.

Further, the step S43 generates a small-size feature map for detecting a large vehicle, a medium-size feature map for detecting a small vehicle, and a large-size feature map for detecting a pedestrian and a riding person.

Compared with the prior art, the invention has the following beneficial effects: the multimode data fusion 3D target detection method based on the secondary enhancement further enhances the input data of the perception module so as to enhance the perception capability. After the information such as category is obtained through the image segmentation network, the complete shape of the obstacle is predicted based on the first enhanced data by fully utilizing the semantic information and the geometric information, so that the input data is further enhanced. The invention can effectively improve the obstacle detection capability of the automatic driving vehicle in difficult scenes such as extreme weather, long distance and the like.

Drawings

FIG. 1 is a schematic diagram of one-time enhancement of point cloud data according to the present invention;

FIG. 2 is a schematic diagram of secondary enhancement of enhanced point cloud data according to the present invention;

fig. 3 is a diagram of a multi-scale 3D object detection network in accordance with the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

In an automatic driving scene, data around a vehicle collected by a sensor is limited, obstacle sensing is performed only based on camera or laser radar data, surrounding environment is difficult to fully sense, and robustness is insufficient. Some recent multi-mode sensing schemes perform primary data enhancement on point cloud data through image data, so that the point cloud data obtains category information, an affiliated instance center and other information. The enhanced point cloud contains both geometric and semantic information. However, the obtained point cloud data is incomplete and only reflects part of the shape of the target due to the self-occlusion, external occlusion or data loss of the target. Based on the point cloud data, a 3D segmentation example is obtained according to the projection relation of the point cloud and the image, and the calculated example center is of course inaccurate, and is only the example center of the target part point cloud, so that the enhanced data cannot fully represent the target, and the optimal detection effect is difficult to achieve. Therefore, the invention carries out the second enhancement based on the original enhancement thought, combines the geometric information of the original point cloud, fully utilizes the semantic information and the like obtained by the first enhancement, predicts the complete shape of the target, and enhances the perception capability of the obstacle, in particular the detection capability of the far-small object or the blocked object.

The example segmentation network, example segmentation mask, outlier filtering, pointNet, downsampling and upsampling used in the present invention are described as follows:

(1) Example splitting network: an instance segmentation network is a deep learning model designed to perform instance segmentation tasks. These networks are typically composed of two main components: object Detection (Object Detection) and semantic segmentation (Semantic Segmentation). The object detection portion is responsible for locating and identifying the different object instances, which are typically represented using bounding boxes. The semantic segmentation section is responsible for assigning a class label to each pixel, classifying each pixel in the image into a different class. The combination of these two parts enables the network to segment each object instance and assign it a unique identifier.

(2) Example segmentation mask: an instance segmentation mask refers to the process of assigning each pixel in an image to its corresponding object instance, typically implemented by a segmentation mask map. A mask image is an image having the same resolution as the original image, where each pixel belongs to an object instance and is encoded with a different pixel value or color. The instance segmentation mask may be used to accurately extract and segment each object in the image, as well as perform various analysis and understanding tasks.

(3) Outlier filtering: the segmented 3D instance may include a small number of background points due to segmentation errors, and the small number of background points may be removed by using methods based on statistics, based on density estimation, and the like.

(4) PointNet is a deep learning network structure for processing point cloud data, and the main objective is to learn characteristic representation of related objects from the point cloud data, and can be used for tasks such as point cloud classification, segmentation, semantic analysis and the like. A simplified version of PointNet is used in the present invention, comprising only a linear layer, a batch normalization layer, and a max-pooling layer.

(5) Downsampling and upsampling: downsampling and upsampling are two operations commonly used in deep learning and signal processing to adjust the resolution or size of data. Downsampling refers to reducing the resolution or size of data, often accompanied by loss of information. Upsampling refers to increasing the resolution or size of data, typically accompanied by interpolation or padding of information.

The invention also adopts two sensors of a camera and a laser radar as sensing data sources, but carries out twice sensing data enhancement, fully utilizes the characteristics of each mode data, can accurately sense surrounding and remote obstacles in complex scenes or poor environments, has stronger robustness, and aims to solve the problem that the obstacles of an automatic driving vehicle in extreme weather, remote scenes and the like are difficult to accurately detect. The invention provides a multimode data fusion 3D target detection method based on secondary enhancement, which comprises the following steps:

s1) using two sensors of a camera and a laser radar as a perception data source to acquire a multi-view image;

s2) carrying out primary enhancement on point cloud data which are N points in total and only contain three-dimensional coordinate values;

s3) cylindricating the point cloud data after primary enhancement, and continuing to secondarily enhance the columnar data;

s4) processing the data subjected to the two enhancement processes by adopting a multi-scale 3D target detection network to obtain a final 3D target detection result.

Referring to fig. 1, the method is used for enhancing the point cloud data which is composed of a total of N points and only contains three-dimensional coordinate values at a time. The method specifically comprises the following steps:

first, a multi-view image is input, and a 2D instance mask and an instance category label are obtained by an instance segmentation network.

And then obtaining a 3D segmentation example according to the projection relation of the original point cloud and the image. Because of certain errors in the segmentation result of the instance segmentation network and the projection process of the original point cloud to the image, outlier detection needs to be carried out on each 3D segmentation instance, and a small amount of background points contained in the outlier detection needs to be filtered.

And finally, calculating the center coordinates of the instance according to the filtered 3D segmentation instance, and adding the 3D segmentation instance and the instance center coordinates to the original point cloud data to obtain enhanced point cloud data, wherein the enhanced point cloud is (3+C+3) dimensional data comprising 3-dimensional coordinate values, C-dimensional vectors (C categories comprising background) and 3-dimensional coordinate values of the instance center.

With continued reference to fig. 2, the method is used for performing secondary enhancement on the cylinder data. The method specifically comprises the following steps:

the input enhanced point cloud data is first columnar. Specifically, the plane is divided into H×W plane grids, then the 3-dimensional space can be divided into H×W columns according to the number of grids, each column contains K points, the dimension of each point is (3+C+3), columns with less than K points are filled with 0, and random sampling is carried out on the more than K points.

Then, feature extraction is performed on each non-empty cylinder by using PointNet simplified version (comprising a linear layer, a batch normalization layer, a ReLU and a maximum pooling layer), so as to obtain a feature map with the size of H multiplied by W multiplied by C1.

The feature map obtained above is sent to a column occupation probability prediction network, as shown in a dotted line frame on the right side of fig. 2, three feature maps with lower resolution are obtained after three downsampling, then the three feature maps are respectively upsampled to the size (H×W×C1) of the original features, then the feature maps are spliced and dimension reduced according to dimensions, and finally the probability of whether each column belongs to a certain target is obtained.

The column occupation probability (H x W x 1) is added to each point in each column after dimension change (H x W x K x 1), and finally the enhancement column is obtained, wherein the size of the enhancement column is H x W x K x (3+C+3+1).

Fig. 3 is a diagram of a multi-scale 3D object detection network according to the present invention, configured to process data enhanced twice to obtain a final 3D object detection result. The method specifically comprises the following steps:

first, feature extraction was performed using a simplified version of PointNet to obtain a feature map of H×W×C2.

The feature map is then fed into the backbone network and subjected to a series of downsampling, upsampling, and element-by-element addition operations to generate three feature maps of different sizes.

And finally, respectively sending the three feature maps into detection heads with different scales to detect targets with different sizes. The small-size feature map is used for detecting large vehicles such as trucks, the medium-size feature map is used for detecting small vehicles such as cars, and the large-size feature map is used for detecting pedestrians and riding persons.

In summary, the invention enhances the perceived data twice by combining the semantic information contained in the image data and the geometric information contained in the point cloud data, fully utilizes the characteristics of each mode data, can accurately perceive surrounding and distant obstacles in complex scenes or poor environments, and has stronger robustness. The method has the specific advantages that:

(1) Two times data enhancement: unlike the other scheme, the method and the device only enhance the point cloud data once through the image data, the method and the device enhance the point cloud data twice by fully utilizing the texture, color and other information of the image data, then predict the complete shape of the target according to the first enhanced data, further obtain the re-enhanced data, and finally achieve the aim of enhancing the detection precision and the robustness.

(2) Network integration: the present invention integrates multiple network structures. Adopting a PointNet simplified network to extract the characteristics of the non-empty column; predicting the probability of whether each cylinder occupies a target by adopting a cylinder occupation prediction network shown as a dotted line box in fig. 2; outputting multi-scale features by adopting a backbone network as shown in a dotted line frame of fig. 3; the multi-scale detection heads are used for processing multi-scale features, and the detection heads with different scales detect targets with different sizes.

Of course, the invention enhances the point cloud data by using other sensor data, such as an ultrasonic radar or the like. The multi-view image is subjected to example segmentation by adopting other image segmentation networks, the probability of whether each cylinder belongs to a certain target is predicted by adopting other network structures, the characteristics are extracted by adopting other characteristic extraction networks as main network, and the characteristics obtained by processing by adopting detection heads of other structures are not listed in detail.

While the invention has been described with reference to the preferred embodiments, it is not intended to limit the invention thereto, and it is to be understood that other modifications and improvements may be made by those skilled in the art without departing from the spirit and scope of the invention, which is therefore defined by the appended claims.

Claims

1. The multi-mode data fusion 3D target detection method based on secondary enhancement is characterized by comprising the following steps of:

s1) adopting two sensors of a camera and a laser radar as sensing data sources to acquire a multi-view image and a laser radar point cloud;

2. The method for detecting a multi-modal data fusion 3D target based on secondary enhancement as claimed in claim 1, wherein the step S2 includes:

s21) firstly inputting a multi-view image, and obtaining a 2D instance mask and an instance category label by an instance segmentation network;

s22) obtaining a 3D segmentation example according to the projection relation between the original point cloud and the image;

s23) detecting outliers of each 3D segmentation example, and filtering out a small amount of background points contained in the outliers;

s24) calculating center coordinates of the instance according to the filtered 3D segmentation instance, and adding the 3D segmentation instance and the center coordinates of the instance to the original point cloud data to obtain enhanced point cloud data.

3. The method for detecting a multi-modal data fusion 3D target based on secondary enhancement as claimed in claim 2, wherein the enhanced point cloud data in step S3 includes 3-dimensional coordinate values, C-dimensional vectors, and 3-dimensional coordinate values of an instance center.

4. The method for detecting a multi-modal data fusion 3D target based on secondary enhancement as claimed in claim 1, wherein the step S3 includes:

s31) firstly, cylindricating the input enhanced point cloud data;

s32) extracting the characteristics of each non-empty cylinder by adopting a PointNet simplified version to obtain a characteristic diagram;

s33) sending the obtained feature images into a column occupation probability prediction network, obtaining three feature images with lower resolution after three downsampling, respectively upsampling the three feature images with lower resolution to the size of original features, then splicing and dimension-reducing according to dimensions, and finally obtaining the probability of whether each column belongs to a certain target;

s34) the column occupation probability is respectively added to each point in each column after dimensional change, and finally the reinforced column is obtained.

5. The method for detecting a multi-modal data fusion 3D target based on secondary enhancement as claimed in claim 4, wherein the step S31 specifically includes: dividing the plane into H multiplied by W plane grids, dividing the 3-dimensional space into H multiplied by W columns according to the grid number, wherein each column contains K points, the dimension of each point is (3+C+3), columns with less than K points are filled with 0, and redundant K points are randomly sampled.

6. The method for detecting a 3D target based on the secondarily enhanced multi-modal data fusion of claim 5, wherein the simplified version of PointNet in step S32 includes a linear layer, a batch normalization layer, a ReLU and a max pooling layer, and the feature extraction is performed on each non-empty cylinder to obtain a feature map with a size of hxw x C1; the step S34 finally obtains a reinforced column having a size of hxwxkx (3+c+3+1).

7. The method for detecting a multi-modal data fusion 3D target based on secondary enhancement as claimed in claim 1, wherein the step S4 includes:

s41) firstly, performing feature extraction by using a PointNet simplified version to obtain a feature map with the size of H multiplied by W multiplied by C2;

s42) sending the feature map into a backbone network, and generating three feature maps with different sizes through a series of downsampling, upsampling and element-by-element addition operations;

s43) finally, the three feature maps are respectively sent to detection heads with different dimensions to detect targets with different sizes.

8. The secondarily enhanced multi-modal data fusion 3D object detection method of claim 7, wherein the step S43 generates a small-sized feature map for detecting a large-sized vehicle, a medium-sized feature map for detecting a small-sized vehicle, and a large-sized feature map for detecting a pedestrian and a riding person.