CN116129233A

CN116129233A - Automatic driving scene panoramic segmentation method based on multi-mode fusion perception

Info

Publication number: CN116129233A
Application number: CN202310153288.5A
Authority: CN
Inventors: 余谦; 马利庄; 张志忠; 谭鑫
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-16

Abstract

The invention discloses an automatic driving scene panoramic segmentation method based on multi-mode fusion perception. And then simultaneously taking the characteristics and the three-dimensional characteristics of the point cloud as the input of the backbone network. Secondly, dividing the bottom layer of the backbone network according to camera vision, matching with the feature map, tiling into a sequence, sending the sequence into a multi-head self-attention network, and learning the relevance of two different mode information. And finally, outputting clustering information by the backbone network, and judging whether the centers should be merged or not by using the cosine similarity of the multi-layer perceptron learning clustering centers when clustering. The invention solves the problem that the panoramic segmentation task in the automatic driving scene can not well use multi-mode information, and the learning of semantic information is greatly enhanced by using the fusion of two layers.

Description

Automatic driving scene panoramic segmentation method based on multi-mode fusion perception

Technical Field

The invention relates to the technical field of perception correlation of automatic driving scenes, in particular to a method for panoramic segmentation by simultaneously using multiple different mode data.

Background

3D visual perception has become a vital task in a wide range of applications including autopilot, robotics and virtual reality. In 3D visual perception, panoramic segmentation is a comprehensive perception task consisting of semantic segmentation, object detection and instance segmentation, which is very challenging because it requires predicting semantic labels for each point, such as trees, roads, etc., from the background class, while also requiring simultaneous recognition of instances of object classes, such as cars, bicycles and pedestrians.

Panoramic segmentation works can be broadly divided into two categories, namely detection-based methods and non-detection methods. Detection-based methods typically employ a two-stage training process that first generates a target detection box with instance labels, and then incorporates these instance predictions into semantic segmentation. In contrast, the undetected approach is to predict both semantic category information and information required for instance clustering. During training, labels are directly used for supervision, and in the reasoning process, the points belonging to the object category are classified into the center of the object closest to the point, wherein the post-processing is performed without parameters. The undetected approach achieves better performance because two different supervisory messages are trained simultaneously and only a single model is used. However, both of these methods have errors in either under-segmentation or over-segmentation. Previous studies have shown that the reason for this is that laser point clouds are typically sparse and unevenly distributed, such that there is a significant difference in the performance of deep neural networks in handling near and far point clouds, and for most three-dimensional networks, far objects occupying only a few points appear small in view and cannot be effectively detected, while nearby objects tend to be too large to be over-segmented.

Instead, the image may provide rich texture and color information. Because the pixels in the image are dense, even small objects, there is enough information to identify them. This facilitates the use of images to supplement the scene-aware performance of the laser point cloud. In addition, most of automatic driving systems are simultaneously provided with a camera and a point cloud camera, so that the multi-mode fusion concept has very broad development prospect.

Disclosure of Invention

The invention aims to provide an automatic driving scene panorama segmentation method based on multi-mode fusion perception, and the method fuses local detail texture information and global semantic information, including point-to-pixel fusion of an input stage and advanced voxel-to-image fusion. The fusion of pixels accurately utilizes the geometric relationship between the lidar sensor and the image camera coordinate system and corrects for the misalignment caused by the asynchronous capture times of the lidar sensor and the image camera due to movement of the vehicle. Global semantic information fusion performs multi-head self-attention feature learning on voxel features and image features in a deep network. And both are designed as lightweight plug and play modules. Finally, in the example clustering process, the model refers to the information of pixel fusion again, and color and texture information is used for secondary clustering, so that the problem of excessive clustering of large objects is solved.

The specific technical scheme for realizing the aim of the invention is as follows:

an automatic driving scene panoramic segmentation method based on multi-mode fusion perception is characterized in that multi-mode feature fusion is used for enhancing network feature extraction capability and large object clustering effect, and the method comprises the following specific steps:

step 1: image feature generation and projection

A1: constructing a data set by using RGB image information and laser radar point cloud information received from an on-board sensor to the surrounding environment; inputting a plurality of camera pictures of a scene in a data set into a pre-trained image semantic segmentation network together to obtain a plurality of feature images with minimum size and a half-size two-dimensional feature image; the image semantic segmentation network fixes all parameters in the training process;

a2: constructing a transformation matrix according to the relative position relation of the sensor on the sampling vehicle, projecting the point cloud onto a corresponding half-size two-dimensional feature map, acquiring features on the feature map according to pixel coordinates of the projected point cloud, forming image features of the point cloud, directly filtering the point cloud which cannot be projected onto the two-dimensional feature map, projecting the point cloud onto a plurality of two-dimensional feature maps at the same time, and randomly using one two-dimensional feature map to acquire the features;

step 2: backbone network feature initialization and training

B1: after the image characteristics of the point cloud and the three-dimensional coordinate geometric characteristics of the original point cloud data are obtained in the step 1, the image characteristics of the point cloud are processed by a multi-layer perceptron capable of reducing the dimension, the geometric characteristics of the point cloud are processed by a multi-layer perceptron capable of amplifying the dimension, so that the two characteristics reach the same dimension, then the two characteristics are divided according to space grids, the problem of inconsistent points in the two modal data is eliminated by separate pooling, finally the two modal characteristics of each grid are added together to obtain network initialization input characteristics after multi-modal fusion, and a cylindrical grid in a polar coordinate system is used as a main network data structure to perform characteristic learning;

b2: training network semantic segmentation branches by using focal loss and cross entropy loss, and training center point probability regression and center point coordinate deviation regression by using L2 loss;

step 3: backbone network underlying high-dimensional feature fusion

C1: taking out a three-dimensional feature map of the bottommost layer of the backbone network, dividing the three-dimensional feature map into a plurality of areas which are the same as the number of cameras according to polar coordinate angles by utilizing the orientation angle information of the cameras on the vehicle-mounted equipment, wherein each area corresponds to a two-dimensional feature map with the smallest size in A2, tiling and connecting the three-dimensional features and the two-dimensional features in the corresponding areas together to form serialization information;

c2: the absolute position of the three-dimensional grid center in the whole space and the pixel coordinates on the minimum-dimension two-dimensional feature map are used for forming the position code of the serialization information, the serialization information is added with the position code and then is sent into a multi-head self-attention mechanism network with linear complexity, and under the condition that the feature dimension is kept unchanged, high-dimension cross-modal information fusion is learned;

step 4: network output result integration and correction

D1: the network outputs grid level semantic segmentation prediction, marks semantic class labels on grids of each point cloud in the three-dimensional space, marks semantic class labels on each point cloud according to the grid class labels of each point cloud, and accordingly obtains the class of each point in input point cloud data, and completes the semantic segmentation prediction of the point level;

d2: outputting a central point probability prediction graph under the view angle of the aerial view by the network, and filtering by using the maximum pooling with the kernel size of 5 to obtain the coordinates of candidate central points; according to the transformation matrix constructed in the step A2, projecting candidate center points onto corresponding pixels of a corresponding half-size two-dimensional feature map, obtaining two-dimensional image features of positions of the center points, learning cosine similarity between each center point and other center points through a multi-layer perceptron, merging the center points of the large object as a semantic segmentation result through a threshold value to obtain a final example target center, and improving clustering effect of the large object; wherein the large object is a bus, a truck, a fire engine and other engineering building vehicles;

d3: the network outputs the coordinate deviation of each grid from the belonged center point under the view angle of the aerial view, the coordinates of the grids with the points are added with the network output deviation to obtain a new grid position, then the nearest center point is calculated, all grids belonging to the same center point are regarded as a single example, the semantic category of the example is obtained by voting the principal components by the semantic category labels of all the points of the example, the finally obtained output result is that the points with the semantic segmentation result as the background category are marked with corresponding category labels, the points with the semantic segmentation as the foreground are marked with category labels and example serial numbers, and the serial numbers of the example under the scene and the category to which the example belongs are predicted.

A2, a transformation matrix is constructed according to the relative position relation of the sensor on the sampling vehicle, and the point cloud is projected onto the corresponding pixel of the corresponding half-size two-dimensional feature map, specifically: an external parameter matrix is constructed through the position relation of the laser radar camera and the RGB image camera, the three-dimensional coordinates of the point cloud are multiplied by the external parameter matrix to change the point cloud from the laser radar coordinate system to the RGB image camera coordinate system, then an internal parameter change matrix is constructed through parameters in the RGB image camera, the product of the coordinates of the point cloud under the RGB image camera coordinate system and the internal parameter matrix is obtained, the specific position of the point cloud on the two-dimensional image in the three-dimensional space is obtained, and the point cloud in the three-dimensional space is projected on the two-dimensional image.

The feature learning in step B1 is performed by using a cylindrical grid in a polar coordinate system as a backbone network data structure, which specifically includes: the characteristic that the arc length of the grid is increased along with the distance axis under the polar coordinate system is used for fitting the characteristics of near density and far sparseness of the laser radar, so that the difference of the number of points in the near-distance grid and the far-distance grid is balanced, and the network characteristic extraction capability is improved; the polar coordinates are used, the angle axis can also directly correspond to the angle of the visual field of the image camera, the high-dimensional feature fusion range is limited to be within a sixth of the scene area, and the consumption of calculation resources is reduced.

The training of the network semantic segmentation branch by using the focal loss and the cross entropy loss in the step B2 specifically comprises the following steps: when the loss function is calculated, the cross entropy loss function is used as classification loss, meanwhile, a penalty term for the category with high confidence coefficient is generated through the predicted category probability coefficient to form a focal loss, the proportion of categories which are easy to judge in the loss function is reduced, the learning efficiency for difficult categories is improved, and the serious category imbalance problem under the automatic driving scene segmentation task is solved.

The multi-head self-attention mechanism network with linear complexity described in the step C2 specifically includes: splitting one square level calculation complexity multi-head self-attention operation into two times, wherein the first operation uses a parameter of network learning to replace an input query, the length of an input sequence is compressed into a constant value irrelevant to the length of the input sequence, the second operation adopts the output of the first operation as key and value, the original input is used as the query, the length of the constant value sequence is restored to the original input size again, the output is aligned to the input, and the calculation complexity is reduced from the square level to be linear while the feature learning capability is maintained.

In step D2, learning cosine similarity between each center point and other center points by using a multi-layer perceptron, and merging center points with semantic segmentation results being large objects by using a threshold value to obtain a final example target center, wherein the method specifically comprises the following steps: after the two-dimensional image features of the center point are obtained, the three-dimensional features and the two-dimensional image features are added together and learned through a multi-layer perceptron to obtain multi-modal features of the center point, cosine similarity calculation is carried out between every two multi-modal features of all the center points, the center points with the similarity larger than a threshold value are regarded as the same instance, and a new center point is calculated by using the average value of coordinates of the center points to serve as a final instance target center.

The invention has the beneficial effects that:

two different layers of multi-mode fusion modes are used, so that the overall characteristic extraction capability and the final panoramic segmentation effect of the network are improved.

The cylindrical grid division structure based on the polar coordinate system is used, the problem of unbalanced distribution caused by near-density and far-distance scattering of laser radar data is solved, and the network characteristic learning capability is improved.

The secondary clustering method is provided, and a plurality of center points belonging to the same large object can be merged by utilizing color and texture information which is not available in the laser radar, so that the clustering effect of the large object which is difficult to process in the panoramic segmentation task is improved.

Drawings

FIG. 1 is a flow chart of a pixel level fusion of the present invention;

FIG. 2 is a high-level semantic feature fusion flow chart of the present invention;

fig. 3 is a flow chart of the present invention.

Detailed description of the preferred embodiments

The present invention will be described in detail below with reference to the accompanying drawings and examples for the purpose of facilitating understanding of the present invention.

Referring to fig. 1, in step 1 of the present invention, picture data is input into a 2D image semantic segmentation network, and a minimum size feature map and a half size feature map are obtained, where the minimum size feature map is used for subsequent network global feature fusion, and the half size feature map is used for pixel level fusion.

S100: the backbone semantic segmentation network adopted by the image mode data is swiftnet18, which is currently used.

S110: the minimum size feature map is the feature map of the last layer of the downsampling of the S100 network.

S120: the half-size feature map is the penultimate layer feature map of the S100 network downsampled and then upsampled.

Referring to fig. 1, in step 1 of the present invention, a point cloud is projected onto a half-size feature map according to a transformation relationship between an image camera and a laser radar camera coordinate system to obtain image features.

S130: the position proportion of each point on the x axis and the y axis of each image can be calculated through the internal and external parameters of the camera and the three-dimensional coordinates of the point cloud, the legal range of the numerical value is 0 to 1, if the numerical value with the negative value or the numerical value larger than 1 appears, the point is out of the visual field of the image, and the same point is in the legal proportion range in at most two images, which means that the point appears in two images at the same time, or is not in any image range, which means that the point is not in the visual field range of 6 images of the camera.

S140: after the x-axis and y-axis proportion of the image where the point is located is obtained, the image feature corresponding to the point can be obtained from the half-size feature map, the feature of one image is taken from the points in the plurality of images, and the points not in the images are directly ignored.

S150: the points ignored by S140 belong to a subset of the number of original point clouds.

Referring to fig. 1, in step 2 of the present invention, the obtained image features of the points and the geometric features of the original points are learned by respective multi-layer perceptron, and then pooled into the same spatial meshing. And performing feature stitching by taking the grids as units, and initializing grids with point clouds as input features of the backbone network.

S160: and the multi-layer perceptron network is used for independently learning the image characteristics of the point cloud in the camera view and the geometric position characteristics of the original point cloud respectively.

S170: limiting the whole scene to an area of 0-50 meters, dividing the whole space into grids according to 480 units of a length axis, 360 units of an angle axis and 32 units of a height axis, maximizing a pool of features learned by S160 into corresponding grids, and connecting the features of two modes together by taking the grids as units to obtain a three-dimensional backbone network initialization feature S200.

Referring to fig. 2, in step 3 of the present invention, the polar grid features of the bottommost layer of the backbone network are extracted, the feature map is divided into 6 regions corresponding to the image according to the polar coordinate angles by using the camera orientation angles given by the data sets, and meanwhile, the 2D feature map with the smallest size in step 1 is extracted, and the features in the corresponding regions are tiled and connected together to form a longer serialization information.

S210-220: all are backbone networks, and the backbone network used in the invention is of a U-shaped structure, and downsampling S210 is performed first, and then upsampling S220 is performed.

S230: the obtained three-dimensional polar grid features after the downsampling in S210.

S240: and according to the correspondence between the polar coordinate angle and the angle occupied by the visual field of the image camera, dividing the three-dimensional polar coordinate feature after downsampling into 6 blocks according to the angle, wherein each block corresponds to one image.

S250: and (3) tiling the corresponding grids and the feature images, and then adding the respective coordinate positions to encode the grids and the feature images into a long sequence.

S260: the linear multi-head self-attention mechanism module specifically comprises an operation of splitting a traditional multi-head self-attention mechanism into two lightweight classes, wherein in the first time, a key and a value use a long sequence output by S250, and a query uses a fixed-length parameter learned by a network to reduce the sequence length in calculation. The query of the second operation uses the output of S250, while the key and value use the output of the first operation to achieve the effect of restoring the feature to the original length. And finally reducing the calculated amount from the square level of the sequence length to the linear level.

Referring to fig. 3, in step 4 of the present invention, coordinates of candidate center points are calculated by using a network predicted center point thermodynamic diagram, then the candidate center points are projected onto a half-size feature map to obtain image features of the candidate center points, cosine similarity between the candidate center points is learned by a multi-layer perceptron, and center points with too high similarity are merged to achieve a secondary clustering effect. After the central points are obtained, the central points closest to each grid after the grid displacement are calculated by combining the grid coordinate displacement results, a grid set of each central point is obtained, a preliminary clustering result is obtained, and then a panoramic segmentation result output by a final network is determined according to the preliminary clustering result and the semantic category prediction result.

S310: the result of the thermodynamic diagram prediction of the center point is a probability prediction of whether each pixel with the size of 480 x 360 and the value between 0 and 1 is the center point or not under the view angle of the bird's eye view.

S320: and the grid coordinate displacement result is the displacement deviation prediction of each pixel with the size of 480 x 360 x 2 from the center point of the grid coordinate displacement result under the view angle of the aerial view.

S330: the semantic segmentation prediction result is the probability of each category to which each grid with the size of 480 x 360 x 32 x category number belongs.

S340: and in the center point prediction and secondary clustering process, coordinates of candidate center points are calculated by using a center point thermodynamic diagram predicted by a network, then the candidate center points are projected onto a half-size feature map to obtain image features of the candidate center points, cosine similarity among the candidate center points is learned by a multi-layer perceptron, and center points with too high similarity are integrated, so that the secondary clustering effect is achieved.

S350: the category irrelevant clustering process is specifically implemented by combining the grid coordinate displacement results after the central points are acquired, calculating the central point closest to each grid displacement, namely obtaining a grid set of each central point, and obtaining a preliminary clustering result.

S360: and the example principal component voting is specifically implemented in such a way that after the preliminary clustering result is obtained, if the semantic category of each point in each clustering set is not uniform, the final category of the set is determined according to the category with the largest point number in the set at the moment, and the uniformity of the final panoramic segmentation prediction result is ensured.

Claims

1. An automatic driving scene panorama segmentation method based on multi-mode fusion perception is characterized by comprising the following specific steps:

step 1: image feature generation and projection

A1: constructing a data set by using RGB image information of the surrounding environment received from the vehicle-mounted sensor and laser radar point cloud information; inputting a plurality of camera pictures of a scene in a data set into a pre-trained image semantic segmentation network together to obtain a plurality of minimum-size two-dimensional feature images and half-size two-dimensional feature images; the image semantic segmentation network fixes all parameters in the training process;

step 2: backbone network feature initialization and training

step 3: backbone network underlying high-dimensional feature fusion

step 4: network output result integration and correction

2. The method for panoramic segmentation of an autopilot scene based on multimodal fusion perception according to claim 1, wherein in step A2, a transformation matrix is constructed according to the relative positional relationship of the sensor on the sample vehicle, and the point cloud is projected onto the corresponding pixels of the corresponding half-size two-dimensional feature map, specifically: an external parameter matrix is constructed through the position relation of the laser radar camera and the RGB image camera, the three-dimensional coordinates of the point cloud are multiplied by the external parameter matrix to change the point cloud from the laser radar coordinate system to the RGB image camera coordinate system, then an internal parameter change matrix is constructed through parameters in the RGB image camera, the product of the coordinates of the point cloud under the RGB image camera coordinate system and the internal parameter matrix is obtained, the specific position of the point cloud on the two-dimensional image in the three-dimensional space is obtained, and the point cloud in the three-dimensional space is projected on the two-dimensional image.

3. The method for panoramic segmentation of an autopilot scene based on multi-modal fusion awareness according to claim 1, wherein in step B1, a cylindrical grid in a polar coordinate system is used as a backbone network data structure for feature learning, specifically: the characteristic that the arc length of the grid is increased along with the distance axis under the polar coordinate system is used for fitting the characteristics of near density and far sparseness of the laser radar, so that the difference of the number of points in the near-distance grid and the far-distance grid is balanced, and the network characteristic extraction capability is improved; the polar coordinates are used, the angle axis can also directly correspond to the angle of the visual field of the image camera, the high-dimensional feature fusion range is limited to be within a sixth of the scene area, and the consumption of calculation resources is reduced.

4. The method for panoramic segmentation of an autopilot scene based on multimodal fusion perception of claim 1, wherein step B2 uses a focal loss and cross entropy loss training network semantic segmentation branch, specifically: when the loss function is calculated, the cross entropy loss function is used as classification loss, meanwhile, a penalty term for the category with high confidence coefficient is generated through the predicted category probability coefficient to form a focal loss, the proportion of categories which are easy to judge in the loss function is reduced, the learning efficiency for difficult categories is improved, and the serious category imbalance problem under the automatic driving scene segmentation task is solved.

5. The method for panoramic segmentation of an autopilot scene based on multimodal fusion perception according to claim 1, wherein the linear complexity multi-headed self-attention mechanism network in step C2 is specifically: splitting one square level calculation complexity multi-head self-attention operation into two times, wherein the first operation uses a parameter of network learning to replace an input query, the length of an input sequence is compressed into a constant value irrelevant to the length of the input sequence, the second operation adopts the output of the first operation as key and value, the original input is used as the query, the length of the constant value sequence is restored to the original input size again, the output is aligned to the input, and the calculation complexity is reduced from the square level to be linear while the feature learning capability is maintained.

6. The method for segmenting the panorama of the automatic driving scene based on the multi-mode fusion perception according to claim 1, wherein in the step D2, the two-dimensional image features are learned by a multi-layer perceptron to learn the cosine similarity between each center point and other center points, and the center points of the large object as the semantic segmentation result are integrated by a threshold value to obtain the final example target center, which is as follows: after the two-dimensional image features of the center point are obtained, the three-dimensional features and the two-dimensional image features are added together and learned through a multi-layer perceptron to obtain multi-modal features of the center point, cosine similarity calculation is carried out between every two multi-modal features of all the center points, the center points with the similarity larger than a threshold value are regarded as the same instance, and a new center point is calculated by using the average value of coordinates of the center points to serve as a final instance target center.