CN113569803A

CN113569803A - Multi-mode data fusion lane target detection method and system based on multi-scale convolution

Info

Publication number: CN113569803A
Application number: CN202110921918.XA
Authority: CN
Inventors: 张国英; 高鑫; 熊一瑾
Original assignee: China University of Mining and Technology Beijing CUMTB
Current assignee: China University of Mining and Technology Beijing CUMTB
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-10-29

Abstract

The invention discloses a multi-modal data fusion lane target detection method and system based on multi-scale convolution, which comprises the steps of firstly preprocessing lane original laser radar point cloud data to obtain three-channel pseudo point cloud data aligned with RGB images; performing multi-mode data fusion on the obtained lane three-channel pseudo-point cloud data and corresponding RGB image data; in the convolution process after multi-modal data fusion, calculating the fusion characteristic channel again by using multi-scale convolution so as to correct the weight of the fusion characteristic channel; and inputting the corrected lane multi-mode fusion data into a pre-established and trained lane target detection model, and predicting lane targets to obtain a lane target detection result. The method can improve the utilization rate of the advantage characteristics of different modal data, effectively improve the precision of the target detection model, has good real-time performance, and is applicable to different target detection models.

Description

Multi-mode data fusion lane target detection method and system based on multi-scale convolution

Technical Field

The invention relates to the technical field of target detection, in particular to a multi-mode data fusion lane target detection method and system based on multi-scale convolution.

Background

In recent years, object detection has made a remarkable progress in face recognition, image recognition, video recognition, automatic driving, and the like. The target detection is particularly important in automatic driving, in order to ensure the safety of getting on the road, all pedestrians and vehicles on the road surface need to be accurately detected, the pedestrians are high-risk groups on the road surface and are the groups which are most easily damaged in traffic accidents, and the pedestrian identification and the pedestrian movement track prediction are research hotspots in the current safe driving field. The vehicle is a main body in a driving scene, and other vehicles running on a detection road are the most important in obstacle detection, so that the method has important significance for safe running, emergency collision avoidance and the like of the vehicle.

The current mainstream method for detecting the target in automatic driving is to detect a camera image, and data of other sensors, such as laser radar point cloud data reflecting depth information, has great difficulty in application due to high calculation complexity. However, the target detection only by using the camera image in the automatic driving has great limitations, such as the illumination change can affect the detection effect, the detection precision is low when people and vehicles are dense, and the possibility of false detection and missing detection of the target with the color similar to the background color is high. The current target detection method relying on pure visual images loses depth information and is easily influenced by environment and weather, and the accuracy of the detection model is difficult to improve in an application scene and lacks robustness. In contrast, lidar point cloud data with depth information for each point has natural advantages in solving the above problems, and therefore, it is reasonable and necessary to perform target detection with multi-modal fusion in autonomous driving, but the current multi-modal fusion scheme has several disadvantages:

1. the calculation cost is high: the point cloud data has high calculation complexity, and the direct calculation of the point cloud consumes a large amount of time and calculation resources, greatly affects the real-time performance, and cannot meet the requirement of automatic driving.

2. No attention was paid to the characteristics of multimodal data: one feasible method is to project the depth information in the point cloud to a two-dimensional plane and fuse the depth image and the camera image, but the camera image and the lidar point cloud are data of different modalities and have great difference. The multi-modal data are coarsely placed in a feature space to be fused, so that the respective advantages of the multi-modal data can be inhibited, and even the noise effect can be generated.

3. The importance of multimodal data is not calculated: in fact, in the process of extracting features by a deep learning network, part of feature channels greatly contribute to a target detection result, and part of feature channels have small contribution, and the fusion by using the fixed proportion is unreasonable.

Disclosure of Invention

The invention aims to provide a multi-modal data fusion lane target detection method and system based on multi-scale convolution.

The purpose of the invention is realized by the following technical scheme:

a method of multi-modal data fusion lane target detection based on multi-scale convolution, the method comprising:

step 1, firstly, preprocessing original laser radar point cloud data to obtain three-channel pseudo point cloud data aligned with an RGB image;

step 2, performing multi-mode data fusion on the three-channel pseudo-point cloud data obtained in the step 1 and corresponding RGB image data; in the multi-modal data fusion process, weight assignment is carried out on the feature channels of different modal data, and fusion weights of all the feature channels are acquired in a self-adaptive mode according to the importance degrees of the respective feature channels;

step 3, in the convolution process after multi-modal data fusion, calculating the fusion characteristic channel again by using multi-scale convolution so as to correct the weight of the fusion characteristic channel;

and 4, inputting the lane multi-mode fusion data corrected in the step 3 into a pre-established and trained lane target detection model, and predicting lane targets to obtain a lane target detection result.

According to the technical scheme provided by the invention, the method can improve the utilization rate of the advantage characteristics of different modal data, effectively improve the accuracy of the lane target detection model, has good real-time performance, and is applicable to different target detection models.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for multi-modal data fusion lane target detection based on multi-scale convolution according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a multi-modal data fusion process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a multi-scale calibration process according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a system according to an embodiment of the present invention;

fig. 5 is a structural diagram of a lane object detection model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and this does not limit the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a method for multi-modal data fusion lane target detection based on multi-scale convolution according to an embodiment of the present invention, where the method includes:

step 1, firstly, preprocessing lane original laser radar point cloud data to obtain three-channel pseudo point cloud data aligned with an RGB image;

in the step, the lane original laser radar point cloud data is the original point cloud data collected by the laser radar equipment and is a bin file; the lane RGB image is a color image acquired by a monocular camera and is a three-channel image.

The pretreatment process specifically comprises the following steps:

because the original laser radar point cloud data is sparse, the target detection effect by directly utilizing the original laser radar point cloud data is not ideal, so that the laser radar point cloud can be complemented by using a K nearest neighbor algorithm, specifically, K nearest neighbor search point cloud interpolation is applied, and K = 2; searching 3 nearest points of each blank pixel through the distance between each point, and calculating a weighted average value according to the normalized pixel distance to be used as a result of the blank pixel to obtain dense point cloud data; the completion process can make the representation of the point cloud data to the target object clearer and improve the detection precision to a certain extent;

then projecting a front view of the dense point cloud data onto an imaging plane of a monocular camera, aligning the front view with an RGB image acquired by the monocular camera, and obtaining the reflectivity of the dense point cloud data; the projection result is a single-channel image with the same size as the RGB image, and the pixel value of the single-channel image is the reflectivity of the pixel alignment position of the dense point cloud data and the RGB image;

projecting the height information and the depth information of the dense point cloud data to respectively obtain corresponding single-channel images, and performing channel stacking on the three obtained single-channel images to obtain three-channel pseudo point cloud data aligned with the RGB images;

the upper half part of the three-channel pseudo-point cloud data can be further trimmed so as to better reduce the influence of irrelevant background.

Step 2, performing multi-mode data fusion on the lane three-channel pseudo-point cloud data obtained in the step 1 and corresponding RGB image data;

in the multi-modal data fusion process, weight assignment is carried out on the feature channels of different modal data, and fusion weights of all the feature channels are acquired in a self-adaptive mode according to the importance degrees of the respective feature channels;

fig. 2 is a schematic diagram of a process of multi-modal data fusion according to an embodiment of the present invention, where the specific process is as follows:

firstly, performing convolution operation on lane RGB image data and three-channel Pseudo-point cloud data (Pseudo LiDAR in the figure 2) by using convolution kernels with the sizes of 3 x 3 and 5 x 5 respectively, wherein all operations performed by two branches keep the number of output channels consistent; specifically, after RGB image data and three-channel pseudo-point cloud data are input, firstly, primary low-dimensional features are extracted respectively, and then, features are extracted from the RGB image data by adopting a 3-by-3 convolution kernel; in contrast, the three-channel pseudo-point cloud data is subjected to feature extraction again by using a convolution kernel of 5 × 5, so that the depth information is better utilized;

merging the output characteristics of the two branches, wherein each channel characteristic is formed by adding the two branches point by point, and then carrying out global average pooling to further refine the characteristics; wherein, the formula for global average pooling is as follows:

wherein

And

is a size parameter of a single characteristic channelNumber, height and width, respectively;

outputting the feature channel set after feature combination for the two branches,

size of passing space

Contracting result completion pairs of two branch output mergerssCalculation of the c-th element of (a), (b), (c), (d) and b), (d) and (d), (d) and (d)i,j) Coordinates of feature points on the feature channel are obtained;

and performing feature mapping on the average pooled features by using a full-connection operation, wherein the formula of the full-connection operation is as follows:

wherein the content of the first and second substances,

is the ReLU activation function;

representing a batch normalization operation;Wsd is a characteristic of d-C dimension, in the patent, d is 8, and C is a channel dimension before branch merging;san output that is a global average pooling;

then mapping by using a softmax function to obtain the weight of each channel, and assigning the weight of the characteristic channel of each modal data; the probability value returned by the softmax function also represents the attention weight of the characteristic channel in different spatial scales;

aiming at all the characteristic channels of a single branch, multiplying the characteristic values of the characteristic channels by the attention weight of the corresponding position to obtain a single branch result after weight calculation;

and then fusing the two branch results after the two branches are calculated by add operation to realize multi-mode data fusion.

in this step, in order to correct the weight of the fused feature channel and improve the accuracy of the model, the embodiment of the present invention uses multi-scale convolution correction after the second ResNet block in the feature extraction process, and performs weight calculation of the feature channel under 3 different receptive fields, as shown in fig. 3, which is a schematic diagram of the multi-scale correction process described in the embodiment of the present invention, specifically:

performing Split operation on the data after multi-modal fusion, namely extracting high-dimensional features through convolution kernels with 3 different sizes; calculating channel weights for the high-dimensional features by sequentially using Fuse and Select operations, namely correcting the weight values of all the feature channels;

the Split operation uses three convolution kernels, 3 × 3, 5 × 5 and 7 × 7, to extract multi-path features;

the Fuse operation controls the multipath characteristics, and information carried by the multipath characteristics is transmitted to a next layer of neuron;

performing weighted summation operation on the output of the Fuse operation and the three weight matrixes by Select operation to obtain output vectors of all branches;

and finally, adding the output vectors of the three branches by add operation to obtain the correction weight of the characteristic channel.

Through twice feature channel weight calculation during and after fusion, the utilization rate of the advantage characteristics of different modal data is greatly improved, and the precision of the target detection model is effectively improved.

As shown in fig. 5, which is a structural diagram of a lane object detection model according to an embodiment of the present invention, on the framework of fast RCNN, classification and regression of lane objects are completed by using FPN and RPN to obtain a lane object detection result.

In the step, lane multi-mode fusion data after weight correction is sent to the FPN and the RPN of a lane target detection model in sequence, and the RPN finishes classification and regression of lane targets according to 256 channel feature layers output by the last layer of convolutional layer; the FPN (feature pyramid) can be used for detecting targets with different scales, and after extracting feature information with different scales, the feature graphs with all scales have high-level semantic information by using a transverse connection structure; RPN (RegionProposal network) generates a candidate region, which combines an anchor frame with a priori size to distinguish the background from the foreground and make the anchor frame closer to a real target;

the classification finishes the classification judgment of the lane target, the regression uses a boundary frame (an anchor frame obtained finally) to frame the detected target, and the classification and regression results are the lane target detection results.

Based on the above method, an embodiment of the present invention further provides a system for multi-modal data fusion lane target detection based on multi-scale convolution, and as shown in fig. 4, the system according to the embodiment of the present invention is a schematic structural diagram, and the system includes:

the data preprocessing module is used for preprocessing the original laser radar point cloud data to obtain three-channel pseudo point cloud data aligned with the RGB image;

the multi-mode data fusion module is used for performing multi-mode data fusion on the three-channel pseudo-point cloud data obtained by the data preprocessing module and the corresponding RGB image data; the multi-mode data fusion module performs weight assignment on the feature channels of different modal data in the multi-mode data fusion process, and adaptively acquires the fusion weights of all the feature channels according to the importance degrees of the feature channels;

the multi-scale correction module is used for calculating the fusion characteristic channel again by using multi-scale convolution in the convolution process after the multi-mode data fusion module carries out multi-mode data fusion so as to correct the weight of the fusion characteristic channel;

and the lane target prediction module is used for inputting the multi-modal fusion data corrected by the multi-scale correction module into a pre-established and trained lane target detection model and predicting lane targets to obtain a lane target detection result.

The specific implementation process of each module in the system is described in the embodiment of the method.

It is noted that those skilled in the art will recognize that embodiments of the present invention are not described in detail herein.

In summary, the method and system of the embodiment of the invention have the advantages that:

1. according to the method, the laser radar point cloud data can be better utilized, and the problems of time consumption and resource occupation caused by calculation of a large number of three-dimensional point clouds are solved by utilizing generated and cut three-channel pseudo point cloud image information;

2. the adopted fusion method can reasonably and effectively calculate the weight of all the characteristic channels, improves the utilization rate of the advantage characteristics of the two modal data, and has great advantages compared with the existing fixed fusion weight mode;

3. after high-dimensional features are extracted from the fused data, the weight of the feature channel is calculated, the weight of the feature channel of the fused data is corrected, the feature channel which greatly contributes to the detection result is greatly highlighted by using the weight calculation of the feature channel twice, and the detection precision of the model is improved.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims. The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Claims

1. A multi-modal data fusion lane target detection method based on multi-scale convolution is characterized by comprising the following steps:

2. The method for multi-modal data fusion lane target detection based on multi-scale convolution according to claim 1, wherein the process of step 1 is specifically as follows:

firstly, complementing original laser radar point cloud data, specifically applying K nearest neighbor search point cloud interpolation, wherein K = 2; searching 3 nearest points of each blank pixel through the distance between each point, and calculating a weighted average value according to the normalized pixel distance to be used as a result of the blank pixel to obtain dense point cloud data;

then projecting a front view of the dense point cloud data onto an imaging plane of a monocular camera, aligning the front view with an RGB image acquired by the monocular camera, and obtaining the reflectivity of the dense point cloud data;

and projecting the height information and the depth information of the dense point cloud data to respectively obtain corresponding single-channel images, and stacking the three obtained single-channel images to obtain three-channel pseudo point cloud data aligned with the RGB images.

3. The method for multi-modal data fusion lane target detection based on multi-scale convolution according to claim 1, wherein the process of step 2 is specifically as follows:

firstly, performing convolution operation on RGB image data and three-channel pseudo-point cloud data by using convolution kernels with the sizes of 3X 3 and 5X 5 respectively, wherein all operations performed by the two branches keep the number of output channels consistent;

combining the output characteristics of the two branches, wherein each channel characteristic is formed by adding the two branches point by point;

then carrying out global average pooling to further refine the characteristics; wherein, the formula for global average pooling is as follows:

wherein the content of the first and second substances,

and

is the size parameter of the single characteristic channel, which is respectively the height and the width;

size of passing space

wherein the content of the first and second substances,

is the ReLU activation function;

representing a batch normalization operation;Wsfeatures in d x C dimensions;san output that is a global average pooling;

multiplying the characteristic value of all characteristic channels of a single branch by the attention weight of the corresponding position to obtain a single branch result after weighted calculation;

and fusing the two branch results by add operation to realize multi-mode data fusion.

4. The method for multi-modal data fusion lane target detection based on multi-scale convolution according to claim 1, wherein the process of step 3 is specifically as follows:

5. The method for multi-modal data fusion lane target detection based on multi-scale convolution according to claim 1, wherein in step 4, lane multi-modal fusion data after weight correction is sequentially sent to the FPN and RPN of the lane target detection model, and the RPN completes classification and regression of lane targets according to the 256-channel feature layers output by the last layer of convolution layer;

and the classification finishes the classification judgment of the lane target, the regression uses a boundary frame to frame the detected target, and the classification and regression results are the lane target detection results.

6. A system for multi-modal data fusion lane target detection based on multi-scale convolution, the system comprising:

the lane original laser radar point cloud data preprocessing module is used for acquiring three-channel pseudo point cloud data aligned with the lane RGB image;

the lane multi-mode data fusion module is used for performing multi-mode data fusion on the lane three-channel pseudo-point cloud data and the corresponding RGB image; carrying out weight assignment on the feature channels of different modal data, and acquiring fusion weights of all the feature channels in a self-adaptive manner according to the importance degree;

the multi-scale correction module is used for calculating the fused characteristic channel again by using multi-scale convolution so as to correct the weight of the fused characteristic channel;