CN116129234A

CN116129234A - Attention-based 4D millimeter wave radar and vision fusion method

Info

Publication number: CN116129234A
Application number: CN202310237553.8A
Authority: CN
Inventors: 彭树生; 刁天涛
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-05-16

Abstract

The invention discloses a method for fusing 4D millimeter wave radar and vision based on attention, which comprises the following steps: extracting radar trunk characteristics of the 4D millimeter wave Lei Dadian cloud data by adopting a voxel format to obtain characteristics of the radar in the BEV space; extracting the main features of the image data to obtain the 2D features of the image; projecting 2D features of the image through a view projection module, intensively predicting depth in a classification mode, and obtaining features of the image in a BEV space according to the predicted image depth and external parameters of a camera; and finally, fusing the millimeter wave radar and the visual characteristics by using a attention mechanism at a characteristic layer through a designed fusion module, and reasonably distributing the weights of the millimeter wave radar and the visual characteristics. The invention solves the problems that the 4D millimeter wave radar and vision are mutually dependent and the weight is difficult to distribute.

Description

Attention-based 4D millimeter wave radar and vision fusion method

Technical Field

The invention relates to the technical field of radar and vision fusion, in particular to a attention-based 4D millimeter wave radar and vision fusion method.

Background

Millimeter wave radar and computer vision technology are widely used in the fields of automatic driving, security protection, intelligent transportation and the like. Millimeter wave radar has the advantages of strong penetrating power and no influence of weather such as illumination, rain and snow, etc., but it cannot provide high-precision target recognition and tracking information. In contrast, computer vision techniques can provide richer target information, but are more affected by factors such as lighting, weather, etc.

In general, radar and vision fusion strategies fall into three categories: decision layer fusion (commonly we refer to as post-fusion), feature layer fusion (middle layer fusion), data layer fusion (pre-fusion). The decision layer fusion is to fuse the final result output by the radar-based model, such as the 3D BoudingBox and the result output by the visual detection, such as the 2D BoudingBox, through a filtering algorithm; the feature layer fusion is to project the final result output by one mode onto the deep learning feature layer of another mode, and then to use a subsequent fusion network to perform information fusion; the data layer fusion is to directly fuse the original data of the two modes, and then directly output a final result by using a neural network.

The fusion strategies have advantages and disadvantages, but post fusion is commonly used in industry, because the scheme is flexible, the robustness is better, the output results of different modes are integrated through manually designed algorithms and rules, and different modes have different use priorities under different conditions, so that the influence on a system when a single sensor fails can be better processed. However, the post-fusion has many disadvantages, namely, the information is not fully utilized, the system links become more complex, the longer the links are, the more easily the problems are caused, and the maintenance cost is high when the rules are stacked more. The academic world is better presented with a pre-fusion scheme, which can better utilize the end-to-end characteristics of the neural network. However, the scheme of pre-fusion can be directly loaded, because the robustness of the current pre-fusion scheme is considered to be insufficient to meet the actual requirement, and particularly when radar signals are in a problem, the current pre-fusion scheme can hardly process.

In a practical environment, the following problems are encountered:

(1) The external parameter of the radar and the camera is inaccurate, namely, the external parameter is inaccurate due to calibration problems or jolt and shake when the vehicle runs, so that deviation can occur in direct projection of the point cloud and the image.

(2) Camera noise, such as lens smudge, card frames, even damage to a certain camera, etc., results in the point cloud being projected onto the image without finding the corresponding feature or obtaining the wrong feature.

(3) Radar noise, except for the dirt shielding problem; for some low-reflection objects, the radar itself characteristics result in missing return points.

Some methods have provided some compatibility capabilities for problems (1) and (2), such as deep fusion, but are ineffective for point cloud missing caused by the radar noise of problem (3). Because all the methods need to go through the point cloud coordinates to Query the image characteristics, once the point cloud is missing, all the methods cannot be performed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention aims to provide a 4D millimeter wave radar and vision fusion sensing method based on attention.

The specific implementation steps of the invention are as follows: a method for fusing 4D millimeter wave radar and vision based on attention comprises the following steps:

step 1: and (3) radar feature extraction: extracting radar trunks of the 4D millimeter wave radar data by adopting a voxel format to obtain BEV characteristics of the radar data;

step 2: extracting image features: extracting main features of the image data to obtain 2D features of the image, and projecting each picture feature pixel back into a 3D space through projection to form BEV features of the image;

step 3: feature fusion: through introducing an attention mechanism, performing attention coding fusion on the radar and visual data obtained in the step 1 and the step 2 in the BEV space to obtain comprehensive target information;

step 4: and (3) target detection: and performing target detection by using the fused BEV characteristic information.

Preferably, in step 1, radar trunk extraction is performed on the 4D millimeter wave radar data by adopting a voxel format, and the step of obtaining radar feature data includes:

selecting a voxel form as input of a point cloud BEV feature extraction network, wherein the BEV feature extraction network takes VoxelNet as a main network, and a feature pyramid network is added, wherein the main network divides a three-dimensional point cloud into a certain number of voxels, performs random sampling and normalization on the points, performs local feature extraction on each non-empty voxel by using a plurality of voxel feature coding layers to obtain voxel-level features, and then further abstracts the features (increases receptive fields and learns geometric space representation) by using a 3D convolution module to obtain BEV features of the point cloud; further refinement of BEV features via the feature pyramid network results in feature maps of different resolutions that all contain semantic information of the original deepest feature map by extracting features from BEV features using a bottom-up path and combining and refining the features using a top-down path.

Preferably, in step 2, the specific step of extracting the image backbone feature of the image data to obtain the 2d feature of the image includes:

taking a Swin Transformer as a backbone network, adding a feature pyramid network layer to obtain 2d features of an image, and explicitly estimating depth information of the image through a view projection module to complete the construction of a BEV view angle of the image, wherein the backbone network whole model adopts a layered structure, and has 4 stages in total, the resolution of an input feature image is reduced in each stage, the receptive field is gradually enlarged, and when input starts, patch editing is carried out to divide the image into a plurality of small blocks and embed the small blocks into the editing; each stage includes two parts, namely a Patch merge and a Swin transducer module.

Preferably, each picture feature pixel is projected back into 3D space by projection, the specific process of constructing the BEV feature of the image is:

a discrete set of depth values is generated for each pixel of the image, the method of generating depth values for the pixels being such that: in a view cone 1 m to 60 m away from a camera, an optional depth value is arranged every 1 m, N points are sampled on the straight line, depth information of Feature points is predicted, a vector of D dimension is expressed by softmax, D represents the number of distances which are 1 m apart and are 1 m apart in a range of 1 m to 60 m, the depth information obtained by each pixel point is used for weighting image features of the same position to generate a pseudo point cloud similar to a truncated pyramid in shape, the camera external parameters and the camera internal parameters are used for converting with the Frustum Feature obtained before to obtain coordinates of a picture in a 3D space, and then flattening is carried out, wherein the specific process is as follows: defining the size of each grid by defining the range of the BEV visual angles, and summarizing the characteristics projected to the corresponding grids into one grid; there may be multiple features in the same grid in top view, the image point cloud is quantized along the x, y dimensions using a fixed step size, the features are aggregated in each BEV grid using a BEV pooling operation, and the features are expanded along the z-axis.

Preferably, in step 3, the radar and the image features are fused, and the fusing step includes:

in space, firstly, carrying out superposition channel on the features of the radar and the image, respectively carrying out global maximum pooling and global average pooling, then respectively carrying out 3×3 convolution kernels, and then carrying out sigmoid activation operation to generate a final spatial attention feature map, and weighting each pixel of the BEV features of the radar and the image through the feature map;

and on the channel, carrying out channel superposition on the radar features and the picture features subjected to the spatial attention extraction, obtaining the weight of the channel through averaging pooling, 3×3 convolution kernel and Sigmoid operation, and multiplying the superposed features through the weight to obtain the final fusion feature.

Compared with the prior art, the invention has the remarkable advantages that: the invention unifies multi-modal characteristics in the shared aerial view (BEV) representation space, well reserves geometric and semantic information, and solves the problems of interdependence and difficult weight distribution during 4D millimeter wave radar and vision fusion.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of a picture feature encoding module according to the present invention.

Fig. 3 is a schematic diagram of a feature fusion module of the present invention.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

As shown in fig. 1, a method for fusing attention-based 4D millimeter wave radar and vision specifically includes the following steps:

when the main features of the radar are extracted, the point cloud of the millimeter wave radar is sparse, and the feature extraction method of dense point cloud is difficult to directly use. In addition, most point-level feature extraction methods only can be used for fusing the features of local information, and the relevance between the local information and the whole information is not strong. Therefore, in the field of autopilot, point-level features are not directly used for 3D object detection tasks. For the 4D millimeter wave radar, the point cloud is also a three-dimensional point cloud, so the feature extraction mode of the laser radar point cloud is also applicable to the 4D millimeter wave radar, and voxel-based feature expression is selected: voxel. The network design is mainly based on VoxelNet and is added with an FPN (feature pyramid network). The method comprises the steps that a backbone network divides a three-dimensional point cloud into a certain number of voxels, after random sampling and normalization of points, local feature extraction is carried out on each non-empty voxel by using a plurality of voxel feature coding layers to obtain voxel-level features, and then the features are further abstracted (the receptive field is increased and geometric space representation is learned) through a 3D convolution module to obtain BEV features of the point cloud; further refinement of BEV features via the feature pyramid network results in feature maps of different resolutions that all contain semantic information of the original deepest feature map by extracting features from BEV features using a bottom-up path and combining and refining the features using a top-down path.

Through practical tests, the radar point characteristics can be effectively extracted through the method.

when the main features of the image are extracted, the calculated amount of the high-resolution and pixel-rich image, whether the image is a transducer or a CNN (convolutional neural network), is large, and the later-stage calculation force requirement is high. Thus, a Swin transducer with hierarchical design is used, including sliding window operations. The network design is mainly based on a Swin Transformer and an FPN layer is added, and the 2d characteristic of the image can be effectively extracted. In order to obtain the characteristics of the image in the BEV space, a view projection module is designed, and as shown in fig. 2, the module explicitly estimates the depth information of the image, and the construction of the BEV view angle of the image is completed. For the obtained 2D image features, each feature pixel is dispersed to D discrete points along the camera ray, the depth probability distribution of each pixel in the image is predicted, relevant features are scaled according to the corresponding depth probability, an image feature point cloud is obtained, then the image feature point cloud is quantized along the x and y dimensions by using a fixed step size, features are aggregated in each BEV grid by using BEV pooling operation, and then the features are unfolded along the z axis.

The whole model of the backbone network adopts a layered structure, and has 4 stages in total, the resolution of the input feature map can be reduced in each stage, the receptive field is gradually enlarged, and the effect of reducing the calculated amount is achieved. At the beginning of the input, patch editing is performed, the image is divided into several small blocks and embedded in editing. Each stage contains two parts, namely a Patch merge (except that the first block is a linear layer) and a Swin transducer module; for the view projection module, a discrete set of depth values is first generated for each pixel of the image, so that the network can select the appropriate depth itself during model training. The method for generating depth values for pixels is that in a view cone 1 m to 60 m from a camera, there is an optional depth value every 1 m (thus 61 optional discrete depth values for each pixel), so N points can be sampled on this straight line, then the network needs to predict the depth information (distribution over depth) of this feature point, and the depth information is represented by a D-dimensional vector through softmax, D represents a distance in the range of 1 m to 60 m, that is, d=61, so that each position on D represents a probability value of the pixel in this depth range. By defining the range of BEV viewing angles, the size of each grid is defined, and the features projected to the corresponding grid are summarized into one grid. There may be multiple features in the same grid in the top view, the image point cloud is quantized along the x, y dimensions using a fixed step size, the features are aggregated in each BEV grid using a BEV pooling operation, and the features are expanded along the z-axis.

when feature fusion is performed, as shown in fig. 3, in order to more effectively extract important features from the point cloud features and the image features, the thought of spatial attention is applied, the feature mapping and the correlation degree of the target detection task can be selectively combined to generate an attention mapping, the relative importance of the image and the point cloud data is reflected, and more important point cloud features and image features can be extracted according to the attention mapping. To effectively fuse the BEV features of the camera and radar, a channel attention extraction approach is used. For two characteristics of different channel numbers, the two characteristics are overlapped and connected, and are fused with the learnable static weights, and important channels are selected in a channel attention extraction mode, so that more important fusion characteristics can be obtained.

And finally, inputting the fusion characteristic into a detection head based on a transducer to obtain a final target detection result.

The invention relates to a attention-based BEV feature layer fusion method, which is characterized in that a mode-specific backbone network is used for respectively extracting features corresponding to an image and a 4D millimeter wave radar, then the features are converted into unified BEV characterization, fusion is carried out through an attention-based mechanism, and finally a target detection result is output through a detection head.

In the invention, the radar point cloud processing and the image processing are independently carried out, the neural network is utilized for encoding, the encoding is projected to a unified BEV space, and then the encoding and the image processing are fused on the BEV space. In this case, the radar and vision have no primary and secondary dependence, so that the flexibility of approximate post-fusion can be realized: the single mode can independently perform the task, when a plurality of modes are added, the performance can be greatly improved, but when one mode is missing or noise is generated, the whole device can not generate destructive results. The method also realizes the self-adaptive fusion of radar and visual data by introducing an attention mechanism, and improves the accuracy of sensing tasks (such as target tracking and recognition).

The invention is described in further detail below in connection with specific embodiments.

Example 1

And (3) radar feature extraction: and extracting radar trunks of the 4D millimeter wave radar data by adopting a voxel format to obtain BEV characteristics of the radar data.

Extracting image features: extracting image backbone features of image data to obtain 2D features of the image, dispersing each feature pixel to D discrete points along camera light by predicting the depth probability distribution mode of each pixel in the image, scaling related features according to corresponding depth probabilities to obtain image feature point clouds, quantifying the image point clouds along x and y dimensions, aggregating the features in each BEV grid by using BEV pooling operation, and expanding the features along a z axis to obtain the features of the image in a BEV space.

Feature fusion: each pixel of the BEV features of the radar and picture is weighted spatially, weighing the importance of the radar and picture features. And then, on the channel, the extracted radar features and the extracted picture features are subjected to channel superposition, the attention extraction is carried out on the fused channel by utilizing channel attention learning, the weights of the two sensor features are reasonably distributed, and the detection precision is further improved.

And (3) target detection: and inputting the fused characteristics into a 3D target detection head to obtain a 3D target detection result.

Parts or structures of the present invention, which are not specifically described, may be existing technologies or existing products, and are not described herein. The foregoing description is only illustrative of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes or direct or indirect application in other related arts are included in the scope of the present invention.

Claims

1. The attention-based 4D millimeter wave radar and vision fusion method is characterized by comprising the following steps of:

2. The attention-based 4D millimeter wave radar and vision fusion method of claim 1, wherein the step of extracting radar trunk of the 4D millimeter wave radar data in the voxel format in the step 1 to obtain radar feature data comprises the steps of:

3. The attention-based 4D millimeter wave radar and vision fusion method of claim 1, wherein the specific step of extracting the image backbone feature of the image data in the step 2 to obtain the 2D feature of the image comprises the following steps:

taking a SwinTransformer as a backbone network, adding a feature pyramid network layer to obtain 2d features of an image, explicitly estimating depth information of the image through a view projection module, and completing construction of a BEV view angle of the image, wherein the backbone network whole model adopts a layered structure, and has 4 stages in total, each stage can reduce resolution of an input feature image, gradually expand a receptive field, and when input starts, patch editing is carried out to divide the image into a plurality of small blocks and embed the small blocks into the editing; each stage includes two parts, namely a Patch merge and a swinTransformer module.

4. The attention-based 4D millimeter wave radar and vision fusion method of claim 1, wherein each picture feature pixel is projected back into the 3D space by projection, and the specific process of forming the BEV feature of the image is:

a discrete set of depth values is generated for each pixel of the image, the method of generating depth values for the pixels being such that: in a view cone 1 m to 60 m away from a camera, an optional depth value is arranged every 1 m, N points are sampled on the straight line, depth information of feature points is predicted, a vector of D dimension is expressed by softmax, D represents the number of distances which are 1 m apart and are 1 m apart in a range of 1 m to 60 m, the depth information obtained by each pixel point is used for weighting image features of the same position to generate a pseudo point cloud similar to a truncated pyramid in shape, the camera external parameters and the internal parameters and the frame features obtained before are used for conversion to obtain coordinates of a picture in a 3D space, and then flattening is carried out, wherein the specific process is as follows: defining the size of each grid by defining the range of the BEV visual angles, and summarizing the characteristics projected to the corresponding grids into one grid; there may be multiple features in the same grid in top view, the image point cloud is quantized along the x, y dimensions using a fixed step size, the features are aggregated in each BEV grid using a BEV pooling operation, and the features are expanded along the z-axis.

5. The attention-based 4D millimeter wave radar and vision fusion method of claim 1, wherein the radar and image features are fused in step 3, and the fusing step comprises: