CN116682105A

CN116682105A - Millimeter wave radar and visual feature attention fusion target detection method

Info

Publication number: CN116682105A
Application number: CN202310590332.9A
Authority: CN
Inventors: 纪元法; 张芬兰; 孙希延; 李晶晶; 付文涛; 梁维彬; 赵松克
Original assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Current assignee: Nanning Guidian Electronic Technology Research Institute Co ltd; Guilin University of Electronic Technology
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-09-01

Abstract

The present invention provides a target detection method for the fusion of millimeter-wave radar and visual feature attention, comprising the steps of: acquiring millimeter-wave radar point cloud data and visual image information; preprocessing point cloud data and visual image fusion in the data layer; Preliminary detection of images to extract image features; target association between radar point cloud and image information to extract radar features; input image features and radar features to feature attention fusion network for fusion; use 3Dbox decoder to decode target detection results and output. The present invention uses the intensity of the millimeter-wave radar scattering cross-section to self-adaptively adjust the size of the spatial information projected from the point cloud onto the image, solving the problem that the size of the existing radar point cloud projected onto the image is fixed; a feature attention fusion of radar and image is proposed Network, which solves the problem of uneven weight distribution in the feature fusion of millimeter-wave radar and visual images, and has the advantages of improving the accuracy and robustness of target detection.

Description

A target detection method based on millimeter-wave radar and visual feature attention fusion

技术领域technical field

本发明涉及传感器特征融合进行目标检测的技术领域，具体涉及一种毫米波雷达和视觉特征注意力融合的目标检测方法。The invention relates to the technical field of sensor feature fusion for target detection, in particular to a target detection method for millimeter wave radar and visual feature attention fusion.

背景技术Background technique

目前，用于目标检测的传感器主要包括视觉摄像机、红外传感器、激光雷达和毫米波雷达等多种传感器。其中摄像机由于其探测范围比较大、能获取到更多的目标原始信息且分类的能力强获得了广泛应用，但摄像机不能获取目标的深度信息，且容易受到不同天气和光线的影响。近年来，虽然摄像机在目标的二维检测中有很大的提升，检测精度也很高，但目前仅利用视觉图像输入在三维目标检测时还有较大的差距。毫米波雷达不仅能实现全天时全天候的工作，还能从回波信号中获取到目标的相对距离、速度和散射截面强度等信息，抗干扰能力比较强，但毫米波雷达点云非常稀疏，容易受到外界杂波和噪声的影响。因此，针对单一传感器的缺陷，一般采用多个传感器融合的方法来解决。相对而言，将激光雷达和摄像机进行融合对目标检测的研究比较多，可是由于激光雷达的探测范围较窄、昂贵的价格，越来越多的研究转而向毫米波雷达和摄像机的融合展开。结合毫米波雷达和视觉摄像机的优点，两者的融合一方面能够很好地利用对方提供的信息进行互补，使融合的信息更加丰富和全面，同时也避免了使用激光雷达时计算量庞大和成本高的问题。At present, the sensors used for target detection mainly include various sensors such as visual cameras, infrared sensors, laser radars, and millimeter-wave radars. Among them, the camera has been widely used because of its relatively large detection range, ability to obtain more original target information and strong classification ability, but the camera cannot obtain the depth information of the target and is easily affected by different weather and light. In recent years, although the camera has greatly improved in the two-dimensional detection of the target, and the detection accuracy is also high, there is still a big gap in the three-dimensional target detection using only visual image input. Millimeter-wave radar can not only achieve all-weather and all-weather work, but also obtain information such as the relative distance, velocity and scattering cross-section strength of the target from the echo signal. It has relatively strong anti-interference ability, but the point cloud of millimeter-wave radar is very sparse. Susceptible to external clutter and noise. Therefore, for the defect of a single sensor, a method of fusion of multiple sensors is generally used to solve it. Relatively speaking, there are many researches on the fusion of laser radar and camera for target detection. However, due to the narrow detection range and high price of laser radar, more and more researches turn to the fusion of millimeter wave radar and camera. . Combining the advantages of millimeter-wave radar and visual camera, the fusion of the two can make good use of the information provided by the other party to complement each other, making the fused information more abundant and comprehensive, and at the same time avoiding the huge amount of calculation and cost when using lidar high question.

通常，将毫米波雷达与视觉摄像机的融合方法分为三种：数据级融合、决策级融合和特征级融合，其中，特征级融合具有很大的研究空间。但由于毫米波雷达点云数据稀疏的问题，以致其表征能力很弱，且对于雷达特征和图像特征在融合时也还存在特征权重分配不均的问题，大都利用雷达特征来辅助图像特征做目标检测。因此对毫米波雷达点云数据处理时如何充分利用点云数据来增强其空间信息、提高表征能力以及实现雷达特征和图像特征权重的合理分配成为了关键问题。Generally, the fusion methods of millimeter-wave radar and visual camera are divided into three types: data-level fusion, decision-level fusion and feature-level fusion. Among them, feature-level fusion has a large research space. However, due to the sparseness of millimeter-wave radar point cloud data, its representation ability is very weak, and there is still a problem of uneven distribution of feature weights in the fusion of radar features and image features. Most of them use radar features to assist image features as targets. detection. Therefore, how to make full use of the point cloud data to enhance its spatial information, improve the representation ability, and realize the reasonable distribution of radar features and image feature weights has become a key issue when processing millimeter-wave radar point cloud data.

发明内容Contents of the invention

鉴于以上现有技术的缺点，本发明提供一种毫米波雷达和视觉特征注意力融合的目标检测方法，用于解决现有技术中毫米波雷达点云数据稀疏、表征能力很弱以及雷达特征和图像特征在融合时也还存在特征权重分配不均的问题。In view of the above shortcomings of the prior art, the present invention provides a target detection method for the fusion of millimeter-wave radar and visual feature attention, which is used to solve the problem of sparse point cloud data of millimeter-wave radar, weak representation ability and radar feature and There is also the problem of uneven distribution of feature weights when image features are fused.

为实现上述目的及其他相关目的，本发明提供一种毫米波雷达和视觉特征注意力融合的目标检测方法，所述方法包括如下步骤：In order to achieve the above purpose and other related purposes, the present invention provides a target detection method for the fusion of millimeter wave radar and visual feature attention, the method comprising the following steps:

步骤一：获取毫米波雷达点云数据和视觉相机的帧图像；Step 1: Obtain the millimeter-wave radar point cloud data and the frame image of the visual camera;

步骤二：对雷达点云进行预处理操作，并将雷达点投影到图像平面得到雷达图像，再利用雷达散射截面强度来控制雷达点云投影到图像的空间信息大小，与视觉图像信息在数据层进行融合；Step 2: Preprocess the radar point cloud, and project the radar point onto the image plane to obtain the radar image, and then use the intensity of the radar scattering cross section to control the size of the spatial information projected from the radar point cloud to the image, and the visual image information in the data layer carry out fusion;

步骤三：对图像进行初步的检测，提取图像特征；Step 3: Carry out preliminary detection on the image and extract image features;

步骤四：将雷达点云与图像进行目标关联，提取雷达特征；Step 4: Associate the radar point cloud with the image and extract the radar features;

步骤五：将图像特征和雷达特征输入特征注意力融合模块，合理分配两种特征的权重；Step 5: Input the image features and radar features into the feature attention fusion module, and reasonably allocate the weights of the two features;

步骤六：结合步骤三中网络检测头初次回归和步骤五二次回归的结果，利用3Dbox解码器解码出目标的检测结果并输出。Step 6: Combining the results of the first regression of the network detection head in step 3 and the second regression of step 5, use the 3Dbox decoder to decode the detection result of the target and output it.

优选地，本发明所述的步骤二中对雷达点云进行预处理操作，将雷达点投影到图像平面得到雷达图像的具体步骤如下：Preferably, in the second step of the present invention, the radar point cloud is preprocessed, and the specific steps of projecting the radar point onto the image plane to obtain the radar image are as follows:

多次对雷达探测点进行扫描并累加其数据，增加点云的密度；Scan the radar detection point multiple times and accumulate its data to increase the density of the point cloud;

加载点云数据，通过设置合适距离的阈值按距离过滤点并为雷达点云添加z偏置，移除速度不正常的雷达点云；Load the point cloud data, filter points by distance by setting the threshold of the appropriate distance and add z offset to the radar point cloud to remove the radar point cloud with abnormal speed;

雷达点表示为以自我中心坐标系统下雷达探测的3D点，将其参数化为P_radar(x，y，z，v_x，v_y，σ)，其中(x，y，z)表示目标的位置，v_x，v_y分别是目标在x，y方向上的径向速度，σ表示目标的散射截面强度；The radar point is expressed as a 3D point detected by the radar in the egocentric coordinate system, and it is parameterized as P _radar (x, y, z, v _x , v _y , σ), where (x, y, z) represents the target position, v _x , v _y are the radial velocity of the target in the x and y directions respectively, and σ represents the scattering cross section intensity of the target;

基于Nuscenes数据集，通过数据集中给定的摄像机内外参数，将毫米波雷达点投影到图像平面。经过坐标转换，将毫米波雷达点从自我坐标系转换到图像坐标系下，生成与视觉图像具有相同尺寸的雷达图像。Based on the Nuscenes dataset, the millimeter-wave radar points are projected to the image plane through the given camera internal and external parameters in the dataset. After coordinate transformation, the millimeter-wave radar point is converted from the self-coordinate system to the image coordinate system to generate a radar image with the same size as the visual image.

对于投影到图像平面的雷达点云数据，将点云数据以像素在纵向上初步扩展为0.5-2.0米的高度范围，在横向上初步扩展为0.2-1.5米的宽度范围，为雷达信息和摄像机的像素之间进行初步的关联；For the radar point cloud data projected onto the image plane, the point cloud data is preliminarily expanded to a height range of 0.5-2.0 meters in the vertical direction and a width range of 0.2-1.5 meters in the horizontal direction, providing radar information and cameras Preliminary correlation between the pixels;

通过雷达点的散射截面强度σ来自适应调整雷达图像上雷达点像素在横向宽度和纵向高度的具体高度值，以获得大小可变化的二维点云像素面积。The specific height values of the radar point pixels on the radar image in the horizontal width and vertical height are adaptively adjusted through the scattering cross-section intensity σ of the radar point to obtain a variable-sized two-dimensional point cloud pixel area.

优选地，本发明所述步骤四中将雷达点云与2D图像的目标关联，并提取雷达特征的具体步骤如下：Preferably, in step 4 of the present invention, the radar point cloud is associated with the target of the 2D image, and the specific steps of extracting radar features are as follows:

利用步骤四中初步回归得到的3D包围框的深度、观测角、3D尺寸和摄像机标定矩阵来生成一个3D视椎体；Use the depth of the 3D bounding box obtained in the preliminary regression in step 4, the observation angle, the 3D size and the camera calibration matrix to generate a 3D viewing frustum;

将雷达点云扩充为固定大小的3D柱体来增大点云的关联率，并将柱体投影到像素坐标系与2D包围框进行匹配，同时将柱体投影到相机坐标系与构造的3D视椎体进行深度值匹配；Expand the radar point cloud to a fixed-size 3D cylinder to increase the correlation rate of the point cloud, and project the cylinder to the pixel coordinate system to match the 2D bounding box, and at the same time project the cylinder to the camera coordinate system to match the constructed 3D The depth value matching of the visual frustum;

雷达特征的提取，对于每个与物体相关的雷达探测，生成3个雷达热图通道，位于物体的2D包围框的中心和内部，其中，热图的宽和高与2D包围框的大小成比例，热图的值由归一化的物体深度和以自我为中心坐标系中Vx、Vy的x、y分量来确定。Extraction of radar features, for each object-related radar detection, generate 3 radar heatmap channels, located in the center and inside of the 2D bounding box of the object, where the width and height of the heatmap are proportional to the size of the 2D bounding box , the value of the heatmap is determined by the normalized object depth and the x, y components of Vx, Vy in the egocentric coordinate system.

优选地，本发明所述步骤五中将图像特征和雷达特征输入特征注意力融合模块，进行权重学习，合理分配两种特征的权重的具体步骤如下：Preferably, in step 5 of the present invention, the image features and radar features are input into the feature attention fusion module to carry out weight learning, and the specific steps for rationally allocating the weights of the two features are as follows:

特征注意力融合网络主要由通道注意力模块CAM和不同大小的卷积层构成。第一个是卷积核大小为1×1，步长为(1，1)，填充为(0，0)的卷积层Conv1×1；第二个是卷积核大小为3×3，步长为(1，1)，填充为(1，1)的卷积层Conv3×3；第三个是卷积核大小为7×7，步长为(1，1)，填充为(3，3)的卷积层Conv7×7；The feature attention fusion network mainly consists of channel attention module CAM and convolutional layers of different sizes. The first one is a convolutional layer Conv1×1 with a convolution kernel size of 1×1, a step size of (1,1), and a filling of (0,0); the second one is a convolution kernel size of 3×3, The step size is (1, 1), the convolution layer Conv3×3 is filled with (1, 1); the third is the convolution kernel size is 7×7, the step size is (1, 1), and the filling is (3 , 3) the convolutional layer Conv7×7;

雷达特征主要经过一个Conv1×1和两个Conv3×3分别进行权重提取后将三条支路得到的注意力权重矩阵按元素相加操作，再经过7×7的卷积层进一步提取特征权重后生成雷达特征的空间注意信息；图像特征依次经过Conv1×1、Conv3×3和Conv1×1处理提取图像特征权重，再与原图像特征经过通道注意力模块处理后的结果按元素相乘操作生成图像特征的通道注意信息；The radar features are mainly extracted by one Conv1×1 and two Conv3×3 respectively, and then the attention weight matrix obtained by the three branches is added element by element, and then the feature weight is further extracted by a 7×7 convolutional layer to generate Spatial attention information of radar features; image features are sequentially processed by Conv1×1, Conv3×3 and Conv1×1 to extract image feature weights, and then multiplied with the original image features after being processed by the channel attention module to generate image features channel attention information;

将雷达特征的空间注意信息和图像特征的通道注意信息进行拼接融合操作生成图像雷达特征张量，然后利用检测头进行二次回归得到目标的深度、速度、转角和属性等信息；The spatial attention information of radar features and the channel attention information of image features are spliced and fused to generate image radar feature tensors, and then the detection head is used to perform secondary regression to obtain the depth, speed, rotation angle and attributes of the target;

如上所述，本发明的一种毫米波雷达和视觉特征注意力融合的目标检测方法的特点及有益效果是：As mentioned above, the characteristics and beneficial effects of the target detection method of a millimeter-wave radar and visual feature attention fusion of the present invention are:

(1)提出了一种利用毫米波雷达散射截面强度来自适应调整雷达点云投影到图像的空间信息大小的方法，将投影到图像上的点云利用雷达散射截面强度来自适应调整雷达点扩展的高度和宽度大小，提高雷达点的空间信息；(1) A method is proposed to adaptively adjust the spatial information size of the radar point cloud projected onto the image by using the intensity of the radar scattering cross section of the millimeter wave radar. Height and width size, improve the spatial information of the radar point;

(2)与现有技术相比，在特征融合模块中加入了注意力学习机制。由于毫米波雷达点云的稀疏性，将雷达特征经过处理生成空间注意信息，图像特征处理生成通道注意信息，再将两路的注意信息进行融合，重新合理分配了毫米波雷达特征和视觉图像特征的权重，增强雷达点云信息，提高三维目标检测的准确性和鲁棒性。(2) Compared with the state-of-the-art, an attention learning mechanism is added to the feature fusion module. Due to the sparsity of the millimeter-wave radar point cloud, the radar features are processed to generate spatial attention information, and the image features are processed to generate channel attention information, and then the two-way attention information is fused to redistribute the millimeter-wave radar features and visual image features reasonably. The weight of the radar point cloud is enhanced to improve the accuracy and robustness of 3D target detection.

附图说明Description of drawings

图1为本发明的算法流程示意图。Fig. 1 is a schematic flow chart of the algorithm of the present invention.

图2为本发明的毫米波雷达和视觉特征注意力融合的目标检测网络结构示意图。Fig. 2 is a schematic diagram of the target detection network structure of the millimeter-wave radar and visual feature attention fusion of the present invention.

图3为本发明提出的特征注意力融合网络结构示意图。Fig. 3 is a schematic diagram of the structure of the feature attention fusion network proposed by the present invention.

具体实施方式Detailed ways

为了使本发明的目的和技术方案更加清晰易懂，下面结合附图和具体的实施例对本发明的技术方案作进一步的详细说明。需要说明的是，以下的具体实施例仅用于解释本发明，不用于限制本发明范围。In order to make the purpose and technical solution of the present invention clearer and easier to understand, the technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that the following specific examples are only used to explain the present invention, and are not intended to limit the scope of the present invention.

如图2所示，为毫米波雷达和视觉特征注意力融合的目标检测网络结构，该网络结构主要由数据预处理、数据层融合、特征提取、特征注意力融合以及目标检测五个模块组成。As shown in Figure 2, it is a target detection network structure for the fusion of millimeter-wave radar and visual feature attention. The network structure is mainly composed of five modules: data preprocessing, data layer fusion, feature extraction, feature attention fusion, and target detection.

数据预处理模块：毫米波雷达雷达点云数据的预处理操作，并生成雷达图像。Data preprocessing module: the preprocessing operation of millimeter wave radar radar point cloud data, and generate radar images.

详细流程如下所示：The detailed process is as follows:

1.下载Nuscenes数据集，从中读取前方位的毫米波雷达点云数据和视觉摄像机的帧图像信息，所述的毫米波雷达点云数据主要从雷达回波信号中读取的信息，包括雷达到目标的距离、位置、目标速度以及目标的散射截面强度等多种信息；1. Download the Nuscenes data set, from which to read the millimeter-wave radar point cloud data and the frame image information of the visual camera. The millimeter-wave radar point cloud data is mainly the information read from the radar echo signal, including radar Various information such as the distance to the target, the position, the target speed, and the intensity of the target's scattering cross section;

2.对雷达点云进行预处理操作，具体实现方法包括：2. Perform preprocessing operations on the radar point cloud. The specific implementation methods include:

(1)多次对雷达探测点进行扫描并累加其数据，增加点云数据的密度；(1) Scan the radar detection point multiple times and accumulate its data to increase the density of point cloud data;

(2)加载点云数据，计算雷达到目标的距离即深度信息，通过设置合适的阈值按距离过滤点并为雷达点云添加z偏置；然后根据点云的距离按从小到大的规则对筛选出的点云进行排序。按距离过滤点的表达式为：(2) Load the point cloud data, calculate the distance from the radar to the target, that is, the depth information, filter points by distance by setting an appropriate threshold and add a z offset to the radar point cloud; then according to the distance of the point cloud according to the rule from small to large The filtered point clouds are sorted. The expression to filter points by distance is:

1m≤d≤100m1m≤d≤100m

其中，d表示雷达到目标点的距离。Among them, d represents the distance from the radar to the target point.

(3)将雷达点表示为以自我中心坐标系统下雷达探测的3D点，并参数化为P_radar(x，y，z，v_x，v_y，σ)，其中(x，y，z)表示目标的位置，v_x，v_y分别表示目标在x，y方向上的径向速度，σ表示目标的散射截面强度；(3) Express the radar point as a 3D point detected by the radar in the egocentric coordinate system, and parameterize it as P _radar (x, y, z, v _x , v _y , σ), where (x, y, z) represents the position of the target, v _x and v _y represent the radial velocity of the target in the x and y directions respectively, and σ represents the scattering cross-section intensity of the target;

3.基于Nuscenes数据集，通过数据集中给定的摄像机内外参数，将毫米波雷达点投影到图像平面。利用坐标转换公式，将毫米波雷达点从自我坐标系转换到图像坐标系下，生成与视觉图像具有相同尺寸的雷达图像。所述的坐标转换具体操作为，先将雷达点从自我坐标系转换到全球坐标系下，再将全球坐标系转换到摄像机坐标系，最后再将摄像机坐标系转换到图像坐标系。3. Based on the Nuscenes dataset, the millimeter-wave radar points are projected to the image plane through the given camera internal and external parameters in the dataset. Using the coordinate transformation formula, the millimeter-wave radar point is transformed from the self-coordinate system to the image coordinate system, and a radar image with the same size as the visual image is generated. The specific operation of the coordinate conversion is to firstly convert the radar point from the self-coordinate system to the global coordinate system, then convert the global coordinate system to the camera coordinate system, and finally convert the camera coordinate system to the image coordinate system.

4.由于毫米波雷达点云的稀疏性，对于投影到图像平面的雷达点云数据，将点云数据以像素在纵向上初步扩展为0.5-2.0米的高度范围，在横向上初步扩展为0.2-1.5米的宽度范围。该步骤为雷达信息和摄像机的像素之间建立了初步的关联，同时在空间上也扩大了雷达点云的信息；再通过雷达点的散射截面强度σ来自适应调整雷达图像上雷达点像素在横向宽度w_p和纵向高度h_p的具体值，其表达式为:4. Due to the sparsity of the millimeter-wave radar point cloud, for the radar point cloud data projected onto the image plane, the point cloud data is preliminarily expanded to a height range of 0.5-2.0 meters in the vertical direction and 0.2 meters in the horizontal direction. -1.5m width range. This step establishes a preliminary association between the radar information and the camera pixels, and at the same time expands the information of the radar point cloud in space; and then adaptively adjusts the lateral direction of the radar point pixels on the radar image through the scattering cross-section intensity σ of the radar point The concrete value of width w _p and vertical height h _p , its expression is:

其中，σ(·)表示散射截面强度的函数，w′_p、h′_p分别表示具体的点云宽度和高度值；S_p表示点云像素面积。Among them, σ( ) represents the function of the intensity of the scattering section, w′ _p and h′ _p represent the specific point cloud width and height values respectively; S _p represents the point cloud pixel area.

通过以上操作便可获得大小可变化的二维点云像素面积，解决现有将雷达点云投影到图像上的尺寸固定的问题。Through the above operations, the pixel area of the two-dimensional point cloud with variable size can be obtained, which solves the existing problem of projecting the radar point cloud onto the image with a fixed size.

数据层融合模块：利用毫米波雷达回波得到的深度信息、在X，Y轴的速度分量v_x，v_y作为像素值投影到视觉图像通道中；在没有雷达回波的位置，将对应的雷达通道的像素值全部设置为零；最后将加入雷达信息的视觉图像转换为的三通道图像，实现毫米波雷达点云数据与视觉图像在数据层的融合。Data layer fusion module: use the depth information obtained from millimeter-wave radar echoes, and the velocity components v _x and v _y on the X and Y axes as pixel values to project into the visual image channel; at positions without radar echoes, the corresponding The pixel values of the radar channel are all set to zero; finally, the visual image added with radar information is converted into a three-channel image to realize the fusion of the millimeter-wave radar point cloud data and the visual image at the data layer.

特征提取模块：包括视觉图像特征的提取和毫米波雷达特征的提取。Feature extraction module: including the extraction of visual image features and the extraction of millimeter-wave radar features.

1.视觉图像特征的提取，详细流程如下：1. The extraction of visual image features, the detailed process is as follows:

(1)将毫米波雷达点云数据与视觉图像在数据层的融合结果作为以改进的DLA-34作为骨干网的CneterNet中心点检测网络的输入，数据层的融合结果是附有雷达信息的三通道视觉图像信息I_i+r∈R^W×H×C，其中，W和H分别表示图像的高度和宽度。经过下采样处理后生成预测的关键点热图R表示下采样率，C表示物体类别的数量；对于生成的热图，建立Focal Loss函数，其表达式为：(1) The fusion result of the millimeter-wave radar point cloud data and the visual image in the data layer is used as the input of the CneterNet central point detection network with the improved DLA-34 as the backbone network. The fusion result of the data layer is the three-dimensional image with radar information Channel visual image information I _i+r ∈ ^{R W×H×C} , where W and H represent the height and width of the image, respectively. Generate predicted key point heatmap after downsampling R represents the downsampling rate, C represents the number of object categories; for the generated heat map, the Focal Loss function is established, and its expression is:

其中，N表示图像中的关键点个数；α，β表示超参数，通常取α＝2，β＝4；表示由高斯核生成的目标的地面真实热图值。Among them, N represents the number of key points in the image; α, β represent hyperparameters, usually α=2, β=4; Represents the ground truth heatmap values of an object generated by a Gaussian kernel.

(2)通过由256通道的Conv3×3卷积层和生成所需输出的Conv1×1卷积层组成的主回归头来预测图像上目标的中心点、2D尺寸(W和H)、中心偏移量、深度、转角以及3D尺寸等信息，这为每个检测到的目标提供了一个粗略的3DBox和精确位置的2DBox。(2) Predict the center point, 2D size (W and H), center bias, and This provides a rough 3DBox and a precise 2DBox for each detected target.

2.毫米波雷达特征的提取，详细流程如下：2. The extraction of millimeter-wave radar features, the detailed process is as follows:

(1)根据数据预处理模块中提到的将毫米波雷达回波得到的深度信息、在X，Y轴的速度分量v_x，v_y转化为像素值，生成一个与视觉图像大小相同的三通道原始雷达点云图像。根据刚性变换公式，将雷达点转换为摄像机坐标系，公式如下:(1) According to the data preprocessing module, the depth information obtained from the millimeter-wave radar echo and the velocity components v _x and v _y on the X and Y axes are converted into pixel values, and a three-dimensional image with the same size as the visual image is generated. channel raw radar point cloud image. According to the rigid transformation formula, the radar point is converted into the camera coordinate system, the formula is as follows:

X_image＝RX_radar+TX _image = RX _radar +T

其中，X_image和X_radar分别表示摄像机、毫米波雷达坐标系中雷达点云的坐标。R和T分别表示旋转矩阵和平动矩阵。Among them, X _image and X _radar represent the coordinates of the radar point cloud in the camera and millimeter-wave radar coordinate system, respectively. R and T represent the rotation matrix and translation matrix, respectively.

(2)对于生成的雷达点云图像，为了补充雷达的高度信息，将雷达点扩展为在[x，y，z]方向大小为(1.5，0.2，0.2)的柱体，增强点云的空间信息，并将柱体投影到像素坐标系与2D包围框进行关联。(2) For the generated radar point cloud image, in order to supplement the height information of the radar, the radar point is expanded into a cylinder with a size of (1.5, 0.2, 0.2) in the [x, y, z] direction to enhance the space of the point cloud Information, and project the cylinder to the pixel coordinate system to associate with the 2D bounding box.

(3)结合上述视觉图像特征提取中输出的2D包围框、估计深度信息以及摄像机标定矩阵为物体创建一个3维的ROI视椎区域，同时将柱体投影到相机坐标系与构造的3D视椎体进行深度值匹配，并忽略在视椎体之外的点，通过控制参数δ来控制视椎区域的大小。(3) Combining the 2D bounding box, estimated depth information, and camera calibration matrix output in the above visual image feature extraction to create a 3D ROI viewing frustum for the object, and project the cylinder to the camera coordinate system and the constructed 3D viewing frustum The depth value matching of the frustum is carried out, and the points outside the frustum are ignored, and the size of the frustum area is controlled by controlling the parameter δ.

(4)将雷达点与相应的物体进行关联后，使用雷达点云的深度和速度信息为图像建立互补的特征，并生成3个雷达热图通道(d,v_x,v_y)且雷达热图的宽度和高度与目标的2DBox成一定的比例，其热图值由目标的深度值d和在自我坐标系中的径向速度(v_x和v_y)的X和Y分量确定，其表h_j达式如下：(4) After associating the radar point with the corresponding object, use the depth and velocity information of the radar point cloud to establish complementary features for the image, and generate 3 radar heat map channels (d,v _x ,v _y ) and the radar heat map The width and height of the map are proportional to the 2DBox of the target, and its heat map value is determined by the depth value d of the target and the X and Y components of the radial velocity (v _x and v _y ) in the self-coordinate system, which is expressed as The expression of h _j is as follows:

式中，M_i表示标准化因子；i＝1,2,3表示特征通道；f_i表示三通道(d,v_x,v_y)的特征值；c_x,j、c_y,j分别表示第j个目标在图像上中心点的x，y轴坐标；w_j、h_j表示第j个目标2DBox的宽度和高度；α表示超参数，用于控制2D Box的宽度和高度大小。若两个目标有重叠在一起的热图区域，通常选取热图值较小的那个区域。In the formula, M _i represents the normalization factor; i=1,2,3 represents the feature channel; f _i represents the feature value of the three channels (d,v _x ,v _y ); c _x,j , c _y,j respectively represent the The x and y axis coordinates of the center point of the j target on the image; w _j , h _j represent the width and height of the jth target 2DBox; α represents the hyperparameter, which is used to control the width and height of the 2D Box. If two targets have overlapping heatmap regions, the region with the smaller heatmap value is usually selected.

特征注意力融合模块：将视觉图像特征和毫米波雷达特征一起输入到特征注意力融合模块中，特征注意力融合网络的结构示意图如图3所示，该特征注意力融合网络包括了一个通道注意力模块CAM和三个不同的注意力权重生成单元。第一个是卷积核大小为1×1，步长为(1，1)，填充为(0，0)的卷积层Conv1×1；第二个是卷积核大小为3×3，步长为(1，1)，填充为(1，1)的卷积层Conv3×3；第三个是卷积核大小为7×7，步长为(1，1)，填充为(3，3)的卷积层Conv7×7。Feature attention fusion module: Input visual image features and millimeter-wave radar features into the feature attention fusion module. The structural diagram of the feature attention fusion network is shown in Figure 3. The feature attention fusion network includes a channel attention The force module CAM and three different attention weight generation units. The first one is a convolutional layer Conv1×1 with a convolution kernel size of 1×1, a step size of (1,1), and a filling of (0,0); the second one is a convolution kernel size of 3×3, The step size is (1, 1), the convolution layer Conv3×3 is filled with (1, 1); the third is the convolution kernel size is 7×7, the step size is (1, 1), and the filling is (3 , 3) the convolutional layer Conv7×7.

特征注意力融合网络的具体实现步骤如下：The specific implementation steps of the feature attention fusion network are as follows:

(1)利用一个Conv1×1和两个Conv3×3分别对步骤四中获得的雷达特征进行权重提取，并将三条支路得到的注意力权重矩阵按元素相加操作，再经过7×7的卷积层进一步提取特征权重后生成雷达特征的空间注意信息；(1) Use one Conv1×1 and two Conv3×3 to extract the weights of the radar features obtained in step 4, and add the attention weight matrix obtained by the three branches by elements, and then go through 7×7 The convolutional layer further extracts the feature weights to generate the spatial attention information of the radar features;

(2)将生成的图像特征依次经过Conv1×1、Conv3×3和Conv1×1处理提取图像特征权重，再与原图像特征经过通道注意力模块处理后的结果按元素相乘操作生成图像特征的通道注意信息；(2) Process the generated image features sequentially through Conv1×1, Conv3×3 and Conv1×1 to extract the image feature weights, and then multiply the original image features with the result of the channel attention module to generate image features. Channel attention information;

(3)利用(1)中雷达特征的空间注意信息和(2)中图像特征的通道注意信息进行拼接融合操作生成图像雷达特征张量；(3) Use the spatial attention information of the radar features in (1) and the channel attention information of the image features in (2) to perform splicing and fusion operations to generate image radar feature tensors;

具体地，将图像特征经过CAM处理，建立通道注意力机制的方法包括：Specifically, the image features are processed by CAM, and the method of establishing a channel attention mechanism includes:

将输入的视觉图像特征I_img∈R^W×H×3分别经过最大层化和平均池化处理，获得2个I′_img∈R^1×1×3的特征图，再将两个特征图分别传进两层感知器网络中，输出的特征进行每个张量对应元素相加处理后，利用激活函数处理便可生成图像通道特征，最后与原视觉图像特征进行张量对应元素相乘，该过程的表达式如下：The input visual image feature I _img ∈ ^{R W×H×3} is processed by maximum layering and average pooling respectively to obtain two feature maps of I′ _img ∈ R ^1×1×3 , and then the two feature maps are respectively Passed into the two-layer perceptron network, after the output features are added to each tensor corresponding element, the image channel feature can be generated by using the activation function, and finally multiplied with the original visual image feature by the tensor corresponding element, the The expression of the process is as follows:

M_c(I_img)＝Sigmoid(MLP(avgpool(I_img))+MLP(maxpool(I_img)))M _c (I _img )=Sigmoid(MLP(avgpool(I _img ))+MLP(maxpool(I _img )))

式中，Sigmoid(·)为激活函数；MLP为感知器网络，用作矩阵运算；avgpool(·)和maxpool(·)分别表示作均值池化计算和最大池化计算；M_c(I_img)为图像通道注意力机制模块的输出。In the formula, Sigmoid( ) is the activation function; MLP is the perceptron network, which is used as a matrix operation; avgpool( ) and maxpool( ) represent mean pooling calculation and maximum pooling calculation respectively; M _c (I _img ) is the output of the image channel attention mechanism module.

目标检测模块：将经过特征注意力融合的输出结果作为CenterNet中心点检测网络检测头二次回归的输入，重新计算目标的深度、速度、转角和属性信息，二次回归头由3个Conv3×3卷积层组成，输出为Conv1×1的卷积层。结合图像特征提取的信息和二次回归头回归得到的深度、速度、转角和属性信息，通过3D边界框解码器还原出3D目标的检测结果，其中深度和姿态在两个回归头里都有，则只利用准确度更高的二次回归的结果。Target detection module: The output result of feature attention fusion is used as the input of the secondary regression of the center point detection network detection head of CenterNet, and the depth, speed, rotation angle and attribute information of the target are recalculated. The secondary regression head consists of 3 Conv3×3 The convolutional layer is composed of a Conv1×1 convolutional layer output. Combining the information extracted from the image feature and the depth, speed, rotation angle and attribute information obtained by the regression of the secondary regression head, the detection result of the 3D target is restored through the 3D bounding box decoder, where the depth and pose are included in the two regression heads. Only the results of the quadratic regression with higher accuracy are used.

需要说明的是，以上所述仅用以说明本发明的技术方案，而非用于限制本发明，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。It should be noted that the above descriptions are only used to illustrate the technical solutions of the present invention, rather than to limit the present invention. Within the scope of knowledge of those skilled in the art, they can also be used without departing from the gist of the present invention. Make various changes.

Claims

1. a target detection method of millimeter-wave radar and visual feature attention fusion, it is characterized in that, the method comprises the following steps:

Step 1: Obtain the millimeter-wave radar point cloud data and the frame image of the visual camera;

Step 2: performing a preprocessing operation on the radar point cloud data, and then fusing the preprocessed radar point cloud data with the frame image at the data layer;

Step 3: Preliminary detection is performed on the frame image, and image features are extracted;

Step 4: Carry out target association between the radar point cloud data and the frame image, and extract radar features;

Step 5: Input the image feature and the radar feature into the feature attention fusion module, and reasonably allocate the weights of the two features;

Step 6: Use the 3Dbox decoder to decode the detection result of the target and output it.

2. the target detection method of millimeter-wave radar according to claim 1 and visual feature attention fusion, it is characterized in that, in described step 2, the concrete steps that described radar point cloud data are carried out preprocessing operation are:

Scan the radar detection point multiple times and accumulate its data to increase the density of the point cloud;

Load the point cloud data, filter points by distance by setting the threshold of the appropriate distance and add z offset to the radar point cloud to remove the radar point cloud with abnormal speed;

The radar point is expressed as a 3D point detected by the radar in the egocentric coordinate system, and it is parameterized as P _radar (x, y, z, v _x , v _y , σ), where (x, y, z) represents the target position, v _x , v _y are the radial velocity of the target in the x and y directions respectively, and σ represents the scattering cross section intensity of the target;

Project the millimeter-wave radar points to the image plane with the camera internal and external parameters given in the dataset.

3. the target detection method of millimeter-wave radar and visual feature attention fusion according to claim 2, is characterized in that, the concrete steps of projecting millimeter-wave radar point cloud to image plane are:

After coordinate conversion, the millimeter-wave radar point is converted from the self-coordinate system to the image coordinate system to generate a radar image with the same size as the visual image;

The point cloud data is preliminarily expanded into a height range of 0.5-2.0 meters in the vertical direction and a width range of 0.2-1.5 meters in the horizontal direction, and the initial correlation is made between the radar information and the pixels of the camera;

The specific height values of the radar point pixels on the radar image in the horizontal width and vertical height are adaptively adjusted through the scattering cross-section intensity σ of the radar point to obtain a variable-sized two-dimensional point cloud pixel area.

4. the target detection method of millimeter-wave radar and visual feature attention fusion according to claim 1, is characterized in that, in described step 2, the described radar point cloud data after preprocessing and described frame image are in The specific steps for data layer fusion are as follows:

Firstly, using the depth information obtained from the millimeter-wave radar echo, the velocity components v _x and v _y on the X and Y axes are projected into the visual image channel as pixel values. At the pixel position without radar echo, all the corresponding radar channels are Set to zero; finally, convert the visual image with radar information into a three-channel image to realize the fusion of the millimeter-wave radar point cloud data and the frame image at the data layer.

5. the target detection method of millimeter-wave radar according to claim 4 and visual feature attention fusion, it is characterized in that, in described step 3, described frame image is carried out preliminary detection, the specific step of extracting image feature is :

The fusion result of the millimeter-wave radar point cloud data and the frame image in the data layer is used as the input of the CneterNet center point detection network with DLA-43 as the backbone network, and the image is initially detected, and the target is obtained by using the detection head regression. The feature information of the image includes rough 3D bounding box, depth, viewing angle, 2D size, and velocity.

6. the target detection method of millimeter-wave radar and visual feature attention fusion according to claim 5, is characterized in that, in described step 4, carry out target association with described radar point cloud data and described frame image, extract The specific steps of the radar feature are:

Use the image feature information of the target obtained by regression and the camera calibration matrix to generate a 3D viewing frustum, and associate radar detection with the target in the viewing frustum area;

Expand the radar point cloud data into a fixed-size 3D cylinder to increase the correlation rate of the point cloud, and project the cylinder to the pixel coordinate system to associate with the 2D bounding box, and at the same time project the cylinder to the camera coordinate system and The constructed 3D viewing frustum performs depth value matching and ignores points outside the viewing frustum;

Use the depth and velocity information of the radar point cloud data to build complementary features for the image and generate a 3-channel (d, v _x , v _y ) radar feature.

7. the target detection method of millimeter-wave radar and visual feature attention fusion according to claim 6, is characterized in that, in described step 5, described image feature and described radar feature input feature attention fusion module The specific steps are:

Use one Conv1×1 and two Conv3×3 to extract the weights of the radar features obtained in step 4, and add the attention weight matrix obtained by the three branches by elements, and then go through the 7×7 convolutional layer After further extracting the feature weights, the spatial attention information of the radar features is generated;

The generated image features are sequentially processed by Conv1×1, Conv3×3 and Conv1×1 to extract the image feature weights, and then multiplied with the original image features after being processed by the channel attention module element-wise to generate the channel attention information of the image features ;

Using the spatial attention information of radar features and the channel attention information of image features to perform splicing and fusion operations to generate image radar feature tensors;

Using the detection head to perform quadratic regression on the generated image radar feature tensor to obtain the depth, speed, rotation angle and attribute information of the target.

8. the target detection method of millimeter-wave radar and visual feature attention fusion according to claim 7, it is characterized in that, in described step 6, utilize 3Dbox decoder to decode the detection result of target and output the specific steps as:

Combining the feature information of the image with multiple pieces of information obtained by regression after the feature attention fusion module, the 3D bounding box decoder restores the detection result of the 3D object, and outputs it to the visual image for display.