CN114639115A

CN114639115A - 3D pedestrian detection method based on fusion of human body key points and laser radar

Info

Publication number: CN114639115A
Application number: CN202210155255.XA
Authority: CN
Inventors: 程景春; 杨生; 张春熹; 金靖; 戴敏鹏; 高爽
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2022-06-17
Anticipated expiration: 2042-02-21
Also published as: CN114639115B

Abstract

The invention discloses a 3D pedestrian detection method fused with human body key points and laser radar, including key point detection, 3D feature extraction and pedestrian position prediction; the invention makes full use of pedestrian key points in images and depth features in point cloud data , enhances the effect of point cloud and image information for improving pedestrian target recognition, effectively improves the accuracy of 3D pedestrian detection, and makes up for the lack of image color information in the point cloud feature and the lack of the three-dimensional position of the image missing target; the method of the present invention is used in intelligent robots. , augmented reality, autonomous driving and many other fields have great significance and application value.

Description

A 3D pedestrian detection method based on fusion of human key points and lidar

技术领域technical field

本发明属于3D目标检测技术领域，涉及一种人体关键点与激光雷达融合的3D行人检测方法。The invention belongs to the technical field of 3D target detection, and relates to a 3D pedestrian detection method integrating human key points and laser radar.

背景技术Background technique

3D行人检测任务依托自动驾驶、增强现实和智能机器人等应用场景，是目前计算机视觉领域的研究热点之一。在上述场景下，人作为主要的行为主体，是一种最普遍的检测目标。尤其是在车辆自动驾驶的情境中，不确定因素往往来源于行人或骑行者。人在交通环境中的灵活性及其所占的重要地位，使得行人需要较高的检测精度。但因为行人目标小、特征不足和背景干扰等问题，为3D行人检测带来很大挑战。3D pedestrian detection tasks rely on application scenarios such as autonomous driving, augmented reality, and intelligent robots, and are currently one of the research hotspots in the field of computer vision. In the above scenarios, people, as the main behavioral subject, are one of the most common detection targets. Especially in the context of autonomous vehicles, uncertainties often come from pedestrians or cyclists. The flexibility of people in the traffic environment and their important position make pedestrians need high detection accuracy. However, due to the small pedestrian target, insufficient features and background interference, it brings great challenges to 3D pedestrian detection.

激光雷达是一种通过探测远距离物体的激光散射来获取目标相关信息的光学遥感技术，是结合传统雷达和现代激光的技术产品。其通过探测目标物体表面的激光散射来获取信息，在测距、测速、扫描、目标检测等领域有广泛的应用。在自动驾驶技术中，感知周围空间环境主要通过激光雷达扫描器，以规划车辆行进路线，控制车辆安全的到达预定目的地。与传统测量技术相比，激光雷达数据采集器具有高测量精度、高检测效率、全天候探测、非接触探测的优点。Lidar is an optical remote sensing technology that obtains target-related information by detecting laser scattering from distant objects. It is a technical product that combines traditional radar and modern lasers. It obtains information by detecting the laser scattering on the surface of the target object, and has a wide range of applications in ranging, speed measurement, scanning, target detection and other fields. In autonomous driving technology, the perception of the surrounding space environment is mainly through lidar scanners to plan the vehicle's travel route and control the vehicle to safely reach the predetermined destination. Compared with traditional measurement technologies, lidar data collectors have the advantages of high measurement accuracy, high detection efficiency, all-weather detection, and non-contact detection.

人体关键点检测是计算机视觉中的一个基础任务，是人体动作识别、行为分析、人机交互等的前置任务。人体骨骼关键点对于描述人体姿态、预测人体行为至关重要，是诸多计算机视觉任务的基础，例如动作分类、异常行为检测、自动驾驶等。人体关键点检测在人体识别中精度可达80％，其在人体行为预测方面也有很好的成绩。因此结合人体关键点研究基于点云和图像的三维行人检测在自动驾驶领域具有十分重要的意义和应用价值。Human key point detection is a basic task in computer vision, and it is a pre-task of human action recognition, behavior analysis, and human-computer interaction. Human skeleton key points are crucial for describing human posture and predicting human behavior, and are the basis for many computer vision tasks, such as action classification, abnormal behavior detection, and automatic driving. Human key point detection has an accuracy of 80% in human body recognition, and it also has good results in human behavior prediction. Therefore, it is of great significance and application value in the field of automatic driving to study 3D pedestrian detection based on point cloud and image in combination with human key points.

现有3D目标检测方式主要基于点云数据进行目标识别，此类算法具有检测精度性能较好的特点，但由于未使用图像数据，缺失色彩信息，使得部分背景特征易被错检为行人。而人体关键点检测主要用于2D场景下人体目标检测和相应的行为预测等方面研究，缺失三维空间下行人的位置和尺寸等特征。Existing 3D target detection methods are mainly based on point cloud data for target recognition. Such algorithms have the characteristics of good detection accuracy and performance, but due to the lack of image data and lack of color information, some background features are easily misdetected as pedestrians. The human body key point detection is mainly used in the research of human target detection and corresponding behavior prediction in 2D scenes, and lacks characteristics such as the position and size of pedestrians in three-dimensional space.

发明内容SUMMARY OF THE INVENTION

为了解决上述已有技术存在的不足，本发明提出一种人体关键点和激光雷达融合的3D行人检测方法，结合点云高距离测量精度和人体关键点高行人识别能力的优点，实现三维空间中3D行人目标的检测。本发明的具体技术方案如下：In order to solve the above-mentioned deficiencies in the prior art, the present invention proposes a 3D pedestrian detection method fused with human body key points and lidar, which combines the advantages of high distance measurement accuracy of point clouds and high pedestrian recognition ability of human key points to realize three-dimensional pedestrian detection in three-dimensional space. 3D pedestrian object detection. The concrete technical scheme of the present invention is as follows:

一种人体关键点与激光雷达融合的3D行人检测方法，在车辆前端安装鱼眼摄像头和激光雷达，用于获取车辆前方和侧方区域的可见光图像、车辆前方和侧方区域的3D空间雷达点云数据，包括以下步骤：A 3D pedestrian detection method fused with human key points and lidar. A fisheye camera and lidar are installed at the front of the vehicle to obtain visible light images of the front and side areas of the vehicle, and 3D space radar points of the front and side areas of the vehicle. Cloud data, including the following steps:

S1：基于可见光图像的人体关键点检测；从图像中提取出人体关键点位置，推理得出关键点之间的连接关系，上溯至每个行人的检测框位置；S1: Human body key point detection based on visible light images; extract the position of human body key points from the image, infer the connection relationship between key points, and trace back to the detection frame position of each pedestrian;

S2：搭建基于人体关键点的雷达3D点云特征提取网络进行特征提取；以基于体素的雷达信号检测网络为基础，通过特征降维与二维图像进行配准，根据配准结果精准引入关键点位置，使得网络能够围绕关键点位置进行三维特征提取；S2: Build a radar 3D point cloud feature extraction network based on human key points for feature extraction; based on a voxel-based radar signal detection network, perform feature dimensionality reduction and two-dimensional image registration, and accurately introduce key points according to the registration results. point position, so that the network can perform 3D feature extraction around the key point position;

S3：行人3D位置检测网络的训练与预测；通过神经网络对行人在3D空间的中心点位置以及长宽高信息进行回归预测，利用损失函数进行训练，得到最终的行人3D检测网络；并经过检测框后处理，给出最终的预测结果。S3: Training and prediction of the pedestrian 3D position detection network; the neural network is used to perform regression prediction on the pedestrian's center point position and length, width and height information in 3D space, and use the loss function for training to obtain the final pedestrian 3D detection network; and after detection Post-box processing to give the final prediction result.

进一步地，所述步骤S1的具体过程为：Further, the specific process of the step S1 is:

S1-1：标注训练数据；根据MSCOCO数据集的关键点数据和标注信息训练OpenPose关键点识别算法，对输入的图像信息进行人体关键点检测，包括14个关键点：头、颈、左右肩、左右肘、左右腕、左右腰、左右膝、左右踝；S1-1: Annotate training data; train the OpenPose key point recognition algorithm according to the key point data and annotation information of the MSCOCO dataset, and perform human key point detection on the input image information, including 14 key points: head, neck, left and right shoulders, Left and right elbows, left and right wrists, left and right waists, left and right knees, left and right ankles;

S1-2：利用OpenPose关键点检测算法，使用步骤S1-1中的训练数据进行训练，得到行人关键点检测网络，向训练好的网络输入鱼眼摄像头获取的可见光图像，得到所有行人的关键点检测结果，再通过匈牙利算法得出每个关键点所属的行人，最终给出图像中所有人的2D关键点；S1-2: Using the OpenPose key point detection algorithm, use the training data in step S1-1 for training to obtain a pedestrian key point detection network, input the visible light image obtained by the fisheye camera to the trained network, and obtain the key points of all pedestrians According to the detection results, the pedestrians to which each key point belongs is obtained through the Hungarian algorithm, and finally the 2D key points of everyone in the image are given;

S1-3：根据关键点所属行人信息，得出该行人的最大最小位置坐标，自下而上生成人体候选区域，得到行人候选框结果。S1-3: According to the pedestrian information to which the key points belong, the maximum and minimum position coordinates of the pedestrian are obtained, and the human body candidate area is generated from the bottom to the top, and the result of the pedestrian candidate frame is obtained.

进一步地，所述步骤S2中的基于人体关键点的雷达3D点云特征提取网络包括串联的体素划分模块、特征映射匹配模块、特征增强模块与预测模块，其中，Further, the radar 3D point cloud feature extraction network based on human key points in the step S2 includes a series connected voxel division module, a feature map matching module, a feature enhancement module and a prediction module, wherein,

①所述体素划分模块，3D空间雷达点云包含三维空间信息，沿X、Y、Z轴的宽度为W、高度为H、深度为D，进行体素划分成均匀等大的长方体块，点云体素划分采用的宽度、高度、深度的最小单元分别为ν_W、ν_H、ν_D，分割后生成的三维体素网络大小为：W′＝W/ν_W、H′＝H/ν_H、D′＝D/ν_D；体素划分后，体素网格中包含雷达点，将体素网格内包含超过T个雷达点的划分为非空体素，否则为空体素，在各非空体素网格内随机采样T个点，作为体素的特征；① In the voxel division module, the 3D space radar point cloud contains three-dimensional space information. The width along the X, Y, and Z axes is W, the height is H, and the depth is D, and the voxels are divided into uniform and equal-sized cuboid blocks. The minimum units of width, height, and depth used for point cloud voxel division are ν _W , ν _H , and ν _D respectively. The size of the 3D voxel network generated after segmentation is: W′=W/ν _W , H′=H/ ν _H , D′=D/ν _D ; after the voxel is divided, the voxel grid contains radar points, and the voxel grid containing more than T radar points is divided into non-empty voxels, otherwise it is empty voxels , randomly sample T points in each non-empty voxel grid as the feature of the voxel;

②所述特征映射匹配模块，通过三层卷积网络从所述体素划分模块的输出中提取点云特征，分为两路分别使用特征加和层对3D空间雷达点云进行降维处理，降维方向分别对应雷达正视图与二维鸟瞰图，其中，雷达正视图方向与可见光图像方向一致，在此方向上基于雷达正视图与可见光图像进行信号配准，并在雷达正视图中引入步骤S1中得到的人体关键点；(2) The feature map matching module extracts point cloud features from the output of the voxel division module through a three-layer convolutional network, and divides it into two channels to perform dimensionality reduction processing on the 3D space radar point cloud using the feature summation layer, respectively. The dimensionality reduction direction corresponds to the radar front view and the two-dimensional bird's-eye view respectively. The radar front view direction is consistent with the visible light image direction. In this direction, the signal registration is performed based on the radar front view and the visible light image, and the steps are introduced in the radar front view. Human key points obtained in S1;

③所述特征增强模块，基于所述特征映射匹配模块的处理结果，从二维鸟瞰图方向进行特征堆叠和增强，引入特征加和层之前的层即所述特征映射匹配模块中的三层卷积网络的特征搭建特征金字塔，沿雷达正视图和二维鸟瞰图中的局部极值所在点周边3*3像素区域进行特征加和，再与二维鸟瞰图特征串联，得到增强后特征；3. The feature enhancement module, based on the processing result of the feature map matching module, performs feature stacking and enhancement from the direction of the two-dimensional bird's-eye view, and introduces the layer before the feature summation layer, that is, the three-layer volume in the feature map matching module. The features of the product network build a feature pyramid, and the features are summed along the 3*3 pixel area around the point where the local extreme value is located in the radar front view and the two-dimensional bird's-eye view, and then concatenated with the two-dimensional bird's-eye view features to obtain the enhanced features;

④所述预测模块，基于增强后的特征搭建预测模块预测行人的3D检测框位置，包括用于特征抽象的三个全连接层和两个预测支路，其中，每个全连接层降采样1/2，两个预测支路分别对应行人类别和行人3D坐标的预测。④ The prediction module, based on the enhanced features, builds a prediction module to predict the position of the pedestrian's 3D detection frame, including three fully connected layers and two prediction branches for feature abstraction, wherein each fully connected layer is downsampled by 1 /2, the two prediction branches correspond to the prediction of pedestrian category and pedestrian 3D coordinates respectively.

进一步地，所述步骤S3的具体过程为：Further, the specific process of the step S3 is:

S3-1：在3D行人检测数据集上统一训练步骤S2中搭建基于人体关键点信息的雷达点云特征提取网络，引入利用focal loss损失函数优化预测结果，其数学表达形式为：S3-1: Unified training on the 3D pedestrian detection data set. In step S2, a radar point cloud feature extraction network based on human key point information is built, and the focal loss loss function is used to optimize the prediction results. Its mathematical expression is:

FL(p_t)＝-(1-p_t)^γlog(p_t)FL(p _t )=-(1-p _t ) ^γ log(p _t )

其中，y表示标签，由于是二分类，y的取值为{+1,-1}，p表示预测样本属于1的概率，范围为0到1；为方便表示，用p_t代替p，p_t表示样本属于正样本的概率，γ聚焦参数，γ≥0，(1-p_t)^γ是调制参数，通过增加难分类样本的权重，使不易分类的小目标在模型训练时受到关注，最终得到高精度的行人3D检测网络；Among them, y represents the label. Since it is a binary classification, the value of y is {+1,-1}, and p represents the probability that the predicted sample belongs to 1, ranging from 0 to 1; for convenience, p _t is used instead of p, p _t represents the probability that the sample belongs to the positive sample, γ focus parameter, γ≥0, (1-p _t ) ^γ is the modulation parameter. By increasing the weight of the difficult-to-classify samples, the small targets that are not easy to classify can be paid attention to during model training, and finally Obtain a high-precision pedestrian 3D detection network;

S3-2：步骤S3-1训练得到3D行人检测网络，在检测过程中，直接将关键点图像和雷达点云输入网络，网络预测出3D行人检测框，并通过非极大值抑制后输出最终结果。S3-2: Step S3-1 trains a 3D pedestrian detection network. During the detection process, the key point image and radar point cloud are directly input into the network. The network predicts a 3D pedestrian detection frame, and outputs the final output after non-maximum suppression. result.

本发明的有益效果在于：The beneficial effects of the present invention are:

1.本发明的方法结合图像中的人体关键点和激光雷达点云的深度信息，实现行人目标的3D检测。充分利用图像中的行人关键点和点云数据中的深度特征，增强了点云和图像信息中用于行人目标识别的效果，有效提升3D行人检测的精度，弥补点云特征中缺失图像色彩信息和图像中三维目标识别精度低的不足；1. The method of the present invention realizes the 3D detection of pedestrian targets by combining the human key points in the image and the depth information of the lidar point cloud. Make full use of the pedestrian key points in the image and the depth features in the point cloud data, enhance the effect of the point cloud and image information for pedestrian target recognition, effectively improve the accuracy of 3D pedestrian detection, and make up for the missing image color information in the point cloud feature. And the lack of low accuracy of 3D target recognition in the image;

2.本发明的方法在自动驾驶等领域具有十分重要的意义和应用价值。2. The method of the present invention has very important significance and application value in the fields of automatic driving and the like.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，通过参考附图会更加清楚的理解本发明的特征和优点，附图是示意性的而不应理解为对本发明进行任何限制，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，可以根据这些附图获得其他的附图。其中：In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings required in the embodiments will be briefly introduced below, and the features and advantages of the present invention will be more clearly understood by referring to the drawings. , the accompanying drawings are schematic and should not be construed as any limitation to the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative effort. in:

图1为本发明的3D行人检测主要流程示意图；1 is a schematic diagram of the main flow of the 3D pedestrian detection of the present invention;

图2为本发明的方法整体设计框架示意图；2 is a schematic diagram of the overall design framework of the method of the present invention;

图3为本发明的区域生成网络中金字塔特征增强示意图。FIG. 3 is a schematic diagram of pyramid feature enhancement in the region generation network of the present invention.

具体实施方式Detailed ways

为了能够更清楚地理解本发明的上述目的、特征和优点，下面结合附图和具体实施方式对本发明进行进一步的详细描述。需要说明的是，在不冲突的情况下，本发明的实施例及实施例中的特征可以相互组合。In order to understand the above objects, features and advantages of the present invention more clearly, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and the features in the embodiments may be combined with each other under the condition of no conflict.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，但是，本发明还可以采用其他不同于在此描述的其他方式来实施，因此，本发明的保护范围并不受下面公开的具体实施例的限制。Many specific details are set forth in the following description to facilitate a full understanding of the present invention. However, the present invention can also be implemented in other ways different from those described herein. Therefore, the protection scope of the present invention is not limited by the specific details disclosed below. Example limitations.

在进行基于点云的3D行人检测研究时发现，点云数据虽然在车辆等大目标的检测方面性能突出，但对于3D行人检测任务，其精确度不高。原因是行人在整个道路场景下属于小目标，且易受背景的干扰；因为行人自身的非刚性结构，雷达扫描得到的点云信息少于车辆，甚至部分特征缺失；由于缺失图像的色彩信息，导致在行人类别分辨时缺少参照，使得不能对预测出的检测结果进一步校准，导致行人目标检测精度不高。In the research of 3D pedestrian detection based on point cloud, it is found that although point cloud data has outstanding performance in the detection of large targets such as vehicles, its accuracy is not high for 3D pedestrian detection tasks. The reason is that pedestrians are small targets in the entire road scene and are easily disturbed by the background; because of the non-rigid structure of pedestrians themselves, the point cloud information obtained by radar scanning is less than that of vehicles, and even some features are missing; due to the lack of color information in the image, As a result, there is a lack of reference when distinguishing pedestrian categories, so that the predicted detection results cannot be further calibrated, resulting in low pedestrian target detection accuracy.

本发明提供了一种人体关键点和激光雷达点云融合用于3D行人目标检测的方法，通过点云数据识别带深度信息的行人3D特征，结合图像中通过人体关键点识别的行人目标，对点云识别的行人进行校准，实现3D行人目标检测。The invention provides a method for 3D pedestrian target detection by fusion of human body key points and laser radar point cloud, identifying pedestrian 3D features with depth information through point cloud data, and combining the pedestrian targets identified by human body key points in the image. The pedestrians identified by the point cloud are calibrated to achieve 3D pedestrian target detection.

在本发明中，提出人体关键点和激光雷达点云融合实现3D行人目标检测的方法，图1为本发明的3D行人检测流程示意图，整体思路为雷达点云的3D目标检测算法识别行人的深度信息和初步的类别预测，人体关键点检测方案实现从图像信息中识别人体的关键点特征，进而通过肢体的关联性实现行人目标的检测。通过人体关键点预测的行人信息对雷达点云预测的3D行人检测结果进行校准，输出最后的预测结果。将整体算法进行模型训练，并进行测试验证，分析检测结果。In the present invention, a method for realizing 3D pedestrian target detection by fusion of human key points and lidar point cloud is proposed. Figure 1 is a schematic diagram of the 3D pedestrian detection process of the present invention. The overall idea is that the 3D target detection algorithm of radar point cloud recognizes the depth of pedestrians Information and preliminary category prediction, the human body key point detection scheme realizes the identification of the key point features of the human body from the image information, and then realizes the detection of pedestrian objects through the correlation of the limbs. The 3D pedestrian detection result predicted by the radar point cloud is calibrated through the pedestrian information predicted by the human key points, and the final prediction result is output. The overall algorithm is model trained, tested and verified, and the detection results are analyzed.

具体地，如图1-2所示，一种人体关键点与激光雷达融合的3D行人检测方法，在车辆前端安装鱼眼摄像头和激光雷达，用于获取车辆前方和侧方区域的可见光图像、车辆前方和侧方区域的3D空间雷达点云数据，包括以下步骤：Specifically, as shown in Figure 1-2, a 3D pedestrian detection method fused with human key points and lidar, a fisheye camera and lidar are installed at the front of the vehicle to obtain visible light images of the front and side areas of the vehicle, 3D spatial radar point cloud data for the area in front of and to the side of the vehicle, including the following steps:

S1：基于可见光图像的人体关键点检测；从图像中提取出人体关键点位置，推理得出关键点之间的连接关系，上溯至每个行人的检测框位置；具体过程为：S1: Human body key point detection based on visible light images; extract the position of human body key points from the image, infer the connection relationship between key points, and trace back to the detection frame position of each pedestrian; the specific process is:

在一些实施方式中，步骤S2中的基于人体关键点的雷达3D点云特征提取网络包括串联的体素划分模块、特征映射匹配模块、特征增强模块与预测模块，其中，In some embodiments, the radar 3D point cloud feature extraction network based on human key points in step S2 includes a concatenated voxel division module, a feature map matching module, a feature enhancement module and a prediction module, wherein,

①体素划分模块，3D空间雷达点云包含三维空间信息，沿X、Y、Z轴的宽度为W、高度为H、深度为D，进行体素划分成均匀等大的长方体块，点云体素划分采用的宽度、高度、深度的最小单元分别为ν_W、ν_H、ν_D，分割后生成的三维体素网络大小为：W′＝W/ν_W、H′＝H/ν_H、D′＝D/ν_D；体素划分后，体素网格中包含雷达点，将体素网格内包含超过T个雷达点的划分为非空体素，否则为空体素，在各非空体素网格内随机采样T个点，作为体素的特征；①Voxel division module. The 3D space radar point cloud contains three-dimensional space information. The width along the X, Y, and Z axes is W, the height is H, and the depth is D. The voxels are divided into uniform and equal-sized cuboid blocks. The point cloud The minimum units of width, height and depth used for voxel division are ν _W , ν _H , and ν _D respectively. The size of the 3D voxel network generated after segmentation is: W′=W/ν _W , H′=H/ν _H , D′=D/ν _D ; After the voxel is divided, the voxel grid contains radar points, and the voxel grid containing more than T radar points is divided into non-empty voxels, otherwise it is an empty voxel, in Randomly sample T points in each non-empty voxel grid as the feature of the voxel;

②特征映射匹配模块，通过三层卷积网络从体素划分模块的输出中提取点云特征，分为两路分别使用特征加和层对3D空间雷达点云进行降维处理，降维方向分别对应雷达正视图与二维鸟瞰图，其中，雷达正视图方向与可见光图像方向一致，在此方向上基于雷达正视图与可见光图像进行信号配准，并在雷达正视图中引入步骤S1中得到的人体关键点；② The feature map matching module extracts point cloud features from the output of the voxel division module through a three-layer convolutional network, and divides it into two channels to use the feature summation layer to reduce the dimension of the 3D space radar point cloud. The dimension reduction directions are respectively Corresponding to the radar front view and the two-dimensional bird's-eye view, the direction of the radar front view is consistent with the direction of the visible light image, and the signal registration is performed based on the radar front view and the visible light image in this direction, and the radar front view is introduced in step S1. key points of the human body;

以KITTI数据采集车为例，将激光雷达坐标系中的空间点m转换，变到相机坐标系中的n，具体的转换关系为：Taking the KITTI data acquisition vehicle as an example, the spatial point m in the lidar coordinate system is converted to n in the camera coordinate system. The specific conversion relationship is:

表示矫正后相机旋转矩阵，使图像在一个平面，在实际计算时，将其扩展成

具体如下：

Represents the camera rotation matrix after rectification, so that the image is in a plane, and in the actual calculation, it is expanded into

details as follows:

代表从激光雷达到相机的转化矩阵，表达如下：

Represents the transformation matrix from lidar to camera, expressed as follows:

式中

表示旋转矩阵，

表示平移矩阵，为激光雷达到0号灰度相机坐标系的转换矩阵。

表示矫正后的相机投影矩阵，表达如下：in the formula

represents the rotation matrix,

Represents the translation matrix, which is the transformation matrix from the lidar to the 0th grayscale camera coordinate system.

Represents the corrected camera projection matrix, expressed as follows:

式中

是第i个相机到0号灰度相机在X轴方向上的偏移量，要将激光雷达点云坐标系下的点投影到左侧的彩色图像中，则i取值为2。

和

是指相机焦距，

和

是指主点的偏移。in the formula

is the offset from the i-th camera to the 0th grayscale camera in the X-axis direction. To project the point under the lidar point cloud coordinate system into the color image on the left, the i value is 2.

and

is the focal length of the camera,

and

is the offset of the principal point.

③特征增强模块，基于特征映射匹配模块的处理结果，从二维鸟瞰图方向进行特征堆叠和增强，引入特征加和层之前的层即特征映射匹配模块中的三层卷积网络的特征搭建特征金字塔，沿雷达正视图和二维鸟瞰图中的局部极值所在点周边3*3像素区域进行特征加和，再与二维鸟瞰图特征串联，得到增强后特征；③ Feature enhancement module, based on the processing results of the feature map matching module, stack and enhance features from the direction of the two-dimensional bird's-eye view, and introduce the layer before the feature summation layer, that is, the feature building feature of the three-layer convolutional network in the feature map matching module. Pyramid, add features along the 3*3 pixel area around the point where the local extreme value is located in the radar front view and the two-dimensional bird's-eye view, and then concatenate with the two-dimensional bird's-eye view features to obtain the enhanced features;

④预测模块，基于增强后的特征搭建预测模块预测行人的3D检测框位置，包括用于特征抽象的三个全连接层和两个预测支路，其中，每个全连接层降采样1/2，两个预测支路分别对应行人类别和行人3D坐标的预测。④Prediction module, build a prediction module based on the enhanced features to predict the 3D detection frame position of pedestrians, including three fully connected layers and two prediction branches for feature abstraction, where each fully connected layer is downsampled by 1/2 , the two prediction branches correspond to the prediction of pedestrian category and pedestrian 3D coordinates respectively.

在一些实施方式中，步骤S3的具体过程为：In some embodiments, the specific process of step S3 is:

FL(p_t)＝-(1-p_t)^γlog(p_t)FL(p _t )=-(1-p _t ) ^γ log(p _t )

综上所述，本发明提供了人体关键点与激光雷达融合用于3D行人检测的实现方法，能够应用于智能机器人、增强现实、自动驾驶等诸多领域，例如在自动驾驶中，应用相机获取的光学图像和激光雷达扫描的点云，应用本发明的方法实现多传感器融合的3D行人检测方式，本发明的方法具有传感信息全面、检测精度准确的特点，是后续3D目标检测发展的趋势，因此，具有很高的推广价值。To sum up, the present invention provides an implementation method for fusion of human key points and lidar for 3D pedestrian detection, which can be applied to many fields such as intelligent robots, augmented reality, and automatic driving. The optical image and the point cloud scanned by the laser radar, the method of the present invention is applied to realize the 3D pedestrian detection method of multi-sensor fusion. The method of the present invention has the characteristics of comprehensive sensing information and accurate detection accuracy, which is the development trend of subsequent 3D target detection. Therefore, it has high promotion value.

在本发明中，术语“第一”、“第二”、“第三”、“第四”仅用于描述目的，不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上，除非另有明确的限定。In the present invention, the terms "first", "second", "third", and "fourth" are used for descriptive purposes only and should not be construed as indicating or implying relative importance. The term "plurality" refers to two or more, unless expressly limited otherwise.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

Claims

1. A3D pedestrian detection method integrating human body key points and a laser radar is characterized in that a fisheye camera and the laser radar are mounted at the front end of a vehicle and used for acquiring visible light images of the front area and the side area of the vehicle and 3D space radar point cloud data of the front area and the side area of the vehicle, and the method comprises the following steps:

s1: detecting key points of the human body based on the visible light image; extracting the positions of key points of the human body from the image, deducing the connection relation between the key points, and tracing to the position of a detection frame of each pedestrian;

s2: building a radar 3D point cloud feature extraction network based on human body key points to extract features; based on a voxel-based radar signal detection network, registering through feature dimension reduction and a two-dimensional image, and accurately introducing key point positions according to a registration result so that the network can extract three-dimensional features around the key point positions;

s3: training and predicting a pedestrian 3D position detection network; carrying out regression prediction on the central point position and the length, width and height information of the pedestrian in the 3D space through a neural network, and training by using a loss function to obtain a final 3D detection network of the pedestrian; and the final prediction result is given after the detection frame post-processing.

2. The method for detecting pedestrian key points in visible light images according to claim 1, wherein the specific process of step S1 is as follows:

s1-1: marking training data; training an OpenPose key point recognition algorithm according to key point data and annotation information of an MSCOCO data set, and carrying out human body key point detection on input image information, wherein the human body key point detection comprises 14 key points: head, neck, left and right shoulders, left and right elbows, left and right wrists, left and right waists, left and right knees, left and right ankles;

s1-2: training by using the OpenPose key point detection algorithm and the training data in the step S1-1 to obtain a pedestrian key point detection network, inputting visible light images acquired by the fisheye camera to the trained network to obtain key point detection results of all pedestrians, obtaining pedestrians to which each key point belongs by the Hungary algorithm, and finally giving 2D key points of all people in the images;

s1-3: and obtaining the maximum and minimum position coordinates of the pedestrian according to the information of the pedestrian to which the key point belongs, and generating a human body candidate region from bottom to top to obtain a pedestrian candidate frame result.

3. The construction of the pedestrian key point-based 3D point cloud feature extraction network according to claim 1 or 2, wherein the human body key point-based radar 3D point cloud feature extraction network in the step S2 comprises a voxel division module, a feature mapping matching module, a feature enhancement module and a prediction module which are connected in series,

the voxel division module is characterized in that 3D space radar point cloud comprises three-dimensional space information, the width along an X, Y, Z axis is W, the height is H, and the depth is D, voxels are divided into cuboid blocks with uniform equal size, and the minimum units of the width, the height and the depth adopted by point cloud voxel division are respectively v_W、ν_H、ν_DThe size of the three-dimensional voxel network generated after segmentation is as follows: w' ═ W/v_W、H′＝H/ν_H、D′＝D/ν_D(ii) a After the voxels are divided, radar points are contained in the voxel grid, non-empty voxels are divided into the voxels containing more than T radar points in the voxel grid, otherwise the voxels are empty voxels, and T points are randomly sampled in each non-empty voxel grid to serve as the characteristics of the voxels;

the feature mapping matching module extracts point cloud features from the output of the voxel division module through a three-layer convolution network, divides the point cloud features into two paths, respectively uses a feature addition layer to perform dimensionality reduction processing on the 3D space radar point cloud, and enables dimensionality reduction directions to respectively correspond to a radar front view and a two-dimensional aerial view, wherein the radar front view direction is consistent with the visible light image direction, signal registration is performed in the direction based on the radar front view and the visible light image, and human body key points obtained in the step S1 are introduced into the radar front view;

the feature enhancement module is used for stacking and enhancing features from the direction of the two-dimensional aerial view based on the processing result of the feature mapping matching module, introducing layers before feature summation layers, namely feature building feature pyramids of three layers of convolution networks in the feature mapping matching module, carrying out feature summation along 3 pixel areas around the points where local extrema in the radar front view and the two-dimensional aerial view are located, and then connecting the feature summation with the features of the two-dimensional aerial view in series to obtain enhanced features;

and fourthly, the prediction module is used for building a prediction module to predict the position of the 3D detection frame of the pedestrian based on the enhanced features, and comprises three full connection layers and two prediction branches for feature abstraction, wherein each full connection layer carries out downsampling 1/2, and the two prediction branches respectively correspond to the prediction of the pedestrian category and the 3D coordinate of the pedestrian.

4. The network training and predicting method according to any one of claims 1-3, wherein said step S3 comprises the following steps:

s3-1: a radar point cloud characteristic extraction network based on human body key point information is built in a step S2 of training on a 3D pedestrian detection data set in a unified mode, a focal loss function is introduced to optimize a prediction result, and the mathematical expression form of the method is as follows:

FL(p_t)＝-(1-p_t)^γlog(p_t)

wherein y represents a label, the value of y is { +1, -1} due to the binary classification, p represents the probability that the prediction sample belongs to 1, and the range is 0 to 1; for convenience of presentation, p is used_tIn place of p, p_tRepresenting the probability that the sample belongs to a positive sample, a gamma focusing parameter, gamma ≧ 0, (1-p)_t)^γThe method is characterized in that the parameters are modulated, and small targets which are difficult to classify are paid attention to during model training by increasing the weight of samples which are difficult to classify, so that a high-precision pedestrian 3D detection network is obtained finally;

s3-2: and S3-1, training to obtain a 3D pedestrian detection network, directly inputting the key point image and the radar point cloud into the network in the detection process, predicting a 3D pedestrian detection frame by the network, and outputting a final result after non-maximum value suppression.