CN117437404B

CN117437404B - Multi-mode target detection method based on virtual point cloud

Info

Publication number: CN117437404B
Application number: CN202311400412.XA
Authority: CN
Inventors: 程腾; 倪昊; 张强; 石琴; 王文冲
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-07-19
Anticipated expiration: 2043-10-26
Also published as: CN117437404A

Abstract

The present invention relates to the technical field of multimodal target detection, and specifically to a multimodal target detection method based on a virtual point cloud, comprising the following detection steps: inputting a picture into a neural network, extracting features from the picture to obtain key points of the picture; constructing a virtual point cloud through key point information in a virtual point cloud construction network; voxelizing the virtual point cloud and the real point cloud of the picture to obtain voxel organization; inputting the voxelized organization into a target detection network to obtain a detection result; jointly updating parameters in a neural network, a virtual point cloud construction network and a target detection network to obtain a multimodal target detection model composed of the neural network, the virtual point cloud construction network and the target detection network; and inputting a picture to be classified into the multimodal target detection model to obtain a category of the picture. The present invention can effectively improve the accuracy of target detection.

Description

A multimodal target detection method based on virtual point cloud

技术领域Technical Field

本发明涉及多模态目标检测技术领域，具体是一种基于虚拟点云的多模态目标检测方法。The present invention relates to the technical field of multimodal target detection, and in particular to a multimodal target detection method based on virtual point cloud.

背景技术Background technique

多模态目标检测是指利用多种不同类型的传感器或数据源，如激光雷达、摄像头、雷达等，融合信息来进行目标检测和定位的一种技术。其目的是提高目标检测的准确性和鲁棒性，同时也使得对复杂场景的理解更全面。Multimodal object detection refers to a technology that uses multiple different types of sensors or data sources, such as lidar, camera, radar, etc., to fuse information for object detection and positioning. Its purpose is to improve the accuracy and robustness of object detection, while also making the understanding of complex scenes more comprehensive.

目前，多模态环境感知方法主要有三种：一、利用多个传感器获取各个模态数据，将各个模态数据在感知前进行叠加融合，又称前融合；二、分别针对各个模态数据设计神经网络，利用神经网络提取特征，得到所需要的局部特征与全局特征，并在特征层面对各个模态数据对应的模态特征进行叠加融合，又称特征融合；三、利用各个模态数据的感知结果进行逻辑上的取舍，综合得到最终结果，又称后融合。At present, there are three main methods for multimodal environment perception: 1. Use multiple sensors to obtain data from each modality, and superimpose and fuse the modal data before perception, which is also called pre-fusion; 2. Design a neural network for each modal data, use the neural network to extract features, obtain the required local features and global features, and superimpose and fuse the modal features corresponding to each modal data at the feature level, which is also called feature fusion; 3. Use the perception results of each modal data to make logical choices and comprehensively obtain the final result, which is also called post-fusion.

在实际的目标检测过程中发现，点云数据较为稀疏，并且点云的位置是无序的，导致在使用上述现有技术时容易出现漏检和误检的问题，极大地影响了目标检测的准确率。In the actual target detection process, it is found that the point cloud data is relatively sparse and the position of the point cloud is disordered, which leads to the problems of missed detection and false detection when using the above-mentioned existing technologies, greatly affecting the accuracy of target detection.

发明内容Summary of the invention

为了避免和克服现有技术中存在的技术问题，本发明提供了一种基于虚拟点云的多模态目标检测方法。本发明能够有效地提高目标检测的准确性。In order to avoid and overcome the technical problems existing in the prior art, the present invention provides a multimodal target detection method based on virtual point cloud. The present invention can effectively improve the accuracy of target detection.

为实现上述目的，本发明提供如下技术方案：To achieve the above object, the present invention provides the following technical solutions:

一种基于虚拟点云的多模态目标检测方法，包括以下检测步骤：A multimodal target detection method based on virtual point cloud includes the following detection steps:

S1、将图片输入到神经网络中，对图片进行特征提取，以获得图片的关键点；S1. Input the image into the neural network and extract features of the image to obtain the key points of the image;

S2、在虚拟点云构造网络中通过关键点信息构造虚拟点云；S2, constructing a virtual point cloud using key point information in a virtual point cloud construction network;

步骤S2的具体步骤如下：The specific steps of step S2 are as follows:

S21、将关键点的高斯图输入到坐标预测网络中，获得高斯图偏移量的预测值；S21, inputting the Gaussian graph of the key point into the coordinate prediction network to obtain the predicted value of the Gaussian graph offset;

S22、基于Smoke算法，采用数理统计的方式计算所有关键点深度的均值和方差，并结合高斯图偏移量的预测值，计算得到关键点的三维坐标，其坐标转换公式如下：S22. Based on the Smoke algorithm, the mean and variance of the depth of all key points are calculated by mathematical statistics, and the three-dimensional coordinates of the key points are calculated by combining the predicted value of the Gaussian graph offset. The coordinate conversion formula is as follows:

z_t＝μ_z+δ_zσ_z z _t = μ _z + δ _z σ _z

S23、将关键点输入置信度网络中，获取各个关键点对应的置信度；S23, inputting the key points into the confidence network to obtain the confidence corresponding to each key point;

S24、选择置信度在设定范围内的关键点，并通过这些关键点的深度值，结合相机的内参矩阵，计算得到位于点云空间中设定数量的虚拟点云，以及这些虚拟点云的坐标；S24, selecting key points whose confidence levels are within a set range, and calculating a set number of virtual point clouds located in the point cloud space and the coordinates of these virtual point clouds through the depth values of these key points combined with the intrinsic parameter matrix of the camera;

S3、将虚拟点云和图片的真实点云进行体素化，获得体素组织；S3, voxelizing the virtual point cloud and the real point cloud of the image to obtain voxel organization;

S4、将体素化组织输入到目标检测网络中，获得该体素化组织的检测结果，该检测结果即为体素化组织对应的图像类别；S4, inputting the voxelized tissue into the target detection network to obtain a detection result of the voxelized tissue, where the detection result is the image category corresponding to the voxelized tissue;

S5、联合更新神经网络、虚拟点云构造网络和目标检测网络中的参数，以获得由神经网络、虚拟点云构造网络和目标检测网络组成的多模态目标检测模型；S5, jointly updating the parameters in the neural network, the virtual point cloud construction network and the target detection network to obtain a multimodal target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network;

联合坐标预测网络中关键点损失函数和目标检测网络中的目标损失函数，构成联合损失函数；通过联合损失函数更新由神经网络、虚拟点云构造网络和目标检测网络组成的多模态目标检测模型中的参数，以获得最优的多模态目标检测模型；记录需要扩充虚拟均匀点云的3D关键点数量，间接反映出单目网络的准确程度，当该数量较少时，给予大的损失权重μ_vp，从而进一步提高第一部分单目网络的训练效率；所述损失优化计算公式如下：The key point loss function in the joint coordinate prediction network and the target loss function in the target detection network form a joint loss function; the parameters in the multimodal target detection model composed of the neural network, the virtual point cloud construction network and the target detection network are updated through the joint loss function to obtain the optimal multimodal target detection model; the number of 3D key points that need to expand the virtual uniform point cloud is recorded to indirectly reflect the accuracy of the monocular network. When the number is small, a large loss weight μ _vp is given to further improve the training efficiency of the first part of the monocular network; the loss optimization calculation formula is as follows:

式中：ΔLoss_i和ΔLoss_i-1为本轮和上一轮的损失值，n为已参与训练的训练轮次，N为本轮训练构造的符合3D空间范围的虚拟点云数量，N_max为关键点网络设定选取3D关键点的数量，β为可调的极小值；Where: ΔLoss _i and ΔLoss _i-1 are the loss values of this round and the previous round, n is the number of training rounds that have participated in the training, N is the number of virtual point clouds that conform to the 3D space range constructed in this round of training, N _max is the number of 3D key points selected by the key point network setting, and β is an adjustable minimum value;

总损失为两部分损失之和，如下：The total loss is the sum of the two losses, as follows:

Loss＝μ₁*L₁+(1-μ₂)*L₂ Loss＝μ ₁ *L ₁ +(1-μ ₂ )*L ₂

式中：L₁为3D关键点的定位损失，L₂为最终预测结果的损失；Where: _L1 is the positioning loss of the 3D key point, _L2 is the loss of the final prediction result;

S6、将待分类的图片输入到多模态目标检测模型中，以获得该图片的类别。S6. Input the image to be classified into the multimodal object detection model to obtain the category of the image.

作为本发明再进一步的方案：步骤S1的具体步骤如下：As a further solution of the present invention: the specific steps of step S1 are as follows:

S11、将图片输入到作为DLA-34网络的神经网络中进行特征提取，获得对应的特征图；S11, input the image into the neural network as the DLA-34 network to extract features and obtain a corresponding feature map;

S12、基于CenterNet，获取特征图上各个点云的相机坐标；S12. Based on CenterNet, obtain the camera coordinates of each point cloud on the feature map;

S13、通过转换公式将点云的相机坐标转化成相机坐标中XY平面上的投影点；S13, converting the camera coordinates of the point cloud into projection points on the XY plane in the camera coordinates through a conversion formula;

S14、对各个投影点，通过计算以投影点位置为中心的二维高斯概率分布，进而生成高斯图；S14, for each projection point, a two-dimensional Gaussian probability distribution centered at the projection point position is calculated, thereby generating a Gaussian graph;

S15、将所有投影点生成的高斯图相加，形成热图，其中高斯图生成公式如下：S15. Add the Gaussian graphs generated by all projection points to form a heat map, where the Gaussian graph generation formula is as follows:

S16、选取热图中二维高斯概率最大的像素点作为关键点。S16. Select the pixel with the largest two-dimensional Gaussian probability in the heat map as the key point.

作为本发明再进一步的方案：步骤S3的具体步骤如下：将获得虚拟点云和该图片对应的真实点云进行体素化处理，并获得处理后的体素组织；接着将体素组织等分成各个体素块；然后对各个体素块中的点云进行特征编码。As a further solution of the present invention: the specific steps of step S3 are as follows: the obtained virtual point cloud and the real point cloud corresponding to the image are voxelized to obtain the processed voxel organization; then the voxel organization is equally divided into individual voxel blocks; and then the point cloud in each voxel block is feature encoded.

与现有技术相比，本发明的有益效果是：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明提出了一种基于虚拟点云的二阶段多模态目标检测方法，即利用图像检测目标信息来构造虚拟点云辅助基于点云的目标检测。该方法首先利用图像检测目标信息构造虚拟点云，增加点云的密集程度，从而提高目标特征的表现。其次，增加点云特征维度以区分真实和虚拟点云，并使用含置信度编码的体素，增强点云的相关性。最后，采用虚拟点云的比例系数设计损失函数，增加图像检测有监督训练，提高二阶段网络训练效率，避免二阶段端到端网络模型存在的模型误差累计问题，有效的提高目标检测系统的精度和鲁棒性。1. The present invention proposes a two-stage multimodal target detection method based on virtual point cloud, that is, using image detection target information to construct a virtual point cloud to assist point cloud-based target detection. This method first constructs a virtual point cloud using image detection target information to increase the density of the point cloud, thereby improving the performance of target features. Secondly, the feature dimension of the point cloud is increased to distinguish between real and virtual point clouds, and voxels with confidence encoding are used to enhance the correlation of the point cloud. Finally, the loss function is designed using the scale coefficient of the virtual point cloud, image detection supervised training is added, the efficiency of two-stage network training is improved, the model error accumulation problem existing in the two-stage end-to-end network model is avoided, and the accuracy and robustness of the target detection system are effectively improved.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的主要检测步骤流程图。FIG. 1 is a flow chart of the main detection steps of the present invention.

图2为本发明中模型整体结构图。FIG. 2 is a diagram showing the overall structure of the model of the present invention.

图3为本发明中体素内虚拟点云构造位置示意图。FIG. 3 is a schematic diagram of the construction position of the virtual point cloud within a voxel in the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

请参阅图1～图3，本发明实施例中，一种基于虚拟点云的多模态目标检测方法，主要由传感器的多模态数据输入、神经网络、虚拟点云构造网络、目标检测网络和损失联合训练组成。首先将图像送入主干网络DLA-34进行特征提取，再通过回归网络获得一定数量目标3D关键点的坐标预测值和目标置信度；再根据生成的关键点信息，在激光雷达点云中构造相应的虚拟点云，并增加点云特征维度区别虚拟和真实点云，把输出的目标置信度同时纳入特征编码，再与真实点云一起送入基于体素的3D目标检测。同时，为了避免二阶段端到端串联网络存在的模型误差累计问题，采用虚拟点云的比例关系设计损失函数，增加图像检测有监督训练，从而提高了图像处理模块的训练效率。Please refer to Figures 1 to 3. In an embodiment of the present invention, a multimodal target detection method based on virtual point cloud is mainly composed of multimodal data input of sensors, neural networks, virtual point cloud construction networks, target detection networks and loss joint training. First, the image is sent to the backbone network DLA-34 for feature extraction, and then the coordinate prediction values and target confidence of a certain number of target 3D key points are obtained through the regression network; then, according to the generated key point information, the corresponding virtual point cloud is constructed in the lidar point cloud, and the point cloud feature dimension is increased to distinguish between virtual and real point clouds, and the output target confidence is simultaneously included in the feature encoding, and then sent to the voxel-based 3D target detection together with the real point cloud. At the same time, in order to avoid the model error accumulation problem existing in the two-stage end-to-end series network, the proportional relationship of the virtual point cloud is used to design the loss function, and the image detection is increased. Supervised training, thereby improving the training efficiency of the image processing module.

本发明提出的检测网络还对虚拟点云所在的体素块进行数据扩充，如图3所示。方法如下：根据虚拟点云的位置信息，可以确定其对应的体素块位置，通过验证该位置是否存在体素，进行标记。假如存在就对该位置体素进行标记，方式为在的回波后增加一个置信度数值。如果该位置不存在真实点云，则考虑整个体素块的空间分布，按照均匀分布的策略进行添加。在设计单个体素时，规定了最多取5个点云，由于体素3D目标检测方法在高度方向上的考量不太大，所以选择矩形的切面，并均匀构造四个点，将其与虚拟点一起添加到整个点云数据中。The detection network proposed in the present invention also performs data expansion on the voxel block where the virtual point cloud is located, as shown in Figure 3. The method is as follows: According to the position information of the virtual point cloud, the corresponding voxel block position can be determined, and the position is marked by verifying whether there is a voxel at the position. If it exists, the voxel at the position is marked by adding a confidence value after the echo. If there is no real point cloud at the position, the spatial distribution of the entire voxel block is considered and added according to the uniform distribution strategy. When designing a single voxel, it is stipulated that a maximum of 5 point clouds are taken. Since the voxel 3D target detection method does not consider much in the height direction, a rectangular section is selected, and four points are evenly constructed, which are added to the entire point cloud data together with the virtual points.

本发明的主要内容如下：The main contents of the present invention are as follows:

a.基于图像的关键点检测。将单目图像经过特征提取网络之后得到对应大小的特征图，基于CenterNet的思想，直接预测出目标3D关键点在2D图像上的投影点。将点云数据中3D关键点真值通过相机公式转换到相机平面投影，再编码成一张张2D高斯图。高斯图是一个二维概率分布函数，它将较高的概率值分配给物体中心附近的像素，将较低的概率值分配给离中心较远的像素。对于每个关键点，通过计算以关键点位置为中心的二维高斯概率分布生成高斯图。高斯的标准差通常被设定为一个固定值，这决定了关键点周围的概率值的分布。然后将所有关键点的高斯图相加，生成最终的热图，它代表了每个像素属于某个特定物体类别的可能性。每个热图中概率值最高的像素被视为相应关键点的位置。图像特征图经过一系列网络，预测头输出对高斯图偏移量的预测其中K为相机内参。之后，借鉴SMOKE的思想，先采用数理统计的方式计算3D关键点深度的均值与方差，结合预测头预测出深度的偏移量，从而得到关键点的3D坐标预测值[x_p,y_p,z_p]。a. Image-based key point detection. After passing the monocular image through the feature extraction network, a feature map of corresponding size is obtained. Based on the idea of CenterNet, the projection point of the target 3D key point on the 2D image is directly predicted. The true value of the 3D key point in the point cloud data is converted to the camera plane projection through the camera formula, and then encoded into a 2D Gaussian map. The Gaussian map is a two-dimensional probability distribution function that assigns higher probability values to pixels near the center of the object and lower probability values to pixels farther from the center. For each key point, a Gaussian map is generated by calculating the two-dimensional Gaussian probability distribution centered on the key point position. The standard deviation of the Gaussian is usually set to a fixed value, which determines the distribution of probability values around the key point. The Gaussian maps of all key points are then added together to generate the final heat map, which represents the probability that each pixel belongs to a specific object category. The pixel with the highest probability value in each heat map is regarded as the position of the corresponding key point. The image feature map passes through a series of networks, and the prediction head outputs a prediction of the Gaussian map offset K is the camera intrinsic parameter. Then, drawing on the idea of SMOKE, we first use mathematical statistics to calculate the mean and variance of the 3D key point depth, and then use the prediction head to predict the depth offset, thereby obtaining the 3D coordinate prediction value [x _p , y _p , z _p ] of the key point.

b.构造虚拟点云。从神经网络最终特征图选取出N个置信度最高的关键点；取预测出的深度z，结合相机内参变换矩阵，得到在点云空间中的N个虚拟3D点[x_vp,y_vp,z_vp]；为了防止超出真实点云的前视图范围，对这些虚拟点云进行筛选过滤后得到N'个点，并添加到点云数据中，其反射强度采用整体点云的平均值替代。b. Construct a virtual point cloud. Select N key points with the highest confidence from the final feature map of the neural network; take the predicted depth z, combined with the camera intrinsic parameter transformation matrix, to obtain N virtual 3D points [x _vp ,y _vp ,z _vp ] in the point cloud space; in order to prevent exceeding the front view range of the real point cloud, these virtual point clouds are filtered to obtain N' points, and added to the point cloud data, and their reflection intensity is replaced by the average value of the entire point cloud.

c.基于点云体素化的目标检测。后的点云3D目标检测网络采用基于体素特征的形式。其主要思想是将整个3D空间沿着x、y、z三个轴分割成大小相同的体素块。对每个体素块中的点云进行特征编码，充分考虑其全局和局部特征后得到体素特征，再采用3D卷积的方式进行目标检测。c. Object detection based on point cloud voxelization. The point cloud 3D object detection network is based on voxel features. The main idea is to divide the entire 3D space into voxel blocks of the same size along the x, y, and z axes. The point cloud in each voxel block is feature encoded, and the voxel features are obtained after fully considering its global and local features, and then the object detection is performed using 3D convolution.

具体实施例：将图片全部调整成统一大小(1280*384*3)，输入网络，首先通过DLA-34主干网络提取特征，得到特征层，特征层再通过预测头和回归头得到3D位置预测所需的参数。其中预测头是通过生成热力图来预测目标的2D中心和类别，回归头回归2D中心转换成3D坐标所需的偏移量等。取热力图中特征值最高的一些点为关键点，采用数理统计的方式计算3D关键点深度的均值与方差，结合预测头预测出深度的偏移量，从而得到关键点的3D坐标预测值。关键点并不都是中心点，一个目标中心点只有一个，关键点包括中心点和周边一些点。这时有了关键点的坐标和热力图关键点的特征值即置信度大小，再根据这些信息构造虚拟点云，结合相机内参变换矩阵，得到在点云空间中的N个虚拟3D点，每个虚拟点增加置信度数据维度，再和真实点云一起送入基于体素的点云3D目标检测网络。Specific implementation method: resize all images to a uniform size (1280*384*3), input them into the network, first extract features through the DLA-34 backbone network to obtain a feature layer, and then use the prediction head and regression head to obtain the parameters required for 3D position prediction. The prediction head predicts the 2D center and category of the target by generating a heat map, and the regression head regresses the 2D center to convert it into the offset required for 3D coordinates. Take some points with the highest eigenvalues in the heat map as key points, use mathematical statistics to calculate the mean and variance of the 3D key point depth, and use the prediction head to predict the depth offset to obtain the 3D coordinate prediction value of the key point. Key points are not all center points. There is only one center point for a target. Key points include the center point and some surrounding points. At this time, we have the coordinates of the key points and the eigenvalues of the key points in the heat map, that is, the confidence size. We then construct a virtual point cloud based on this information and combine it with the camera intrinsic parameter transformation matrix to obtain N virtual 3D points in the point cloud space. Each virtual point increases the confidence data dimension and is then sent to the voxel-based point cloud 3D object detection network together with the real point cloud.

使用KITTI数据集对本发明提出的多模态检测模型进行了实验，并将结果与几种激光雷达和多模态3D物体检测方法进行了比较。对于车辆检测，本发明提出的网络表现优良，检测精度优于经典3D点云检测网络和某些多传感器信息融合网络，车辆检测精度达到了86.9％。The proposed multimodal detection model was experimented with the KITTI dataset, and the results were compared with several LiDAR and multimodal 3D object detection methods. For vehicle detection, the proposed network performed well, with detection accuracy better than the classic 3D point cloud detection network and some multi-sensor information fusion networks, and the vehicle detection accuracy reached 86.9%.

本发明提出的3D检测网络在无障碍的目标检测中发挥优良，即使他们被遮挡，也可以很好地检测他们。并且远距离目标检测中也具有良好效果，准确性的提高主要由于同时处理了图像和激光点云信息，通过对图像获取关键点来构造虚拟点云，使点云空间中远距离目标的点云不再稀疏，因此对远距离和小物体具有的更好的检测效果。The 3D detection network proposed in the present invention performs well in barrier-free target detection, and can detect them well even if they are blocked. It also has good results in long-distance target detection. The improvement in accuracy is mainly due to the simultaneous processing of image and laser point cloud information. By obtaining key points from the image to construct a virtual point cloud, the point cloud of the long-distance target in the point cloud space is no longer sparse, so it has a better detection effect on long-distance and small objects.

在网络的训练过程中尝试了不同的方法，包括增加损失偏差权重和直接将两部分损失相加，通过训练收敛过程进行了比较。将偏差权重引入损失函数后，模型的收敛速度明显加快，检测效果也得到了部分提升。这种方法不仅可以更好地平衡两部分损失，还可以更好地表达不同检测模态的重要性，提高了模型性能的表现。Different methods were tried during the network training process, including adding the loss bias weight and directly adding the two parts of the loss, and compared through the training convergence process. After introducing the bias weight into the loss function, the convergence speed of the model was significantly accelerated, and the detection effect was also partially improved. This method can not only better balance the two parts of the loss, but also better express the importance of different detection modalities, improving the performance of the model.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above description is only a preferred specific implementation manner of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can make equivalent replacements or changes according to the technical scheme and inventive concept of the present invention within the technical scope disclosed by the present invention, which should be covered by the protection scope of the present invention.

Claims

1. A multimodal target detection method based on virtual point cloud, characterized in that it includes the following detection steps:

S1. Input the image into the neural network and extract features of the image to obtain the key points of the image;

S2, constructing a virtual point cloud using key point information in a virtual point cloud construction network;

The specific steps of step S2 are as follows:

S21, inputting the Gaussian graph of the key point into the coordinate prediction network to obtain the predicted value of the Gaussian graph offset;

S22. Based on the Smoke algorithm, the mean and variance of the depth of all key points are calculated by mathematical statistics, and the three-dimensional coordinates of the key points are calculated by combining the predicted value of the Gaussian graph offset. The coordinate conversion formula is as follows:

z _t = μ _z + δ _z σ _z

S23, inputting the key points into the confidence network to obtain the confidence corresponding to each key point;

S24, selecting key points whose confidence levels are within a set range, and calculating a set number of virtual point clouds located in the point cloud space and the coordinates of these virtual point clouds through the depth values of these key points combined with the intrinsic parameter matrix of the camera;

S3, voxelizing the virtual point cloud and the real point cloud of the image to obtain voxel organization;

S4, inputting the voxelized tissue into the target detection network to obtain a detection result of the voxelized tissue, where the detection result is the image category corresponding to the voxelized tissue;

S5, jointly updating the parameters in the neural network, the virtual point cloud construction network and the target detection network to obtain a multimodal target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network;

The key point loss function in the joint coordinate prediction network and the target loss function in the target detection network constitute a joint loss function; the parameters in the multimodal target detection model composed of the neural network, the virtual point cloud construction network and the target detection network are updated through the joint loss function to obtain the optimal multimodal target detection model; the number of 3D key points that need to expand the virtual uniform point cloud is recorded to reflect the accuracy of the monocular network; the loss function calculation formula is as follows:

Where: ΔLoss _i and ΔLoss _i-1 are the loss values of this round and the previous round, n is the number of training rounds that have participated in the training, N is the number of virtual point clouds that conform to the 3D space range constructed in this round of training, N _max is the number of 3D key points selected by the key point network setting, and β is an adjustable minimum value;

The total loss is the sum of the two losses, as follows:

Loss＝μ ₁ *L ₁ +(1-μ ₂ )*L ₂

Where: _L1 is the positioning loss of the 3D key point, _L2 is the loss of the final prediction result;

S6. Input the image to be classified into the multimodal object detection model to obtain the category of the image.

2. According to the multimodal target detection method based on virtual point cloud according to claim 1, it is characterized in that the specific steps of step S1 are as follows:

S11, input the image into the neural network as the DLA-34 network to extract features and obtain a corresponding feature map;

S12. Based on CenterNet, obtain the camera coordinates of each point cloud on the feature map;

S13, converting the camera coordinates of the point cloud into projection points on the XY plane in the camera coordinates through a conversion formula;

S14, for each projection point, a two-dimensional Gaussian probability distribution centered at the projection point position is calculated, thereby generating a Gaussian graph;

S15. Add the Gaussian graphs generated by all projection points to form a heat map, where the Gaussian graph generation formula is as follows:

S16. Select the pixel with the largest two-dimensional Gaussian probability in the heat map as the key point.

3. According to claim 2, a multimodal target detection method based on virtual point cloud is characterized in that the specific steps of step S3 are as follows: the virtual point cloud and the real point cloud corresponding to the image are voxelized to obtain the processed voxel organization; then the voxel organization is equally divided into individual voxel blocks; and then the point cloud in each voxel block is feature encoded.