CN117437404A

CN117437404A - Multi-mode target detection method based on virtual point cloud

Info

Publication number: CN117437404A
Application number: CN202311400412.XA
Authority: CN
Inventors: 程腾; 倪昊; 张强; 石琴; 王文冲
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-01-23
Anticipated expiration: 2043-10-26
Also published as: CN117437404B

Abstract

The invention relates to the technical field of multi-mode target detection, in particular to a multi-mode target detection method based on virtual point cloud, which comprises the following detection steps: inputting the picture into a neural network, and extracting the characteristics of the picture to obtain key points of the picture; constructing a virtual point cloud through key point information in a virtual point cloud construction network; voxelized is carried out on the virtual point cloud and the real point cloud of the picture to obtain a voxel tissue; inputting the voxelized tissue into a target detection network to obtain a detection result; jointly updating parameters in the neural network, the virtual point cloud construction network and the target detection network to obtain a multi-mode target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network; the method and the device for classifying the images input the images to be classified into the multi-mode target detection model to obtain the types of the images, and can effectively improve the accuracy of target detection.

Description

Multi-mode target detection method based on virtual point cloud

Technical Field

The invention relates to the technical field of multi-mode target detection, in particular to a multi-mode target detection method based on virtual point cloud.

Background

Multimodal target detection refers to a technique that utilizes a variety of different types of sensors or data sources, such as lidar, cameras, radar, etc., to fuse information for target detection and localization. The method aims to improve the accuracy and the robustness of target detection and simultaneously enable understanding of complex scenes to be more comprehensive.

Currently, three main multi-mode environmental sensing methods are: 1. acquiring each mode data by using a plurality of sensors, and superposing and fusing each mode data before sensing, which is also called pre-fusion; 2. designing a neural network for each modal data, extracting features by using the neural network to obtain required local features and global features, and superposing and fusing the modal features corresponding to each modal data at a feature level, which is also called feature fusion; 3. and logically accepting and rejecting the sensing results of the modal data to comprehensively obtain a final result, which is also called post fusion.

In the actual target detection process, the point cloud data are found to be sparse, and the positions of the point clouds are disordered, so that the problems of missing detection and false detection easily occur when the prior art is used, and the accuracy of target detection is greatly influenced.

Disclosure of Invention

In order to avoid and overcome the technical problems in the prior art, the invention provides a multi-mode target detection method based on virtual point cloud. The invention can effectively improve the accuracy of target detection.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a multi-mode target detection method based on virtual point cloud comprises the following detection steps:

s1, inputting a picture into a neural network, and extracting characteristics of the picture to obtain key points of the picture;

s2, constructing a virtual point cloud through key point information in a virtual point cloud construction network;

s3, voxelizing the virtual point cloud and the real point cloud of the picture to obtain a voxel tissue;

s4, inputting the voxelized tissue into a target detection network to obtain a detection result;

s5, jointly updating parameters in the neural network, the virtual point cloud construction network and the target detection network to obtain a multi-mode target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network;

s6, inputting the pictures to be classified into the multi-mode target detection model to obtain the types of the pictures.

As still further aspects of the invention: the specific steps of step S1 are as follows:

s21, inputting the picture into a neural network serving as a DLA-34 network for feature extraction, and obtaining a corresponding feature map;

s22, acquiring camera coordinates of each point cloud on the feature map based on the CenterNet;

s23, converting camera coordinates of the point cloud into projection points on an XY plane in the camera coordinates through a conversion formula;

s24, calculating two-dimensional Gaussian probability distribution with the position of each projection point as the center, so as to generate Gao Situ;

s25, adding the Gaussian graphs generated by all the projection points to form a heat graph;

s26, selecting a pixel point with the maximum two-dimensional Gaussian probability in the heat map as a key point.

As still further aspects of the invention: the specific steps of step S2 are as follows:

s21, inputting a Gaussian diagram of the key points into a coordinate prediction network to obtain a predicted value of Gao Situ offset;

s22, calculating the mean value and the variance of the depths of all the key points in a mathematical statistics mode based on a Smake algorithm, and combining the predicted value of Gao Situ offset to obtain the three-dimensional coordinates of the key points;

s23, inputting the key points into a confidence coefficient network, and obtaining the confidence coefficient corresponding to each key point;

s24, selecting key points with confidence in a set range, and calculating to obtain set number of virtual point clouds in a point cloud space and coordinates of the virtual point clouds by combining depth values of the key points with an internal reference matrix of a camera.

As still further aspects of the invention: the specific steps of step S3 are as follows: carrying out voxelization on the obtained virtual point cloud and the real point cloud corresponding to the picture, and obtaining a voxel tissue after the voxelization; dividing the voxel tissue into voxel blocks equally; then, carrying out feature coding on point clouds in each voxel block; and finally, inputting the encoded voxel block into a target detection network, and predicting the category of the picture.

As still further aspects of the invention: combining a key point loss function in the coordinate prediction network and a target loss function in the target detection network to form a combined loss function; and updating parameters in the multi-modal target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network through the joint loss function so as to obtain an optimal multi-modal target detection model.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a two-stage multi-mode target detection method based on virtual point cloud, namely, image detection target information is utilized to construct virtual point cloud to assist target detection based on point cloud. According to the method, firstly, the image detection target information is utilized to construct a virtual point cloud, and the density degree of the point cloud is increased, so that the performance of target characteristics is improved. And secondly, adding the feature dimension of the point cloud to distinguish the real point cloud from the virtual point cloud, and enhancing the relevance of the point cloud by using voxels containing confidence codes. Finally, a loss function is designed by adopting the proportionality coefficient of the virtual point cloud, the supervised training of image detection is increased, the training efficiency of the two-stage network is improved, the problem of model error accumulation of the two-stage end-to-end network model is avoided, and the accuracy and the robustness of the target detection system are effectively improved.

Drawings

FIG. 1 is a flow chart of the main detection steps of the present invention.

Fig. 2 is an overall block diagram of a model in accordance with the present invention.

Fig. 3 is a schematic view of a virtual point cloud structure position in a voxel in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1 to 3, in an embodiment of the present invention, a method for detecting a multi-modal target based on a virtual point cloud mainly includes multi-modal data input of a sensor, a neural network, a virtual point cloud structure network, a target detection network, and a loss combined training. Firstly, sending an image into a backbone network DLA-34 for feature extraction, and obtaining coordinate predicted values and target confidence coefficients of a certain number of target 3D key points through a regression network; and constructing a corresponding virtual point cloud in the laser radar point cloud according to the generated key point information, adding the point cloud feature dimension to distinguish virtual point cloud from real point cloud, simultaneously incorporating the output target confidence into feature codes, and sending the feature codes and the real point cloud into 3D target detection based on voxels. Meanwhile, in order to avoid the problem of model error accumulation in the two-stage end-to-end series network, a loss function is designed by adopting the proportional relation of the virtual point cloud, and the supervised training of image detection is increased, so that the training efficiency of the image processing module is improved.

The detection network provided by the invention also carries out data expansion on the voxel block where the virtual point cloud is located, as shown in fig. 3. The method comprises the following steps: according to the position information of the virtual point cloud, the position of the corresponding voxel block can be determined, and marking is carried out by verifying whether the position has voxels. If present, the location voxel is labeled by adding a confidence value after the echo. If the position does not exist the true point cloud, the spatial distribution of the whole voxel block is considered, and the addition is carried out according to a uniform distribution strategy. When designing a single voxel, it is prescribed that a maximum of 5 point clouds are taken, and since the consideration of the voxel 3D object detection method in the height direction is not too great, a tangent plane of a rectangle is selected, four points are uniformly constructed, and added to the whole point cloud data together with virtual points.

The main content of the invention is as follows:

a. image-based keypoint detection. And (3) obtaining a feature map with a corresponding size after the monocular image passes through a feature extraction network, and directly predicting projection points of the target 3D key points on the 2D image based on the idea of CenterNet. And converting the true value of the 3D key point in the point cloud data into a camera plane projection through a camera formula, and then encoding the true value into a piece of 2D Gao Situ. The gaussian is a two-dimensional probability distribution function that assigns higher probability values to pixels near the center of the object and lower probability values to pixels farther from the center. For each keypoint, a gaussian graph is generated by computing a two-dimensional gaussian probability distribution centered on the keypoint location. The standard deviation of the gaussian is usually set to a fixed value, which determines the distribution of probability values around the keypoints. The gaussian maps of all the keypoints are then summed to generate a final heat map that represents the likelihood that each pixel belongs to a particular object class. The pixel with the highest probability value in each heat map is regarded as the position of the corresponding key point. Image characteristic diagram passes through a series of networks, and prediction head outputs prediction of Gao Situ offsetWherein K is the camera internal reference. Then, by taking reference to the thought of SMOKE, firstly adopting a mathematical statistics mode to calculate the mean value and variance of the depth of the 3D key point, and combining the offset of the predicted depth of the prediction head to obtain the 3D coordinate predicted value [ x ] of the key point _p ,y _p ,z _p ]。

b. Constructing a virtual point cloud. Selecting N key points with highest confidence coefficient from the final feature map of the complex; taking the predicted depth z, combining with the camera internal reference transformation matrix to obtain N virtual 3D points [ x ] in the point cloud space _vp ,y _vp ,z _vp ]The method comprises the steps of carrying out a first treatment on the surface of the In order to prevent the front view range of the real point cloud from being exceeded, N' points are obtained after screening and filtering the virtual point clouds and are added into point cloud data, and the reflection intensity of the point cloud data is replaced by the average value of the whole point cloud.

c. Target detection based on point cloud voxelization. The post-point cloud 3D object detection network adopts a form based on voxel characteristics. The main idea is to divide the whole 3D space into voxel blocks of the same size along the three axes x, y, z. And carrying out feature coding on the point cloud in each voxel block, fully considering global and local features to obtain voxel features, and then carrying out target detection in a 3D convolution mode.

Specific examples: the pictures are all adjusted to be uniform in size (1280 x 384 x 3), the pictures are input into a network, features are extracted through a DLA-34 backbone network to obtain a feature layer, and parameters required by 3D position prediction are obtained through a prediction head and a regression head. Wherein the prediction head predicts the 2D center and class of the target by generating a thermodynamic diagram, the regression head regresses the offset required for the 2D center to convert to 3D coordinates, etc. And taking some points with the highest characteristic values in the thermodynamic diagram as key points, calculating the mean value and the variance of the depth of the 3D key points in a mathematical statistics mode, and combining the offset of the predicted depth of the prediction head to obtain the 3D coordinate predicted value of the key points. The key points are not all central points, only one target central point is provided, and the key points comprise the central points and peripheral points. The coordinates of the key points and the characteristic values of the thermodynamic diagram key points, namely the confidence level, are used for constructing virtual point clouds according to the information, combining with an internal camera transformation matrix to obtain N virtual 3D points in a point cloud space, adding confidence data dimensions to each virtual point, and sending the N virtual points and the real point clouds into a point cloud 3D target detection network based on voxels together

Experiments were performed on the multi-modal detection model proposed by the present invention using the KITTI dataset and the results were compared with several laser-only radars and multi-modal 3D object detection methods. For vehicle detection, the network provided by the invention has excellent performance, the detection precision is superior to that of a classical 3D point cloud detection network and a certain multi-sensor information fusion network, and the vehicle detection precision reaches 86.9%.

The 3D detection network provided by the invention plays an excellent role in barrier-free target detection, and can well detect even if the target detection network is shielded. The method has the advantages that the method has good effect in remote target detection, and the accuracy is improved mainly because the image and the laser point cloud information are processed at the same time, and the virtual point cloud is constructed by acquiring key points from the image, so that the point cloud of the remote target in the point cloud space is not sparse, and the method has better detection effect on remote and small objects.

Different approaches have been tried in the training process of the network, including increasing the loss bias weight and directly adding the two partial losses, and comparing by the training convergence process. After the deviation weight is introduced into the loss function, the convergence rate of the model is obviously accelerated, and the detection effect is partially improved. The method not only can better balance the loss of the two parts, but also can better express the importance of different detection modes, and improves the performance of the model.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The multi-mode target detection method based on the virtual point cloud is characterized by comprising the following detection steps of:

2. The method for detecting a multi-modal object based on virtual point cloud as claimed in claim 1, wherein the specific steps of step S1 are as follows:

s25, adding the Gaussian graphs generated by all the projection points to form a heat graph, wherein the Gaussian kernel generation is shown as follows:

3. The method for detecting a multi-modal object based on virtual point cloud as claimed in claim 2, wherein the specific steps of step S2 are as follows:

s22, calculating the mean value and the variance of the depths of all the key points by adopting a mathematical statistics mode based on a Smoke algorithm, and combining the predicted value of Gao Situ offset to obtain the three-dimensional coordinates of the key points, wherein the coordinate conversion formula is as follows:

z _p ＝μ _z +δ _z σ _z

4. The method for detecting a multi-modal object based on virtual point cloud as claimed in claim 3, wherein the specific steps of step S3 are as follows: carrying out voxelization on the obtained virtual point cloud and the real point cloud corresponding to the picture, and obtaining a voxel tissue after the voxelization; dividing the voxel tissue into voxel blocks equally; then, carrying out feature coding on point clouds in each voxel block; and finally, inputting the encoded voxel block into a target detection network, and predicting the category of the picture.

5. The method for detecting a multi-modal target based on a virtual point cloud as claimed in claim 4, wherein the key points in the network are predicted by combining coordinatesThe loss function and a target loss function in the target detection network form a joint loss function; and updating parameters in the multi-modal target detection model consisting of the neural network, the virtual point cloud construction network and the target detection network through the joint loss function so as to obtain an optimal multi-modal target detection model. Recording the number of 3D key points needing to expand the virtual uniform point cloud, indirectly reflecting the accuracy of the monocular network, and giving a large loss weight mu when the number is small _vp Thereby further improving the training efficiency of the first part of monocular network. The loss optimization calculation formula is as follows:

wherein: ΔLoss _i And DeltaLoss _i-1 For the loss values of the current round and the previous round, N is the training round number of the participated training, N is the number of virtual point clouds which are constructed by the training round and accord with the 3D space range, N _max And setting and selecting the number of 3D key points for the key point network, wherein beta is an adjustable minimum value.

The total loss is the sum of the two losses as follows.

Loss＝μ ₁ *L ₁ +(1-μ ₂ )*L ₂

Wherein: l (L) ₁ For the positioning loss of the 3D key points, L ₂ Loss as the final prediction result.