CN111401264A

CN111401264A - Vehicle target detection method and device, computer equipment and storage medium

Info

Publication number: CN111401264A
Application number: CN202010194177.5A
Authority: CN
Inventors: 周康明; 郭义波
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2020-07-10

Abstract

The application relates to a vehicle target detection method, a device, computer equipment and a storage medium, wherein point cloud data of a target three-dimensional space are obtained, 2D voxelization is carried out on the point cloud data of the target three-dimensional space, the point cloud data after 2D voxelization is input into a preset three-dimensional vehicle detection network comprising a high-resolution network HRNet and a region generation network RPN, and then output results of the three-dimensional vehicle detection network are decoded and screened to obtain detection frames of all vehicle targets in the target three-dimensional space. The method can greatly improve the performance of three-dimensional vehicle target detection, so that the detection result is more accurate.

Description

Vehicle target detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of object detection technologies, and in particular, to a vehicle object detection method, apparatus, computer device, and storage medium.

Background

The robot or the unmanned vehicle senses the surrounding space environment through a 3D target detection technology, and makes a path plan according to the sensed surrounding environment, so that automatic control of the robot or safe driving of the unmanned vehicle is realized.

Common 3D detection algorithms can be divided into a monocular 2D image-based method, a binocular 2D image-based method, a 3D laser point cloud-based method and a method adopting the 2D image and the 3D laser point cloud simultaneously, and through the detection algorithms, information such as position information (central point coordinates x, y and z of a vehicle target in a three-dimensional space), length, width and height of the vehicle, yaw angle (the yaw angle refers to an included angle between a vehicle head and the positive direction of a y axis) of the vehicle in the three-dimensional space is detected, so that the vehicle target in the three-dimensional space is completely detected.

However, the existing 3D vehicle detection algorithms cannot accurately detect each vehicle target when detecting the vehicle target.

Disclosure of Invention

In view of the above, it is necessary to provide a vehicle object detection method, apparatus, computer device and storage medium for solving the above technical problems.

In a first aspect, an embodiment of the present application provides a vehicle object detection method, including:

acquiring point cloud data of a target three-dimensional space;

2D voxelization is carried out on the point cloud data of the target three-dimensional space;

inputting the point cloud data subjected to 2D voxelization into a preset three-dimensional vehicle detection network; the three-dimensional vehicle detection network comprises a high-resolution network HRNet network and a region generation network RPN network;

and decoding and screening the output result of the three-dimensional vehicle detection network to obtain detection frames of all vehicle targets in the target three-dimensional space.

In one embodiment, the 2D voxelization of the point cloud data in the target three-dimensional space includes:

dividing a target three-dimensional space into a plurality of small spaces; the point cloud data are randomly distributed in each small space;

converting the target three-dimensional space into a matrix with the same specification, wherein the cells in the matrix correspond to the small spaces in the target three-dimensional space one to one;

if the small space in the target three-dimensional space has point cloud data, filling a first value into the corresponding cell in the matrix, and if the small space in the target three-dimensional space does not have the point cloud data, filling a second value into the corresponding cell in the matrix; wherein the first value is different from the second value.

In one embodiment, the inputting the point cloud data after 2D voxelization into a preset three-dimensional vehicle detection network includes:

inputting the 2D voxelized point cloud data into an HRNet network, and extracting the depth characteristics of the 2D voxelized point cloud data;

inputting the depth features into an RPN network to obtain a classification result and a regression result of the 2D voxelized point cloud data; the classification result comprises a prediction probability result of whether the vehicle target is a vehicle target or not and a prediction result of the yaw angle category; predicting the deviation between the anchor point frame of regression result regression and the true value; the anchor boxes characterize pre-selected boxes that are generated in advance.

In one embodiment, the decoding and screening the output result of the three-dimensional vehicle detection network to obtain the detection frames of all vehicle targets in the target three-dimensional space includes:

acquiring anchor point frames with the highest probability in the prediction probability results of the vehicle targets in a preset number;

decoding the regression results of the anchor point frames in the preset number, and determining the actual output values of the anchor point frames in the preset number according to the decoding results and the prediction results of the yaw angle categories;

and screening the actual output values of the anchor points in the preset number through a non-maximum suppression algorithm and the prediction probability results of the vehicle targets corresponding to the anchor points in the preset number to obtain the detection frames of all the vehicle targets in the target three-dimensional space.

In one embodiment, the acquiring point cloud data of a target three-dimensional space includes:

and acquiring the coordinate position of each object surface point in the target three-dimensional space through a laser radar to obtain point cloud data.

In one embodiment, the training process of the three-dimensional vehicle detection network includes:

acquiring point cloud data of a plurality of sample three-dimensional spaces; the point cloud data of the sample three-dimensional space comprises standard marking frames of all vehicle targets in the sample three-dimensional space;

respectively carrying out 2D voxelization on the point cloud data of the plurality of sample three-dimensional spaces;

inputting the 2D voxelized sample point cloud data into an initial HRNet network of an initial three-dimensional vehicle detection network for depth feature extraction, and inputting the extracted depth features into an initial RPN network of the initial three-dimensional vehicle detection network to obtain classification results and regression results of the 2D voxelized sample point cloud data;

and determining the value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized point cloud data of each sample until the variation amplitude of the loss function is within a preset range, thereby obtaining the three-dimensional vehicle detection network.

In one embodiment, the method further comprises:

determining a coordinate point matrix with the same specification as the depth feature according to the depth feature extracted by the initial HRNet network;

generating anchor point frames with the number corresponding to the preset yaw angle by taking each coordinate point in the coordinate point matrix as a center;

coding a positive anchor point frame in the anchor point frames to obtain a coded regression label standard value; the positive anchor point frame indicates that the intersection ratio of the anchor point frame and the corresponding standard marking frame is greater than a preset value;

and coding the vehicle target category of the normal anchor point frame to obtain a coded vehicle target classification label standard value.

In one embodiment, the loss function includes a regression loss function, a yaw angle classification loss function, and a vehicle target classification loss function; the classification result of each 2D voxelized sample point cloud data comprises a prediction probability result of a vehicle target and a prediction result of a yaw angle category;

determining a value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized point cloud data of each sample, including:

determining the value of a regression loss function according to the regression result of the 2D voxelized point cloud data of each sample and the encoded regression label standard value;

determining the value of a yaw angle classification loss function according to the prediction result of the yaw angle classification and the coded yaw angle classification label standard value;

and determining the value of the vehicle target classification loss function according to the prediction probability result of the vehicle target and the encoded standard value of the vehicle target classification label.

In a second aspect, an embodiment of the present application provides a vehicle object detection apparatus, including:

the acquisition module is used for acquiring point cloud data of a target three-dimensional space;

the conversion module is used for carrying out 2D voxelization on the point cloud data of the target three-dimensional space;

the detection module is used for inputting the point cloud data subjected to 2D voxelization into a preset three-dimensional vehicle detection network; the three-dimensional vehicle detection network comprises a high-resolution network HRNet network and a region generation network RPN network;

and the processing module is used for decoding and screening the output result of the three-dimensional vehicle detection network to obtain the detection frames of all vehicle targets in the target three-dimensional space.

In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of any one of the methods provided in the embodiments of the first aspect when executing the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any one of the methods provided in the embodiments of the first aspect.

According to the vehicle target detection method, the device, the computer equipment and the storage medium, the point cloud data of the target three-dimensional space are subjected to 2D voxelization by acquiring the point cloud data of the target three-dimensional space, the point cloud data subjected to 2D voxelization are input into a preset three-dimensional vehicle detection network comprising a high-resolution network HRNet and a region generation network RPN, and then output results of the three-dimensional vehicle detection network are decoded and screened to obtain detection frames of all vehicle targets in the target three-dimensional space. According to the method, three-dimensional point cloud data is subjected to 2D voxelization, coordinate information of the point cloud data is discarded, only position information of the point cloud in a voxel space is reserved, and the reserved position information can represent the geometric shape of a vehicle target, so that the data volume is reduced on the premise that the geometric shape of the vehicle target is not lost, the obtained 2D voxel data is input into an HRNet network to extract features, the depth features which represent the position and the geometric shape of the vehicle target and have higher distinguishing capability are obtained through the feature extraction and fusion capability of the HRNet, the obtained depth features are applied to an RPN network, the performance of three-dimensional vehicle target detection can be greatly improved, and the detection result is more accurate.

Drawings

FIG. 1 is a diagram illustrating an exemplary embodiment of a vehicle target detection method;

FIG. 2 is a schematic flow chart of a vehicle object detection method according to an embodiment;

FIG. 3 is a schematic flow chart diagram illustrating a vehicle object detection method according to another embodiment;

FIG. 4 is a schematic flow chart diagram illustrating a vehicle object detection method according to another embodiment;

fig. 4a is a schematic diagram of extracting depth features of an HRNet network provided in an embodiment;

FIG. 4b is a diagram illustrating an RPN network output regression and classification results, as provided in an embodiment;

FIG. 5 is a schematic flow chart diagram illustrating a vehicle object detection method according to another embodiment;

FIG. 6 is a schematic diagram of three-dimensional point cloud data provided in accordance with an embodiment;

FIG. 7 is a schematic flow chart diagram illustrating a vehicle object detection method according to another embodiment;

FIG. 8 is a schematic flow chart diagram illustrating a vehicle object detection method according to another embodiment;

FIG. 9 is a schematic illustration of a vehicle object detection method according to another embodiment;

fig. 10 is a block diagram of a vehicle object detection device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Referring to fig. 1, the present application provides an application environment of a vehicle object detection method, wherein the computer device includes a processor, a memory, a network interface and a database which are connected through a system bus. Wherein the processor is used. The non-volatile storage medium is stored to provide computing and control capabilities. The memory includes a non-volatile storage medium, an internal memory storing an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database is used for storing data of a vehicle object detection method. The network interface is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a vehicle object detection method. It is to be understood that the internal structure of the computer device shown in fig. 1 is an example, and not intended to be limiting.

The embodiment of the application provides a vehicle target detection method and device, computer equipment and a storage medium, which can more accurately detect a vehicle target in a three-dimensional space. The following describes in detail how to solve the above technical problems and the technical solutions of the present application by using embodiments and with reference to the drawings. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. In addition, in the vehicle object detection method provided by the present application, the execution subjects of fig. 2 to fig. 9 are computer devices. The executing main bodies in fig. 2 to 9 may also be vehicle target detection devices, wherein the devices may be implemented as part or all of computer equipment by software, hardware or a combination of software and hardware.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments.

In an embodiment, fig. 2 provides a vehicle target detection method, and the embodiment relates to a specific process in which after point cloud data of a target three-dimensional space is acquired by a computer device, the point cloud data is subjected to 2D voxelization, the 2D voxelized point cloud data is input into a preset three-dimensional vehicle detection network, and an output result of the three-dimensional vehicle detection network is encoded and screened, as shown in fig. 2, the method includes:

s101, point cloud data of a target three-dimensional space are obtained.

The target three-dimensional space represents a three-dimensional space to be detected, and the point cloud data represents a set formed by positions of object surface points in the x, y and z coordinate spaces of the three-dimensional space.

Illustratively, the computer device acquires point cloud data of the target three-dimensional space, which may be acquired by a 3D scanning device, for example, by a lidar. It is to be understood that since it is point cloud data of a three-dimensional space, the point cloud data acquired in this step is three-dimensional data.

S102, performing 2D voxelization on the point cloud data of the target three-dimensional space.

Based on the point cloud data of the target three-dimensional space acquired as described above, the computer device performs 2D Voxelization on the point cloud data, wherein Voxelization refers to converting a geometric representation of an object into a voxel representation closest to the object, resulting in a volume data set that not only contains surface information of the model, but also describes internal properties of the model. That is, after the three-dimensional point cloud data is 2D voxelized, the point cloud data is converted into 2D data. The 2D voxelization of the point cloud data of the target three-dimensional space may be performed by matrix conversion or by conversion using a preset algorithm model, which is not limited in this embodiment.

S103, inputting the point cloud data subjected to 2D voxelization into a preset three-dimensional vehicle detection network; the three-dimensional vehicle detection network comprises a high-resolution network HRNet network and a region generation network RPN network.

Based on the 2D voxelized point cloud data, inputting the point cloud data into a preset three-dimensional vehicle detection Network, wherein the three-dimensional vehicle detection Network includes a High Resolution Network (HRNet) and a Region generation Network (RPN). The HRNet has the following advantages as a general feature extraction backbone network: firstly, the high resolution is kept in the whole process of feature extraction, and secondly, features with different resolutions are crossed and fused in the process of feature extraction. The advantages ensure that the fusion of the multi-scale features is realized while the high-resolution features are kept in the process of passing through the HRNet, so that more abstract and discriminative features can be obtained.

Because only regular 2D data can be processed by the HRNet, the 2D data obtained by 2D voxelization of the point cloud data can be directly input into the HRNet network in the step S102, namely, the 3D point cloud data is converted into the 2D voxel data, so that the 2D data can be directly input into the HRNet network to obtain high abstract features, the depth features which represent the vehicle target position and the geometric shape and have higher discrimination capability can be obtained by the feature extraction and fusion capability of the HRNet, and the performance of 3D vehicle target detection can be improved by applying the depth features to the RPN network, so that the obtained output frame is more accurate.

And S104, decoding and screening the output result of the three-dimensional vehicle detection network to obtain detection frames of all vehicle targets in the target three-dimensional space.

After the point cloud data subjected to 2D voxelization in the step is subjected to the three-dimensional vehicle detection network, the obtained output results are the classification result and the regression result output by the RPN network, the result output by the RPN network is a prediction result, and can be converted into an actual output value only by decoding, and then the optimal result is screened out from the actual output value, so that the detection frames of all vehicle targets in the target three-dimensional space can be obtained.

According to the vehicle target detection method provided by the embodiment, the point cloud data of the target three-dimensional space is obtained, 2D voxelization is carried out on the point cloud data of the target three-dimensional space, the point cloud data after 2D voxelization is input into a preset three-dimensional vehicle detection network comprising a high-resolution network HRNet network and a region generation network RPN network, and then the output result of the three-dimensional vehicle detection network is decoded and screened to obtain the detection frames of all vehicle targets in the target three-dimensional space. According to the method, three-dimensional point cloud data is subjected to 2D voxelization, coordinate information of the point cloud data is discarded, only position information of the point cloud in a voxel space is reserved, and the reserved position information can represent the geometric shape of a vehicle target, so that the data volume is reduced on the premise that the geometric shape of the vehicle target is not lost, the obtained 2D voxel data is input into an HRNet network to extract features, the depth features which represent the position and the geometric shape of the vehicle target and have higher distinguishing capability are obtained through the feature extraction and fusion capability of the HRNet, the obtained depth features are applied to an RPN network, the performance of three-dimensional vehicle target detection can be greatly improved, and the detection result is more accurate.

On the basis of the foregoing embodiment, an embodiment of the present application further provides a vehicle target detection method, which relates to a specific process of performing 2D voxelization on point cloud data of a target three-dimensional space by using a computer device, as shown in fig. 3, where the step S102 includes:

s201, dividing a target three-dimensional space into a plurality of small spaces; the point cloud data are randomly distributed in each small space.

The point cloud data acquired at the beginning in the above steps is of a target three-dimensional space, that is, the point cloud data is distributed in the target three-dimensional space, and then after the target three-dimensional space is divided into a plurality of small spaces, the point cloud data can be randomly distributed in each small space.

Illustratively, taking the target three-dimensional space as x-axis (0,70.4), y-axis (-40,40), and z-axis (-3,1) spaces as an example, where x, y, and z are illustrated by the range of the existing lidar, the point cloud data is distributed in the x-axis (0,70.4), y-axis (-40,40), and z-axis (-3,1) target three-dimensional space, and the target three-dimensional space is divided into 1408 parts along the x-axis, 1600 parts along the y-axis, and 40 parts along the z-axis, so that 1600 parts by 1408 and 40 small spaces with the size of 0.05,0.05, and 0.1 are obtained. It should be understood that the division of the x-axis, the y-axis, and the z-axis is only an example, and in practical applications, other numbers of division may be performed according to practical situations, which is not limited in this embodiment.

S202, converting the target three-dimensional space into a matrix with the same specification, wherein the cells in the matrix correspond to the small spaces in the target three-dimensional space one by one.

After the target three-dimensional space is divided into a plurality of small spaces, the small spaces are converted into matrixes with the same specification, in the converted matrixes, each cell corresponds to one small space in the target three-dimensional space, namely the number of the small spaces in the target three-dimensional space, and the converted matrixes have the same number of cells.

For example, all cells in the matrix generated first are empty, and in the step S201, the target three-dimensional space is divided into 1408 parts along the x axis, 1600 parts along the y axis, and 40 parts along the z axis, for example, so that a full 0 matrix with a shape of 1600 × 1408 × 40 is generated. Each data position in the matrix has a one-to-one correspondence with each small space of the point cloud space.

S203, if the point cloud data exists in the small space in the target three-dimensional space, filling a first value in the corresponding cell in the matrix, and if the point cloud data does not exist in the small space in the target three-dimensional space, filling a second value in the corresponding cell in the matrix; wherein the first value is different from the second value.

After the matrix is generated, in this step, a value is assigned to each cell filling value in the matrix, that is, each position in the matrix, if there is a little cloud in the small space corresponding to the cell, the matrix position (cell) corresponding to the small space is assigned with a first value, for example, 1, otherwise, a second value, for example, 0, so that after all the cell filling values in the matrix are sequentially assigned, a 3-dimensional 2D voxelized data can be obtained. For example, the matrix may be viewed as a stack of 40 layers of a two-dimensional matrix size 1600 x 1408.

In this embodiment, when the matrix having the same specification as the target three-dimensional space including the plurality of small spaces is filled with the numerical values, the positions where the point clouds exist and the positions where the point clouds do not exist are assigned with different values, so that the positions where the point clouds exist can be completely represented. Of course, in order to make the geometric information of the point cloud more accurate and complete, the number of small spaces may be increased when the target three-dimensional space is divided, and when the number of the divided small spaces of the target three-dimensional space is more and less, the set information of the point cloud after voxelization is more accurate and complete.

In one embodiment, as shown in fig. 4, the step S103 of inputting the point cloud data after 2D voxelization into the preset three-dimensional vehicle detection network includes:

s301, inputting the point cloud data subjected to 2D voxelization into an HRNet network, and extracting the depth features of the point cloud data subjected to 2D voxelization.

The three-dimensional vehicle detection network is composed of an HRNet network and an RPN network, and specifically, the HRNet network and the RPN network can be referred to the description in step S103. In this step, the 2D voxelized point cloud data is input to the HRNet network to extract the depth features of the 2D voxelized point cloud data.

For example, the 2D voxelized point cloud data listed in the above step S202 is taken as an example, that is, the 2D voxelized data has a shape of 1600 × 1408 × 40, since it can be regarded as 3-dimensional 2D voxelized data, that is, 40 is regarded as a dimension in height, for example, color picture data usually refers to the third dimension as a channel dimension, and 40 may be referred to as a channel number here. The data size after voxelization was thus 1600 x 1408 and the number of channels was 40. Inputting the data into the HRNet network to extract features. Referring to fig. 4a, which is a schematic diagram of an HRNet network, in order to reduce the amount of computation and increase the inference speed, for the first two convolutional layers, packet convolution is used, where the number of packet convolution packets in the first layer is 8, and the number of packet convolution packets in the second layer is 16, and then a depth feature with a shape of 200x176x128 is obtained after passing through the HRNet network.

S302, inputting the depth features into an RPN (resilient packet network) to obtain a classification result and a regression result of the 2D voxelized point cloud data; the classification result comprises a prediction probability result of whether the vehicle target is a vehicle target or not and a prediction result of the yaw angle category; predicting the deviation between the anchor point frame of regression result regression and the true value; the anchor boxes characterize pre-selected boxes that are generated in advance.

After the depth features of the point cloud data subjected to 2D voxelization are extracted, the extracted depth features are input into an RPN (resilient packet network), wherein the RPN is composed of regression branches and classification branches, and the regression branches and the classification branches are composed of multilayer convolutions. The regression branch is responsible for predicting the deviation between the anchor block (anchor-box) of the vehicle target regression and the true value of the vehicle target, wherein the anchor block (anchor-box) is a pre-selected block generated in advance in the target detection, and the target detection is realized by performing regression and classification on the pre-selected blocks, that is, the deviation of the anchor block of the regression, namely Δ x, Δ y, Δ z, Δ w, Δ l, Δ h, Δ θ, which are output by the regression branch, and the seven parameters are described in the above embodiment, so that the output result of the regression branch is in the form of a matrix with a shape of 200x176x4x 7.

The classification branches comprise a vehicle classification branch and a yaw angle classification branch, the vehicle classification branch is responsible for predicting whether the vehicle is the vehicle or not, the obtained prediction probability result is whether the vehicle is the vehicle target or not, the probability result is the probability of each class, the two classes are total, the vehicle is the vehicle or not, for example, the probability of the vehicle and the probability of the vehicle are not, if the vehicle is represented by 1 and not represented by 0, the output prediction probability is a continuous value between [0 and 1], and the shape of the output vehicle target prediction probability result is a matrix of 200x176x4x 2. The yaw angle classification is responsible for predicting the yaw angle classification, the output is the prediction result of the yaw angle classification, the total number of the prediction result is two, the yaw angle classification is positive or negative, and the matrix shape output by the yaw angle classification branch is 200 × 176 × 4 × 2.

Referring to fig. 4b, the data shape on the convolution box in the figure represents the data shape after the input data passes through the convolution layer, and the parameters in the convolution box are represented as follows from top to bottom: referring to fig. 4b, the RPN network has three outputs, including a vehicle classification head car _ cls _ header (corresponding to the vehicle classification branch), a yaw angle classification head header _ cls _ header (corresponding to the yaw angle classification branch), a detection box regression head reg _ header (corresponding to the regression branch), and three output shapes are respectively 200 × 176 × 4 × 2, and 200 × 176 × 4 × 7.

In the embodiment, the point cloud data after 2D voxelization is directly input into the HRNet network, so that the high abstract characteristics can be obtained, and then the high abstract characteristics are input into the RPN network to output the regression result and the classification result, and the accuracy of vehicle target detection is greatly improved.

In an embodiment, as shown in fig. 5, the step S104 of "decoding and screening the output result of the three-dimensional vehicle detection network to obtain the detection frames of all vehicle targets in the target three-dimensional space" includes:

s401, acquiring a preset number of anchor points with highest probability in the prediction probability result of the vehicle target.

In this step, based on the step S103, after the 2D voxelized point cloud data is input into the three-dimensional vehicle detection network, the output result of the three-dimensional vehicle detection network is obtained, and in combination with the step S302, the output result is the regression result and the classification result, and in this step, a preset number of anchor frames with the highest probability in the prediction probability results of the vehicle targets in the classification result are obtained.

Illustratively, according to the classification result, the first 1000 anchor boxes with the highest probability (i.e. anchor-boxes, the contents of the subsequent part will be described by replacing anchor boxes with anchor-boxes) are reserved according to the order from large to small of the predicted probability of the vehicle target, for example, the predicted probability of the vehicle target is distributed between [0,1], and the first 1000 anchor-boxes with the highest probability are selected from the probabilities before.

S402, decoding the regression results of the anchor point frames in the preset number, and determining the actual output values of the anchor point frames in the preset number according to the decoding results and the prediction results of the yaw angle types.

And decoding the regression results of the preset number of anchor-boxes, namely decoding the obtained 1000 anchor-boxes.

Illustratively, the decoding formula is as follows:

wherein, Δ x, Δ y, Δ z, Δ l, Δ w, Δ h, Δ θ are regression labels of the encoded anchor-box, the ag superscript represents the true value corresponding to the anchor-box, and the a superscript represents the anchor-box of the RPN network regression, that is, the 1000 actual output values corresponding to the anchor-box can be decoded according to the decoding formula, it should be noted that, after decoding, it is also necessary to determine whether to add the sign to the regression output yaw angle result in combination with the yaw angle classification result, and after adding the sign, the actual output value is obtained.

S403, screening the actual output values of the anchor frames in the preset number through a non-maximum suppression algorithm and the prediction probability results of the vehicle targets corresponding to the anchor frames in the preset number to obtain the detection frames of all the vehicle targets in the target three-dimensional space.

After the actual output values of the anchor frames (anchor-boxes) in the preset number are obtained, the final detection frames of all the vehicle targets in the target three-dimensional space are screened out by adopting a non-maximum suppression algorithm according to the actual output values of the anchor-boxes in the preset number and by combining the prediction probability results of the vehicle targets corresponding to the anchor-boxes in the preset number.

For example, the 1000 anchor-boxes are obtained by firstly taking the anchor-box with the highest predicted probability result and performing 3D cross-over ratio on the anchor-box and the rest 999 anchor-boxes

(Intersection-over-Union, IoU), if the 3D IOU is greater than a certain threshold, e.g., 0.1, then the two three-dimensional anchor-boxes are considered to be too large in coincidence, and if the 3D IOU is less than 0.1, then the two three-dimensional anchor-boxes are considered to be too small in coincidence, and the overlap is discarded. After 999 comparisons are made, for example, 500 anchor-boxes are reserved, then the anchor-box with the highest prediction probability result is taken out from the 500 anchor-boxes, the anchor-box with the highest prediction probability result is continuously made with the rest 499 anchors, if the 3D IOU is larger than 0.1, the anchor-box is reserved, if the 3D IOU is smaller than 0.1, the anchor-box is left, the operation is repeated, and the final result is the detection frame of all vehicle targets in the final target three-dimensional space.

The embodiment decodes the result output by the three-dimensional vehicle detection network, screens out the optimal result after decoding and outputs the optimal result as the final result, thereby ensuring the accuracy of outputting the detection frames of all vehicle targets in the target three-dimensional space finally.

In an embodiment, the acquiring point cloud data of the target three-dimensional space in the step S101 includes: and acquiring the coordinate position of each object surface point in the target three-dimensional space through a laser radar to obtain point cloud data.

Wherein, the laser radar comprises a transmitter and a receiver. In practical application, can set up laser radar on the automobile body, positions such as vehicle bottom or roof, specifically, laser radar transmitter is incessantly to emitting laser all around, can reflect when the laser hits the object surface, and the receiver accepts the laser of reflection, calculates object surface point distance laser radar's position through transmission and acceptance time interval and light speed, and then converts into the position of object surface point in three-dimensional space.

The effective range of the laser radar is as follows: in the coordinate system of the three-dimensional space in fig. 6, the positive direction of x is the forward direction of the vehicle, the positive direction of y is the left-to-right direction, the positive direction of z is the direction pointing to the sky from the ground, and fig. 6 shows the directions of z, x, and y, where the black solid frame in fig. 6 marks the 3d frame for the vehicle target. The black dots are three-dimensional point clouds.

The method provided by the embodiment is used for obtaining the coordinate position of each object surface point in the target three-dimensional space, so that the point cloud data in the target three-dimensional space can be obtained, and the point cloud data in the target three-dimensional space can be efficiently, accurately and stably obtained.

In one embodiment, as shown in fig. 7, the training process of the three-dimensional vehicle detection network includes:

s501, point cloud data of a plurality of sample three-dimensional spaces are obtained; the point cloud data of the sample three-dimensional space comprises standard marking frames of all vehicle targets in the sample three-dimensional space.

In order to ensure that the training of the three-dimensional vehicle detection network is more stable and robust and the vehicle target detection is more accurate, a large number of diversified training samples need to be obtained, and point cloud data of various sample three-dimensional spaces need to be obtained. The acquired point cloud data of the sample three-dimensional space comprises standard labeling frames of the vehicle targets in the sample three-dimensional space, namely, in order to enable the initial three-dimensional vehicle detection network to learn the labeling of the vehicle targets in the sample three-dimensional space, the vehicle targets are labeled in the pre-acquired sample data.

And S502, performing 2D voxelization on the point cloud data of the plurality of sample three-dimensional spaces respectively.

After the point cloud data of the plurality of sample three-dimensional spaces are obtained, in this step, 2D voxelization is performed on the point cloud data of the plurality of sample three-dimensional spaces, where the voxelization process may refer to the description in the step S201, and this embodiment is described herein again.

S503, inputting the 2D voxelized sample point cloud data into an initial HRNet network of an initial three-dimensional vehicle detection network for depth feature extraction, and inputting the extracted depth features into an initial RPN network of the initial three-dimensional vehicle detection network to obtain classification results and regression results of the 2D voxelized sample point cloud data.

The method comprises the step of inputting point cloud data of each sample subjected to 2D voxelization into an initial HRNet network and an initial RPN network training process of an initial three-dimensional vehicle detection network. It should be noted that, in steps S501 to S503, the processes involved are all described in the foregoing embodiment, and the detailed processes may refer to the foregoing embodiment, which is not described herein again.

S504, determining the value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized point cloud data of each sample until the variation amplitude of the loss function is within a preset range, and obtaining the three-dimensional vehicle detection network.

Taking the classification result and the regression result output by the initial RPN of the primary initial three-dimensional vehicle detection network as an example, substituting the classification result and the regression result output by the initial RPN and the coding labels corresponding to the classification result and the regression result into a preset loss function to obtain a value of the loss function, wherein the value of the loss function is used for guiding the direction of the next initial three-dimensional vehicle detection network training. The encoding label indicates the actual classification result and the encoding value of the regression result corresponding to each classification result and regression result, and an embodiment is provided below to explain how to obtain the encoding value of the actual classification result and regression result.

Optionally, in an embodiment, if the loss function includes a regression loss function, a yaw angle classification loss function, a vehicle target classification loss function; the classification result of each 2D voxelized sample point cloud data comprises a prediction probability result of a vehicle target and a prediction result of a yaw angle category; determining a value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized sample point cloud data, including: determining the value of a regression loss function according to the regression result of the 2D voxelized point cloud data of each sample and the encoded regression label standard value; determining the value of a yaw angle classification loss function according to the prediction result of the yaw angle classification and the coded yaw angle classification label standard value; and determining the value of the vehicle target classification loss function according to the prediction probability result of the vehicle target and the encoded standard value of the vehicle target classification label.

Illustratively, preset loss functions are respectively given for the classification result and the regression result, wherein the loss function of the regression result is a regression loss function, the loss function of the classification result is a yaw angle classification loss function and a vehicle target classification loss function, correspondingly, the encoding labels are respectively an encoded regression label standard value, an encoded yaw angle classification label standard value and an encoded vehicle target classification label standard value,

for example, for the regression loss function, focal-loss, the yaw angle classification loss function, and the vehicle target classification loss function, smooth-l1 loss are selected, during training, the learning strategy of the network is Adam, Adam is a first-order optimization algorithm capable of replacing the traditional random gradient descent (SGD) process, the weight of the neural network can be updated iteratively based on training data, the initial learning rate is set to be 0.001, and gradually decreases along with iterative training, 10 epochs (1 epoch is equal to one time of training by using all samples in a training set, the value of the popular epoch is that the whole data set is turned for several times) are changed to be 0.5, and thus, through 100epoch iterative training, the trained model is finally saved, and the three-dimensional vehicle detection network is obtained.

In the embodiment, a large amount of point cloud data of a sample three-dimensional space are obtained, and the point cloud data are subjected to 2D voxelization respectively and then input into an initial three-dimensional vehicle detection network formed by the HRNet and the RPN network for training, so that the HRNet and the RPN network can obtain high-abstraction characteristics, and the high-abstraction characteristics are input into the RPN network to output regression results and classification results, so that the accuracy of vehicle target detection is greatly improved.

In one embodiment, as shown in fig. 8, the method further comprises:

s601, according to the depth features extracted by the initial HRNet network, determining a coordinate point matrix with the same specification as the depth features.

In this embodiment, in the three-dimensional vehicle detection network process, obtaining a code label of a regression result and a classification result of point cloud data of a sample three-dimensional space after passing through the three-dimensional vehicle detection network can be understood as obtaining a true value in advance, and comparing the true value with a predicted value output by the three-dimensional vehicle detection network in a training process in the subsequent process.

Specifically, in this step, based on the depth feature of the point cloud data in the sample three-dimensional space extracted by the initial HRNet network, a coordinate point matrix having the same specification as the depth feature is determined, for example, the depth feature shape is 200X176X128,128 as a feature dimension. The corresponding bird's eye view (kan) diagram has two-dimensional space ranges of (-40,40) in the y-axis direction and (0,70.4) in the x-axis direction, 200 coordinate points are uniformly obtained in the range of-40 to 40, 176 coordinate points are uniformly obtained in the range of 0 to 70.4, and finally combined into a coordinate point matrix with the shape of 200x176x3, wherein 3 represents three dimensions of x, y and z, and the coordinates of the z-axis all take 0.5, namely, the central points of the vehicle targets are all assumed to be 0.5 m high from the ground.

And S602, taking each coordinate point in the coordinate point matrix as a center, and generating anchor point frames with the number corresponding to the preset yaw angle.

After the coordinate point matrix is obtained, anchor-box frames (anchor-box) with the number corresponding to the preset yaw angle are generated by taking each coordinate point in the coordinate point matrix as the center. For example, with these coordinate points as the centers of the anchor-boxes, 4 anchor-boxes with yaw angles of (0,0.79,1.57,2.37) (assuming that the angles are 0,45,90,135 respectively) are generated, and all the generated anchor-boxes have the length and width height of (3.9,1.6,1.56), and it should be noted that this length, width and height value is exemplified by the average value of the kitti data set vehicle target, which is not limited in this embodiment. Therefore, the number of the anchor-box is 200X176X4 ═ 140800, the shape of the generated anchor-box matrix is 200X176X4X7, wherein 7 in the last dimension represents seven parameter dimensions, i.e., the center point coordinates X, y, z of the anchor-box, the width, length and height (w, l, h) of the anchor-box, and the yaw angle of the anchor-box.

S603, encoding the positive anchor point frame in the anchor point frames to obtain a regression label standard value after encoding; the positive anchor point frame indicates that the intersection ratio of the anchor point frame and the corresponding standard marking frame is greater than a preset value.

After generating the anchor-box, encoding the positive type anchor-box in the anchor-box, wherein; and the intersection ratio of the normal anchor-box and the corresponding standard marking box is larger than a preset value.

Illustratively, the encoding is performed with the generated 140800 anchor-boxes and corresponding true values. The true value is the true value of the vehicle target in the point cloud scene, for example, if 3 vehicles are included in one scene, the matrix of the true value is 3 × 7. The last dimension 7 is the same as the representation of the last dimension 7 of the anchor-box. And calculating the score of each anchor-box by using the anchor-box and the true value, wherein the score calculation rule is as follows: and calculating the IOU of all anchor-boxes and the ground-truth in the bird's eye view plane, wherein if one ground-truth intersects with each anchor-box in a ratio of more than 0.6, the classification label of the anchor-box is a positive class, and the ground-truth with the largest IOU is selected for calculating the following regression label. If the iou of the anchor-box and all of the group-truth are less than 0.45, the classification label of the anchor-box is a negative class. If the iou of the anchor-box and all of the ground-truth are between 0.45 and 0.6, then the anchor-box does not participate in the calculation.

For the anchor-box with the classification label as the positive class, the regression label of the anchor-box can be obtained by encoding in the following way. The encoding method is as follows:

wherein, Δ x, Δ y, Δ z, Δ l, Δ w, Δ h, Δ θ are regression labels of the encoded anchor-box, the upper label of the Ag represents the true value corresponding to the anchor-box, and the upper label of a represents the anchor-box of the three-dimensional vehicle detection network regression. The matrix shape of the encoded regression label is 200x176x4x 7.

S604, coding the yaw angle category of the normal anchor point frame to obtain a coded yaw angle classification label standard value, and coding the vehicle target category of the normal anchor point frame to obtain a coded vehicle target classification label standard value.

For the above-mentioned positive type anchor-box, the regression label can be calculated by using the above formula, but for the positive type anchor-box, the yaw angle category needs to be coded, if the yaw angle of the group-channel corresponding to the positive type anchor-box is greater than 0, the positive type anchor-box is marked as positive type, and if the yaw angle is less than 0, the negative type anchor-box is marked as negative type (for example, the range of the yaw angle of the vehicle target in the KITTI data set is-pi to pi, and the positive or negative type is provided, but the values of the anchor-box yaw angle in the present embodiment only take four values of 0,45,90,135, and the yaw angles corresponding to the four values are all positive values, so it needs to judge whether the predicted type of the yaw angle is positive or negative, the coded yaw angle classification label is 200x176x4x2, which is a one-hot coding result, and after the coding of the yaw angle category, the positive type anchor-box is also needed to code the target type of the vehicle, the probability that a vehicle object is classified into two categories, vehicle and not vehicle, the shape of the encoded classification label is 200x176x4x 2. The last dimension 2 indicates the one-hot encoding result for the 0,1 class.

In the embodiment, the regression and classification results of the point cloud data of the sample three-dimensional space after passing through the three-dimensional vehicle detection network are encoded in advance to obtain the corresponding encoding label, and when the three-dimensional vehicle detection network is trained, the encoding label and the regression and classification results output by the three-dimensional vehicle detection network can be substituted into the preset loss function, so that the three-dimensional vehicle detection network is guided according to the value of the loss function, the trained three-dimensional vehicle detection network is more stable and robust, and a vehicle target can be detected more accurately.

In one embodiment, there is also provided an embodiment, as shown in fig. 9, including:

s1, acquiring and labeling laser point cloud data of the three-dimensional space through a laser radar;

s2, performing 2D voxelization on the point cloud data;

s3, inputting the pixilated point cloud data into an HRNet network in a three-dimensional vehicle detection network to extract depth features;

s4, inputting the depth features into an RPN network in the three-dimensional vehicle detection network, and outputting regression and classification results of each anchor-box;

s5, generating an anchor-box at each position of the depth feature map, and coding to obtain a code label of the regression and classification result of each anchor-box;

s6, training and optimizing the three-dimensional vehicle detection network based on the preset loss function, the regression and classification result output by the RPN and the coding label of the regression and classification result until the trained three-dimensional vehicle detection network is obtained;

s7, acquiring point cloud data of a target three-dimensional space by using a laser radar;

s8, performing 2D voxelization on the point cloud data of the target three-dimensional space;

and S9, inputting the 2D voxelized point cloud data into the trained three-dimensional vehicle detection network to output regression and classification results, and decoding and screening the regression and classification results to obtain a detection frame of the vehicle target in the target three-dimensional space.

For each step provided in the embodiments in the foregoing embodiments, the implementation principle and technical effect thereof can be referred to the description in the foregoing embodiments, and are not described herein again.

It should be understood that although the various steps in the flow charts of fig. 2-9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-9 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 10, there is provided a vehicle object detecting device including: an acquisition module 10, a conversion module 11, a detection module 12 and a processing module 13, wherein,

the acquisition module 10 is used for acquiring point cloud data of a target three-dimensional space;

the conversion module 11 is configured to perform 2D voxelization on point cloud data of a target three-dimensional space;

the detection module 12 is configured to input the 2D voxelized point cloud data into a preset three-dimensional vehicle detection network; the three-dimensional vehicle detection network comprises a high-resolution network HRNet network and a region generation network RPN network;

and the processing module 13 is used for decoding and screening the output result of the three-dimensional vehicle detection network to obtain detection frames of all vehicle targets in the target three-dimensional space.

In one embodiment, the conversion module 11 includes:

a dividing unit for dividing the target three-dimensional space into a plurality of small spaces; the point cloud data are randomly distributed in each small space;

the conversion unit is used for converting the target three-dimensional space into a matrix with the same specification, and the cells in the matrix correspond to the small spaces in the target three-dimensional space one to one;

the assignment unit is used for filling a first value into the corresponding cell in the matrix if the point cloud data exists in the small space in the target three-dimensional space, and filling a second value into the corresponding cell in the matrix if the point cloud data does not exist in the small space in the target three-dimensional space; wherein the first value is different from the second value.

In one embodiment, the detection module 12 includes:

the feature extraction unit is used for inputting the 2D voxelized point cloud data into an HRNet network and extracting the depth features of the 2D voxelized point cloud data;

the classification regression unit is used for inputting the depth features into the RPN to obtain a classification result and a regression result of the 2D voxelized point cloud data; the classification result comprises a prediction probability result of whether the vehicle target is a vehicle target or not and a prediction result of the yaw angle category; and predicting the deviation between the anchor point frame of regression result regression and the true value.

In one embodiment, the processing module 13 includes:

the acquiring unit is used for acquiring anchor frames with the highest probability in the predicted probability results of the vehicle targets by a preset number;

the decoding unit is used for decoding the regression results of the anchor point frames in the preset number and determining the actual output values of the anchor point frames in the preset number according to the decoding results and the prediction results of the yaw angle categories;

and the screening unit is used for screening the actual output values of the anchor frames in the preset number through a non-maximum suppression algorithm and the prediction probability results of the vehicle targets corresponding to the anchor frames in the preset number to obtain the detection frames of all the vehicle targets in the target three-dimensional space.

In an embodiment, the obtaining module 10 is specifically configured to obtain, by using a laser radar, coordinate positions of surface points of each object in a target three-dimensional space in the target three-dimensional space, so as to obtain point cloud data.

In one embodiment, the apparatus further comprises:

the system comprises a sample acquisition module, a data acquisition module and a data acquisition module, wherein the sample acquisition module is used for acquiring point cloud data of a plurality of sample three-dimensional spaces; the point cloud data of the sample three-dimensional space comprises standard marking frames of all vehicle targets in the sample three-dimensional space;

the sample data conversion module is used for respectively carrying out 2D voxelization on the point cloud data of the plurality of sample three-dimensional spaces;

the training module is used for inputting the 2D voxelized sample point cloud data into an initial HRNet network of an initial three-dimensional vehicle detection network for depth feature extraction, and inputting the extracted depth features into an initial RPN network of the initial three-dimensional vehicle detection network to obtain classification results and regression results of the 2D voxelized sample point cloud data;

and the optimization module is used for determining the value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized point cloud data of each sample until the variation amplitude of the loss function is within a preset range, so as to obtain the three-dimensional vehicle detection network.

In one embodiment, the apparatus further comprises:

the determining module is used for determining a coordinate point matrix with the same specification as the depth feature specification according to the depth feature extracted by the initial HRNet network;

the generating module is used for generating anchor point frames with the number corresponding to the preset yaw angle by taking each coordinate point in the coordinate point matrix as a center;

the encoding module is used for encoding the positive anchor point frames in the anchor point frames to obtain the encoded regression label standard values; the positive anchor point frame indicates that the intersection ratio of the anchor point frame and the corresponding standard marking frame is greater than a preset value; and coding the vehicle target category of the normal anchor point frame to obtain a coded vehicle target classification label standard value.

In one embodiment, the loss function includes a regression loss function, a yaw angle classification loss function, a vehicle target classification loss function; the classification result of each 2D voxelized sample point cloud data comprises a prediction probability result of a vehicle target and a prediction result of a yaw angle category; the optimization module is specifically configured to determine a value of a regression loss function according to a regression result of each 2D voxelized sample point cloud data and the encoded regression label standard value; determining the value of a yaw angle classification loss function according to the prediction result of the yaw angle classification and the coded yaw angle classification label standard value; and determining the value of the vehicle target classification loss function according to the prediction probability result of the vehicle target and the encoded standard value of the vehicle target classification label.

The implementation principle and technical effect of all the vehicle target detection devices provided by the embodiments are similar to those of the vehicle target detection method embodiments, and are not described herein again.

For specific limitations of the vehicle object detection device, reference may be made to the above limitations of the vehicle object detection method, which are not described herein again. The respective modules in the above vehicle object detection apparatus may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, the internal structure of which may be as described above in fig. 1. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a vehicle object detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the above-described architecture shown in fig. 1 is merely a block diagram of some of the structures associated with the present solution, and does not constitute a limitation on the computing devices to which the present solution applies, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

acquiring point cloud data of a target three-dimensional space;

The implementation principle and technical effect of the computer device provided by the above embodiment are similar to those of the above method embodiment, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring point cloud data of a target three-dimensional space;

The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.

It will be understood by those of ordinary skill in the art that all or a portion of the processes of the methods of the embodiments described above may be implemented by a computer program that may be stored on a non-volatile computer-readable storage medium, which when executed, may include the processes of the embodiments of the methods described above, wherein any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A vehicle object detection method, characterized in that the method comprises:

acquiring point cloud data of a target three-dimensional space;

2. The method of claim 1, wherein the 2D voxelization of the point cloud data of the target three-dimensional space comprises:

dividing the target three-dimensional space into a plurality of small spaces; the point cloud data are randomly distributed in each small space;

if the point cloud data exists in the small space in the target three-dimensional space, filling a first value in the corresponding cell in the matrix, and if the point cloud data does not exist in the small space in the target three-dimensional space, filling a second value in the corresponding cell in the matrix; wherein the first value is different from the second value.

3. The method according to claim 1 or 2, wherein the inputting the 2D voxelized point cloud data into a preset three-dimensional vehicle detection network comprises:

inputting the 2D voxelized point cloud data into the HRNet network, and extracting the depth features of the 2D voxelized point cloud data;

inputting the depth features into the RPN to obtain a classification result and a regression result of the 2D voxelized point cloud data; the classification result comprises a prediction probability result of whether the vehicle target is a vehicle target or not and a prediction result of a yaw angle category; the prediction result of the deviation between the anchor point frame of the regression result regression and the true value; the anchor boxes characterize pre-selected boxes that are generated in advance.

4. The method according to claim 3, wherein the decoding and screening the output result of the three-dimensional vehicle detection network to obtain the detection frames of all vehicle targets in the target three-dimensional space comprises:

acquiring a preset number of anchor point frames with highest probability in the prediction probability result of the vehicle target;

5. The method of claim 1 or 2, wherein the obtaining point cloud data of a target three-dimensional space comprises:

and acquiring the coordinate position of each object surface point in the target three-dimensional space through a laser radar to obtain the point cloud data.

6. The method according to claim 1 or 2, wherein the training process of the three-dimensional vehicle detection network comprises:

respectively performing 2D voxelization on the point cloud data of the plurality of sample three-dimensional spaces;

inputting each 2D voxelized sample point cloud data into an initial HRNet network of an initial three-dimensional vehicle detection network for depth feature extraction, and inputting the extracted depth features into an initial RPN network of the initial three-dimensional vehicle detection network to obtain a classification result and a regression result of each 2D voxelized sample point cloud data;

and determining the value of a loss function of the initial three-dimensional vehicle detection network according to the classification result and the regression result of the 2D voxelized point cloud data of each sample until the variation amplitude of the loss function is in a preset range, thereby obtaining the three-dimensional vehicle detection network.

7. The method of claim 6, further comprising:

generating anchor point frames with the number corresponding to a preset yaw angle by taking each coordinate point in the coordinate point matrix as a center;

encoding the positive anchor point frames in the anchor point frames to obtain the encoded regression label standard values; the positive anchor point frame indicates that the intersection ratio of the anchor point frame and the corresponding standard marking frame is greater than a preset value;

and coding the yaw angle category of the positive anchor point frame to obtain the coded yaw angle classification label standard value, and coding the vehicle target category of the positive anchor point frame to obtain the coded vehicle target classification label standard value.

8. The method of claim 7, wherein the loss function comprises a regression loss function, a yaw angle classification loss function, a vehicle target classification loss function; the classification result of each 2D voxelized sample point cloud data comprises a prediction probability result of a vehicle target and a prediction result of a yaw angle category;

determining the value of the regression loss function according to the regression result of the 2D voxelized point cloud data of each sample and the encoded regression label standard value;

determining the value of the yaw angle classification loss function according to the prediction result of the yaw angle category and the coded yaw angle classification label standard value;

and determining the value of the vehicle target classification loss function according to the prediction probability result of the vehicle target and the coded vehicle target classification label standard value.

9. A vehicle object detection apparatus, characterized in that the apparatus comprises:

10. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 8 when executing the computer program.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.