CN111144304A

CN111144304A - Vehicle target detection model generation method, vehicle target detection method and device

Info

Publication number: CN111144304A
Application number: CN201911371122.0A
Authority: CN
Inventors: 周康明; 郭义波
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-12

Abstract

The application relates to a generation method of a vehicle target detection model, a vehicle target detection method, a device, a computer device and a storage medium. The method comprises the following steps: extracting the characteristics of the obtained point cloud sample; inputting the obtained original features into a point cloud segmentation network, and detecting foreground points in the point cloud sample; fusing the original features of the foreground points to generate shallow features; repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features; inputting the deep features into a regional candidate network, and outputting a regression result and a classification result; and finishing the training of the vehicle target detection model according to the regression result and the classification result. In the method, the foreground and background point prediction and the final target detection of the point cloud are sequentially carried out, so that end-to-end training is realized, and the implementation process is simple; and the extraction of useful information is realized under the condition of eliminating background point clouds, so that the generation and use efficiency of the model can be accelerated, and the precision of vehicle target detection is improved.

Description

Vehicle target detection model generation method, vehicle target detection method and device

Technical Field

The present application relates to the field of vehicle detection technologies, and in particular, to a method and an apparatus for generating a vehicle target detection model, a computer device, and a storage medium, and a method and an apparatus for detecting a vehicle target, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, internet technology and automobile industry, the automobile intelligence degree is higher and higher. Automotive intelligence includes driving assistance and unmanned driving. For unmanned driving, safety is one of the most interesting issues. The accurate perception of the surrounding environment by the unmanned vehicle is the basis for ensuring safety, and the vehicle target detection based on machine vision is an important item.

In recent years, deep learning has been rapidly developed. Convolutional neural networks are widely used in image recognition and target detection. Meanwhile, many deep learning models are also applied to three-dimensional vehicle target detection. Such as voxel-net (point cloud-based three-dimensional spatial information layer-by-layer learning network), SECOND (sparse Embedded convolutional target detection network), Frustum _ pointe (a three-dimensional target detection network), and the like. The models take laser point clouds received from a laser radar or images obtained from a camera as input, and position information, vehicle size and vehicle yaw angle of a vehicle target in a three-dimensional space are obtained through calculation. In addition, three-dimensional vehicle target detection models based on deep learning are generated in the related technology, the models mainly obtain shallow features according to point cloud, perform deep feature fusion on the obtained shallow features to obtain deep features, and perform regression and classification on the deep features to obtain detection results. However, because the existing model is used for training and detecting based on all point clouds, interference factors are more in the training and detecting process, and the detecting result is not accurate enough.

Disclosure of Invention

Based on this, it is necessary to provide a method, an apparatus, a computer device and a storage medium for generating an end-to-end vehicle object detection model, and a method, an apparatus, a computer device and a storage medium for vehicle object detection, which can simplify the vehicle object detection process.

In order to achieve the above object, in one aspect, an embodiment of the present application provides a method for generating a vehicle object detection model, where the method includes:

extracting the characteristics of the obtained point cloud sample to obtain original characteristics;

inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud sample;

fusing the original features of the foreground points to generate shallow features;

repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features;

inputting the deep features into a regional candidate network, and outputting a regression result and a classification result;

and training the vehicle target detection model according to the regression result and the classification result to generate a trained vehicle target detection model.

On the other hand, the embodiment of the application also provides a vehicle target detection method, which comprises the following steps:

acquiring a point cloud to be detected;

and inputting the point cloud to be detected into a vehicle target vehicle inspection model to generate a detection result of the vehicle target, wherein the vehicle target detection model is generated by adopting the method.

On the other hand, the embodiment of the present application further provides a device for generating a vehicle object detection model, where the device includes:

the characteristic extraction module is used for extracting the characteristics of the acquired point cloud sample to obtain original characteristics;

the point cloud segmentation module is used for inputting the original features into a point cloud segmentation network and detecting foreground points in the point cloud sample;

the shallow feature generation module is used for fusing the original features of the foreground points to generate shallow features;

the deep feature generation module is used for repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features;

the training data generation module is used for inputting the deep features into the regional candidate network and outputting a regression result and a classification result;

and the model training module is used for training the vehicle target detection model according to the regression result and the classification result to generate a trained vehicle target detection model.

On the other hand, the embodiment of this application still provides a vehicle object detection device, the device includes:

the point cloud to be detected acquisition module is used for acquiring point cloud to be detected;

and the detection result generation module is used for inputting the point cloud to be detected into the vehicle target vehicle inspection model to generate a detection result of the vehicle target, and the vehicle target detection model is generated by adopting the method.

In yet another aspect, an embodiment of the present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the method when executing the computer program.

In yet another aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method.

According to the method and the device for generating the vehicle target detection model, the computer equipment and the storage medium, the original characteristics are obtained by performing characteristic extraction on the acquired point cloud sample; inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud sample; fusing the original features of the foreground points to generate shallow features; repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features; and inputting the deep features into the regional candidate network, and outputting a regression result and a classification result. In the technical scheme, the foreground and background point prediction and the final target detection of the point cloud are sequentially carried out, end-to-end training is realized, and the implementation process is simple. In addition, foreground points in the point cloud sample are detected by using the point cloud segmentation network, and shallow features and deep features are generated based on original features of the foreground points. The method has the advantages that the extraction and shallow fusion of useful information are realized under the condition that the background point cloud is eliminated, the generation and use efficiency of the vehicle target detection model can be accelerated, and the interference of the background point cloud on the vehicle target detection is eliminated, so that the vehicle target detection precision is improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a method for generating a vehicle object detection model;

FIG. 2 is a schematic flow chart diagram illustrating a method for generating a vehicle object detection model in one embodiment;

FIG. 3 is a schematic diagram of a sample point cloud obtained in one embodiment;

FIG. 4 is a diagram illustrating the extraction of raw features using a feature extraction network, according to one embodiment;

FIG. 5 is a schematic diagram illustrating an exemplary process for foreground point segmentation of a point cloud sample;

FIG. 6a is a schematic view of a bird's eye view plane of a point cloud sample in one embodiment;

FIG. 6b is a schematic diagram of a grid division of a bird's eye view plane of a point cloud sample according to an embodiment;

FIG. 7 is a diagram illustrating a point cloud segmentation network performing foreground point segmentation on a point cloud sample in accordance with an embodiment;

FIG. 8 is a flow diagram illustrating the determination of foreground points using local features, full scene features, and raw features in one embodiment;

FIG. 9 is a diagram illustrating shallow feature generation via a shallow converged network, under an embodiment;

FIG. 10 is a diagram illustrating deep feature generation via deep fusion networks and regression and classification results via regional candidate networks, in accordance with an embodiment;

FIG. 11 is a schematic flow chart diagram illustrating a method for vehicle object detection in one embodiment;

FIG. 12 is a block diagram showing an example of a device for generating a vehicle object detection model;

FIG. 13 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The generation method of the vehicle target detection model provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 and the server 104 communicate via a network. The point cloud data may be pre-stored in the server 104, and may be called from the server 104 when the vehicle target detection model needs to be trained; or the data can be acquired in real time through a laser radar and other acquisition devices. The terminal 102 is pre-deployed with a vehicle target detection model to be trained, and the vehicle target detection model may be composed of a plurality of network combinations based on deep learning, and the network combinations may include, but are not limited to, a feature extraction network, a point cloud segmentation network, a shallow feature fusion network, a deep feature fusion network, and a region candidate network. Specifically, the terminal 102 performs feature extraction on the acquired point cloud sample to obtain an original feature; inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud sample; fusing the original features of the foreground points to generate shallow features; repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features; inputting the deep features into a regional candidate network, and outputting a regression result and a classification result; and training the vehicle target detection model according to the regression result and the classification result to generate a trained vehicle target detection model. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, and tablet computers, and the server 104 may be implemented by an independent server or a server cluster formed by a plurality of servers.

In one embodiment, as shown in fig. 2, a method for generating a vehicle object detection model is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and includes the following steps:

step 202, extracting the features of the acquired point cloud to obtain original features.

The point cloud sample may be a three-dimensional laser point cloud collected by a collecting device such as a laser radar. The collecting device can be arranged on an unmanned automobile or automatic equipment, laser points reflected by an object are received by emitting laser to the surrounding space, and the reflected laser points form laser point cloud. As shown in fig. 3, a point cloud sample is acquired by a laser radar disposed on the roof of the unmanned automobile. In fig. 3, the black dots are three-dimensional point clouds, the black solid frame is the position of the vehicle, the point clouds in the black solid frame are foreground point clouds, and the other point clouds except the foreground point clouds are background point clouds. The point cloud data represents the position of the target in three-dimensional space x, y, z coordinates, as (x, y, z). Assuming that the number of point clouds input into the vehicle object detection model is n, the shape of the point cloud data is n × 3. The original features may be features obtained by feature extraction through a three-dimensional deep learning network capable of directly processing point cloud data, where the three-dimensional deep learning network may be a PointNet (point cloud deep learning network) series network, such as PointNet, PointNet + +, and the like. Specifically, taking feature extraction using PointNet as an example, fig. 4 shows a schematic diagram of feature extraction by PointNet in one embodiment. After the terminal obtains the point cloud sample, inputting the point cloud sample into a vehicle target detection model to be trained, and extracting features through a PointNet feature extraction network in the model to obtain original features with the shape of n multiplied by 128, wherein 128 is the dimensionality of the original features.

In this embodiment, the point cloud sample acquired by the terminal may be a point cloud obtained by sampling an acquired original point cloud. As shown in fig. 3, since the number of the collected original point clouds is usually large, if all the original point clouds are input into the model to extract the original features, the processing speed of the algorithm may be affected. Therefore, under the condition of not influencing the performance of the algorithm, the original point cloud can be sampled, so that the number of points in the point cloud sample input to the vehicle target detection model to be trained is reduced. Illustratively, if the x-coordinate range in FIG. 3 is (0,70.4 meters), the heading is indicated; a y coordinate range of (-40 meters, 40 meters) represents a left-to-right direction; a z-axis coordinate range of (0,4 meters) indicates that the vertical ground is pointing skyward. Because the number of the point clouds is gradually thinned along with the increase of the distance in the advancing direction of the x axis, a hierarchical sampling mode can be used, namely the point clouds farther in the advancing direction of the x axis can be not sampled; the closer point clouds may exacerbate the degree of sampling. For example, 50% of the point cloud is randomly retained in the range of 0-20 meters, 70% in the range of 20-30 meters, and 80% in the range of 30-40 meters on the x-axis. No sampling was performed for 40 to 70.4 meters.

Step 204, inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud.

The point cloud segmentation network can use the existing semantic segmentation network or a newly designed segmentation network. Specifically, the point cloud segmentation network can perform feature fusion on the input original features to generate full scene features, and classify the foreground points and the background points of the point cloud samples on the basis of the full scene features. Further, since the three-dimensional laser point cloud is sparse in the entire three-dimensional space, the center of the three-dimensional target is far away from the three-dimensional point cloud. Therefore, local features can be fused on the point cloud segmentation network, and point cloud samples are classified on the basis of combination of the local features and the full scene features, so that the accuracy of model detection can be improved.

And step 206, fusing the original features of the foreground points to generate shallow features.

And step 208, repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features.

Specifically, the obtained original features of the foreground points are input into a shallow feature fusion network, and shallow features are obtained through pooling fusion and the like. And then, continuously inputting the shallow feature into a depth feature fusion network for depth fusion. The depth feature fusion network can perform down-sampling and up-sampling operations on the input shallow features to obtain low-level features and high-level features of the shallow features, and perform repeated feature fusion on the obtained high-level features and the low-level features to obtain deep features. The depth feature fusion Network may use an HRNet (High Resolution Network).

And step 210, inputting the deep features into the regional candidate network, and outputting a regression result and a classification result.

Specifically, before the deep features are input into the regional candidate network, a plurality of anchor boxes may be generated on the deep features and labeled according to the detection requirements of the vehicle object detection model. Annotated labels may include, but are not limited to, regression labels, vehicle classification labels, and yaw angle classification labels. After the marking is finished, the deep features are input into a regional candidate network, each anchor frame is compared with the Ground-Truth (true value) of the vehicle target through the regional candidate network, and results such as the deviation value of the anchor frames and the Ground-Truth and the classification probability of each anchor frame are output as a regression result and a classification result and used as data for training the vehicle target detection model.

And 212, training the vehicle target detection model according to the regression result and the classification result to generate a trained vehicle target detection model.

Specifically, after the labels, the regression results, and the classification results of the anchor point frames are obtained, the selected loss function may be used to perform iterative training on the vehicle target detection model according to the marked point cloud samples and the set hyper-parameters. After training is completed, the model parameters with the highest accuracy can be selected to set the vehicle target detection model, and the vehicle target detection model is generated.

In the method for generating the vehicle target detection model, the foreground and background point prediction and the final target detection of the point cloud are sequentially carried out, so that end-to-end training is realized, and the implementation process is simple. In addition, foreground points in the point cloud sample are detected by using the point cloud segmentation network, and shallow features and deep features are generated based on original features of the foreground points. The method has the advantages that the extraction and shallow fusion of useful information are realized under the condition that the background point cloud is eliminated, the generation and use efficiency of the vehicle target detection model can be accelerated, and the interference of the background point cloud on the vehicle target detection is eliminated, so that the vehicle target detection precision is improved.

In one embodiment, as shown in fig. 5, inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud, includes:

step 502, dividing the aerial view plane of the point cloud sample to obtain a plurality of grids.

The bird's-eye view plane of the point cloud sample is a two-dimensional plane of the point cloud on the X-y axis plane, fig. 6a shows the bird's-eye view plane of the point cloud sample shown in fig. 3, and in fig. 6a, the X coordinate range of the bird's-eye view plane is (0,70.4 meters), and the y coordinate range is (-40 meters, 40 meters). Specifically, the grid center intervals and the grid sizes may be preset, and the bird's-eye view plane may be divided into a plurality of grids of uniform spatial regions. As shown in fig. 6b, if the grids are divided at intervals of 0.1 meter, the size of each grid is 0.1 meter × 0.1 meter. Then, the bird's-eye view plane can be divided into 800 × 704 grids of the same size, and in the following embodiments, the bird's-eye view plane shown in fig. 6a and 6b will be described as an example.

And step 504, mapping the original features to the aerial view plane to obtain the original features in each grid, and fusing the original features in each grid to obtain the local features of each grid.

Specifically, in the present embodiment, the point cloud segmentation network may segment the point cloud sample by combining the local features and the full scene features. As shown in fig. 7, a network structure diagram of a point cloud segmentation network used in this embodiment is shown, and the point cloud segmentation network includes a local feature fusion module 701, a full scene feature fusion module 702, and a convolution module 703. The local feature fusion module 701 can perform fusion according to the position of the point cloud sample in the aerial view plane. For point cloud samples in the same grid, original features of the point cloud samples falling in the same grid can be fused through Max Point pooling. For example, if the shape of the original feature is N × 128, the number of points in the point cloud sample falling in each grid is M₁,M₂,M₃……M_800×704(800 x 704 represents the total number of grids), then the original features of the point cloud samples in each grid are M, respectively₁×128,M₂×128,M₃×128……M_800×704X 128. M for each grid₁×128,M₂×128,M₃×128……M_800×704Pooling and fusing the original characteristics of x 128 to obtain 800 x 704 parts with 1 x 128 shapesAnd (4) partial characteristics.

Step 506, according to the original features and the local features, a point cloud segmentation network is adopted to detect foreground points in the point cloud sample.

Specifically, the original features may be subjected to pooling fusion by the full scene fusion module 702, so as to obtain the full scene features of the point cloud sample. Then, on the basis of the full scene features and the local features, the point cloud samples are classified through the convolution module 703, and foreground points in the point cloud samples are output. Further, in this embodiment, the category of the point cloud sample may also be predicted by combining the original features. By classifying and segmenting foreground points and background points of the point cloud sample according to three combined features of the original feature, the local feature and the full-scene feature, useful information can be extracted while background point clouds are eliminated, and therefore the accuracy of model detection is improved.

In one embodiment, as shown in fig. 8, an implementation process for detecting a foreground point in a point cloud according to a local feature, a full scene feature and an original feature is described, which includes:

and 802, copying local features of each grid according to the number of points in the point cloud sample in each grid.

And step 804, copying the full scene features according to the number of points in the point cloud samples.

Specifically, with continued reference to fig. 7, after obtaining the local feature 1 × 128 corresponding to each grid, the number M of points in the cloud sample of points in the grid may be determined₁,M₂,M₃……M_800×704Respectively copying the local feature corresponding to each grid to M₁,M₂,M₃……M_800×704Then, obtaining M₁×128,M₂×128,M₃×128……M_800×704A feature of x 128, i.e., an N x 128 feature. Similarly, the full scene feature can be copied N times according to the total number N of points in the point cloud sample, so as to obtain an N × 128 feature.

And 806, splicing the copied local features, the copied full-scene features and the original features to generate spliced features.

And 808, performing convolution processing on the splicing characteristics, and outputting foreground points in the point cloud.

Specifically, the copied local feature, the copied full scene feature and the original feature are all N × 128 features, so that the three features can be spliced to obtain an N × 384 feature. The N × 384 feature is convolved by the convolution module in fig. 7, and an N × 2 segmentation result is output. In this embodiment, the point cloud segmentation network may be trained separately in advance. During training, the shape of the label is N × 1. For point cloud samples within the group-Truth box, labeled 1, the outer point cloud samples are labeled 0. The loss function of the point cloud segmentation network may use focal-loss.

In one embodiment, fusing the original features of the foreground points to generate a shallow feature includes: and fusing the original features of the foreground points of each grid, and filling data in each grid which does not contain the foreground points to generate the shallow features.

Specifically, as shown in fig. 9, a network structure diagram of the shallow feature fusion network used in the present embodiment is shown. If the number of the foreground points is predicted to be P, the original feature shape of the foreground points is P × 128, and the P × 128 features are input into the shallow feature fusion network. Then, pooling fusion is performed on the original features of the foreground points in each grid, and assuming that the number of the foreground points in one grid is P ', the shape of the features in the grid is P' × 128, and the features of 1 × 128 can be obtained through MaxPooling pooling fusion. Thus, each small grid with foreground points will get a 1 × 128 feature. While for a small grid without foreground points, 1 x 128 data padding of all 0's may be used. Finally, the obtained 1 × 128 features are placed in corresponding positions in the grid, and the shape of the finally obtained shallow fusion feature is 800 × 704 × 128.

In one embodiment, the regional candidate network includes a regression branch, a vehicle classification branch, and a yaw angle classification branch; inputting the deep features into a regional candidate network, and outputting a regression result and a classification result, wherein the method comprises the following steps: comparing each anchor point frame marked in advance on the deep features with the real value of the vehicle target through regression branches, and outputting a regression result; predicting the vehicle category of each anchor point frame through the vehicle classification branch, and outputting a vehicle classification result; and predicting the yaw angle category of each anchor point frame through the yaw angle classification branch, and outputting a yaw angle classification result.

Specifically, in this embodiment, the deep features may be obtained through the HRNet network. Fig. 10 is a schematic diagram of deep features obtained through the HRNet network. The HRNet network is able to maintain a high resolution representation throughout the process. Starting with the high resolution sub-network as a first stage, adding high to low resolution sub-networks one by one to form more stages, and connecting the multi-resolution sub-networks in parallel. And repeatedly exchanging information in the parallel multi-resolution sub-networks in the whole process to perform repeated multi-scale fusion, so that the model detection can be more accurate. In fig. 10, the depth feature shape output by the HRNet network is 200 × 176 × 128.

The following describes the generation and labeling process of the anchor point frame on the depth feature: specifically, uniform coordinate points may be set in the x-axis and y-axis directions of the bird's eye view plane. For example, 176 coordinate points are uniformly obtained in the range of 0 to 70.4 meters on the x-axis, 200 coordinate points are uniformly obtained in the range of-40 meters to 40 meters on the y-axis, and finally combined into a coordinate point matrix having a shape of 200 × 176 × 3 (the coordinates on the z-axis all take a fixed value, for example, 0.5 meters, that is, assuming that the center points of the vehicle targets are all 0.5 meters high from the ground). The coordinates are taken as the centers of the anchors-box, and anchors-box with preset shapes and preset yaw angles are generated, for example, 4 anchors-boxes with yaw angles of (0,0.79,1.57 and 2.37) (the angles are 0 degrees, 45 degrees, 90 degrees and 135 degrees respectively are generated, and the length, width and height of all the generated anchors-boxes are (3.9,1.6 and 1.56). This aspect height value is chosen from the kitti data set vehicle target mean. The kitti data set is a computer vision algorithm evaluation data set under the current international largest automatic driving scene. From this, it is possible to obtain the number of anchor-boxes of 200 × 176 × 4 — 140800, and the shape of the generated anchor-box matrix of 200 × 176 × 4 × 7, where 7 in the last dimension represents the center point coordinates (x, y, z) of the anchor-box, the width, length, and height (w, l, h) of the anchor-box, and the yaw angle of the anchor-box.

Then, one-hot encoding (one-hot encoding) was performed using the generated 140800 anchor-boxes and ground-treth (real values). In particular, the group-route is the real value of the vehicle target, for example, if 3 vehicles are included in the scene, then the matrix of the group-route is 3 × 7. The last dimension 7 is the same as the representation of the last dimension 7 of the anchor-box. Calculating the score of each anchor-box by the anchor-box and the ground-channel, wherein the score calculation rule is as follows: IoU (Intersection over Union ratio, "Intersection" means the overlapping area of two regions, "and" means the sum of the areas of two regions minus the overlapping area, such as two regions, the Intersection area is 20, and the Union area is 100, then the Intersection ratio is 0.2) is calculated for all anchor-boxes and ground-truth in the bird's eye view plane. For each anchor-box, if a group-route crosses it more than a first threshold (e.g., 0.6), the vehicle classification label of that anchor-box may be labeled as positive (which may be labeled as 1) and the group-route with the largest of its IoU may be selected for calculating the regression label. If the anchor-box and all ground-boxes are less than the second threshold (e.g., 0.45), then the vehicle classification label of the anchor-box may be marked as a negative class (which may be labeled 0). If the anchor-box and all of the group-truth's IoU are between the second threshold and the first threshold, then the anchor-box may not participate in the calculation. For the anchor-box with the classification label as the positive class, the regression label of the anchor-box can be obtained by one-hot coding in the following way:

Δθ＝|θ^g|-θ^a

wherein, Δ x, Δ y, Δ z, Δ l, Δ w, Δ h, Δ θ are regression labels of the anchor-box after one-hot coding, g superscript represents the value of ground-channel corresponding to the normal anchor-box, and a superscript represents the value of anchor-box. In addition, for the anchor-box of the positive type, one-hot coding is also needed for the yaw angle category. Specifically, if the yaw angle of the group-route corresponding to the positive type anchor-box is greater than 0 °, the type of yaw angle of the anchor-box may be marked as positive, and the opposite positive may be marked as negative (the range of yaw angles of the vehicle target in the kitti data set is-pi to pi, and there are positive and negative values). The regression tag matrix shape after one-hot encoding is 200 × 176 × 4 × 7, and the vehicle classification tag shape after one-hot encoding is 200 × 176 × 4 × 2. The last dimension 2 indicates that the 0,1 class is one-hot coded. The shape of the coded yaw angle classification label is 200 × 176 × 4 × 2.

After encoding the anchor box on the deep feature is complete, the deep feature may be imported into the RPN (regional candidate Network). Continuing with fig. 10, the RPN network includes 3 branches: regression branch, vehicle classification branch and yaw angle classification branch, each branch consisting of multiple layers of convolution. The regression branch is used to predict the true deviation of the anchor-box from the vehicle target, the output is the deviation from the anchor block, the output matrix shape is 200 × 176 × 4 × 7, and the shape corresponds to the regression label. The vehicle classification branch is used to predict whether the vehicle is a vehicle, and the probability of each category is obtained, and in the embodiment, there are two categories: whether it is a car or not, the vehicle classification branch output shape is 200 × 176 × 4 × 2, the shape being a vehicle classification label. The yaw angle classification branch is used for predicting the yaw angle category, and the total of the two categories is as follows: positive or negative. The matrix output shape of the yaw angle classification branch is 200 × 176 × 4 × 2, the shape corresponding to the yaw angle classification label.

Further, after obtaining the regression label, the vehicle classification label and the yaw angle classification label, and the regression result and the classification result output by the RPN network according to the three labels, the model may be iteratively trained using the selected loss function. Smooth L1 loss may be selected for the regression loss function, and Focal loss may be selected for the vehicle classification loss function and the yaw angle classification loss function. In this embodiment, preferably, the hyper-parameters for training the vehicle target detection model are set as: the learning strategy is used as an Adam optimization algorithm (a first-order optimization algorithm which can replace the traditional random gradient descent process); the initial learning rate is set to 0.001 and gradually decreases with iterative training, e.g., 10 epochs (1 epoch is equivalent to one training using all samples in the training set) becomes 0.5 of the original; through 100 epoch iterative training, the vehicle target detection model with the best accuracy can be finally saved.

In one embodiment, a vehicle object detection method is also provided, which may be applied to a second terminal installed on an unmanned automobile, which may be an electronic device with strong data storage and computing capabilities. The trained vehicle target detection model obtained by the method is arranged in the second terminal. When the unmanned automobile works, the point cloud collecting device arranged on the unmanned automobile can transmit the collected point cloud to be detected to the second terminal, and the position information, the size information, the yaw angle and the like of the vehicle target in the surrounding environment of the unmanned automobile are obtained through detection of a trained vehicle target detection model arranged on the second terminal. Specifically, as shown in fig. 11, the vehicle object detection method may include the steps of:

step 1101, point clouds to be detected are obtained.

The point cloud to be detected is acquired by an acquisition device such as a laser radar arranged on the unmanned automobile. The collection device may be provided on the roof of the unmanned vehicle. The point cloud to be detected can be the point cloud obtained by sampling the collected original point cloud. A hierarchical sampling may be used, for example, the closer the point cloud to the acquisition device is, the denser the point cloud is, the more dense the point cloud is, the more intensive the calculation is, and therefore, the closer point cloud is to be heavily sampled, for example, in the range of 0-20 m, and only 50% of the point cloud is retained; randomly reserving 70% point cloud within the range of 20-30 meters; randomly reserving 80% point cloud within the range of 30-40 meters; no sampling was performed for 40 to 70.4 meters.

Step 1102, inputting the point cloud to be detected into a vehicle target detection model, and extracting features through a feature extraction network in the vehicle target detection model.

Specifically, the feature extraction network may use a three-dimensional deep learning network capable of directly performing feature extraction on point cloud data, and may be a PointNet feature extraction network, for example.

And 1103, dividing the aerial view plane of the point cloud to be detected to obtain a uniform spatial area small grid.

And 1104, inputting the features into a point cloud segmentation network, and detecting foreground points in the point cloud through the point cloud segmentation network.

Specifically, the foreground points in the point cloud can be classified and segmented by combining the local features, the full scene features and the features extracted from the feature extraction network, so that the accuracy of classifying the foreground points is improved. The local features may be obtained by mapping features obtained by the PointNet feature extraction network on the aerial view plane, and performing pooling fusion on the features of the point cloud to be detected in each grid to obtain the local features. Reference may be made in particular to the schematic illustration of the point cloud segmentation network in fig. 7.

Step 1105, inputting the characteristics of the foreground points into the shallow fusion network for characteristic fusion to obtain the shallow characteristics.

Specifically, features of foreground points within each grid are pooled and data padding may be performed using all 0's for small grids without foreground points. Reference may be made to the schematic diagram of the point cloud segmentation network in fig. 9.

Step 1106, inputting the shallow features into a deep fusion network for multi-scale repeated fusion to obtain deep features.

Step 1107, the deep features are input into the regional candidate network to obtain a classification result and a regression result. The classification result comprises a vehicle classification result and a yaw angle classification result.

Specifically, the vehicle classification results may be sorted first, the scores of the vehicle categories are sorted from large to small, and a preset number (for example, 1000) of anchor-box classification results and regression results with the highest scores are retained. Since the obtained regression results are results after one-hot encoding, 1000 regression results of the preset number need to be decoded to obtain a real output. The decoded formula is the inverse of the formula used in training:

l^g＝Δl*l^a+l^a，w^g＝Δw*w^a+w^a，h^g＝Δh*h^a+h^a

θ^g＝Δθ+θ^a

after decoding, whether the sign is added to the regression output yaw angle result or not can be judged by combining the yaw angle classification result. After the real outputs are obtained, an NMS (Non Maximum Suppression, Non Maximum Suppression algorithm) may be used to screen out an optimal result from a preset number of real outputs as a final result for output. Illustratively, if the predetermined number is 1000. For the 1000 anchor-boxes and the scores thereof, the output box with the highest vehicle classification score is taken firstly, and the anchor-box with the highest vehicle classification score is saved. This anchor-box was then calculated IoU from the remaining 999 anchor-boxes. If IoU is greater than a certain threshold, such as 0.1, then the two anchor-boxes can be considered to overlap too much, leaving the score low. Otherwise, the retention score is low for anchor-box. After the comparison of the 999 anchor-boxes is finished, the operations are continuously and repeatedly carried out on the reserved anchor-boxes, and the optimal result is screened and output as the final result.

It should be understood that although the various steps in the flow charts of fig. 1-11 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-11 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 12, there is provided a generating apparatus 1200 of a vehicle object model, including: a feature extraction module 1201, a point cloud segmentation module 1202, a shallow feature generation 1203, a deep feature generation 1204, a training data generation module 1205, and a model training module 1206, wherein:

a feature extraction module 1201, configured to perform feature extraction on the obtained point cloud sample to obtain an original feature;

a point cloud segmentation module 1202, configured to input the original features into a point cloud segmentation network, and detect a foreground point in the point cloud sample;

a shallow feature generation module 1203, configured to fuse the original features of the foreground points to generate a shallow feature;

a deep feature generation module 1204, configured to repeatedly fuse the multi-scale features generated according to the shallow features to generate deep features;

a training data generating module 1205 for inputting the deep features into the regional candidate network and outputting regression and classification results;

and the model training module 1206 is used for training the vehicle target detection model according to the regression and classification results to generate a trained vehicle target detection model.

In one embodiment, the point cloud segmentation module 1202 is specifically configured to map the original features into an aerial view plane to obtain original features in each mesh, and fuse the original features in each mesh to obtain local features of each mesh; and detecting foreground points in the point cloud sample by adopting a point cloud segmentation network according to the original features and the local features.

In one embodiment, the point cloud segmentation module 1202 is specifically configured to fuse the original features to generate full scene features; and detecting foreground points in the point cloud sample according to the local features, the full scene features and the original features.

In one embodiment, the point cloud segmentation module 1202 is specifically configured to copy local features of each mesh according to the number of points in the point cloud sample in each mesh; copying the full scene features according to the number of points in the point cloud samples; splicing the copied local features, the copied full-scene features and the original features to generate splicing features; and performing convolution processing on the splicing characteristics, and outputting foreground points in the point cloud sample.

In an embodiment, the shallow feature generation module 1203 is specifically configured to fuse the original features of the foreground points of each grid, and perform data filling on each grid that does not include the foreground points, so as to generate the shallow features.

In one embodiment, the training data generating module 1205 is specifically configured to compare each anchor point frame labeled in advance on the deep features with the true value of the vehicle target through a regression branch, and output a regression result; predicting the vehicle category of each anchor point frame through the vehicle classification branch, and outputting a vehicle classification result; and predicting the yaw angle category of each anchor point frame through the yaw angle classification branch, and outputting a yaw angle classification result.

For specific definition of the vehicle object model generation apparatus 1200, the above definition of the vehicle object model generation method can be referred to, and will not be described herein again. The respective modules in the vehicle target model generation apparatus 1200 described above may be entirely or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of generating a vehicle object model. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

extracting the characteristics of the obtained point cloud sample to obtain original characteristics; inputting the original features into a point cloud segmentation network, and detecting foreground points in the point cloud sample; fusing the original features of the foreground points to generate shallow features; repeatedly fusing the multi-scale features generated according to the shallow features to generate deep features; inputting the deep features into a regional candidate network, and outputting a training label; and training the vehicle target detection model according to the training labels to generate the trained vehicle target detection model.

In one embodiment, the processor, when executing the computer program, further performs the steps of:

dividing a bird's-eye view plane of the point cloud to obtain a plurality of grids; mapping the original features to an aerial view plane to obtain the original features in each grid, and fusing the original features in each grid to obtain local features of each grid; and detecting foreground points in the point cloud sample by adopting a point cloud segmentation network according to the original features and the local features.

fusing the original features to generate full scene features; and detecting foreground points in the point cloud sample according to the local features, the full scene features and the original features.

copying the local features of each grid according to the number of points in the point cloud samples in each grid; copying the full scene features according to the number of points in the point cloud samples; splicing the copied local features, the copied full-scene features and the original features to generate splicing features; and performing convolution processing on the splicing characteristics, and outputting foreground points in the point cloud sample.

and fusing the original features of the foreground points of each grid, and filling data in each grid which does not contain the foreground points to generate the shallow features.

In one embodiment, the regional candidate network includes a regression branch, a vehicle classification branch, and a yaw angle classification branch; the processor, when executing the computer program, further performs the steps of:

comparing each anchor point frame marked in advance on the deep features with the real value of the vehicle target through regression branches, and outputting a regression result; predicting the vehicle category of each anchor point frame through the vehicle classification branch, and outputting a vehicle classification result; and predicting the yaw angle category of each anchor point frame through the yaw angle classification branch, and outputting a yaw angle classification result.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of generating a vehicle object detection model, the method comprising:

and training a vehicle target detection model according to the regression result and the classification result to generate a trained vehicle target detection model.

2. The method of claim 1, wherein inputting the raw features into a point cloud segmentation network, detecting foreground points in the point cloud samples, comprises:

dividing a bird's-eye view plane of the point cloud to obtain a plurality of grids;

mapping the original features to the aerial view plane to obtain original features in each grid, and fusing the original features in each grid to obtain local features of each grid;

and detecting foreground points in the point cloud sample by adopting the point cloud segmentation network according to the original features and the local features.

3. The method of claim 2, wherein detecting foreground points in the point cloud sample using the point cloud segmentation network based on the original features and the local features comprises:

fusing the original features to generate full scene features;

and detecting foreground points in the point cloud sample according to the local features, the full scene features and the original features.

4. The method of claim 3, wherein detecting foreground points in the point cloud sample from the local features, the full scene features, and the raw features comprises:

copying the local features of each grid according to the number of points in the point cloud samples in each grid;

copying the full scene features according to the number of points in the point cloud samples;

splicing the copied local features, the copied full scene features and the original features to generate spliced features;

and performing convolution processing on the splicing features, and outputting foreground points in the point cloud sample.

5. The method of claim 2, wherein fusing the original features of the foreground points to generate shallow features comprises:

and fusing the original features of the foreground points of each grid, and performing data filling on each grid which does not contain the foreground points to generate the shallow feature.

6. The method of claim 1, wherein inputting the deep features into a regional candidate network, outputting regression results and classification results comprises:

comparing each anchor point frame marked in advance on the deep features with the real value of the vehicle target through the regional candidate network, and outputting a regression result;

predicting the vehicle category of each anchor point frame and outputting a vehicle classification result;

and predicting the yaw angle category of each anchor point frame and outputting a yaw angle classification result.

7. A vehicle object detection method, characterized in that the method comprises:

acquiring a point cloud to be detected;

inputting the point cloud to be detected into a vehicle target vehicle inspection model to generate a detection result of a vehicle target, wherein the vehicle target detection model is generated by adopting the method of any one of claims 1 to 6.

8. An apparatus for generating a vehicle object model, the apparatus comprising:

the training data generation module is used for inputting the deep features into a regional candidate network and outputting a regression result and a classification result;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.