CN116740668A

CN116740668A - Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium

Info

Publication number: CN116740668A
Application number: CN202311029637.9A
Authority: CN
Inventors: 位硕权; 马也驰; 华炜; 鲍虎军
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-09-12
Anticipated expiration: 2043-08-16
Also published as: CN116740668B

Abstract

The application relates to a three-dimensional target detection method, a three-dimensional target detection device, computer equipment and a storage medium, wherein the three-dimensional target detection method comprises the following steps: acquiring a plurality of images to be detected, wherein the images to be detected are generated based on a multi-camera; based on a plurality of images to be detected, voxel image features of corresponding preset voxel spaces are obtained; acquiring voxelized point cloud characteristics corresponding to the preset voxel space based on an offline point cloud map, and fusing the voxelized image characteristics and the voxelized point cloud characteristics to obtain voxel fusion characteristics; and acquiring a recognition result of the target object based on the voxel fusion characteristics. The method solves the technical problem of lower three-dimensional target detection precision based on the multi-camera in the related technology, has lower calculation consumption, remarkably improves the target detection precision on the premise of almost unchanged detection at times, and expands the application range of the intelligent driving perception technology.

Description

Three-dimensional object detection method, three-dimensional object detection device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to a method and apparatus for detecting a three-dimensional object, a computer device, and a storage medium.

Background

Intelligent driving technology is one of the important development directions of the future automobile industry. In the existing intelligent driving technology, the perception technology is used as a core part of the intelligent driving technology, can help automobiles to recognize surrounding environments, road conditions, obstacles and the like, and provides important information support for intelligent driving. Currently, the main sensors of intelligent driving perception include various sensors such as cameras, lidar, ultrasonic detectors, and the like. In the conventional monocular image sensing technology, due to lack of depth information in an image, certain limitations exist for tasks such as distance estimation, obstacle detection, scene understanding and the like.

In the related art, a multi-view image technology is generally adopted, and image data is acquired through a plurality of cameras, so that more accurate and complete scene information and target depth information are obtained, and further, the perception and decision of intelligent driving are better supported. However, in the related art, in the process of three-dimensional target detection, only image features acquired by a multi-view camera are often considered, and then the target is identified based on the voxelized image features, and various information is not fully utilized for optimization, so that the expression of the features is not rich and comprehensive enough. Therefore, the three-dimensional object detection accuracy based on the multi-camera in the related art is low.

Aiming at the technical problem of lower three-dimensional target detection precision based on a multi-camera in the related technology, no effective solution is proposed at present.

Disclosure of Invention

Based on the above, the application provides a three-dimensional target detection method, a three-dimensional target detection device, a three-dimensional target detection computer device and a three-dimensional target detection storage medium, so as to solve the technical problem of low three-dimensional target detection precision based on a multi-camera in the related technology.

In a first aspect, the present application provides a method for three-dimensional object detection, the method comprising:

acquiring a plurality of images to be detected, wherein the images to be detected are generated based on a multi-camera;

based on a plurality of images to be detected, voxel image features of corresponding preset voxel spaces are obtained;

acquiring voxelized point cloud characteristics corresponding to the preset voxel space based on an offline point cloud map, and fusing the voxelized image characteristics and the voxelized point cloud characteristics to obtain voxel fusion characteristics;

and acquiring a recognition result of the target object based on the voxel fusion characteristics.

In one embodiment, a database stores in advance voxelized map features corresponding to an offline point cloud map, and the acquiring voxelized point cloud features corresponding to the preset voxel space includes:

Acquiring current positioning information;

and determining the preset voxel space based on the current positioning information, and intercepting the voxelized map features based on the preset voxel space to obtain the voxelized point cloud features.

In one embodiment, the acquiring the voxelized point cloud characteristics corresponding to the preset voxel space includes:

acquiring current positioning information;

determining the preset voxel space based on the current positioning information, and intercepting the offline point cloud map based on the preset voxel space to obtain offline point cloud data;

and carrying out voxelized conversion and feature extraction on the offline point cloud data to obtain the voxelized point cloud features.

In one embodiment, based on the plurality of images to be detected, acquiring the voxel image features of the corresponding preset voxel space includes:

extracting features of a plurality of images to be detected to obtain a plurality of corresponding two-dimensional image features;

voxel sampling is carried out on a plurality of two-dimensional image features, and voxel sampling features corresponding to the preset voxel space are obtained;

and carrying out feature extraction on the voxelized sampling features to obtain the voxelized image features.

In one embodiment, the feature extracting the voxelized sampling feature to obtain the voxelized image feature includes:

compressing the first dimension and the second dimension of the voxelized sampling feature to the same dimension to obtain a compressed sampling feature;

performing feature extraction on the compressed sampling features based on two-dimensional convolution check to obtain compressed image features;

and carrying out dimension recovery on the first dimension and the second dimension of the compressed image feature to obtain the voxelized image feature.

In one embodiment, the obtaining, based on the voxel fusion feature, a recognition result of the target object includes:

performing feature extraction on the voxel fusion features based on two-dimensional convolution check to obtain bird's eye view features;

and acquiring a recognition result of the target object based on the aerial view characteristic.

In one embodiment, the method further comprises:

acquiring an initial three-dimensional target detection model, wherein the initial three-dimensional target detection model is used for executing the three-dimensional target detection method;

inputting a plurality of sample images and sample point cloud data synchronized with the sample images to the initial three-dimensional target detection model, the sample images determined based on a multi-view camera dataset;

Extracting sample voxelized image features of a plurality of sample images based on the initial three-dimensional target detection model, wherein the sample voxelized image features and the sample voxelized point cloud features are determined based on the same sample voxel space;

training the initial three-dimensional target detection model based on a first loss function, wherein the first loss function is determined based on the sample voxelized image characteristics and sample voxelized point cloud characteristics, and the sample voxelized point cloud characteristics are acquired based on the sample point cloud data.

In one embodiment, the acquiring the sample voxelized point cloud characteristic includes:

voxel sampling is carried out on the sample point cloud data based on external parameters of the point cloud acquisition equipment, so that sample voxel point cloud data are obtained;

and carrying out feature extraction on the sample voxelized point cloud data to obtain the sample voxelized point cloud features.

In one embodiment, the method further comprises:

inputting a plurality of sample images into the initial three-dimensional target detection model to obtain sample center point coordinates, wherein the sample images are determined based on a multi-camera data set;

Determining a sample elliptical feature map based on the sample center coordinates, and determining a second loss function based on the distance between the sample elliptical feature map and a label elliptical feature map, the label elliptical feature map being determined based on label center point coordinates of a sample image;

training the initial three-dimensional object detection model based on the second loss function.

In one embodiment, the determining of the sample ellipse feature map includes:

determining a major axis and a minor axis of the sample ellipse feature map based on the sample center point coordinates;

setting a characteristic value corresponding to the sample center point as a first true value, setting a characteristic value corresponding to the sample elliptic characteristic diagram boundary as a second true value, and determining the characteristic value inside the sample elliptic characteristic diagram based on Gaussian distribution to obtain the sample elliptic characteristic diagram.

In a second aspect, the present application also provides a three-dimensional object detection apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a plurality of images to be detected, and the images to be detected are generated based on a multi-camera;

the feature extraction module is used for acquiring voxel image features of a corresponding preset voxel space based on a plurality of images to be detected;

The feature fusion module is used for acquiring voxelized point cloud features corresponding to the preset voxel space based on an offline point cloud map, and fusing the voxelized image features and the voxelized point cloud features to obtain voxel fusion features;

and the target identification module is used for acquiring the identification result of the target object based on the voxel fusion characteristics.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor which when executing the computer program performs the steps of:

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

In a fifth aspect, the present application also provides a computer program product. The computer program product comprises a computer program which, when executed by a processor, implements the steps of:

The application provides a three-dimensional target detection method, a three-dimensional target detection device, computer equipment and a storage medium, wherein the three-dimensional target detection method comprises the following steps: acquiring a plurality of images to be detected, wherein the images to be detected are generated based on a multi-camera; based on a plurality of images to be detected, voxel image features of corresponding preset voxel spaces are obtained; acquiring voxelized point cloud characteristics corresponding to the preset voxel space based on an offline point cloud map, and fusing the voxelized image characteristics and the voxelized point cloud characteristics to obtain voxel fusion characteristics; and acquiring a recognition result of the target object based on the voxel fusion characteristics. The voxel image characteristics corresponding to the multi-view images and the voxel point cloud characteristics corresponding to the offline point cloud map in the same voxel space are obtained, and the voxel image characteristics and the voxel point cloud characteristics are subjected to characteristic fusion, so that the richness and the accuracy of the voxel characteristics are improved; further, compared with real-time point cloud data, the offline point cloud map adopted in the application has higher data volume and quality, and can further improve the richness and accuracy of voxel characteristics. Therefore, the method solves the technical problem of lower three-dimensional target detection precision based on the multi-camera in the related technology by improving the voxel characteristic quality, has lower calculation consumption, obviously improves the target detection precision on the premise of almost unchanged detection at all times, and expands the application range of the intelligent driving perception technology.

Drawings

FIG. 1 is an application environment diagram of a three-dimensional object detection method according to an embodiment of the present application;

FIG. 2 is a flow chart of a three-dimensional object detection method according to an embodiment of the application;

FIG. 3 is a flowchart of an offline point cloud map fusion method according to an embodiment of the present application;

FIG. 4 is a flow chart of a three-dimensional object detection method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a training process of a three-dimensional object detection model according to an embodiment of the present application;

FIG. 6 is a block diagram of a three-dimensional object detection device according to an embodiment of the present application;

fig. 7 is an internal structural view of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

The three-dimensional target detection method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a communication network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 may be configured as an in-vehicle auxiliary system in a driving scene, including various sensing devices such as a camera and a radar, and various information processing devices such as an in-vehicle electronic control unit. The terminal 102 and the server 104 interact with each other through communication network such as various driving data, environment data, operation data, etc. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

Referring to fig. 2, fig. 2 is a flow chart of a three-dimensional object detection method according to an embodiment of the application.

In one embodiment, as shown in fig. 2, the three-dimensional object detection method includes:

s202: a plurality of images to be detected are acquired, and the images to be detected are generated based on a multi-camera.

Specifically, a multi-view image at a certain moment, namely an image to be detected, is obtained through a multi-view camera. It can be understood that by arranging the multi-camera in different directions, sensing images in different directions can be obtained, and then three-dimensional information of the current space can be obtained.

Illustratively, the multi-view camera is set based on the driving scene in the present embodiment. The horizontal view angle of the camera is about 90 degrees, two cameras which are horizontally forward are arranged right in front of the car body so as to obtain accurate depth information, and four cameras are arranged at the front side and the rear side of the car body so as to detect the surrounding environment.

S204: based on a plurality of images to be detected, voxel image features of a corresponding preset voxel space are obtained.

Specifically, image feature extraction is carried out on the images to be detected respectively, and the image features are mapped to a three-dimensional voxel space based on the mapping relation between the image features and a preset voxel space to obtain voxelized image features; or directly mapping a plurality of images to be detected to a three-dimensional voxel space, and further extracting voxel characteristics to obtain voxel image characteristics. It will be appreciated that the specific manner in which the voxel image features are obtained from the image to be detected is not limited in this embodiment.

The preset voxel space in the embodiment is determined based on the range of each axial direction in the three-dimensional space coordinate system and the voxel size. For example, the preset voxel space is initialized by taking the center of the vehicle as the origin, the range of values of the X, Y, Z axes is respectively [ -7 meters, 70.6 meters ], [ -30.4 meters, 30.4 meters ], [ -3 meters, 1 meter ], and the voxel size is [0.2 meters, 0.2 meters ], so that the size of the preset voxel space is 20×304×388.

S206: and acquiring voxelized point cloud characteristics corresponding to a preset voxel space based on the offline point cloud map, and fusing the voxelized image characteristics and the voxelized point cloud characteristics to obtain voxel fusion characteristics.

Specifically, corresponding voxelized point cloud features are obtained based on a preset voxel space, feature fusion is carried out on the voxelized image features and the voxelized point cloud features, voxel fusion features are obtained, and then analysis and identification are carried out on the voxel fusion features. It can be appreciated that the voxel fusion feature contains both image information and point cloud information, which is more rich and comprehensive than the individual image feature description information.

The voxelized point cloud features in the embodiment are features of point cloud data corresponding to a preset voxel space in an offline point cloud map; the offline point cloud map is offline data which is stored in a database in advance and describes map information through the point cloud.

It should be noted that, the voxelized point cloud features may be directly stored in the database, so as to facilitate direct extraction; or the offline point cloud map is stored in a database, and the point cloud data in the offline point cloud map are intercepted in real time and feature extraction is performed in the three-dimensional target detection process so as to obtain voxelized point cloud features.

S208: based on the voxel fusion characteristics, a recognition result of the target object is obtained.

Specifically, after the voxel fusion characteristics are obtained, the voxel fusion characteristics are analyzed and identified, so that the identification result of the target object is obtained.

Illustratively, the recognition result of the target object in the present embodiment includes, but is not limited to: a target number; the characteristic length of the three-dimensional information of the target comprises the coordinates of a target center point, the length, width and height of the target and the course angle of the target; a class of targets comprising: vehicles, pedestrians, cyclists; confidence score of the target.

In the embodiment, the voxel image characteristics corresponding to the multi-view images and the voxel point cloud characteristics corresponding to the offline point cloud map in the same voxel space are obtained, and the voxel image characteristics and the voxel point cloud characteristics are subjected to characteristic fusion, so that the richness and the accuracy of the voxel characteristics are improved; further, compared with real-time point cloud data, the offline point cloud map adopted in the application has higher data volume and quality, and can further improve the richness and accuracy of voxel characteristics. Therefore, the method solves the technical problem of lower three-dimensional target detection precision based on the multi-camera in the related technology by improving the quality of the voxel characteristics, has lower calculation consumption, obviously improves the target detection precision on the premise of almost unchanged detection at times, and expands the application range of the intelligent driving perception technology.

In another embodiment, the database stores in advance voxelized map features corresponding to an offline point cloud map, and the obtaining voxelized point cloud features corresponding to a preset voxel space includes:

step 1: acquiring current positioning information;

step 2: and determining a preset voxel space based on the current positioning information, and intercepting the voxelized map features based on the preset voxel space to obtain voxelized point cloud features.

Specifically, in this embodiment, the database stores the voxel map features in advance, where the voxel map features are features obtained by voxel conversion and feature extraction of the offline point cloud map.

Specifically, current positioning information is acquired based on navigation equipment, radar equipment and the like, and a currently positioned position point is used as a specific voxel point of a preset voxel space, so that a corresponding preset voxel space is determined.

Illustratively, the currently located location point is taken as the center voxel point of the preset voxel space.

Specifically, after a preset voxel space is determined, feature data corresponding to the preset voxel space is intercepted in the voxelized map feature to be used as voxelized point cloud features.

In the embodiment, the corresponding voxel space is determined through the current position, and then the voxelized map features are intercepted to obtain voxelized point cloud features, and the method that the voxelized map features are pre-stored in a database and the features are directly read avoids the need of carrying out feature extraction on an offline point cloud map in real time in the three-dimensional target detection process, so that the static acceleration of the offline point cloud map is realized, and the three-dimensional target detection efficiency is improved.

In another embodiment, obtaining the voxelized point cloud characteristics corresponding to the preset voxel space includes:

step 1: acquiring current positioning information;

step 2: determining a preset voxel space based on the current positioning information, and intercepting an offline point cloud map based on the preset voxel space to obtain offline point cloud data;

step 3: and carrying out voxelized conversion and feature extraction on the offline point cloud data to obtain voxelized point cloud features.

Specifically, in this embodiment, an offline point cloud map is pre-stored in the database, where the offline point cloud map includes point cloud information related to the surrounding environment.

Specifically, after a preset voxel space is determined, intercepting data corresponding to the preset voxel space, namely offline point cloud data, from an offline point cloud map, mapping the point cloud data to the preset voxel space to realize voxelization, and extracting features of the voxelized point cloud data to obtain voxelized point cloud features.

Referring to fig. 3, fig. 3 is a flowchart illustrating an offline point cloud map fusion method according to an embodiment of the application.

For example, as shown in fig. 3, the positioning information Pose of the image acquisition device at the current acquisition time, the offline point cloud Map and the Voxel image Feature Voxel_feature of the current frame are input, and the offline point cloud Map and the Voxel image Feature Voxel_feature are converted into the same coordinate system through the current positioning information Pose. And intercepting off-line point cloud data in the off-line point cloud Map according to a preset Voxel space, voxelization is carried out, and a sparse 3D convolutional neural network is adopted to extract voxelization characteristics, so that voxel_map of the voxelization point cloud characteristics is obtained. And combining the Voxel image Feature Voxel_feature and the Voxel point cloud Feature Voxel_Map in the Feature dimension, and carrying out Feature fusion by a 3D convolutional neural network or a transform and other methods to obtain the Voxel fusion Feature Voxel_feature_Map.

In the embodiment, the corresponding voxel space is determined through the current position, and then the offline point cloud map is intercepted, subjected to voxel conversion and feature extraction, so that the voxel point cloud features are obtained, the feature acquisition process is simple, a complex point cloud feature extraction network is not needed, and the efficiency of three-dimensional target detection is improved.

In another embodiment, based on a plurality of images to be detected, acquiring the voxelized image features of the corresponding preset voxel space comprises:

step 1: extracting features of a plurality of images to be detected to obtain a plurality of corresponding two-dimensional image features;

step 2: voxel sampling is carried out on the plurality of two-dimensional image features, and voxel sampling features corresponding to a preset voxel space are obtained;

step 3: and extracting the characteristics of the voxelized sampling characteristics to obtain voxelized image characteristics.

By way of example, a general convolutional neural network or a transform network is selected to fully extract image features, and ResNet34 is adopted as an image feature extraction module in the embodiment, and weight files fully trained in an ImageNet dataset are loaded to improve feature extraction capability and reduce training difficulty.

For example, a preset Voxel space volume_image is initialized, points in the volume_image are projected into two-dimensional Image features through camera internal parameters and external parameters, and three-dimensional feature sampling accuracy is improved through a linear interpolation method. The same three-dimensional space point can be projected onto a plurality of images, the characteristic value of the point which cannot be projected is filled with 0, and the extracted characteristics of different images are spliced in sequence to ensure that the channel numbers are the same.

For example, in the process of extracting the characteristics of the voxelized sampling characteristics, a three-dimensional convolutional neural network, a sparse convolutional neural network and the like can be selected. Alternatively, the self-attention mechanism extraction may be performed by a set of 1 x 1 convolution kernels in combination with the voxel attention mechanism, prior to extracting the voxel-ized image features, and simulating and calculating the probability that the voxels in the three-dimensional space are foreground, and extracting the voxel image features on the basis of a voxel attention mechanism.

In the embodiment, two-dimensional feature extraction, voxel sampling and voxel feature extraction operations are sequentially performed on the image to be detected, and the voxel image feature extraction process is simple and easy to realize, so that the operation cost of the three-dimensional target detection model is reduced and the detection efficiency is improved.

In another embodiment, feature extraction of the voxelized sample features to obtain voxelized image features includes:

step 1: compressing the first dimension and the second dimension of the voxelized sampling feature to the same dimension to obtain a compressed sampling feature;

step 2: performing feature extraction on the compressed sampling features based on the two-dimensional convolution check to obtain compressed image features;

step 3: and carrying out dimension recovery on the first dimension and the second dimension of the compressed image feature to obtain the voxelized image feature.

For example, the first dimension and the second dimension in the present embodiment may be set as a long dimension and a wide dimension, respectively.

Illustratively, to increase the detection speed, this example employs a special two-dimensional convolution to extract the three-dimensional features. Firstly, pressing the length dimension and the width dimension of a preset Voxel space to the same dimension, reducing the dimension of a voxel_sample_feature, then performing Feature extraction through a plurality of two-dimensional convolution layers with the convolution kernel size of 3 multiplied by 1, compressing 192-dimensional features to 8 dimensions to obtain compressed image features, reshaping to the original dimension, and recovering the length dimension and the width dimension to obtain the voxel_feature of the Voxel-Feature.

In the embodiment, the voxel sampling characteristics are compressed, then the two-dimensional convolution kernel is used for carrying out characteristic extraction and then carrying out dimension recovery, so that the excessive operand caused by the direct characteristic extraction through the three-dimensional convolution kernel is avoided, the operation cost in the characteristic extraction process is reduced, and the efficiency of three-dimensional target detection is improved.

In another embodiment, based on the voxel fusion feature, obtaining the recognition result of the target object includes:

step 1: performing feature extraction on the voxel fusion features based on the two-dimensional convolution check to obtain bird's eye view features;

Step 2: based on the aerial view characteristics, a recognition result of the target object is obtained.

Specifically, feature stitching is performed on voxel fusion features in the height direction, and feature extraction is performed through a series of two-dimensional convolution kernels, so that two-dimensional aerial view features are obtained. And analyzing and identifying the aerial view features to finally obtain identification results such as the position, the size, the angle and the like of the three-dimensional target.

Referring to fig. 4, fig. 4 is a flow chart of a three-dimensional object detection method according to another embodiment of the application.

Illustratively, as shown in FIG. 4, the input is a plurality of images { Image [ i ], 0.ltoreq.i.ltoreq.N-1 }, i representing the camera index, N representing the number of cameras at a certain time. The image feature extraction module is used for extracting features of the multi-view images; the voxel sampling module is used for sampling the two-dimensional image characteristics to obtain voxelized sampling characteristics; the voxel feature extraction module is used for extracting three-dimensional features of the voxelized sampling features to obtain voxelized image features; the point cloud map fusion module is used for carrying out feature fusion on the voxelized image features and the voxelized point cloud features to obtain voxelized fusion features; the aerial view feature extraction module is used for extracting aerial view features based on voxel fusion features; the recognition module is used for analyzing and recognizing the aerial view characteristics to acquire recognition results of the target object; the loss module is used for training the three-dimensional target detection model in the training process.

In another embodiment, the method further comprises:

step 1: acquiring an initial three-dimensional target detection model, wherein the initial three-dimensional target detection model is used for executing a three-dimensional target detection method;

step 2: inputting a plurality of sample images and sample point cloud data synchronized with the sample images to an initial three-dimensional target detection model, the sample images being determined based on a multi-view camera dataset;

step 3: extracting sample voxelized image features of a plurality of sample images based on an initial three-dimensional target detection model, wherein the sample voxelized image features and the sample voxelized point cloud features are determined based on the same sample voxel space;

step 4: training an initial three-dimensional target detection model based on a first loss function, wherein the first loss function is determined based on sample voxelized image features and sample voxelized point cloud features, and the sample voxelized point cloud features are acquired based on sample point cloud data.

Specifically, an initial three-dimensional object detection model is obtained, wherein the initial three-dimensional object detection model is an untrained network model for executing the three-dimensional object detection method.

Specifically, a sample image and sample point cloud data synchronous with the sample image are input into an initial three-dimensional target detection model, corresponding sample voxelized image features and sample voxelized point cloud features are extracted through the initial three-dimensional target detection model, and the sample voxelized image features and the sample voxelized point cloud features are determined based on the same sample voxel space in the voxelized process.

Specifically, the sample voxelized point cloud characteristics are used as characteristic standard parameters, a first loss function is established based on the sample voxelized image characteristics and the sample voxelized point cloud characteristics, and parameters in an initial three-dimensional target detection model are trained through the first loss function, so that the first loss function meets convergence conditions, and a trained three-dimensional target detection model is obtained.

The sample image is determined based on the multi-camera data set, and the sample point cloud data are data which are acquired in real time and are synchronous with the acquisition time of the sample image.

In the embodiment, the first loss function is established by acquiring the point cloud characteristics and combining the image characteristics, and then the parameters of the initial three-dimensional target detection model are adjusted through the first loss function, so that the accuracy of model characteristic extraction is improved, the accuracy of three-dimensional target detection is further improved, the method in the embodiment is applied to a training stage, no operation is caused in a detection stage, namely the detection time is not increased, and the detection precision is improved on the premise of ensuring the detection speed.

In another embodiment, the process of obtaining the sample voxelized point cloud features includes:

step 1: based on external parameters of the point cloud acquisition equipment, voxel sampling is carried out on sample point cloud data to obtain sample voxel point cloud data;

Step 2: and carrying out feature extraction on the sample voxelized point cloud data to obtain the sample voxelized point cloud features.

Specifically, converting sample point cloud data into the same coordinate system of sample voxelized image features through external parameters of point cloud acquisition equipment, and voxelizing the sample point cloud data by using the same voxel space to obtain sample voxelized point cloud data; and then, carrying out feature extraction on the voxelized point cloud data to obtain the voxelized point cloud features of the sample.

Referring to fig. 5, fig. 5 is a schematic diagram of a training process of a three-dimensional object detection model according to an embodiment of the application.

For example, as shown in fig. 5, the training phase is input as point cloud data PointCloud time-synchronized with the multi-view image, where n is the number of point clouds, and the current frame voxel_feature is 8×20×304×388, and the point cloud data PointCloud is converted into the same coordinate system as the voxel_feature by the external parameter rt_pointcloud of the point cloud acquisition device, and voxel_pointcloud data is Voxel by using the Voxel space, so as to obtain voxel_pointcloud data. And extracting the point cloud characteristics from the point cloud data by adopting a trained centrPoint network model, thereby obtaining Voxel-based point cloud characteristics Voxel_Pointclod_feature, wherein the size of the Voxel-based point cloud characteristics Voxel_Pointclod_feature is 16 multiplied by 20 multiplied by 304 multiplied by 388. And then, further Feature extraction can be carried out on the voxel_feature of the Voxel Image by means of a three-dimensional convolutional neural network, a transducer and the like to obtain a fine voxel_image_feature, wherein the size of the fine voxel_image_feature is 16 multiplied by 20 multiplied by 304 multiplied by 388, and the size of the fine voxel_image_feature is the same as that of the voxel_point cloud Feature. Finally, calculating loss through knowledge distillation, and optimizing the voxel feature extraction network through a first loss function.

In the embodiment, the point cloud data are sampled and the characteristics are extracted successively to obtain the voxel point cloud characteristics, and the characteristic extraction mode is simple, so that the training efficiency of the initial three-dimensional target detection model is improved.

In another embodiment, the method further comprises:

step 2: inputting a plurality of sample images into an initial three-dimensional target detection model to obtain sample center point coordinates, wherein the sample images are determined based on a multi-camera data set;

step 3: determining a sample ellipse feature map based on the sample center coordinates, and determining a second loss function based on the sample ellipse feature map and the distance between the tag ellipse feature maps, the tag ellipse feature map being determined based on the tag center point coordinates of the sample image;

step 4: the initial three-dimensional object detection model is trained based on the second loss function.

Specifically, the recognition result in the training process in this embodiment includes the coordinates of the center point of the sample. After the coordinates of the sample center point are obtained, an elliptical boundary frame is determined based on the coordinates of the sample center point, and a sample elliptical feature map is defined based on the elliptical boundary frame. And determining a label ellipse feature map corresponding to the center point coordinate in the sample label based on the same method, determining a focal_loss second loss function by combining the sample ellipse feature map and the label ellipse feature map, and training an initial three-dimensional target detection model based on the second loss function so as to regress the center point coordinate.

It can be appreciated that, since the difficulty of estimating the depth of the target by the multi-vision method increases as the target moves away from the host vehicle, and the influence of the lateral error and the depth error is different in the intelligent driving field, the regression of the coordinates of the center point of the target in this example adopts the optimized second loss function.

Optionally, in this embodiment, other loss functions may be further established, so as to train the initial three-dimensional target detection model. For example, detecting the motion direction of the target center point, returning the sine value and the cosine value of the target center point, and establishing an L1_Loss Loss function; regression of the offset of the target center point adopts an L1_Loss Loss function; regression of target length, width, height and target height uses a smothl1_loss function, where the losses of different detection branches are assigned different weights.

In the embodiment, the loss function is optimized based on the elliptic characteristic diagram, so that the initial three-dimensional target detection model is ensured to be suitable for a scene with inconsistent transverse information and depth information, and the accuracy of the model and the three-dimensional target detection accuracy are improved.

In another embodiment, the determining of the sample ellipse feature map includes:

step 1: determining a major axis and a minor axis of the sample ellipse feature map based on the sample center point coordinates;

Step 2: setting a characteristic value corresponding to a sample center point as a first true value, setting a characteristic value corresponding to a sample elliptic characteristic diagram boundary as a second true value, and determining the characteristic value inside the sample elliptic characteristic diagram based on Gaussian distribution to obtain the sample elliptic characteristic diagram.

Illustratively, a Gaussian ellipse is used in this embodiment to create a sample ellipse signature. The major axis and the minor axis of the ellipse change with the change of the coordinates of the center point, and the specific ellipse equation is:

wherein i is a target index, K is the total number of true targets of the current frame, x_label [ i ] and y_label [ i ] are respectively the coordinates of the center point of the target true value label, namely the center point of the ellipse, A [ y_label [ i ] ], B [ x_label [ i ] ] are respectively the long and short axes of the ellipse which changes along with the change of the target coordinates, and the ratio of the long and short axes increases along with the fact that the target is far away from the vehicle. In the sample ellipse characteristic diagram, the true values of [ x_label [ i ], y_label [ i ] are 1, the true values of the ellipse boundary and the outside of the ellipse are 0, and the true values from the center point of the ellipse to the boundary point are distributed by adopting Gaussian distribution, so that the true values in the ellipse are obtained.

In the embodiment, the method for establishing the sample ellipse characteristic diagram and the internal true value is simple and easy to calculate, so that the training process is accelerated, and the training efficiency of the initial three-dimensional target detection model is improved.

In another embodiment, based on the above embodiment, experimental verification of the three-dimensional object detection method was performed. Specifically, in this embodiment, a nuScene-imitated dataset is adopted, which includes image data of a six-eye looking-around camera and time-synchronized laser radar data, and the image is an RGB color image with resolution of 720×1280. The camera horizontal field angle is about 90 degrees. Two cameras which are horizontally forward are arranged right in front of the car body so as to obtain accurate depth information, and four cameras are arranged at the front side and the rear side so as to detect surrounding targets.

Specifically, in the training and testing stage, the present example verifies multiple optimization modes, namely, using the knowledge distillation module (optimized first loss function) alone, using the point cloud map fusion module alone, using the optimized focal_loss second loss module alone, and using the optimization scheme in combination. Finally, training and reasoning tests are respectively carried out on various schemes by using the data set which is manufactured by imitating nuScens in the example, and the various schemes in the example are compared with a popular three-dimensional target detection scheme bevdepth, BEVDet, BEVFormer based on a multi-image at the present stage and a basic module without an optimization module, and under the same training set and model parameter optimization method, 3D map precision comparison of various indexes in a verification set is shown in the following table:

Specifically, as can be seen from the table, in the data set adopted in the example, the accuracy of the optimization scheme of the example is improved on each type of three-dimensional target detection, and the highest accuracy can be obtained by combining multiple optimization schemes, so that the effectiveness of the optimization scheme of the invention of the example is verified.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a three-dimensional object detection device for realizing the three-dimensional object detection method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the three-dimensional object detection device provided below may be referred to the limitation of the three-dimensional object detection method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 6, there is provided a three-dimensional object detection apparatus including:

an image acquisition module 10, configured to acquire a plurality of images to be detected, where the plurality of images to be detected are generated based on a multi-camera;

the feature extraction module 20 is configured to obtain voxel image features of a corresponding preset voxel space based on a plurality of images to be detected;

the feature extraction module 20 is further configured to perform feature extraction on a plurality of images to be detected, so as to obtain a plurality of corresponding two-dimensional image features;

voxel sampling is carried out on the plurality of two-dimensional image features, and voxel sampling features corresponding to a preset voxel space are obtained;

extracting the characteristics of the voxelized sampling characteristics to obtain voxelized image characteristics;

The feature extraction module 20 is further configured to compress the first dimension and the second dimension of the voxelized sampling feature to the same dimension, so as to obtain a compressed sampling feature;

performing feature extraction on the compressed sampling features based on the two-dimensional convolution check to obtain compressed image features;

performing dimension recovery on the first dimension and the second dimension of the compressed image feature to obtain a voxelized image feature;

the feature fusion module 30 is configured to obtain a voxelized point cloud feature corresponding to a preset voxel space based on an offline point cloud map, and fuse the voxelized image feature and the voxelized point cloud feature to obtain a voxel fusion feature;

the feature fusion module 30 is further configured to obtain current positioning information;

determining a preset voxel space based on the current positioning information, and intercepting the voxelized map features based on the preset voxel space to obtain voxelized point cloud features;

determining a preset voxel space based on the current positioning information, and intercepting an offline point cloud map based on the preset voxel space to obtain offline point cloud data;

performing voxelized conversion and feature extraction on the offline point cloud data to obtain voxelized point cloud features;

A target recognition module 40, configured to obtain a recognition result of the target object based on the voxel fusion feature;

the target recognition module 40 is further configured to perform feature extraction based on the voxel fusion feature of the two-dimensional convolution check, so as to obtain a bird's eye view feature;

acquiring a recognition result of the target object based on the aerial view characteristic;

the three-dimensional target detection device further comprises a first training module;

the first training module is used for acquiring an initial three-dimensional target detection model, and the initial three-dimensional target detection model is used for executing a three-dimensional target detection method;

inputting a plurality of sample images and sample point cloud data synchronized with the sample images to an initial three-dimensional target detection model, the sample images being determined based on a multi-view camera dataset;

extracting sample voxelized image features of a plurality of sample images based on an initial three-dimensional target detection model, wherein the sample voxelized image features and the sample voxelized point cloud features are determined based on the same sample voxel space;

training an initial three-dimensional target detection model based on a first loss function, wherein the first loss function is determined based on sample voxelized image characteristics and sample voxelized point cloud characteristics, and the sample voxelized point cloud characteristics are acquired based on sample point cloud data;

The first training module is further used for voxel sampling the sample point cloud data based on the external parameters of the point cloud acquisition equipment to obtain sample voxel point cloud data;

extracting characteristics of the sample voxelized point cloud data to obtain sample voxelized point cloud characteristics;

the three-dimensional target detection device further comprises a second training module;

the second training module is used for acquiring an initial three-dimensional target detection model, and the initial three-dimensional target detection model is used for executing a three-dimensional target detection method;

inputting a plurality of sample images into an initial three-dimensional target detection model to obtain sample center point coordinates, wherein the sample images are determined based on a multi-camera data set;

determining a sample ellipse feature map based on the sample center coordinates, and determining a second loss function based on the sample ellipse feature map and the distance between the tag ellipse feature maps, the tag ellipse feature map being determined based on the tag center point coordinates of the sample image;

training an initial three-dimensional target detection model based on a second loss function;

the second training module is also used for determining the long axis and the short axis of the sample ellipse characteristic diagram based on the coordinates of the sample center point;

setting a characteristic value corresponding to a sample center point as a first true value, setting a characteristic value corresponding to a sample elliptic characteristic diagram boundary as a second true value, and determining the characteristic value inside the sample elliptic characteristic diagram based on Gaussian distribution to obtain the sample elliptic characteristic diagram.

The respective modules in the above three-dimensional object detection apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a three-dimensional object detection method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in FIG. 7 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

based on a plurality of images to be detected, voxel image features of a corresponding preset voxel space are obtained;

acquiring voxelized point cloud characteristics corresponding to a preset voxel space based on an offline point cloud map, and fusing the voxelized image characteristics and the voxelized point cloud characteristics to obtain voxel fusion characteristics;

based on the voxel fusion characteristics, a recognition result of the target object is obtained.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

The user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as Static Random access memory (Static Random access memory AccessMemory, SRAM) or dynamic Random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of three-dimensional object detection, the method comprising:

2. The method according to claim 1, wherein a database stores therein voxelized map features corresponding to an offline point cloud map in advance, and the acquiring voxelized point cloud features corresponding to the preset voxel space includes:

acquiring current positioning information;

3. The method according to claim 1, wherein the acquiring the voxelized point cloud characteristics corresponding to the preset voxel space includes:

acquiring current positioning information;

4. The method according to claim 1, wherein acquiring the voxelized image features of the corresponding preset voxel space based on the plurality of images to be detected comprises:

5. The method of claim 4, wherein the feature extracting the voxelized sample features to obtain the voxelized image features comprises:

6. The method according to claim 1, wherein obtaining the identification result of the target object based on the voxel fusion feature comprises:

7. The three-dimensional object detection method according to claim 1, characterized in that the method further comprises:

8. The method of claim 7, wherein the process of obtaining the sample voxelized point cloud features comprises:

9. The three-dimensional object detection method according to claim 1, characterized in that the method further comprises:

10. The method of claim 9, wherein the determining of the sample ellipse feature map comprises:

11. A three-dimensional object detection device, the device comprising:

12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of any one of claims 1 to 10.