CN114332845A

CN114332845A - 3D target detection method and device

Info

Publication number: CN114332845A
Application number: CN202011057005.XA
Authority: CN
Inventors: 王子辰; 钮敏哲; 张晓鹏; 许春景
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2022-04-12

Abstract

The embodiment of the application discloses a method and equipment for detecting a 3D target, which can be applied to the field of computer vision in the field of artificial intelligence, and the method comprises the following steps: the method comprises the steps of firstly obtaining 2D information corresponding to each laser point in laser point cloud, respectively inputting the laser point cloud and the 2D information into two three-dimensional sparse convolution modules to obtain a first feature and a second feature, then inputting the first feature and the second feature into a two-dimensional convolution module after cascading to obtain a third feature (namely a fusion feature), cascading the third feature with the first feature and the second feature to obtain a fourth feature (namely a combination feature), and finally realizing 3D target detection by utilizing the fourth feature. The embodiment of the application fuses the laser point cloud and the 2D information on the characteristic layer, improves the 3D target detection performance, simultaneously keeps the original characteristics of the laser point cloud, and enables the 3D target detection to have good robustness in the complex scenes of camera failure such as night, rain and fog days and the like.

Description

3D target detection method and device

Technical Field

The present application relates to the field of computer vision, and in particular, to a method and apparatus for 3D target detection.

Background

Object detection is a traditional task in the field of computer vision, and compared with image classification, object detection requires identifying the class of an object and outputting the minimum bounding box of the object. Two-dimensional (2D) object detection, as shown in the left part of fig. 1, typically takes a 2D image as input and outputs the type of object in the image and a 2D bounding box (also referred to as a 2D object box). However, in a real three-dimensional (3D) environment, objects have three-dimensional shapes, and most application scenes require information such as the length, width, height, and angle of a target object. For example, in the field of automatic driving, 2D target detection cannot meet the requirements of some scenes due to the wide application of sensors such as laser radar and depth-image (RGB-D) cameras, and therefore 3D target detection is performed, as shown in the right part of fig. 1, in 3D target detection, a 3D bounding box (also referred to as a 3D target box, where the 3D target box includes information such as length, width, height, and rotation angle) of a target object and a classification category of the target object are output, and the method can be applied to various application scenes, for example, a detection task of an automatic driving vehicle.

As shown in fig. 2, taking the application of 3D target detection in the field of automatic driving as an example, for 3D target detection, sensor data such as laser radar, millimeter wave radar, ultrasonic wave, laser detection and ranging system (LiDAR), RGB-D camera, etc. can be used as input data, and different types of sensors have respective advantages and disadvantages, such as low resolution of laser radar, sparse laser point cloud data, but better ranging capability and environmental adaptability; the RGB image data is denser and better in target identification, but is greatly influenced by the environment and completely fails at night or in rainy and foggy days; ultrasonic waves are good for detecting close-range targets, but cannot carry out long-range distance measurement; the millimeter wave radar can accurately detect the target speed, but cannot perform target identification. Therefore, the redundancy and the accuracy of the detection model can be improved by fusing the sensor data of multiple modes (namely, the sensor data acquired by different sensors).

However, most of the 3D target detection models use pure lidar as input, and far exceed the multi-modal 3D detection performance, and how to perform more reasonable and effective fusion on multi-modal data, so that the improvement of the performance of 3D target detection is a problem that needs to be solved at present.

Disclosure of Invention

The embodiment of the application provides a method and equipment for detecting a 3D target, which are used for obtaining 2D information corresponding to each laser point in laser point cloud according to the laser point cloud and a 2D image, and performing feature fusion on the laser point cloud and the 2D information on a feature layer, so that the original features of the 3D laser point cloud are kept while the 3D target detection performance is improved, and the 3D target detection has good robustness in complex scenes with camera failures such as night, rain and fog days and the like.

Based on this, the embodiment of the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application first provides a method for 3D object detection, which may be used in the computer vision field in the field of artificial intelligence, and the method includes: the execution equipment firstly acquires an image and a laser point cloud at a certain moment at a certain position through a sensor device (such as a camera, a laser radar and the like) deployed on the execution equipment, and then obtains two-dimensional information corresponding to the laser point cloud according to the acquired image and the laser point cloud. After the execution equipment acquires the laser point cloud and the two-dimensional information corresponding to the laser point cloud, convolution operation is performed on the laser point cloud and the two-dimensional information through a three-dimensional first sparse convolution module and a three-dimensional second sparse convolution module respectively, and therefore a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information are obtained respectively. After the execution device obtains a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information, the first feature and the second feature are cascaded, and the cascaded first feature and the cascaded second feature are input into a two-dimensional convolution module to be subjected to convolution operation, so that a third feature is obtained, and the third feature can be called as a fusion feature because the first feature and the second feature are fused. It should be noted that in the embodiment of the present application, the cascading between features refers to superimposing features to obtain a new cascading feature, for example, assuming that the first feature is 1 × 2 × 3 and the second feature is also 1 × 2 × 3, where 1 represents the number of channels and 2 × 3 represents the size of the first/second feature, the cascading feature obtained after cascading the first feature and the second feature is 2 × 2 × 3, 2 is the number of channels after cascading, and 2 × 3 is the size of the cascading feature after cascading. After the execution device obtains the third feature, the first feature, the second feature and the third feature obtained before are cascaded to obtain a fourth feature, which may also be referred to as a combined feature. After the executing device obtains the fourth feature, the fourth feature may be input to the classification regression module, for example, a classification regression header may be input, so as to output the 3D target frame and the classification category to which the target object in the 3D target frame belongs. Wherein the target object may also be referred to as a target object.

In the above embodiment of the present application, the execution device first obtains 2D information corresponding to each laser point in the laser point cloud (i.e., 3D information), then inputs the 3D information and the 2D information into two three-dimensional sparse convolution modules respectively to obtain a first feature and a second feature, then concatenates the first feature and the second feature and inputs the concatenated first feature and second feature into a two-dimensional convolution module to obtain a third feature (i.e., a fusion feature), concatenates the third feature with the first feature and the second feature that start to obtain a fourth feature (i.e., a combination feature), and finally uses the fourth feature to realize 3D target detection through a classification regression module. The method and the device are used for obtaining the 2D information corresponding to each laser point in the laser point cloud according to the laser point cloud and the 2D image, and performing feature fusion on the feature layer by using the laser point cloud and the 2D information, so that the original features of the 3D laser point cloud are kept while the 3D target detection performance is improved, and the 3D target detection has good robustness under the complex scenes that the cameras are invalid, such as at night, in a rain and fog day and the like.

In a possible design of the first aspect, the obtaining, by the execution device, two-dimensional information corresponding to the laser point cloud according to the obtained image and the laser point cloud may be performed by: firstly, performing semantic segmentation on an acquired image through a semantic segmentation model by an execution device to obtain a semantic segmentation score of each pixel point in the image, wherein the segmentation score is used for expressing the probability that each pixel point belongs to each classification category, and the semantic segmentation score of each pixel point in the image forms a semantic segmentation graph. And then, the execution equipment respectively projects the laser point clouds to the image and the semantic segmentation map obtained based on the image to obtain target RGB information and target semantic segmentation scores which respectively correspond to the laser point clouds in the laser point clouds, and the target RGB information and the target semantic segmentation scores form the two-dimensional information.

In the above embodiment of the application, how to obtain 2D information according to an image and a laser point cloud is explained, that is, firstly, performing semantic segmentation on the image to obtain a semantic segmentation map, and then, projecting the laser point cloud onto the image and the semantic segmentation map respectively to obtain 2D information, which has realizability.

In a possible design of the first aspect, the input data of the sparse convolution module may be voxelized first, and then input to the sparse convolution module to perform convolution operation after voxelization, so that the execution device may voxelized the laser point cloud to obtain a voxelized laser point cloud, and then perform convolution operation on the voxelized laser point cloud by the first sparse convolution module, thereby obtaining the first feature.

In the above embodiment of the present application, the input data of the three-dimensional sparse convolution module needs to be voxelized first, and the voxelized input data is input into the sparse convolution module to perform convolution operation, so that in the embodiment of the present application, the voxelized laser point cloud is firstly performed, the voxelized laser point cloud is obtained, and then the convolution operation is performed by the first sparse convolution module, which has flexibility.

In a possible design of the first aspect, the executing device also performs voxelization on the two-dimensional information to obtain voxelized two-dimensional information, and then performs convolution operation on the voxelized two-dimensional information by using the second sparse convolution module to obtain the second feature.

In the above embodiments of the present application, it is necessary to perform voxelization processing on the laser point cloud and also perform voxelization processing on the two-dimensional information, which is flexible, and the embodiments of the present application perform voxelization processing on the laser point cloud and the two-dimensional information respectively, which is easy to operate.

In a possible design of the first aspect, the executing device may also cascade the laser point cloud and the two-dimensional information to obtain multi-modal information, perform voxelization processing on the multi-modal information once to obtain voxelized multi-modal information, obviously, the voxelized multi-modal information includes the voxelized laser point cloud and the voxelized two-dimensional information, and finally perform convolution operation on the voxelized laser point cloud by the executing device through the first sparse convolution module to obtain the first feature, and perform convolution operation on the voxelized two-dimensional information by the second sparse convolution module to obtain the second feature.

In the above embodiment of the present application, another voxelization mode is described, that is, the laser point cloud and the two-dimensional information are cascaded to obtain the multi-modal information, so that voxelization of the multi-modal information once is to voxelize the laser point cloud and the 2D information at the same time, thereby saving one voxelization operation.

In a possible design of the first aspect, in some embodiments of the present application, the input data of the two-dimensional convolution module may be subjected to de-voxelization first, where the de-voxelization is an operation process opposite to the voxelization, and the third feature is subjected to de-voxelization and then input to the two-dimensional convolution module to perform convolution operation.

In the above embodiments of the present application, the input data of the two-dimensional convolution module needs to be subjected to de-voxelization (i.e., to be represented in a dense manner), the de-voxelization is an operation process opposite to the voxelization, and the input data after the third characteristic de-voxelization is input into the two-dimensional convolution module to perform convolution operation, which is flexible.

A second aspect of the embodiments of the present application further provides a training method for a 3D target detection model, where the 3D target detection model includes a three-dimensional first sparse convolution module, a three-dimensional second sparse convolution module, a two-dimensional convolution module, and a classification regression module, and the method may include: firstly, a training device acquires an initial training set, any one training sample in the initial training set can be called as an initial training sample, each initial training sample in the initial training set comprises a frame of laser point cloud and two-dimensional information (can be called as an initial laser point cloud and initial two-dimensional information), the initial two-dimensional information is obtained according to the initial laser point cloud and an initial image, and the initial laser point cloud corresponds to the initial image, namely sensor data acquired by different types of sensors at the same time and the same position. Then, the training device constructs a first training set according to the initial training set, and the construction process is as follows: the method comprises the steps of extracting local laser points and local two-dimensional information of target objects, such as target objects forming automobiles, trucks and pedestrians, from each initial training sample in an initial training set, randomly copying the local laser points and the local two-dimensional information of the target objects in each initial training sample to obtain first training samples, and enabling each first training sample to form the first training set. Likewise, each first training sample in the first training set also includes a frame of laser point cloud and one two-dimensional information (which may be referred to as a first laser point cloud and a first two-dimensional information). After the training equipment constructs a first training set according to the initial training set, first training samples in the first training set are training samples subjected to data enhancement, the first training samples comprise first laser point clouds and first two-dimensional information, and the training equipment executes convolution operation on the first laser point clouds and the first two-dimensional information through a first sparse convolution module and a second sparse convolution module respectively, so that first characteristics corresponding to the first laser point clouds and second characteristics corresponding to the first two-dimensional information are obtained respectively. After the training equipment obtains a first feature corresponding to the first laser point cloud and a second feature corresponding to the first two-dimensional information, the first feature and the second feature are cascaded, and the cascaded first feature and the cascaded second feature are input into a two-dimensional convolution module to be subjected to convolution operation, so that a third feature is obtained. After the training device obtains the third feature, the first feature, the second feature and the third feature obtained before are cascaded to obtain a fourth feature, and the fourth feature can also be called a combined feature. After the training device obtains the fourth feature, the fourth feature may be input to the classification regression module, for example, a classification regression head may be input, so as to obtain the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs. Wherein the target object may also be referred to as a target object. After the training equipment obtains the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs, iterative training is carried out on the 3D target prediction model by using a target loss function according to the 3D real target frame marked in the training sample, the real classification category to which the target object in the 3D real target frame belongs, and the prediction classification category to which the target object in the 3D prediction target frame and the 3D prediction target frame belong.

In the embodiment of the application, how to use each module in the training device to perform iterative training on each module in the 3D target detection model is specifically described, and the training process is easy to implement.

In a possible design of the second aspect, the initial two-dimensional information includes initial RGB information corresponding to each initial laser point in an initial laser point cloud obtained by projecting the initial laser point cloud to an initial image, and an initial semantic segmentation score corresponding to each initial laser point obtained by projecting the initial laser point cloud to an initial semantic segmentation map, where the initial semantic segmentation score is a probability that each pixel point in the initial image obtained by performing semantic segmentation on the initial image by using a semantic segmentation model belongs to each classification category, and the initial semantic segmentation score of each pixel point in the initial image constitutes the initial semantic segmentation map. Similarly, the first two-dimensional information includes first RGB information corresponding to each first laser point in the first laser point cloud obtained by projecting the first laser point cloud to the first image, and a first semantic segmentation score corresponding to each first laser point obtained by projecting the first laser point cloud to the first semantic segmentation map, the first semantic segmentation score is a probability that each pixel point in the first image obtained by performing semantic segmentation on the first image through the semantic segmentation model belongs to each classification category, and the first semantic segmentation score of each pixel point in the first image constitutes the first semantic segmentation map.

In the above-described embodiments of the present application, what two-dimensional information is described specifically, and the two-dimensional information is realizable.

In a possible design of the second aspect, the input data of the sparse convolution module may be voxelized first, and then input to the sparse convolution module to perform convolution operation after voxelization, so that the execution device may voxelized the first laser point cloud to obtain a voxelized first laser point cloud, and then perform convolution operation on the voxelized first laser point cloud by the first sparse convolution module to obtain the first feature.

In the above embodiment of the present application, the input data of the three-dimensional sparse convolution module needs to be voxelized first, and the voxelized input data is input into the sparse convolution module to perform convolution operation, so that in the embodiment of the present application, the voxelized first laser point cloud is first performed, the voxelized first laser point cloud is obtained, and then the convolution operation is performed by the first sparse convolution module, which has flexibility.

In a possible design of the second aspect, the executing device also performs voxelization on the first two-dimensional information to obtain voxelized first two-dimensional information, and then performs convolution operation on the voxelized first two-dimensional information by using the second sparse convolution module to obtain the second feature.

In the above embodiments of the present application, it is not only necessary to perform voxelization processing on the first laser point cloud, but also to perform voxelization processing on the first two-dimensional information, which is flexible.

In a possible design of the second aspect, the executing device may also cascade the first laser point cloud and the first two-dimensional information to obtain multi-modal information, perform voxelization processing on the multi-modal information once by the executing device to obtain voxelized multi-modal information, obviously, the voxelized multi-modal information includes the voxelized first laser point cloud and the voxelized first two-dimensional information, and finally, the executing device performs convolution operation on the voxelized first laser point cloud by the first sparse convolution module to obtain the first feature, and performs convolution operation on the voxelized first two-dimensional information by the second sparse convolution module to obtain the second feature.

In the above embodiment of the present application, another voxelization mode is described, that is, the first laser point cloud and the first two-dimensional information are cascaded to obtain the multi-modal information, so that voxelization of the multi-modal information once is to voxelize the first laser point cloud and the first two-dimensional information at the same time, thereby saving one voxelization operation.

In a possible design of the second aspect, in some embodiments of the present application, the input data of the two-dimensional convolution module may be subjected to de-voxelization first, where the de-voxelization is an operation process opposite to the voxelization, and the third feature is subjected to de-voxelization and then input to the two-dimensional convolution module to perform convolution operation.

Training all training samples in the entire training set once is called a training round (i.e. epoch is 1), and 1 epoch is equal to one training round using all training samples in the training set, and an epoch is equal to several, which indicates that several rounds of training have been performed on the training set. Therefore, in a possible design of the second aspect, the above steps may be repeatedly performed until the training round of the first training set reaches the first preset round, for example, the first preset round is 5, and then the training apparatus will repeatedly perform steps 1001 to 1006 for 5 times, where the first training set in each round is reconstructed according to the initial training set, so that the first training set in each training round is not identical (i.e., data enhancement is performed each time) to improve the performance of the 3D detection model. When the training round of the first training set reaches a first preset round (for example, 15 times), the first training set is not constructed, but the initial training set is directly used as a new first training set (only data enhancement modes such as rotation and turnover can be reserved), and iterative training is performed on the 3D target detection model according to the initial training set according to the steps until the training round trained by using the initial training set reaches a second preset round (for example, 5 times).

In the above embodiment of the present application, each epoch needs to construct a first training set according to an initial training set until reaching the first preset epoch number of times (i.e., a first preset round of times, for example, 15 times), and then directly train the initial training set to reach a second preset epoch number of times (i.e., a second prediction round of times, for example, 5 times).

A third aspect of the embodiments of the present application provides an execution device, where the execution device has a function of implementing the method of the first aspect or any one of the possible implementation manners of the first aspect. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

A fourth aspect of the embodiments of the present application provides a training apparatus having a function of implementing a method according to any one of the second aspect and the second possible implementation manner. The function can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

A fifth aspect of the present embodiment provides an execution device, which may include a memory, a processor, and a bus system, where the memory is configured to store a program, and the processor is configured to call the program stored in the memory to execute the method according to the first aspect of the present embodiment or any one of the possible implementation manners of the first aspect.

A sixth aspect of the embodiments of the present application provides a training apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to call the program stored in the memory to execute the method according to any one of the second aspect and the possible implementation manner of the second aspect of the embodiments of the present application.

A seventh aspect of the present application provides a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of the first aspect or any one of the possible implementations of the first aspect, or cause the computer to perform the method of the second aspect or any one of the possible implementations of the second aspect.

An eighth aspect of embodiments of the present application provides a computer program product, which when run on a computer, causes the computer to perform the method of any one of the above-mentioned first aspect or first possible implementation manner, or causes the computer to perform the method of any one of the above-mentioned second aspect or second possible implementation manner.

A ninth aspect of the embodiments of the present application provides a chip, where the chip includes at least one processor and at least one interface circuit, the interface circuit is coupled to the processor, the at least one interface circuit is configured to perform a transceiving function and send an instruction to the at least one processor, and the at least one processor is configured to execute a computer program or an instruction, where the at least one processor has a function of implementing the method according to the first aspect or any one of the possible implementations of the second aspect, or a function of implementing the method according to any one of the possible implementations of the second aspect, and the function may be implemented by hardware, or by software, or by a combination of hardware and software, where the hardware or software includes one or more modules corresponding to the above functions. In addition, the interface circuit is used for communicating with other modules besides the chip, for example, the interface circuit may send the 3D object detection model obtained by the on-chip processor to various intelligent running (e.g., unmanned driving, assisted driving, etc.) agents for motion planning (e.g., driving behavior decision, global path planning, etc.).

Drawings

Fig. 1 is a schematic diagram illustrating a difference between 2D object detection and 3D object detection provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of multi-sensor data applied to an autonomous vehicle;

FIG. 3 is a schematic diagram of a conventional 3D object detection method;

FIG. 4 is another schematic diagram of a conventional 3D object detection method;

FIG. 5 is a schematic diagram of a two-dimensional convolution module according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a three-dimensional sparse convolution module according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an artificial intelligence body framework provided by an embodiment of the present application;

fig. 8 is an application flow of the 3D object detection method and a model for performing 3D object detection according to the embodiment of the present application;

FIG. 9 is a system architecture diagram of a 3D object detection system provided in an embodiment of the present application;

fig. 10 is a schematic flowchart of a training method of a 3D object detection model according to an embodiment of the present application;

fig. 11 is a schematic flowchart of obtaining two-dimensional information according to an embodiment of the present application;

fig. 12 is a schematic diagram illustrating that the training apparatus provided in the embodiment of the present application performs a convolution operation after performing a voxelization process on the first laser point cloud and the first two-dimensional information respectively;

fig. 13 is another schematic diagram of the training apparatus provided in the embodiment of the present application performing a convolution operation after performing a voxelization process on the first laser point cloud and the first two-dimensional information;

FIG. 14 is a schematic diagram of an exception condition for data enhancement provided by an embodiment of the present application;

FIG. 15 is an example of an adaptive data enhancement strategy proposed by an embodiment of the present application;

fig. 16 is a schematic diagram of a 3D object detection method provided in an embodiment of the present application;

fig. 17 is a schematic diagram illustrating that the executing device performs a convolution operation after performing a voxelization process on the laser point cloud and the two-dimensional information respectively according to the embodiment of the present application;

fig. 18 is another schematic diagram of performing a convolution operation after performing a voxelization process on the laser point cloud and the two-dimensional information by the performing apparatus according to the embodiment of the present application;

FIG. 19 is a schematic diagram of an output 3D object box and classification categories to which the object belongs according to an embodiment of the present application;

FIG. 20 is a diagram illustrating an application scenario provided by an embodiment of the present application;

fig. 21 is a schematic diagram of another application scenario provided in the embodiment of the present application;

fig. 22 is a schematic diagram of another application scenario provided in the embodiment of the present application;

fig. 23 is a schematic diagram of another application scenario provided in the embodiment of the present application;

fig. 24 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 25 is a schematic structural diagram of a training apparatus according to an embodiment of the present application;

fig. 26 is another schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 27 is a schematic view of another embodiment of a training apparatus according to the present application;

fig. 28 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the various embodiments of the application and how objects of the same nature can be distinguished. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Before the embodiment of the present application is introduced, a simple introduction is first performed on a common method for 3D object detection at present, so that it is subsequently convenient to understand the embodiment of the present application.

(1) One of the common methods for 3D object detection

The method comprises the steps of directly taking laser point cloud collected by a laser radar as input data of a 3D target detection model, and outputting 3D target detection frames and classification categories to which target objects (also called target objects) in each 3D target detection frame belong after model processing.

However, the 3D target detection method uses a pure laser radar as input, the resolution of the laser radar is low, and laser point cloud data is sparse, so that the problem of missed detection is prominent.

(2) Second common method for 3D object detection

Multi-View three-dimensional object detection network (MV 3D) proposes a Multi-View fusion method based on LiDAR and RGB images (also referred to as two-dimensional images), and specifically, employs Multi-modal sensor data as input and detects a target object in three-dimensional space. As shown in fig. 3, the method uses three views, first, the first view is a view converting laser point cloud acquired by a laser radar into a Bird's Eye View (BEV), the BEV is reduced by one dimension relative to the original laser point cloud, and the three-dimensional laser point cloud is converted into a two-dimensional picture of the bird's eye view; the second view is a front view of the lidar; the third is an RGB image acquired by the camera. The MV3D uses the sensor data of the three visual angles as input data of the model, a 3D candidate frame is regressed by the characteristics obtained by the convolution layer of the two-dimensional picture of the BEV visual angle, then the 3D candidate frame is converted into the characteristic graphs of the other two visual angles through projection and ROI posing is carried out, then the obtained characteristics of the three visual angles are fused through a depth fusion module, the fused characteristics are used for final classification and regression, and the 3D target detection frame and the classification category of the target object in each 3D target detection frame are output.

However, for 3D target detection, the characteristics of three dimensions of the laser point cloud are all important, the method converts the laser point cloud acquired by the laser radar into a BEV visual angle, and then dimension information in the z direction (namely height) is lost, and dimension information in the y direction is lost by using a front view of the laser radar, so that input data always lose information in one dimension, and the performance of 3D target detection is greatly reduced.

(3) Third common method for 3D target detection

Another 3D target detection mode is frustum-pointet (F-pointet), as shown in FIG. 4, the 3D target detection model is composed of three modules: a cone candidate box (hull probability) module, a three-dimensional instance segmentation (3D instance segmentation) module, and a three-dimensional bounding box estimation (analog 3D box estimation) module. Firstly, a mature 2D target detector is utilized to extract a two-dimensional object area from an RGB image and classify a target object, then a known camera projection matrix is used to project a 2D target frame to a 3D space to generate a view frustum (namely a near plane and a far plane which are specified by a depth sensor range) of a 3D search space containing the target object, and then all laser point clouds in the view frustum are collected to form a view frustum point cloud. Laser point clouds (each point comprises information of three dimensionalities of xyz and laser radar reflection intensity) acquired by a laser radar in the view cones are subjected to 3D instance segmentation on each laser point through a pointenet, and the segmented point cloud features are subjected to T-net to generate translation amount so as to more accurately estimate a 3D target frame. The method is different from the existing multi-mode fusion technology in that a multi-step training mode is used, a 2D target detection model and a 3D target detection model are completely decoupled, a more mature 2D target detector is used as a basis to generate a candidate frame, and the problem of missing detection caused by 3D point cloud sparsity is reduced.

However, the method uses the view cone generated based on the 2D object detector as the input of the 3D detection model, and completely depends on the performance of the 2D object detector, and due to the limitation of the RGB camera, the method can only be used in a simple scene, and completely fails in a complex scene (e.g., at night, in a rainy and foggy day, in a sheltering environment, etc.).

The existing 3D target detection methods can be classified into pure lidar point cloud, pure image, laser point cloud, image fusion and other methods according to the difference of input data. However, among many detection methods, pure lidar point clouds perform better. According to common sense, the fusion method has more information than the pure laser point cloud method, and the detection accuracy is not as good as that of the pure laser point cloud method. This is because most of the current fusion methods, such as the above MV3D, lose one-dimensional information by converting the laser point cloud into the BEV view, so that the detection performance of the 3D object detection model is greatly lost; or if the F-pointet regresses according to the view cone generated by the 2D target frame as a candidate frame, the F-pointet completely fails in complex scenes such as night and the like by depending on the detection result of the 2D target detector too much. In a word, the existing fusion method based on sensor data such as laser point cloud and images is unreasonable, and the correspondence between the 2D information and the 3D information is not made.

Since the laser point cloud is sparse and the image is dense, the correspondence of the two kinds of modal information is important to make, and how to reasonably and effectively fuse the multi-modal data is an urgent problem to be solved. In addition, a two-dimensional image shot by a camera fails in a complex scene such as night, and another problem to be solved is how to make the detection performance of the feature fused by the multi-modal sensor data more robust to the complex scene. Based on this, the embodiment of the application provides a method for detecting a 3D target, which is used for obtaining 2D information corresponding to each laser point in laser point clouds according to the laser point clouds and 2D images, and performing feature fusion on the laser point clouds and the 2D information on a feature layer, so that the original features of the 3D laser point clouds are retained while the 3D target detection performance is improved, and the 3D target detection has good robustness in complex scenes with camera failures such as night, rain and fog days and the like.

Since the embodiments of the present application relate to a lot of related knowledge about target detection, in order to better understand the scheme of the embodiments of the present application, the following first introduces related terms and concepts that may be related to the embodiments of the present application. It should be understood that the related conceptual explanations may be limited by the specific details of the embodiments of the present application, but do not mean that the present application is limited to the specific details, and that the specific details of the embodiments may vary from one embodiment to another, and are not limited herein.

(1) Laser point cloud

The laser point cloud may also be referred to as laser point cloud data, laser information received by a laser sensor such as a laser radar or a three-dimensional laser scanner is presented in the form of a point cloud, a point data set of an appearance surface of a measured object obtained by a measuring instrument is referred to as a point cloud, if the measuring instrument is a laser sensor, the obtained point cloud is referred to as a laser point cloud (generally, 32 lines of laser have tens of thousands of laser points at the same time), the laser information contained in the laser point cloud may be referred to as [ x, y, z, intensity ], and the laser information represents a three-dimensional coordinate of a target position, on which each laser point is projected, in a laser coordinate system and a reflection intensity of the laser point.

(2) Voxel (voxel) and voxelization (voxelization)

A voxel is an abbreviation of a volume element, and a volume containing a voxel can be represented by a volume rendering or by extracting a polygon isosurface of a given threshold contour. The voxel is the minimum unit of digital data on three-dimensional space segmentation, and is used in the fields of three-dimensional imaging, scientific data, medical images and the like. Conceptually similar to the minimum unit of a two-dimensional space: pixels, which are used on image data of a two-dimensional computer image. Some real three-dimensional displays use voxels to describe their resolution, for example: a display of 512 x 512 voxels may be displayed.

Voxelization is the conversion of a geometric representation of an object into a voxel representation that is closest to the object.

(3) Convolutional Neural Networks (CNN)

The CNN is a deep neural network with a convolution structure, and is a deep learning (deep learning) architecture, and the CNN includes a feature extractor composed of convolution layers and sub-sampling layers. The feature extractor may be viewed as a filter and the convolution process may be viewed as convolving an input image or convolved feature plane (feature map) with a trainable filter. The convolutional layer is a neuron layer for performing convolutional processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way image information is extracted is location independent. The underlying principle is: the statistics of a certain part of the image are the same as the other parts. Meaning that image information learned in one part can also be used in another part. We can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels can be used to extract different image information, and generally, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix of random size, and can be learned to obtain reasonable weights during the training process of CNN. In addition, sharing weights brings the direct benefit of reducing connections between layers of CNN while reducing the risk of over-fitting.

(4) Two-dimensional convolution module

In the embodiment of the present application, a two-dimensional convolution module may be regarded as a special CNN, where the convolution module includes at least a preset number of convolution layers, and specifically, an image to be recognized is input to the two-dimensional convolution module, and features are extracted through operations such as convolution and pooling. As shown in fig. 5, an image is convolved to change the field of view, and after multi-channel features are obtained, the image is straightened into a one-dimensional vector through pooling, and finally category information is output through a full connection layer. The two-dimensional convolution module described in the embodiment of the present application may specifically be a two-dimensional model such as VGG, ResNet, icept, and efficiency Net, or may be another two-dimensional model, and the specific representation form of the two-dimensional convolution module provided in the present application is not limited herein.

(5) Three-dimensional sparse convolution module

Unlike a 2D object detection scenario, the input used for 3D object detection typically includes a 3D laser point cloud in addition to a two-dimensional image. Unlike images, laser point clouds are sparse in spatial distribution (pixels on an image are dense and regularly arranged, as opposed to an image). To facilitate processing of these point cloud information, it is common practice to divide the 3D space into a certain number of voxels (voxels) of the same size, and to perform the analysis in units of voxels (pixels of an analog picture). The analysis of the laser point cloud in 3D space requires 3D convolution, which is called a three-dimensional sparse convolution module, and specifically as shown in fig. 6, fig. 6 is a schematic diagram of a three-dimensional sparse convolution module, which performs a similar function as a two-dimensional convolution module, except that the input of the three-dimensional sparse convolution includes two parts, one is a coordinate and the other is a feature. Since multiple laser point data may be contained within a single voxel, none may be present. Taking the currently mainstream published 3D detection data set KITTI as an example, only 5% of voxels contain valid laser point data under the common voxel size. Therefore, due to the sparsity of the laser point cloud, a 3D sparse convolution module is required to improve the operation speed.

Taking the application of the sparse convolution module to the field of automatic driving as an example: approximately 5 k-8 k voxels and sparsity of about 0.005 are generated after dividing voxels in laser point cloud collected by a laser radar deployed on an autonomous vehicle, huge calculation time and memory are consumed by directly applying 3D convolution, and sparsity of output is limited by sparsity of input data through sparse convolution, so that calculation amount of subsequent convolution operation is greatly reduced.

The three-dimensional sparse convolution module described in the embodiment of the present application may specifically be a second, cbgs, centerpoint, or other specific three-dimensional model, and the specific representation form of the three-dimensional sparse convolution module provided in the present application is not limited herein.

(6) Loss function

In the process of training a neural network (e.g., CNN), because it is desirable that the output of the neural network is as close as possible to the value actually desired to be predicted, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the value actually desired to be predicted (of course, there is usually an initialization process before the first update, that is, parameters are pre-configured for each layer in the neural network). Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.

(7) Back propagation algorithm

In the training process of the neural network, a Back Propagation (BP) algorithm can be adopted to correct the size of parameters in the initial neural network model, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, aiming at obtaining the optimal parameters of the neural network model, such as a weight matrix.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The overall workflow of the artificial intelligence system will be described first, please refer to fig. 7, fig. 7 shows a schematic structural diagram of an artificial intelligence body framework, which is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to intelligent chips in a distributed computing system provided by the base platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent manufacturing, intelligent transportation, intelligent house, intelligent medical treatment, intelligent security protection, autopilot, safe city etc..

The method can be applied to the field of computer vision in the field of artificial intelligence, particularly relates to the field of target detection, and specifically, with reference to fig. 7, data acquired by infrastructure in the embodiment of the method are an image to be detected and a laser point cloud, and then a 3D target frame and a classification category to which a target object in the 3D target frame belongs are obtained through a series of operations related to the 3D target detection method provided by the embodiment of the method.

Referring to fig. 8, fig. 8 is an application flow of the 3D target detection method and a model for performing 3D target detection, which may be referred to as a 3D target detection model, where the 3D target detection model may specifically include a three-dimensional first sparse convolution module 801, a three-dimensional second sparse convolution module 802, a two-dimensional convolution module 803, and a classification regression module 804, and based on the 3D detection model, firstly, performing semantic segmentation on an acquired two-dimensional image to obtain a semantic segmentation map, where the semantic segmentation map is composed of semantic segmentation scores of pixel points in the image, and the semantic segmentation scores are used to represent probabilities that the pixel points belong to respective classification categories, and then projecting a 3D laser point cloud (also referred to as 3D information) onto the image and the semantic segmentation map according to projection conversion, obtaining RGB information and semantic segmentation scores corresponding to each laser point, forming two-dimensional information (also referred to as 2D information) by the RGB information and the semantic segmentation scores corresponding to each laser point, then performing voxelization processing on the 3D information and the 2D information, inputting the three-dimensional sparse convolution modules (i.e., a first sparse convolution module 801 and a second sparse convolution module 802 shown in fig. 8) into the two-dimensional sparse convolution modules, respectively obtaining a first feature and a second feature, then cascading the first feature and the second feature into a two-dimensional convolution module (i.e., a convolution module 803 shown in fig. 8), obtaining a third feature (also referred to as a fusion feature), cascading the third feature with the first feature and the second feature, obtaining a fourth feature (also referred to as a combination feature), and finally inputting the fourth feature into a classification regression module 804 shown in fig. 8 (for example, the classification regression module 804 may be specifically a classification regression head), and a classification category to which the target object in the 3D target detection frame and the 3D target frame belongs is obtained.

To facilitate understanding of the present solution, first, a system architecture of a 3D object detection system provided in the present embodiment is described with reference to fig. 9, please refer to fig. 9, and fig. 9 is a system architecture diagram of a 3D object detection system 200 provided in the present embodiment. In fig. 9, the 3D object detection system 200 includes an execution device 210, a training device 220, a database 230, a client device 240, a data storage system 250, and a data acquisition device 260, where the execution device 210 includes a calculation module 211 and an input/output (I/O) interface 212, and the calculation module 211 is substantially the 3D object detection model 201 provided in the embodiment of the present application.

In the training phase, the data acquisition device 260 may be configured to obtain an open-source large-scale data set (i.e., a training set) required by a user, and store the data set in the database 230, where each training data (i.e., a training sample) in the data set includes a 3D laser point cloud, a 2D image, and two-dimensional information corresponding to each frame of laser point cloud. The training device 220 performs iterative training on the 3D target detection model 201 based on the data set maintained in the database 230 to obtain a mature 3D target detection model 201, that is, obtain the trained 3D target detection model 201, where the 3D target detection model 201 may specifically include the three-dimensional first sparse convolution module 801, the three-dimensional second sparse convolution module 802, the two-dimensional convolution module 803, and the classification regression module 804 described above with reference to fig. 8. The 3D object detection model 201 trained by the training apparatus 220 can be applied to different systems or apparatuses.

In the inference phase, the data acquisition device 260, such as a camera and a laser radar mounted on a wheeled mobile device, may be configured to acquire target data (in this embodiment, the target data is a 3D laser point cloud and a 2D image), and store the target data in the data storage system 250, and the execution device 210 may call the data, the code, and the like in the data storage system 250 to perform processing, and may also store the data, the instruction, and the like in the data storage system 250. The data storage system 250 may be disposed in the execution device 210 or the data storage system 250 may be an external memory with respect to the execution device 210. The calculation module 211 processes the laser point cloud and the target image stored in the data storage system 250 through the trained 3D target detection model 201 to obtain a 3D target frame and a classification category (i.e., a detection result) to which a target object in the 3D target frame belongs, and sends the detection result to an external device through the I/O interface 212, for example, to a client device 240 such as a mobile phone or a personal computer.

In some embodiments of the present application, a "user" may also input data to the I/O interface 212 through the client device 240. For example, the client device 240 may be a camera device and a laser radar mounted on an autonomous vehicle, an image captured by the camera device and a laser point cloud collected by the laser radar are input to the computing module 211 of the execution device 210 as input data, a 3D target detection model 201 in the computing module 211 performs 3D target detection on the input image and laser point cloud to obtain a detection result, and then outputs the detection result to the camera device or directly displays the detection result on a display interface (if any) of the execution device 210; in addition, in some embodiments of the present application, the client device 240 may also be integrated in the execution device 210, for example, when the execution device 210 is an autonomous vehicle, the image and the laser point cloud collected by the lidar may be directly captured by a camera of the autonomous vehicle or the image and the laser point cloud sent by other devices (e.g., a mobile phone) may be received, and then the calculation module 211 in the autonomous vehicle performs 3D target detection on the image and the laser point cloud to obtain a detection result, and directly presents the detection result on a display interface of the mobile phone. The product forms of the execution device 210 and the client device 240 are not limited herein.

It should be noted that fig. 9 is only a schematic diagram of a 3D object detection system architecture provided in an embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 9, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210; in fig. 9, the client device 240 is an external device with respect to the execution device 210, and in other cases, the client device 240 may be integrated in the execution device 210.

It should be further noted that, in some embodiments of the present application, the 3D object detection system 200 may also be split into a plurality of sub-modules/sub-units to jointly implement the solution provided in the embodiments of the present application, which is not limited herein.

It should be further noted that, the training of the 3D object detection model 201 described in the foregoing embodiment may be implemented on the cloud side, for example, the training device 220 on the cloud side (the training device 220 may be disposed on one or more servers or virtual machines) may obtain a training set, and train the 3D object detection model 201 according to training data in the training set to obtain the trained 3D object detection model 201, and then the trained 3D object detection model 201 is sent to the execution device 210 for application, for example, a method for executing 3D object detection is sent to the execution device 210, for example, in the system architecture corresponding to fig. 9, the training device 220 performs overall training on the 3D object detection model 201, and the trained 3D object detection model 201 is sent to the execution device 210 for use; the training of the 3D object detection model 201 described in the above embodiment may also be implemented on the terminal side, that is, the training device 220 may be located on the terminal side, for example, a training set may be obtained by a terminal device (e.g., a mobile phone, a smart watch, etc.), a wheeled mobile device (e.g., an autonomous vehicle, an assisted driving vehicle, etc.), etc., and the 3D object detection model 201 may be trained according to a plurality of training data in the training set to obtain the trained 3D object detection model 201, where the trained 3D object detection model 201 may be directly used by the terminal device, or may be sent by the terminal device to other devices for use. The embodiment of the present application does not specifically limit on which device (cloud side or terminal side) the 3D object detection model 201 is trained or applied.

It should be noted that, in some embodiments of the present application, the execution device 210 deployed with the trained 3D target detection model 201 may further include a camera and a lidar, and is configured to collect environmental information around the execution device, so as to obtain a target image and a target lidar to be input.

It should be further noted that, in the above embodiments of the present application, the 3D object detection model 201 trained by the training device 220 may be applied to different systems or devices (i.e., the execution device 210), for example, the execution device 210 may be various terminal-side devices with a display interface, such as a camera, a video recorder, an edge device of a smart home, and the execution device 210 may also be an end-side device such as a mobile phone, a personal computer, a computer workstation, a tablet computer, a smart wearable device (e.g., a smart watch, a smart bracelet, a smart headset, and the like), a game machine, a set-top box, a media consumption device, and the like; the implement device 210 may also be various wheeled mobile devices (wheeled construction equipment, autonomous vehicles, assisted driving vehicles, etc.), and the autonomous vehicles may also be cars, trucks, motorcycles, buses, boats, airplanes, helicopters, lawn mowers, recreational vehicles, playground vehicles, construction equipment, trolleys, golf carts, trains, carts, and the like. Any device that can deploy the 3D object detection model 201 described in the embodiment of the present application may be considered as the execution device 210 described in the embodiment of the present application, and is not limited herein.

With reference to the above description, the embodiments of the present application provide a 3D target detection method and a training method of a 3D target detection model, which can be applied to an inference phase and a training phase of the 3D target detection model, respectively. Since the flow of the training phase and the reasoning phase are different, the following description starts with the two phases separately.

First, training phase

In the embodiment of the present application, the training phase is a process of performing a training operation on the 3D object detection model 201 by using the training data by the training device 210 in fig. 9. Specifically, referring to fig. 10, fig. 10 is a schematic flowchart of a training method for a 3D target detection model according to an embodiment of the present disclosure, where the 3D target detection model includes a three-dimensional first sparse convolution module, a three-dimensional second sparse convolution module, a two-dimensional convolution module, and a classification regression module, and the method includes the following steps:

1001. and constructing a first training set according to the initial training set, wherein a first training sample in the first training set is a training sample obtained by extracting a target object from each initial training sample in the initial training set and randomly copying the target object in each initial training sample, the initial training sample comprises initial laser point cloud and initial two-dimensional information, and the first training sample comprises the first laser point cloud and the first two-dimensional information.

Firstly, a training device acquires an initial training set, any one training sample in the initial training set can be called as an initial training sample, each initial training sample in the initial training set comprises a frame of laser point cloud and two-dimensional information (can be called as an initial laser point cloud and initial two-dimensional information), the initial two-dimensional information is obtained according to the initial laser point cloud and an initial image, and the initial laser point cloud corresponds to the initial image, namely sensor data acquired by different types of sensors at the same time and the same position.

Then, the training device constructs a first training set according to the initial training set, and the construction process is as follows: the method comprises the steps of extracting local laser points and local two-dimensional information of target objects, such as target objects forming automobiles, trucks and pedestrians, from each initial training sample in an initial training set, randomly copying the local laser points and the local two-dimensional information of the target objects in each initial training sample to obtain first training samples, and enabling each first training sample to form the first training set. Likewise, each first training sample in the first training set also includes a frame of laser point cloud and one two-dimensional information (which may be referred to as a first laser point cloud and a first two-dimensional information).

Specifically, in some embodiments of the present application, as shown in fig. 11, the initial two-dimensional information includes initial RGB information corresponding to each initial laser point in an initial laser point cloud obtained by projecting the initial laser point cloud to an initial image, and an initial semantic segmentation score corresponding to each initial laser point obtained by projecting the initial laser point cloud to an initial semantic segmentation map, where the initial semantic segmentation score is a probability that each pixel point in the initial image obtained by performing semantic segmentation on the initial image through a semantic segmentation model belongs to each classification category, and the initial semantic segmentation score of each pixel point in the initial image constitutes the initial semantic segmentation map.

Similarly, the first two-dimensional information includes first RGB information corresponding to each first laser point in the first laser point cloud obtained by projecting the first laser point cloud to the first image, and a first semantic segmentation score corresponding to each first laser point obtained by projecting the first laser point cloud to the first semantic segmentation map, the first semantic segmentation score is a probability that each pixel point in the first image obtained by performing semantic segmentation on the first image through the semantic segmentation model belongs to each classification category, and the first semantic segmentation score of each pixel point in the first image constitutes the first semantic segmentation map.

It should be noted that, in the embodiment of the present application, a process of constructing the first training set according to the initial training set may also be referred to as label data enhancement (gt-augmentation), which is a commonly used data enhancement method in 3D object detection, and the principle of the method is to increase the number of instances of the training samples by randomly copying and pasting local laser point clouds occupied by target objects such as "cars", "motorcycles", "bicycles", and the like in the initial training samples and corresponding two-dimensional information to each initial training sample, so as to solve the problem of unbalanced sample number. The processing process is as follows: in the training process, because each frame of initial laser point cloud corresponds to each initial image, for example, the initial laser point cloud is an image and a frame of laser point cloud which are shot by the vehicle body based on the position of the vehicle body at a certain moment, the local laser point cloud of a certain object (such as a bicycle) of each frame of laser point cloud in all initial training samples is extracted through gt-augmentation, and the local pixel of the bicycle in the corresponding image and the local semantic segmentation score of the bicycle on the corresponding semantic segmentation graph are extracted (can be considered to be extracted together correspondingly), and are randomly stuck in other initial training samples.

1002. And performing convolution operation on the first laser point cloud through the first sparse convolution module to obtain a first characteristic, and performing convolution operation on the first two-dimensional information through the second sparse convolution module to obtain a second characteristic.

After the training equipment constructs a first training set according to the initial training set, first training samples in the first training set are training samples subjected to data enhancement, the first training samples comprise first laser point clouds and first two-dimensional information, and the training equipment executes convolution operation on the first laser point clouds and the first two-dimensional information through a first sparse convolution module and a second sparse convolution module respectively, so that first characteristics corresponding to the first laser point clouds and second characteristics corresponding to the first two-dimensional information are obtained respectively.

It should be noted that, in some embodiments of the present application, the input data of the sparse convolution module may be subjected to voxelization first, and then input to the sparse convolution module to perform convolution operation, so that the specific manner of the training apparatus performing convolution operation on the first laser point cloud and the first two-dimensional information through the first sparse convolution module and the second sparse convolution module respectively may be, but is not limited to, the following manner:

the first mode can be specifically referred to fig. 12, where fig. 12 is a schematic diagram of performing a convolution operation after a training device performs voxelization processing on a first laser point cloud and first two-dimensional information, specifically, the training device performs voxelization processing on the first laser point cloud to obtain a voxelized first laser point cloud, and then performs a convolution operation on the voxelized first laser point cloud by using a first sparse convolution module to obtain a first feature; similarly, the training device performs voxelization processing on the first two-dimensional information to obtain voxelized first two-dimensional information, and then performs convolution operation on the voxelized first two-dimensional information through the second sparse convolution module to obtain the second feature.

A second mode, specifically referring to fig. 13, fig. 13 is another schematic diagram of a training device performing a voxelization operation on a first laser point cloud and first two-dimensional information and then performing a convolution operation, specifically, the training device first cascades the first laser point cloud and the first two-dimensional information to obtain multi-modal information, then performs a voxelization operation on the multi-modal information once to obtain the voxelized multi-modal information, obviously, the voxelized multi-modal information includes the voxelized first laser point cloud and the voxelized first two-dimensional information, and finally performs a convolution operation on the voxelized first laser point cloud by using a first sparse convolution module to obtain a first feature, and performs a convolution operation on the voxelized first two-dimensional information by using a second sparse convolution module to obtain a second feature.

1003. And after the first characteristic and the second characteristic are cascaded, inputting the first characteristic and the second characteristic into a convolution module for convolution operation to obtain a third characteristic.

After the training equipment obtains a first feature corresponding to the first laser point cloud and a second feature corresponding to the first two-dimensional information, the first feature and the second feature are cascaded, and the cascaded first feature and the cascaded second feature are input into a two-dimensional convolution module to be subjected to convolution operation, so that a third feature is obtained.

It should be noted that in the embodiment of the present application, the cascading between features refers to superimposing features to obtain a new cascading feature, for example, assuming that the first feature is 1 × 2 × 3 and the second feature is also 1 × 2 × 3, where 1 represents the number of channels and 2 × 3 represents the size of the first/second feature, the cascading feature obtained after cascading the first feature and the second feature is 2 × 2 × 3, 2 is the number of channels after cascading, and 2 × 3 is the size of the cascading feature after cascading.

It should be noted that, in some embodiments of the present application, input data of the two-dimensional convolution module may be subjected to de-voxelization first, where the de-voxelization is an operation process opposite to the voxelization, and then input into the two-dimensional convolution module to perform convolution operation after the third feature is de-voxelization, specifically, the training device concatenates the first feature and the second feature to obtain a concatenated feature, then performs de-voxelization on the concatenated feature to obtain a de-voxelization concatenated feature, and finally the training device inputs the de-voxelization concatenated feature into the two-dimensional convolution module to perform convolution operation to obtain a third feature, that is, a fusion feature.

1004. And cascading the first feature, the second feature and the third feature to obtain a fourth feature.

After the training device obtains the third feature, the first feature, the second feature and the third feature obtained before are cascaded to obtain a fourth feature, which may also be referred to as a combined feature, and the cascading manner is similar to that of the first feature and the second feature, which is not described herein again.

1005. And inputting the fourth features into a classification regression module to obtain the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs.

After the training device obtains the fourth feature, the fourth feature may be input to the classification regression module, for example, a classification regression head may be input, so as to obtain the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs. Wherein the target object may also be referred to as a target object. For example, the right diagram of fig. 1 is taken as an illustration: the number of the obtained 3D prediction target frames is 2, the specific size and angle of the 3D prediction target frames are shown in the right diagram of fig. 1, and the prediction classification categories to which the target objects in the 2D prediction target frames belong are "car" and "tricycle", respectively.

1006. And performing iterative training on the 3D target prediction model by using a target loss function according to the 3D real target frame, the real classification category to which the target object in the 3D real target frame belongs, the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs.

After the training equipment obtains the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs, iterative training is carried out on the 3D target prediction model by using a target loss function according to the 3D real target frame marked in the training sample, the real classification category to which the target object in the 3D real target frame belongs, and the prediction classification category to which the target object in the 3D prediction target frame and the 3D prediction target frame belong.

1007. And repeating the steps 1001 to 1006 until the training round of the first training set reaches a first preset round.

Training all training samples in the entire training set once is called a training round (i.e. epoch is 1), and 1 epoch is equal to one training round using all training samples in the training set, and an epoch is equal to several, which indicates that several rounds of training have been performed on the training set. Therefore, in some embodiments of the present application, steps 1001 to 1006 may be repeatedly performed until the training round of the first training set reaches a first preset round, for example, the first preset round is 5, and then the training apparatus will repeatedly perform steps 1001 to 1006 for 5 times, where the first training set in each round is reconstructed according to the initial training set, so that the first training set of each training round is not identical (i.e., data enhancement is performed each time) to improve the performance of the 3D detection model.

1008. And taking the initial training set as a new first training set, and repeatedly executing the steps 1001 to 1006 until the training round of the new first training set reaches the second prediction round.

It should be noted that, in some embodiments of the present application, an adaptive data enhancement strategy is also proposed, because the principle of the data enhancement method of gt-augmentation is to paste local laser point clouds occupied by target objects such as "cars", "motorcycles", "bicycles", etc. in initial training samples and corresponding two-dimensional information into each initial training sample by random copy to increase the number of instances of the training samples, so as to solve the problem of unbalanced sample number, but there are many unreasonable situations in the pasting process, such as the pasting position coincides with the original object, so that the model cannot recognize these abnormal samples, resulting in the reduction of recognition performance. Fig. 14 is a schematic diagram of an abnormal situation of data enhancement, as shown in fig. 14, wherein a motorcycle and a bicycle are duplicated enhanced samples, and the pasting positions of the motorcycle and the bicycle are mixed with an originally existing obstacle (such as a wall), which is harmful to the training of the model. Therefore, when the training round of the first training set reaches the first preset round (e.g., 15 times), the gt-autonomation method is not used for constructing the first training set, but the initial training set is directly used as a new first training set (only data enhancement methods such as rotation and inversion can be reserved), the 3D target detection model is iteratively trained according to the initial training set in the steps 1001 to 1006 until the training round trained by the initial training set reaches the second preset round (e.g., 5 times), so that more real training samples can be seen in the 3D target detection model, the performance loss caused by the overlarge difference between the sample space generated by data enhancement and the original input sample space is avoided, and the data enhancement efficiency is further improved. For example, as shown in fig. 15, fig. 15 is an example of an adaptive data enhancement strategy proposed in the present application, and for categories with small size (i.e., sparse initial laser point cloud) and a small number of instances, such as motorcycle and bicycle, the data is not enhanced by adopting the gt-augmentation method when the whole training process reaches three quarters (assuming that the whole epoch is 20, then the epoch is 15), and the model is iteratively trained by adopting the initial training set in the remaining 5 training rounds.

Second, reasoning phase

According to the training phase, the training device obtains a trained 3D target detection model, and the trained 3D target detection model can be used for performing 3D target detection by the execution device, specifically referring to fig. 16, where fig. 16 is a 3D target detection method provided in the embodiment of the present application, the method may include the following steps:

1601. and obtaining two-dimensional information corresponding to the laser point cloud according to the obtained image and the laser point cloud.

The execution equipment firstly acquires an image and a laser point cloud at a certain moment at a certain position through a sensor device (such as a camera, a laser radar and the like) deployed on the execution equipment, and then obtains two-dimensional information corresponding to the laser point cloud according to the acquired image and the laser point cloud.

Specifically, the step of obtaining, by the execution device, the two-dimensional information corresponding to the laser point cloud according to the obtained image and the laser point cloud may be performed by: firstly, performing semantic segmentation on an acquired image through a semantic segmentation model by an execution device to obtain a semantic segmentation score of each pixel point in the image, wherein the segmentation score is used for expressing the probability that each pixel point belongs to each classification category, and the semantic segmentation score of each pixel point in the image forms a semantic segmentation graph. And then, the execution equipment respectively projects the laser point clouds to the image and the semantic segmentation map obtained based on the image to obtain target RGB information and target semantic segmentation scores which respectively correspond to the laser point clouds in the laser point clouds, and the target RGB information and the target semantic segmentation scores form the two-dimensional information.

1602. And performing convolution operation on the laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic.

After the execution equipment acquires the laser point cloud and the two-dimensional information corresponding to the laser point cloud, convolution operation is performed on the laser point cloud and the two-dimensional information through a three-dimensional first sparse convolution module and a three-dimensional second sparse convolution module respectively, and therefore a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information are obtained respectively. This process is similar to step 1002 in FIG. 10, and will not be described herein.

It should be noted that, in some embodiments of the present application, the input data of the sparse convolution module may be subjected to voxelization first, and then input into the sparse convolution module to perform convolution operation, so that the specific manner of the execution device performing convolution operation on the laser point cloud and the two-dimensional information through the first sparse convolution module and the second sparse convolution module may be, but is not limited to, the following manner:

the first method can specifically refer to fig. 17, where fig. 17 is a schematic diagram of performing a convolution operation after performing a voxelization process on the laser point cloud and the two-dimensional information by the execution device, specifically, performing a voxelization process on the laser point cloud by the execution device to obtain a voxelized laser point cloud, and then performing a convolution operation on the voxelized laser point cloud by the first sparse convolution module to obtain a first feature; similarly, the executing device performs voxelization processing on the two-dimensional information to obtain voxelized two-dimensional information, and then performs convolution operation on the voxelized two-dimensional information through the second sparse convolution module to obtain the second feature.

A second mode, specifically referring to fig. 18, fig. 18 is another schematic diagram of an executing device performing a voxelization operation on the laser point cloud and the two-dimensional information and then performing a convolution operation, specifically, the executing device first concatenates the laser point cloud and the two-dimensional information to obtain multi-modal information, then performs a voxelization operation on the multi-modal information to obtain the voxelized multi-modal information, obviously, the voxelized multi-modal information includes the voxelized laser point cloud and the voxelized two-dimensional information, and finally performs a convolution operation on the voxelized laser point cloud by using a first sparse convolution module to obtain a first feature, and performs a convolution operation on the voxelized two-dimensional information by using a second sparse convolution module to obtain a second feature.

1603. And after the first characteristic and the second characteristic are cascaded, inputting the first characteristic and the second characteristic into a two-dimensional convolution module for convolution operation to obtain a third characteristic.

After the execution device obtains a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information, the first feature and the second feature are cascaded, and the cascaded first feature and the cascaded second feature are input into a two-dimensional convolution module to be subjected to convolution operation, so that a third feature is obtained, and the third feature can be called as a fusion feature because the first feature and the second feature are fused.

It should be noted that, in some embodiments of the present application, input data of the two-dimensional convolution module may be subjected to de-voxelization first, where the de-voxelization is an operation process opposite to the voxelization, and then input into the two-dimensional convolution module to perform convolution operation after the third feature is subjected to de-voxelization, specifically, the execution device concatenates the first feature and the second feature to obtain a concatenated feature, then performs de-voxelization on the concatenated feature to obtain a de-voxelization concatenated feature, and finally, the execution device inputs the de-voxelization concatenated feature into the two-dimensional convolution module to perform convolution operation to obtain a third feature, that is, a fusion feature.

1604. And cascading the first feature, the second feature and the third feature to obtain a fourth feature.

After the execution device obtains the third feature, the first feature, the second feature, and the third feature obtained before are further cascaded to obtain a fourth feature, which may also be referred to as a combined feature, and the cascading manner is similar to the cascading manner of the first feature and the second feature, which is not described herein again.

1605. And inputting the fourth features into a classification regression module to obtain the 3D target frame and the classification category to which the target object in the 3D target frame belongs.

After the executing device obtains the fourth feature, the fourth feature may be input to the classification regression module, for example, a classification regression header may be input, so as to output the 3D target frame and the classification category to which the target object in the 3D target frame belongs. Wherein the target object may also be referred to as a target object. For example, fig. 19 is an exemplary illustration: the number of the obtained 3D object frames is 3, the specific size and angle thereof are shown in fig. 19, and the classification categories to which the object objects in the 3D object frames belong are "car", and "bicycle", respectively.

To facilitate understanding of the 3D object detection method according to the embodiment shown in fig. 16, the following example is illustrated: firstly, an executing device deployed with a camera and a laser radar respectively acquires a frame of laser point cloud and an image at a certain time through the camera and the laser radar, and for the image, the executing device firstly uses a 2D semantic segmentation model to obtain a semantic segmentation map of the image, wherein the semantic segmentation map is composed of semantic segmentation scores respectively corresponding to each pixel point in the image, for example, the number of categories of a nusceenes data set is 11 (namely car, pedestrian, bus, barrier, traffic con, truck, trailer, motorcycle, constraint, bicycle, and background), and then the semantic segmentation score of each pixel point is the probability of belonging to the 11 categories.

Each 3D laser point in the frame of laser point cloud collected by the execution device is represented as (x, y, z, r) or (x, y, z, r, t), where [ x, y, z ] represents the position coordinates of the laser point in three-dimensional space, r represents the reflection intensity of the laser point, and t represents the time sequence, that is, the time when the laser radar scans on the object, that is, the time when the laser radar obtains [ x, y, z ]. The laser point cloud is projected to the image and a semantic segmentation graph obtained based on the image through a projection transformation (such as translation, rotation and the like), and the complete transformation process is as the formula:

where, T is a coordinate matrix, ego ← lidar represents the conversion of the laser radar coordinate system to the vehicle body coordinate system, ego_tc←ego_tlThe vehicle coordinate system to camera coordinate system conversion corresponding to the laser radar is shown, and camera ← ego is shown.

After the laser point cloud is projected to the image and the semantic segmentation map, RGB information corresponding to each laser point in the laser point cloud and a semantic segmentation score corresponding to each laser point can be obtained, the RGB information corresponding to each laser point and the semantic segmentation score corresponding to each laser point are used as two-dimensional information (i.e., 2D input described below), the two-dimensional information is actually a one-to-one correspondence relationship between the laser point and each pixel on the image, and then the 2D information and the original 3D laser point cloud (i.e., 3D input described below) are cascaded to obtain multi-modal information (i.e., Multimodal input described below). In some embodiments, cascading may not be performed, and the advantage of cascading is that only one voxelization process needs to be performed on the multi-modal information, and if cascading is not performed, the one voxelization process needs to be performed on the laser point cloud and the 2D information respectively. The two-dimensional information, the laser point cloud and the multi-modal information are represented as follows:

3D input：(x，，y，z，r，t)

2D input：(r，g，b，car，pedestrian，bus，barrier，…，background)

Multimodal input：(x，y，z，r，t，r，g，b，car，pedestrian，bus，barrier，…，background)

wherein, 3D input is a certain laser point in the laser point cloud, 2D input is the corresponding 2D information of every laser point after the laser point cloud passes through twice projections, [ car, pedestrian, bus, barrier, …, backsground ] is the semantic segmentation score of a certain pixel point corresponding to 3D input, [ r, g, b ] is the value of a certain pixel point corresponding to 3D input.

Then, the executing device performs voxelization processing on the 2D input and the 3Dinput respectively, or the executing device directly performs voxelization processing on the Multimodal input once, then respectively passes the voxelized laser point cloud and the voxelized 2D information through two 3D sparse convolution modules (i.e. the first sparse convolution module and the second sparse convolution module), the two sparse convolution modules only need to ensure that the number of channels is the same, the number of convolution layers can be different, after the two sparse convolution modules perform convolution operation, a first feature (i.e. the 3D feature described below) corresponding to the laser point cloud and a second feature (i.e. the 2D feature described below) corresponding to the two-dimensional information can be obtained respectively, then after the two features are cascaded (i.e. the channels are cascaded), the two features are passed through a 2D convolution module (the convolution layer of the convolution module can be preset) for convolution operation, to obtain a third feature (which may also be referred to as a fused feature, hereinafter referred to as a Multimodal feature) after the two features are fused, and finally concatenating the obtained fused feature with the first feature and the second feature (i.e., a channel cascade, similar to the above) to obtain a final fourth feature (which may also be referred to as a combined feature, hereinafter referred to as a combined feature), wherein the first feature, the second feature, the third feature, and the fourth feature are represented as follows:

3Dfeature＝Conv3D(Input3D)

2Dfeature＝Conv3D(Input2D)

Multimodal feature＝Conv2D(Concatenate(3D feature，2D feature))

Combinated feature＝Concatenate(3D feature，2D feature，Multimodal feature)

finally, the combined feature is input into the classification regression module (e.g., input into the classification regression header), so as to output the 3D object box and the classification category to which the object in the 3D object box belongs. Wherein the target object may also be referred to as a target object. In the embodiment of the application, the fourth feature is fused with the feature of the original laser point cloud (i.e. the first feature), and the feature of the original laser point cloud can ensure that the 3D target detection model is normal under the condition that the camera fails at night, in rainy and foggy days and the like, so that the 3D target detection model has good robustness in special scenes.

In order to more intuitively recognize the beneficial effects brought by the embodiment of the application, the following further compares the technical effects brought by the embodiment of the application, and in order to fairly compare the advantages and disadvantages of various 3D target detection algorithms, the application verifies the effectiveness of the 3D target detection method provided by the embodiment of the application on data sets of nuScenes and KITTI. nuScenes is a large-scale autopilot data set published by the autopilot company nuTonomy in 2019. The data set contains not only Camera and Lidar, but also Radar data. Consists of 1000 scenes containing 140 million images, 40 million lidar scans (determining the distance between objects) and 110 million three-dimensional bounding boxes (objects detected with a combination of RGB camera, radar and lidar). The class of target detection is 10 classes, and the problem of serious class imbalance exists, so that the method has practical significance. The KITTI data set was created by the Carlsu Rigium, Germany and the Toyota American institute of technology. The data set is used for evaluating the performance of computer vision technologies such as stereo image (stereo), optical flow (optical flow), visual odometry (visual odometry), 3D object detection (object detection) and 3D tracking (tracking) in a vehicle-mounted environment. The KITTI comprises real image data acquired in urban areas, villages, expressways and other scenes, wherein each image contains at most 15 vehicles and 30 pedestrians and has various degrees of shielding and truncation. Wherein the 3D target detection training set comprises 3712, the validation set is 3769, and the test set is 7518. Table 1 below shows the detection results of the 3D target detection method provided in this embodiment and other existing 3D target detection methods on the nuScenes data set, and table 2 shows the detection results of the 3D target detection method provided in this embodiment and other existing 3D target detection methods on the KITTI data set.

Table 1: the 3D target detection method and other 3D target detection methods of the application have detection performance on the nuScenes data set

Table 2: detection performance of the 3D target detection method and other 3D target detection methods on KITTI (transmission time interval)

As shown in table 1, the 3D target detection method provided by the present application participated in 2020nuScenes detection change and achieved the first performance, and as a result, as shown in table 1, mAP exceeds 3 points of the second centrpoint mAP and 1.5 points of NDS, exceeds 12 points of last year champion challenge mAP (cbgs (megii)) mAP and 5.7 points of NDS. In addition, the SECOND model is used as Baseline in the KITTI data set in the 3D target detection method provided by the application, the effectiveness of the method is verified, the detection performance is shown in table 2, the results of 3D target detection and BEV detection are stably improved, and particularly the improvements of pedestrian and cyclest are obvious.

The 3D target detection method and the trained 3D target detection model provided by the embodiment of the application can be used for carrying out 3D detection on various objects in the fields of intelligent security, safe cities, intelligent terminals and the like, and a plurality of application scenes of falling to products are introduced below.

(1) Automatic driving scenario

Autopilot is a current very popular direction of research. With the development of economy, the number of global automobiles is continuously increased, and the phenomena of traffic jam, difficulty in finding parking spaces, difficulty in driving and frequent accidents are more and more. Automotive technology has become the latest direction of development throughout the automotive industry. The automatic driving technology can comprehensively improve the safety and the comfort of automobile driving, and meet higher-level market demands and the like. The autopilot is a complete software and hardware interactive system, the core technology includes hardware (automobile manufacturing technology, autopilot chip), autopilot software, high-precision map, sensor communication network, and the like, as shown in fig. 20, fig. 20 illustrates a top-down layered architecture of an autopilot vehicle, and there may be defined interfaces between systems for transmitting data between the systems to ensure real-time and integrity of the data. The following briefly introduces the various systems:

a. environment sensing system

The environmental perception is the most basic part in the automatic driving vehicle, and no matter the driving behavior decision or the global path planning is made, the corresponding judgment, decision and planning are carried out on the basis of the environmental perception according to the real-time perception result of the road traffic environment, so that the intelligent driving of the vehicle is realized. The environment sensing system mainly utilizes various sensors to obtain related environment information so as to complete construction of an environment model and knowledge expression of a traffic scene, the used sensors comprise a camera, a single-line radar (SICK), a four-line radar (IBEO), a three-dimensional laser radar (HDL-64E) and the like, wherein the camera is mainly responsible for traffic light detection, lane line detection, road sign detection, vehicle identification and the like; the laser sensor is mainly responsible for detection, identification and tracking of dynamic/static obstacles and accurate positioning of the laser sensor, for example, laser emitted by a three-dimensional laser radar generally collects external environment information at the frequency of 10FPS, returns a laser point cloud at each moment, and finally sends the acquired real-time laser point cloud to an autonomous decision-making system for further decision-making and planning.

b. Autonomous decision making system

The autonomous decision making system is a key component in an automatic driving vehicle and mainly comprises two core subsystems of behavior decision making and motion planning, wherein the behavior decision making subsystem mainly obtains a global optimal driving route by operating a global planning layer to make clear a specific driving task, and outputs information such as the position and the orientation of an object around the self vehicle according to real-time each frame of laser point cloud and images sent by an environment perception system and a trained 3D target detection model deployed on the automatic driving vehicle. And finally, based on road traffic rules and driving experience, deciding a reasonable driving behavior according to the positioning of the self vehicle and the information such as the position, orientation and the like of surrounding objects, sending the driving behavior instruction to a motion planning subsystem, and planning a feasible driving track based on indexes such as safety, stability and the like by the motion planning subsystem according to the received driving behavior instruction and the current environment perception information and sending the feasible driving track to a control system.

c. Control system

The control system is in particular also divided into two parts: the system comprises a control subsystem and an execution subsystem, wherein the control subsystem is used for converting a feasible driving track generated by the autonomous decision system into specific execution instructions of each execution module and transmitting the specific execution instructions to the execution subsystem; the execution subsystem receives the execution instruction from the control subsystem and then sends the execution instruction to each control object to reasonably control the steering, braking, accelerator, gear and the like of the vehicle, so that the vehicle automatically runs to complete corresponding driving operation.

It should be noted that the general architecture of the autonomous vehicle shown in fig. 20 is only illustrative, and in practical applications, more or fewer systems/subsystems or modules may be included, and each system/subsystem or module may include multiple components, which is not limited herein.

Based on the autonomous vehicle shown in fig. 20, the autonomous vehicle performs target detection on the acquired image and the laser point cloud in real time to locate positions of pedestrians, obstacles, vehicles, and the like, and then executes a corresponding driving strategy. In practical application, the automatic driving vehicle can know the surrounding traffic conditions through video information/images collected by a camera or laser point clouds collected by a laser radar, obtain a target object in front of the automatic driving vehicle based on a deployed trained 3D target detection model, and draw a safe and reliable route based on the target object so as to navigate the road in front. Compared with 2D target detection, the 3D target detection can provide the position, the size and the direction of an object in a three-dimensional environment, is a very important part in an environment perception module, and accurately detects the object in the environment, so that the automatic driving safety trip guarantee is realized.

(2) Image processing scene (for example, mobile phone)

With the rapid growth of national economy, the rapid progress of society and the continuous enhancement of national strength, people have more requirements on daily life entertainment, the camera function of terminal equipment (such as a mobile phone) is more and more refined, 2D target detection is mature, and the research on real-time 3D target detection of the movement of daily objects is increasingly carried out. As shown in fig. 21, the 3D target detection model provided in the embodiment of the present application may be deployed in a terminal such as a mobile phone, and multimodal feature fusion is performed by using sparse 3D laser point cloud obtained by a depth camera of the mobile phone and a picture taken by the camera, so that the 3D target detection method described in the embodiment of the present application is implemented, and user experience can be greatly improved and shooting fun can be greatly increased.

(3) Intelligent robot interaction scenario

The intelligent robot will go into thousands of households in the future, and the intelligent robot is expected to be an assistant for human beings, and can sense the surrounding environment and make corresponding actions. In practical application, the intelligent robot can collect images and laser point clouds of the surrounding environment, and carries out 3D target detection on the collected images and the laser point clouds through a trained 3D target detection model deployed on the intelligent robot so as to locate a certain target. For example, referring to fig. 22, a male owner in a room is doing housework, and needs a bowl at that time, thus allowing the robot to pass the bowl to him for housekeeping assistance. After receiving the instruction, the robot manager firstly detects the surrounding environment, acquires the image and the laser point cloud of the surrounding environment, finds the position of the bowl by applying the 3D target detection method according to the embodiment of the application based on the acquired image and the laser point cloud, and then can perform a series of subsequent actions. The robot manager detects the surrounding environment, the process of finding the position of the bowl needs to acquire images and laser point clouds of the surrounding environment, and the position of the bowl is located on the basis of the deployed 3D target detection model.

(4) Mobile robot navigation

With the progress of science and technology, the mobile robot is widely applied to various industries, is an intelligent device for realizing autonomous control movement and automatic execution work, can receive the command of a user, run a pre-programmed program and even realize autonomous movement without human intervention. Nowadays, in the fields of families, markets, restaurants and the like, robots applying laser navigation account for most of the time, and cover all indoor scenes. In a restaurant, the food delivery robot can move freely, and can not hit guests even if a plurality of people exist; in a shopping mall, if the user gets lost, the user can be brought to a destination by clicking a large screen on the robot; at home, the sweeping robot knows which places to sweep and which places to not sweep.

As shown in fig. 23, the trained 3D object detection model according to the embodiment of the present application is deployed on the mobile robot, and the mobile robot navigation first needs to accurately perceive objects in the three-dimensional environment, similar to the automatic driving vehicle. The robot is generally provided with the laser radar and the camera, and the 3D target detection method provided by the embodiment of the application can further improve the performance of 3D target detection and ensure the correct and safe navigation of the robot.

(5)3D point cloud data annotation

Compared with 2D data, 3D data has one more dimension, and 3D point cloud data are sparse and difficult to label. At present, the training of the depth network can not leave data, the 3D target detection model provided by the embodiment of the application can assist 3D point cloud data labeling, the labor cost is reduced under the condition of ensuring the labeling quality, and the labeling efficiency is improved.

It should be noted that the trained target detection model described in this application may be applied not only to the application scenarios described in fig. 20 to fig. 23, but also to various subdivision fields of the artificial intelligence field, such as the image processing field, the computer vision field, the semantic analysis field, and so on, and as long as the field and the device that can use the neural network, the trained 3D target detection model provided in this application embodiment may be applied, and is not illustrated here.

On the basis of the corresponding embodiment, in order to better implement the above-mentioned scheme of the embodiment of the present application, the following also provides a related device for implementing the above-mentioned scheme. Specifically referring to fig. 24, fig. 24 is a schematic structural diagram of an execution device according to an embodiment of the present application, where the execution device 2400 includes: the system comprises an acquisition module 2401, a first operation module 2402, a second operation module 2403, a cascade module 2404 and a detection module 2405, wherein the acquisition module 2401 is used for obtaining two-dimensional information corresponding to laser point cloud according to an acquired image and the laser point cloud; a first operation module 2402, configured to perform convolution operation on the laser point cloud through a three-dimensional first sparse convolution module to obtain a first feature, and perform convolution operation on the two-dimensional information through a three-dimensional second sparse convolution module to obtain a second feature; a second operation module 2403, configured to cascade the first feature and the second feature, and input the cascade to a two-dimensional convolution module for convolution operation to obtain a third feature; a cascading module 2404, configured to cascade the first feature, the second feature, and the third feature to obtain a fourth feature; the detection module 2405 is configured to input the fourth feature into a classification regression module to obtain a 3D target frame and a classification category to which a target object in the 3D target frame belongs.

In the above embodiment of the present application, the execution device 2400 first obtains, through the obtaining module 2401, 2D information corresponding to each laser point in the laser point cloud (i.e., 3D information), then the first operation module 2402 inputs the 3D information and the 2D information into two three-dimensional sparse convolution modules respectively to obtain a first feature and a second feature, then the second operation module 2403 concatenates the first feature and the second feature and inputs the concatenated feature into a two-dimensional convolution module to obtain a third feature (i.e., a fusion feature), the third feature is further concatenated with the first feature and the second feature that start to obtain a fourth feature (i.e., a combination feature), and finally, the fourth feature is used to implement 3D target detection through the classification regression module. The method and the device are used for obtaining the 2D information corresponding to each laser point in the laser point cloud according to the laser point cloud and the 2D image, and performing feature fusion on the feature layer by using the laser point cloud and the 2D information, so that the original features of the 3D laser point cloud are kept while the 3D target detection performance is improved, and the 3D target detection has good robustness under the complex scenes that the cameras are invalid, such as at night, in a rain and fog day and the like.

In one possible design, the obtaining module 2401 is specifically configured to: performing semantic segmentation on the obtained image through a semantic segmentation model to obtain semantic segmentation scores of all pixel points in the image, wherein the semantic segmentation scores are used for expressing the probability that all the pixel points belong to respective classification categories, and the semantic segmentation scores of all the pixel points in the image form a semantic segmentation graph; and then projecting the laser point cloud to the image to obtain target RGB information corresponding to each laser point in the laser point cloud, projecting the laser point cloud to the semantic segmentation map to obtain a target semantic segmentation score corresponding to each laser point, wherein the target RGB information and the target semantic segmentation score form the two-dimensional information.

In one possible design, the first operation module 2402 is specifically configured to: performing voxelization processing on the laser point cloud to obtain a voxelized laser point cloud; and then performing convolution operation on the voxelized laser point cloud through a three-dimensional first sparse convolution module.

In a possible design, the first operation module 2402 is further specifically configured to: performing voxelization processing on the two-dimensional information to obtain voxelized two-dimensional information; and then performing convolution operation on the voxelized two-dimensional information through a three-dimensional second sparse convolution module.

In a possible design, the first operation module 2402 is further specifically configured to: cascading the laser point cloud and the two-dimensional information to obtain multi-mode information; performing voxelization processing on the multi-modal information to obtain voxelized multi-modal information, wherein the voxelized multi-modal information comprises voxelized laser point cloud and voxelized two-dimensional information; performing convolution operation on the voxelized laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the voxelized two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic.

In one possible design, the second operation module 2403 is specifically configured to: cascading the first characteristic and the second characteristic to obtain a cascading characteristic; performing de-voxelization processing on the cascade characteristic to obtain a de-voxelized cascade characteristic; and inputting the de-voxelized cascade features into a two-dimensional convolution module for convolution operation.

It should be noted that the contents of information interaction, execution processes, and the like between the modules/units in the execution device 2400 may be specifically applied to various application scenarios in the foregoing corresponding method embodiments in the present application, and the specific contents may refer to descriptions in the foregoing method embodiments in the present application, and are not described herein again.

Next, another related apparatus provided in the embodiment of the present application is introduced, specifically referring to fig. 25, fig. 25 is a schematic structural diagram of a training apparatus provided in the embodiment of the present application, and the training apparatus 2500 includes: a building module 2501, a first operation module 2502, a second operation module 2503, a cascade module 2504, a prediction module 2505, a training module 2506, wherein, the constructing module 2501 is configured to construct a first training set according to an initial training set, an initial training sample in the initial training set includes an initial laser point cloud and initial two-dimensional information corresponding to the initial laser point cloud, a first training sample in the first training set includes a first laser point cloud and first two-dimensional information corresponding to the first laser point cloud, the first training sample is a target object extracted from each initial training sample in the initial training set, randomly copying the target object to the training sample obtained in each initial training sample, the initial training sample is any one of the training samples in the initial training set, and the first training sample is any one of the training samples in the first training set; a first operation module 2502, configured to perform convolution operation on the first laser point cloud through a three-dimensional first sparse convolution module to obtain a first feature, and perform convolution operation on the first two-dimensional information through a three-dimensional second sparse convolution module to obtain a second feature; a second operation module 2503, configured to cascade the first feature and the second feature, and input the cascade to a two-dimensional convolution module for convolution operation to obtain a third feature; a cascade module 2504, configured to cascade the first feature, the second feature, and the third feature to obtain a fourth feature; the prediction module 2505 is configured to input the fourth feature into a classification regression module to obtain a 3D prediction target frame and a prediction classification category to which a target object in the 3D prediction target frame belongs; a training module 2506, configured to perform iterative training on a model by using a target loss function according to a 3D real target frame, a real classification category to which a target object in the 3D real target frame belongs, the 3D prediction target frame, and a prediction classification category to which the target object in the 3D prediction target frame belongs, where the model includes the first sparse convolution module, the second sparse convolution module, the convolution module, and the classification regression module.

In the embodiment of the present application, how to perform iterative training on each module in the 3D object detection model by using each module in the training device 2500 is specifically described, and the training process is easy to implement.

In one possible design, the initial two-dimensional information includes initial RGB information corresponding to each initial laser point in the initial laser point cloud obtained by projecting the initial laser point cloud to an initial image, and an initial semantic segmentation score corresponding to each initial laser point obtained by projecting the initial laser point cloud to an initial semantic segmentation map, where the initial semantic segmentation score is a probability that each pixel point in the initial image obtained by performing semantic segmentation on the initial image by a semantic segmentation model belongs to each classification category, and the initial semantic segmentation score of each pixel point in the initial image constitutes the initial semantic segmentation map; the first two-dimensional information comprises first RGB information corresponding to each first laser point in the first laser point cloud obtained by projecting the first laser point cloud to a first image and a first semantic segmentation score corresponding to each first laser point obtained by projecting the first laser point cloud to a first semantic segmentation map, the first semantic segmentation score is the probability that each pixel point in the first image obtained by performing semantic segmentation on the first image through the semantic segmentation model belongs to each classification category, and the first semantic segmentation score of each pixel point in the first image forms the first semantic segmentation map.

In one possible design, the first operating module 2502 is specifically configured to: performing voxelization processing on the first laser point cloud to obtain a voxelized first laser point cloud; performing, by the first sparse convolution module, a convolution operation on the voxelized first laser point cloud.

In one possible design, the first operating module 2502 is further configured to: performing voxelization processing on the first two-dimensional information to obtain voxelized first two-dimensional information; performing, by the second sparse convolution module, a convolution operation on the voxelized first two-dimensional information.

In one possible design, the first operating module 2502 is further configured to: cascading the first laser point cloud and the first two-dimensional information to obtain multi-mode information; performing voxelization processing on the multi-modal information to obtain voxelized multi-modal information, wherein the voxelized multi-modal information comprises voxelized first laser point cloud and voxelized first two-dimensional information; performing convolution operation on the voxelized first laser point cloud through the first sparse convolution module to obtain a first characteristic, and performing convolution operation on the voxelized first two-dimensional information through the second sparse convolution module to obtain a second characteristic.

In one possible design, the second operation module 2503 is specifically configured to: cascading the first characteristic and the second characteristic to obtain a cascading characteristic; performing de-voxelization processing on the cascade characteristic to obtain a de-voxelized cascade characteristic; and inputting the de-voxelized cascade features into the convolution module for convolution operation.

Training all training samples in the entire training set once is called a training round (i.e. epoch is 1), and 1 epoch is equal to one training round using all training samples in the training set, and an epoch is equal to several, which indicates that several rounds of training have been performed on the training set. Thus, in one possible design, the training module 2506 is further configured to: repeating the above steps performed by each of the building module 2501, the first operating module 2502, the second operating module 2503, the cascading module 2504, the predicting module 2505, and the training module 2506 until the training round (epoch) for the first training set reaches a first preset round; taking the initial training set as a new first training set, repeating the above steps performed by each of the building module 2501, the first operating module 2502, the second operating module 2503, the cascading module 2504, the predicting module 2505, and the training module 2506 until the training round of the new first training set reaches a second preset round.

It should also be noted that the contents of information interaction, execution process, and the like between the modules/units in the training device 2500 may be specifically applied to various application scenarios in the foregoing corresponding method embodiments in the present application, and the specific contents may refer to the descriptions in the foregoing method embodiments in the present application, and are not described herein again.

Referring to fig. 26, fig. 26 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2600 may be embodied as a side device, an edge device (e.g., a virtual reality VR device, a mobile phone, a tablet, a laptop, a smart wearable device, etc.), or a wheel-type mobile device (e.g., an autonomous vehicle, an assisted driving vehicle, a smart robot, etc.), which is not limited herein. In this embodiment, each module described in the embodiment corresponding to fig. 24 may be deployed on the execution device 2600, and is used to implement the function of the execution device 2400 in the embodiment corresponding to fig. 24. Specifically, the execution device 2600 includes: a receiver 2601, a transmitter 2602, a processor 2603, and a memory 2604 (wherein the number of processors 2603 in the execution device 2600 may be one or more, for example, one processor in fig. 25), wherein the processor 2603 may include an application processor 26031 and a communication processor 26032. In some embodiments of the application, the receiver 2601, the transmitter 2602, the processor 2603, and the memory 2604 may be connected by a bus or other means.

Memory 2604 may include both read-only memory and random-access memory, and provides instructions and data to processor 2603. A portion of memory 2604 may also include non-volatile random access memory (NVRAM). The memory 2604 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 2603 controls the operation of the execution device 2600. In particular implementations, various components of the execution device 2600 are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The methods disclosed in the embodiments of the present application may be applied to the processor 2603 or implemented by the processor 2603. The processor 2603 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by instructions in the form of hardware integrated logic circuits or software in the processor 2603. The processor 2603 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 2603 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 2604, and the processor 2603 reads the information in the memory 2604 and performs the steps of the method in combination with the hardware.

The receiver 2601 may be used to receive entered numeric or character information and to generate signal inputs relating to performing settings and function control of the device 2600. The transmitter 2602 may be used to output numeric or character information through the first interface; the transmitter 2602 may also be used to send instructions to the disk groups through the first interface to modify data in the disk groups; the transmitter 2602 may also include a display device such as a display screen.

In one embodiment, the processor 2603 is configured to execute the steps of the target detection method corresponding to fig. 16, for example, first acquire an image and a laser point cloud at a certain position at a certain time by a sensor device (e.g., a camera, a laser radar, etc.) disposed on the execution apparatus, and then obtain two-dimensional information corresponding to the laser point cloud according to the acquired image and laser point cloud. After the laser point cloud and the two-dimensional information corresponding to the laser point cloud are obtained, convolution operation is performed on the laser point cloud and the two-dimensional information through a three-dimensional first sparse convolution module and a three-dimensional second sparse convolution module respectively, and therefore a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information are obtained respectively. After a first feature corresponding to the laser point cloud and a second feature corresponding to the two-dimensional information are obtained, the first feature and the second feature are cascaded, and the cascaded first feature and the cascaded second feature are input into a two-dimensional convolution module to be subjected to convolution operation, so that a third feature is obtained, and the third feature can be called as a fusion feature due to the fact that the first feature and the second feature are fused. And after the third feature is obtained, cascading the first feature, the second feature and the third feature which are obtained before to obtain a fourth feature, wherein the fourth feature can also be called a combined feature, and after the fourth feature is obtained, inputting the fourth feature into a classification regression module, for example, inputting a classification regression head so as to output the 3D target frame and the classification category to which the target object in the 3D target frame belongs. Wherein the target object may also be referred to as a target object.

In this embodiment, in another case, the processor 2603 is further configured to execute various applications of the trained target detection model in various application scenarios in the embodiments corresponding to fig. 20 to fig. 23, which is described with reference to the above embodiments and is not described herein again.

Referring to fig. 27, fig. 27 is a schematic structural diagram of a training apparatus provided in the embodiment of the present application, and for convenience of description, only portions related to the embodiment of the present application are shown, and details of the technique are not disclosed, please refer to a method portion in the embodiment of the present application. The training device 2700 may be deployed with modules of the training device 2500 described in the embodiment corresponding to fig. 25, and is used to implement the functions of the training device in the embodiment corresponding to fig. 25, specifically, the training device 2700 is implemented by one or more servers, the training device 2700 may have relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 2722 (e.g., one or more CPUs) and a memory 2732, and one or more storage media 2730 (e.g., one or more mass storage devices) storing applications 2742 or data 2744. Memory 2732 and storage media 2730 may be transitory or persistent storage, among other things. The program stored on the storage medium 2730 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Further, central processor 2722 may be configured to communicate with storage medium 2730 to perform a series of instructional operations on storage medium 2730 on exercise device 2700.

Training apparatus 2700 may also include one or more power supplies 2726, one or more wired or wireless network interfaces 2750, one or more input-output interfaces 2758, and/or one or more operating systems 2741, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

In this embodiment of the application, the steps executed by the training device in the embodiment corresponding to fig. 10 may be implemented based on the structure shown in fig. 27, and details are not repeated here.

Also provided in embodiments of the present application is a computer program product, which when run on a computer, causes the computer to perform the steps performed by the apparatus for detection in the method as described in the aforementioned illustrated embodiments, or causes the computer to perform the steps performed by the apparatus for training in the method as described in the aforementioned illustrated embodiments.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the computer is caused to execute the steps performed by the detection device in the method described in the foregoing illustrated embodiment, or the computer is caused to execute the steps performed by the training device in the method described in the foregoing illustrated embodiment.

The detection device and the training device provided by the embodiment of the application can be chips, and the chips comprise: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute computer-executable instructions stored by the storage unit to cause the training device to perform the optimization method of the neural network described in the illustrated embodiment, or a chip within the detection device to perform the image processing method or the audio processing method described in the illustrated embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, please refer to fig. 28, fig. 28 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 200, and the NPU 200 is mounted on a main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2003, and the controller 2004 controls the arithmetic circuit 2003 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2003 internally includes a plurality of processing units (PEs). In some implementations, the arithmetic circuitry 2003 is a two-dimensional systolic array. The arithmetic circuit 2003 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2003 is a general purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2002 and buffers it in each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2001 and performs matrix arithmetic with the matrix B, and partial results or final results of the obtained matrix are stored in an accumulator (accumulator) 2008.

The unified memory 2006 is used to store input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 2005, and the DMAC is transferred to the weight Memory 2002. Input data is also carried into the unified memory 2006 by the DMAC.

The BIU is a Bus Interface Unit 2010 for the interaction of the AXI Bus with the DMAC and the Instruction Fetch Buffer (IFB) 2009.

The Bus Interface Unit 2010(Bus Interface Unit, BIU for short) is configured to obtain an instruction from the external memory by the instruction fetch memory 2009, and is further configured to obtain the original data of the input matrix a or the weight matrix B from the external memory by the storage Unit access controller 2005.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2006 or to transfer weight data to the weight memory 2002 or to transfer input data to the input memory 2001.

The vector calculation unit 2007 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2007 can store the vector of processed outputs to the unified memory 2006. For example, the vector calculation unit 2007 may apply a linear function and/or a nonlinear function to the output of the arithmetic circuit 2003, such as linear interpolation of the feature planes extracted by the convolutional layers, and further such as a vector of accumulated values, to generate the activation values. In some implementations, the vector calculation unit 2007 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuit 2003, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2009 connected to the controller 2004 for storing instructions used by the controller 2004;

the unified memory 2006, the input memory 2001, the weight memory 2002, and the instruction fetch memory 2009 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The operations of the layers in the first neural network, the operations of the layers in the second neural network, and the joint iterative training process of the two neural networks shown above may be performed by the arithmetic circuit 2003 or the vector calculation unit 2007.

Wherein any of the aforementioned processors may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits configured to control the execution of the programs of the method of the first aspect.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, and may be implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk of a computer, and includes instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the methods described in the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Claims

1. A method of 3D object detection, comprising:

obtaining two-dimensional information corresponding to the laser point cloud according to the obtained image and the laser point cloud;

performing convolution operation on the laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic;

after the first characteristic and the second characteristic are cascaded, inputting the first characteristic and the second characteristic into a two-dimensional convolution module for convolution operation to obtain a third characteristic;

cascading the first feature, the second feature and the third feature to obtain a fourth feature;

and inputting the fourth features into a classification regression module to obtain a 3D target frame and a classification category to which the target object in the 3D target frame belongs.

2. The method of claim 1, wherein obtaining two-dimensional information corresponding to the laser point cloud from the obtained image and the laser point cloud comprises:

performing semantic segmentation on the obtained image through a semantic segmentation model to obtain semantic segmentation scores of all pixel points in the image, wherein the semantic segmentation scores are used for expressing the probability that all the pixel points belong to respective classification categories, and the semantic segmentation scores of all the pixel points in the image form a semantic segmentation graph;

and projecting the laser point cloud to the image to obtain target RGB information corresponding to each laser point in the laser point cloud, projecting the laser point cloud to the semantic segmentation map to obtain a target semantic segmentation score corresponding to each laser point, wherein the target RGB information and the target semantic segmentation score form the two-dimensional information.

3. The method of any of claims 1-2, wherein the performing, by a three-dimensional first sparse convolution module, a convolution operation on the laser point cloud comprises:

performing voxelization processing on the laser point cloud to obtain a voxelized laser point cloud;

and performing convolution operation on the voxelized laser point cloud through a three-dimensional first sparse convolution module.

4. The method according to any one of claims 1-3, wherein said performing a convolution operation on said two-dimensional information by a second sparse convolution module in three dimensions comprises:

performing voxelization processing on the two-dimensional information to obtain voxelized two-dimensional information;

and performing convolution operation on the voxelized two-dimensional information through a three-dimensional second sparse convolution module.

5. The method of any one of claims 1-2, wherein convolving the laser point cloud with a first sparse convolution module in three dimensions to obtain a first feature and convolving the two-dimensional information with a second sparse convolution module in three dimensions to obtain a second feature comprises:

cascading the laser point cloud and the two-dimensional information to obtain multi-mode information;

performing voxelization processing on the multi-modal information to obtain voxelized multi-modal information, wherein the voxelized multi-modal information comprises voxelized laser point cloud and voxelized two-dimensional information;

performing convolution operation on the voxelized laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the voxelized two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic.

6. The method according to any one of claims 4-5, wherein the concatenating the first feature and the second feature into a two-dimensional convolution module for convolution comprises:

cascading the first characteristic and the second characteristic to obtain a cascading characteristic;

performing de-voxelization processing on the cascade characteristic to obtain a de-voxelized cascade characteristic;

and inputting the de-voxelized cascade features into a two-dimensional convolution module for convolution operation.

7. A training method of a 3D object detection model, wherein the model comprises a three-dimensional first sparse convolution module, a three-dimensional second sparse convolution module, a two-dimensional convolution module and a classification regression module, and the method comprises the following steps:

constructing a first training set according to an initial training set, wherein initial training samples in the initial training set comprise initial laser point clouds and initial two-dimensional information corresponding to the initial laser point clouds, the first training samples in the first training set comprise first laser point clouds and first two-dimensional information corresponding to the first laser point clouds, the first training samples are training samples obtained by extracting target objects from all the initial training samples in the initial training set and randomly copying the target objects in all the initial training samples, the initial training samples are any one of the initial training samples, and the first training samples are any one of the training samples in the first training set;

performing convolution operation on the first laser point cloud through the first sparse convolution module to obtain a first characteristic, and performing convolution operation on the first two-dimensional information through the second sparse convolution module to obtain a second characteristic;

after the first characteristic and the second characteristic are cascaded, inputting the first characteristic and the second characteristic into the convolution module for convolution operation to obtain a third characteristic;

inputting the fourth features into the classification regression module to obtain a 3D prediction target frame and a prediction classification category to which a target object in the 3D prediction target frame belongs;

and performing iterative training on the model by using a target loss function according to the 3D real target frame, the real classification category to which the target object in the 3D real target frame belongs, the 3D prediction target frame and the prediction classification category to which the target object in the 3D prediction target frame belongs.

8. The method of claim 7,

the initial two-dimensional information comprises initial RGB information corresponding to each initial laser point in the initial laser point cloud obtained by projecting the initial laser point cloud to an initial image, and initial semantic segmentation scores corresponding to each initial laser point obtained by projecting the initial laser point cloud to an initial semantic segmentation map, wherein the initial semantic segmentation scores are probabilities that each pixel point in the initial image obtained by performing semantic segmentation on the initial image through a semantic segmentation model belongs to each classification category, and the initial semantic segmentation scores of each pixel point in the initial image form the initial semantic segmentation map;

the first two-dimensional information comprises first RGB information corresponding to each first laser point in the first laser point cloud obtained by projecting the first laser point cloud to a first image and a first semantic segmentation score corresponding to each first laser point obtained by projecting the first laser point cloud to a first semantic segmentation map, the first semantic segmentation score is the probability that each pixel point in the first image obtained by performing semantic segmentation on the first image through the semantic segmentation model belongs to each classification category, and the first semantic segmentation score of each pixel point in the first image forms the first semantic segmentation map.

9. The method of any of claims 7-8, wherein the performing, by the first sparse convolution module, a convolution operation on the first laser point cloud comprises:

performing voxelization processing on the first laser point cloud to obtain a voxelized first laser point cloud;

performing, by the first sparse convolution module, a convolution operation on the voxelized first laser point cloud.

10. The method according to any of claims 7-9, wherein said performing, by said second sparse convolution module, a convolution operation on said first two-dimensional information comprises:

performing voxelization processing on the first two-dimensional information to obtain voxelized first two-dimensional information;

performing, by the second sparse convolution module, a convolution operation on the voxelized first two-dimensional information.

11. The method of any of claims 7-8, wherein the convolving the first laser point cloud with the first sparse convolution module to obtain a first feature and the convolving the first two-dimensional information with the second sparse convolution module to obtain a second feature comprises:

cascading the first laser point cloud and the first two-dimensional information to obtain multi-mode information;

performing voxelization processing on the multi-modal information to obtain voxelized multi-modal information, wherein the voxelized multi-modal information comprises voxelized first laser point cloud and voxelized first two-dimensional information;

performing convolution operation on the voxelized first laser point cloud through the first sparse convolution module to obtain a first characteristic, and performing convolution operation on the voxelized first two-dimensional information through the second sparse convolution module to obtain a second characteristic.

12. The method according to any one of claims 10-11, wherein said concatenating the first feature and the second feature before inputting them to the convolution module for convolution comprises:

and inputting the de-voxelized cascade features into the convolution module for convolution operation.

13. The method according to any one of claims 7-12, further comprising:

repeatedly executing the steps until the training round (epoch) of the first training set reaches a first preset round;

and taking the initial training set as a new first training set, and repeatedly executing the steps until the training round of the new first training set reaches a second preset round.

14. An execution device, the device comprising:

the acquisition module is used for acquiring two-dimensional information corresponding to the laser point cloud according to the acquired image and the laser point cloud;

the first operation module is used for performing convolution operation on the laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic;

the second operation module is used for cascading the first characteristic and the second characteristic and then inputting the cascaded first characteristic and second characteristic into the two-dimensional convolution module for convolution operation to obtain a third characteristic;

a cascade module, configured to cascade the first feature, the second feature, and the third feature to obtain a fourth feature;

and the detection module is used for inputting the fourth features into a classification regression module to obtain a 3D target frame and a classification category to which the target object in the 3D target frame belongs.

15. The device according to claim 14, wherein the obtaining module is specifically configured to:

16. The device according to any one of claims 14 to 15, characterized in that said first operating module is specifically configured to:

17. The device according to any one of claims 14 to 16, wherein the first operating module is further configured to:

18. The device according to any one of claims 14 to 15, wherein the first operating module is further configured to:

19. The device according to any one of claims 17 to 18, characterized in that said second operating module is specifically configured to:

20. An exercise apparatus, characterized in that the apparatus comprises:

the device comprises a construction module, a first training set and a second training set, wherein the construction module is used for constructing the first training set according to an initial training set, initial training samples in the initial training set comprise initial laser point clouds and initial two-dimensional information corresponding to the initial laser point clouds, the first training samples in the first training set comprise first laser point clouds and first two-dimensional information corresponding to the first laser point clouds, the first training samples are training samples obtained by extracting target objects from all the initial training samples in the initial training set and randomly copying the target objects in all the initial training samples, the initial training samples are any one of the initial training samples, and the first training samples are any one of the training samples in the first training set;

the first operation module is used for performing convolution operation on the first laser point cloud through a three-dimensional first sparse convolution module to obtain a first characteristic, and performing convolution operation on the first two-dimensional information through a three-dimensional second sparse convolution module to obtain a second characteristic;

the prediction module is used for inputting the fourth features into a classification regression module to obtain a 3D prediction target frame and a prediction classification category to which a target object in the 3D prediction target frame belongs;

the training module is used for performing iterative training on a model by using a target loss function according to a 3D real target frame, a real classification category to which a target object in the 3D real target frame belongs, the 3D prediction target frame and a prediction classification category to which the target object in the 3D prediction target frame belongs, wherein the model comprises the first sparse convolution module, the second sparse convolution module, the convolution module and the classification regression module.

21. The apparatus of claim 20,

22. The device according to any one of claims 20 to 21, wherein the first operating module is specifically configured to:

23. The device according to any one of claims 20 to 22, wherein the first operating module is further configured to:

24. The device according to any one of claims 20 to 21, wherein the first operating module is further configured to:

25. The device according to any one of claims 23 to 24, characterized in that said second operating module is specifically configured to:

26. The apparatus of any of claims 20-25, wherein the training module is further configured to:

repeating the above steps performed by the building module, the first operating module, the second operating module, the cascade module, the prediction module, and the training module, respectively, until a training round (epoch) for the first training set reaches a first preset round;

and taking the initial training set as a new first training set, and repeating the steps respectively executed by the building module, the first operation module, the second operation module, the cascade module, the prediction module and the training module until the training round of the new first training set reaches a second preset round.

27. An execution device comprising a processor and a memory, the processor coupled with the memory,

the memory is used for storing programs;

the processor to execute a program in the memory to cause the training apparatus to perform the method of any of claims 1-6.

28. A training device comprising a processor and a memory, the processor being coupled to the memory,

the memory is used for storing programs;

the processor to execute a program in the memory to cause the training apparatus to perform the method of any of claims 7-13.

29. A computer-readable storage medium comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1-6 or causes the computer to perform the method of any one of claims 7-13.

30. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1-6 or cause the computer to perform the method of any one of claims 7-13.

31. A chip system, comprising a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute a computer program or instructions to cause the method of any of claims 1-6 to be performed or to cause the method of any of claims 7-13 to be performed.