CN115019034A

CN115019034A - Detection model training method and device and object detection method and device

Info

Publication number: CN115019034A
Application number: CN202210615907.3A
Authority: CN
Inventors: 张达; 苗振伟; 刘挺; 占新; 卿泉; 袁婷婷
Original assignee: Alibaba China Co Ltd
Current assignee: Zhejiang Cainiao Chuancheng Network Technology Co.,Ltd.
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-09-06

Abstract

The embodiment of the application provides a detection model training method and a detection model training device, wherein the method comprises the following steps: and acquiring at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, the sample point clouds and sample object detection results corresponding to the sample images. And processing the multi-frame sample point clouds and the multi-frame sample images according to the feature extraction network in the detection model to obtain feature information corresponding to the multi-frame sample point clouds and the multi-frame sample images. And processing the characteristic information according to a detection network in the detection model to obtain a first object detection result output by the object detection model. And updating the model parameters of the detection model according to the first object detection result and the sample object detection result. The method provided by the application can effectively improve the accuracy of object detection.

Description

Detection model training method and device and object detection method and device

Technical Field

The embodiment of the application relates to an image processing technology, in particular to a detection model training method and device and an object detection method and device.

Background

With the continuous development of image processing technology, environmental awareness has become an important application, for example, an object detection module in environmental awareness can effectively detect an object existing in an environment.

At present, in the prior art, when performing target detection, an individual sensor is usually used to acquire environmental data at the current time, and then target detection is realized based on the environmental data acquired at the current time, that is, in the prior art, target detection is usually realized based on a single frame of image acquired by a single sensor.

However, the single frame of environmental data acquired by a single independent sensor often lacks data comprehensiveness, which may result in lower accuracy of target detection.

Disclosure of Invention

The embodiment of the application provides a detection model training method and device and an object detection method and device, and aims to solve the problem of low accuracy of target detection.

In a first aspect, an embodiment of the present application provides a detection model training method, including:

acquiring at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images;

processing the multi-frame sample point cloud and the multi-frame sample image according to a feature extraction network in a detection model to obtain feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image;

processing the characteristic information according to a detection network in the detection model to obtain a first object detection result output by the object detection model;

and updating the model parameters of the detection model according to the first object detection result and the sample object detection result.

In a second aspect, an embodiment of the present application provides an object detection method, including:

acquiring a first point cloud and a first image acquired at a first moment;

acquiring multiple frames of second point clouds and multiple frames of second images acquired before the first moment;

processing the first point cloud, the first image, the plurality of frames of second point clouds and the plurality of frames of second images according to a detection model to obtain object detection results corresponding to the first point cloud and the first image,

wherein the detection model is a model trained according to the method of the first aspect.

In a third aspect, an embodiment of the present application provides a detection model training apparatus, including:

the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring at least one group of training data, and the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images;

the first processing module is used for processing the multi-frame sample point cloud and the multi-frame sample image according to a feature extraction network in a detection model to obtain feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image;

the second processing module is used for processing the characteristic information according to a detection network in the detection model to obtain a first object detection result output by the object detection model;

and the updating module is used for updating the model parameters of the detection model according to the first object detection result and the sample object detection result.

In a fourth aspect, an embodiment of the present application provides an object detection apparatus, including:

the first acquisition module is used for acquiring a first point cloud and a first image acquired at a first moment;

the second acquisition module is used for acquiring a plurality of frames of second point clouds and a plurality of frames of second images acquired before the first moment;

a processing module for processing the first point cloud, the first image, the plurality of frames of second point clouds and the plurality of frames of second images according to a detection model to obtain object detection results corresponding to the first point cloud and the first image,

In a fifth aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of the first or second aspect when the program is executed.

In a sixth aspect, embodiments of the present application provide a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method according to the first aspect or the second aspect.

In a seventh aspect, the present application provides a computer program product, which includes a computer program, where the computer program is executed by a processor to perform the method according to the first aspect or the second aspect.

The embodiment of the application provides a detection model training method and a detection model training device, wherein the method comprises the following steps: and acquiring at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, the sample point clouds and sample object detection results corresponding to the sample images. And processing the multi-frame sample point cloud and the multi-frame sample image according to the feature extraction network in the detection model to obtain the feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image. And processing the characteristic information according to a detection network in the detection model to obtain a first object detection result output by the object detection model. And updating the model parameters of the detection model according to the first object detection result and the sample object detection result. By acquiring at least one group of training data, any one group of training data comprises a plurality of frames of sample point clouds and a plurality of frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images, and then training a detection model according to the training data. Specifically, the first object detection result output by the object detection model is obtained by sequentially processing the feature extraction network and the detection network in the detection model, and then the model parameters in the detection model are updated according to the first object detection result and the sample object detection result in the training data, so that the detection model can be effectively trained according to multi-frame point cloud data and multi-frame image data, and further the detection model after training can be ensured to be processed aiming at multi-frame point cloud and multi-frame images, so that the object detection is realized, and the accuracy of the object detection can be effectively improved because multi-frame multi-mode data is used as a support.

An embodiment of the present application provides an object detection method and an object detection device, where the method includes: and acquiring a first point cloud and a first image acquired at a first moment. And acquiring multiple frames of second point clouds and multiple frames of second images acquired before the first moment. And processing the first point cloud, the first image, the multiple frames of second point clouds and the multiple frames of second images according to the detection model to obtain object detection results corresponding to the first point cloud and the first image. Processing a plurality of frame images and a plurality of frame point clouds through the detection model obtained according to the training, wherein the plurality of frame images comprise a first image acquired at a first moment, a plurality of frame second images acquired before the first moment, and the plurality of frame point clouds comprise a first point cloud acquired at the first moment and a plurality of frame second point clouds acquired before the first moment, so that an object detection result is output. Because the object detection result is determined according to the multi-frame multi-modal environment data, the comprehensiveness and richness of data on which the output object detection result depends can be effectively ensured, and the accuracy and effectiveness of object detection can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic view of a target detection scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart of a training method for a detection model according to an embodiment of the present disclosure;

fig. 3 is a second flowchart of a detection model training method according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a detection model provided in an embodiment of the present application;

fig. 5 is a schematic diagram illustrating a correspondence relationship between a multi-frame sample point cloud and a multi-frame sample image provided in an embodiment of the present application;

FIG. 6 is a schematic diagram of feature points included in a grid provided by an embodiment of the present application;

fig. 7 is a schematic diagram illustrating implementation of region partition of a feature map provided in an embodiment of the present application;

fig. 8 is a schematic diagram illustrating an implementation of determining a set of areas according to an embodiment of the present application;

fig. 9 is a schematic diagram illustrating an implementation of a down-sampling process according to an embodiment of the present application;

fig. 10 is a flowchart three of a detection model training method provided in the embodiment of the present application;

fig. 11 is a flowchart of an object detection method provided by an embodiment of the present disclosure;

fig. 12 is a schematic flowchart of a detection model training method and an object detection method according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of a detection model training apparatus according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 15 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to better understand the technical solution of the present application, the related art related to the present application will be further described in detail below.

In the related fields of automatic driving, robots and the like, environmental perception is an important component of an algorithm system. And 3D target detection is a core algorithm module in the environment perception system. Taking automatic driving as an example, wherein 3D object detection can detect in real time dynamic objects and static obstacles existing around an automatic driving vehicle, wherein the dynamic objects are, for example, people, vehicles, etc., and the static obstacles are, for example, guideboards, road piles, etc., by detecting the objects existing around the automatic driving vehicle, safe, rational, reliable route planning and prediction can be provided for the automatic driving vehicle. The object detection described in the present application is also to be understood as object detection, which is the same meaning.

Thus, high precision 3D object detection is a cornerstone of autopilot-based context awareness. In an autonomous vehicle, a number of sensors are included to obtain environmental information of the surrounding environment. For example, the sensor may include a laser radar, an image sensor, an ultrasonic wave, a millimeter wave radar, and the like. Among them, laser radars and image sensors are widely used in automatic driving systems.

For example, in the prior art, a 3D target detection algorithm based on a single frame of laser point cloud exists. The method is mainly divided into a point-based method and a grid-based method by positioning, classifying and identifying objects in three-dimensional space through point cloud information captured by a multi-line laser radar. The method based on points takes original point cloud as input, utilizes typical point cloud characteristics to extract network to extract characterization information of the point cloud, and then regresses the category and the position of an object on a point level; the grid-based method firstly projects the point cloud to a corresponding grid coordinate system such as a bird's-eye view 2D plane, a 3D voxel and the like, and then processes rasterized point cloud information by using a 2D target detection algorithm such as fast regional Convolutional Neural network (fast-RCNN), Single Shot multi-box Detector (SSD), YOLO and the like or a 3D sparse Convolutional network to generate a final detection result. Compared with a point-based method, the grid-based method needs fewer used computing resources, is higher in overall detection precision, and is widely applied to the current automatic driving system. However, the detection method based on single-frame laser point cloud only depends on the point cloud information at the current moment for judgment, and a large amount of useful historical information is lost, so that the detection accuracy under the complex urban environment is limited.

And, there is a 3D target detection algorithm based on single frame image information in the prior art, which is not described herein again.

That is, in the prior art, the target detection may be performed based on a single frame of laser point cloud, or may also be performed based on image information of a single frame. However, the laser radar has the characteristics of all weather, high ranging precision, rich three-dimensional information and the like, but lacks important semantic information; the image sensor has the characteristics of rich color and texture information, complete semantic information and the like, but lacks important depth information. Therefore, only the single-frame environmental information acquired by the single sensor is independently used for target detection, so that the problem of lack of comprehensiveness of data exists, and the target detection accuracy is low.

Further, because the accuracy of target detection is lower, the problems of target missing detection, false detection, inaccurate estimation and the like can occur, and in an automatic driving scene, the automatic driving vehicle can be further subjected to risk problems such as unreasonable deceleration and sudden braking, and serious potential safety hazards can be caused.

Aiming at the problems in the prior art, the application provides the following technical conception: because the information collected by a single sensor lacks comprehensiveness, the information collected by a plurality of sensors can be considered to be subjected to fusion processing so as to enhance comprehensiveness of the information. For example, image information acquired by an image sensor and point cloud data acquired by a laser radar may be fused for target detection. The method for fusing the image and the point cloud can solve the problem of multi-sensor information complementation to a great extent, but the point cloud image information can be confused due to unavoidable calibration errors between the image sensor and the laser radar, so that the target detection effect is limited. Meanwhile, the requirement on the computing capability of the sensing system is high due to the complexity of the overall design of information fusion to realize target detection.

Meanwhile, because the information provided by the environment data of a single frame lacks richness, the target detection can be carried out by adopting the environment data of multiple frames. However, if the target detection is performed by using only the multi-frame data collected by the single sensor, the target detection accuracy is still low due to the problem that the data collected by the single sensor lacks comprehensiveness.

Therefore, in combination with the above considerations, the present application proposes an idea of performing target detection by using multi-frame multi-modal data, where the multi-frame refers to performing target detection by using data acquired at the current time and data acquired at the historical time, and the data acquired at the historical time can complement the problem of insufficient observation at the current time, so as to improve the detection accuracy. The multi-mode is to perform target detection by using data acquired by multiple sensors, for example, image data acquired by an image sensor and point cloud data acquired by a laser radar may be used for fusion to perform target detection, so that the problem that data acquired by a single sensor lacks comprehensiveness can be effectively avoided, and the accuracy of target detection can be effectively improved.

For example, fig. 1 may be combined to understand, and fig. 1 is a schematic view of a target detection scenario provided in this embodiment of the present application.

As shown in fig. 1, when performing target detection, multiple frames of images and multiple frames of point clouds, such as image 1, image 2, image 3, …, point cloud 1, point cloud 2, point cloud 3, … shown in fig. 1, may be acquired. And then, object detection can be carried out according to the multi-frame image and the multi-frame point cloud, so that a detection result is obtained.

Based on the above description, the method provided by the present application is described in detail below with reference to specific embodiments.

It can be understood that the detection model training method and the object detection method provided in the present application include two parts, one part is the detection model training method, and this part is the training process for the detection model; the other part is an object detection method, and the part is directed to the application of a trained detection model. The following description will be made for each of these two parts.

Before describing the specific implementation of the method, it should be further noted that the execution main body in each embodiment of the present application may be a device with a data processing function, such as a local server, a cloud server, a processor, a chip, and the like. The specific execution main body can be selected and set according to actual requirements, which is not limited in this embodiment, and all devices having a data processing function can be used as the execution main body in this embodiment.

The following first describes a part of the training method of the detection model, that is, the training process of the detection model. Fig. 2 is a flowchart of a detection model training method provided in the embodiment of the present application.

As shown in fig. 2, the method includes:

s201, obtaining at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, the sample point clouds and sample object detection results corresponding to the sample images.

In this embodiment, in order to perform model training on the detection model, at least one set of training data may be acquired first, for example. The training data of any group may include a plurality of frames of sample point clouds, a plurality of frames of sample images, and sample object detection results corresponding to the plurality of frames of sample point clouds and the plurality of frames of sample images.

It will be appreciated that the sample object detection results in the training data may be, for example, manually labeled or may also be machine labeled, but for the case of machine labeling, the premise is that the labeling correctness is guaranteed, that is, the sample object detection results in the training data are guaranteed to be accurate as a reference for training.

In a possible implementation manner, for any group of training data, the multi-frame sample point cloud may be, for example, a multi-frame point cloud acquired by a laser radar on the same device at different times. And the multi-frame sample image may be, for example, a multi-frame image acquired by an image sensor on the same device at different times. And, there may be a plurality of image sensors therein.

Furthermore, for example, the multi-frame sample point cloud may include a sample point cloud acquired at a first time and a multi-frame sample point cloud acquired before the first time. And, the sample image acquired at the first timing and the multi-frame sample image acquired before the first timing may be included in the multi-frame sample images. And the detection result of the sample object in the set of training data can be the detection result corresponding to the sample point cloud and the sample image acquired at the first moment.

It can be understood that the first time may be a time at which object detection is currently required, and then it may be understood that, for an image sensor and a laser point cloud of the same device, a sample image and a sample point cloud acquired at the first time may be acquired, a multi-frame sample point cloud and a multi-frame sample image captured at a historical time (before the first time) may be acquired, and an object detection result corresponding to the sample image and the sample point cloud acquired at the first time may be used as a sample object detection result, so as to obtain a set of training data.

S202, processing the multi-frame sample point cloud and the multi-frame sample image according to the feature extraction network in the detection model to obtain feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image.

After the training data is obtained, the detection model can be trained according to the training data, and the detection model in the embodiment can process multiple frames of point clouds and multiple frames of images, so that object detection results corresponding to the point clouds and the images are output.

In a possible implementation manner, the detection model in this embodiment may include a feature extraction network, where the feature extraction network may process a plurality of frame sample point clouds and a plurality of frame sample images, so as to obtain feature information corresponding to the plurality of frame sample point clouds and the plurality of frame sample images.

It should be noted that, what is output after the feature extraction network processing is feature information corresponding to both the multi-frame sample point cloud and the multi-frame sample image, rather than feature information corresponding to both the multi-frame sample point cloud and the multi-frame sample image.

S203, processing the characteristic information according to the detection network in the detection model to obtain a first object detection result output by the object detection model.

And the detection model also comprises a detection network, and after the characteristic extraction network processes the characteristic information corresponding to the multi-frame sample point cloud and the multi-frame sample image, the detection network in the detection model can process the characteristic information so as to obtain a first object detection result output by the object detection model.

In a possible implementation manner, the first object detection result may include a plurality of frame sample images and a plurality of frame sample point clouds, and the position and classification information of each object in the sample point clouds and the sample images acquired at the first time.

The specific implementation in the detection network is to process the extracted feature information, so as to output the sample point cloud acquired at the first moment and the position and classification information of each object in the sample image. In the actual implementation process, the specific implementation of the detection network can be selected and set according to actual requirements, and the detection network can be a network structure capable of realizing the target detection function.

And S204, updating the model parameters of the detection model according to the first object detection result and the sample object detection result.

After the detection model outputs the first object detection result, the model parameters of the detection model can be updated according to the first object detection result and the sample object detection result in the training data, so that the training of the detection model is realized.

It can be understood that, when the model parameters of the detection model are updated, the first object detection result is processed by the detection model, and the sample object detection result is labeled in advance, so that the correctness of the sample object detection result can be ensured.

Therefore, in a possible implementation manner, for example, the first object detection result and the sample object detection result may be processed by using a preset loss function, so as to determine a loss function value, where a specific implementation of the preset calculation function may be selected and set according to actual requirements, as long as the loss function value may reflect a difference between the first object detection result and the sample object detection result.

It is understood that the objective of model optimization in this embodiment is to make the first object detection result output by the detection model and the sample object detection result in the training data as close as possible. Therefore, after the loss function value is determined, the model parameters of the detection model can be updated according to the loss function value, so that the detection model is optimized according to the loss function value, the distance between the first object detection result output by the detection model and the sample object detection result is shortened, and the correctness of the first sample object detection result output by the detection model is effectively ensured.

And in the embodiment, a plurality of groups of training data exist, and when the detection model is trained and optimized, the same training process is executed for each group of training data, so that multiple rounds of training of the detection model are realized. In a possible implementation manner, when it is determined that the number of training rounds of the detection model reaches the preset number of rounds, or when it is determined that the detection accuracy of the detection model reaches the preset accuracy, it may be determined that the training of the detection model is finished, so as to obtain the detection model after the training is finished. The identified training model can be used for object detection aiming at point cloud data and image data subsequently.

The detection model training method provided by the embodiment of the application comprises the following steps: and acquiring at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, the sample point clouds and sample object detection results corresponding to the sample images. And processing the multi-frame sample point cloud and the multi-frame sample image according to the feature extraction network in the detection model to obtain the feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image. And processing the characteristic information according to a detection network in the detection model to obtain a first object detection result output by the object detection model. And updating the model parameters of the detection model according to the first object detection result and the sample object detection result. By acquiring at least one group of training data, any one group of training data comprises a plurality of frames of sample point clouds and a plurality of frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images, and then training a detection model according to the training data. Specifically, the first object detection result output by the object detection model is obtained by sequentially processing the feature extraction network and the detection network in the detection model, and then the model parameters in the detection model are updated according to the first object detection result and the sample object detection result in the training data, so that the detection model can be effectively trained according to multi-frame point cloud data and multi-frame image data, and further the detection model after training can be ensured to be processed aiming at multi-frame point cloud and multi-frame images, so that the object detection is realized, and the accuracy of the object detection can be effectively improved because multi-frame multi-mode data is used as a support.

Based on the above introduction, a specific model structure and a processing procedure in the detection model in the present application are further described in detail with reference to a specific embodiment, and are described with reference to fig. 3 to 8, fig. 3 is a second flowchart of a detection model training method provided in the embodiment of the present application, fig. 4 is a schematic structural diagram of the detection model provided in the embodiment of the present application, fig. 5 is a schematic diagram of a correspondence relationship between a point cloud of multiple frames of samples and an image of the multiple frames of samples provided in the embodiment of the present application, fig. 6 is a schematic diagram of feature points included in a grid provided in the embodiment of the present application, fig. 7 is a schematic diagram of implementing region division of a feature map provided in the embodiment of the present application, and fig. 8 is a schematic diagram of implementing determining a region set provided in the embodiment of the present application.

As shown in fig. 3, the method includes:

s301, obtaining at least one group of training data, wherein the training data comprises a plurality of frames of sample point clouds, a plurality of frames of sample images, the sample point clouds and sample object detection results corresponding to the sample images.

The implementation manner of S301 is similar to the implementation manner of S201 described above, and is not described herein again.

S302, for any frame of sample image, projecting the sample image to corresponding sample point cloud according to calibration parameters between the image acquisition equipment and the point cloud acquisition equipment to obtain projected image information corresponding to the sample image.

Based on the above description, it can be determined that after at least one set of training data is determined, the multi-frame sample point cloud and the multi-frame sample image can be processed according to the feature extraction network in the detection model, so as to obtain feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image.

In one possible implementation, for example, the model structure of the detection model may be understood in conjunction with fig. 4, as shown in fig. 4, a feature extraction network and a detection network are included in the detection model, and a feature encoding unit and a feature processing unit are included in the feature extraction network.

For any set of training data, after inputting the multi-frame sample image and the multi-frame sample point cloud to the detection model, for example, the multi-frame sample image and the multi-frame sample point cloud may be processed by a feature encoding unit in the feature extraction network first.

The feature encoding unit may have some differences in the processing of the multi-frame sample image and the multi-frame sample point cloud, and the following first describes the processing of the multi-frame sample image. The processing of each frame of sample image is similar, so that the processing process of the sample image is described below by taking any one of the multiple frames of sample images as an example, and the processing of the rest of sample images is not described in detail.

First, the relationship between the multi-frame sample image and the multi-frame sample point cloud will be described with reference to fig. 5. As can be understood from the above description, for any set of training data, the multi-frame sample point cloud and the multi-frame sample point cloud include the sample point cloud and the sample image acquired at the first time, and the multi-frame sample point cloud and the multi-frame sample image acquired before the first time.

For example, referring to fig. 5, taking an autonomous vehicle as an example, assuming that the autonomous vehicle a acquires an image 1 and a point cloud 1 at time t1, and assuming that object detection is currently required for the image 1 and the point cloud 1, time 1 may be the first time.

And, a multi-frame historical point cloud and a multi-frame historical image before time t1 are also needed as training data, for example, as shown in fig. 5, an image 2 and a point cloud 2 at time t2, an image 3 and a point cloud 3 at time t3, an image 4 and a point cloud 4 at time t4, an image 5 and a point cloud 5 at time t5, …, and an image j and a point cloud j at time tj, where j may be an integer greater than or equal to 2, may be obtained.

It will be appreciated that the time instants t2, t3, t4 are historical time instants prior to the time instant t1, and that for each time instant there is acquired image data and point cloud data. Therefore, in this embodiment, each frame of sample image has a corresponding sample point cloud, and the corresponding relationship here is acquired at the same time.

After understanding the corresponding relationship between the sample image and the sample point cloud, when processing the sample image, in a possible implementation manner, for example, the sample image may be projected onto the corresponding sample point cloud according to a calibration parameter between the image acquisition device and the point cloud acquisition device, so as to obtain projected image information corresponding to the sample image.

Wherein the image capturing device is used for capturing image data, and the image capturing device may be, for example, the image sensor described above. And a point cloud acquisition device for acquiring point cloud data, which may be, for example, the laser radar described above.

It can be understood that the difference between the installation positions of the image acquisition device and the point cloud acquisition device results in that the image data acquired by the image acquisition device and the point cloud data acquired by the point cloud acquisition device are in different coordinate systems. Therefore, for example, calibration parameters between the image acquisition device and the point cloud acquisition device can be determined according to the installation position of the point cloud acquisition device, the installation position of the image acquisition device, the appearance design parameters of the automatic driving vehicle and the projection parameters of each camera. The calibration parameters may indicate a correspondence between image pixels and points in the laser point cloud. The specific implementation of determining the calibration parameters between different sensors may be selected and set according to actual requirements, which is not limited in this embodiment.

When the sample image is projected onto the corresponding sample point cloud, for example, the sample image may be directly projected onto the corresponding sample point cloud to obtain projected image information corresponding to the sample image. Or, effective feature information can be extracted from the sample image by using the target detection network, and then the feature information of the sample image is projected onto the corresponding sample point cloud according to the calibration parameters, so that projected image information corresponding to the sample image is obtained.

And S303, projecting the projected image information corresponding to the sample image to a target image aiming at any frame of sample image to obtain a second projection image corresponding to the sample image, and performing feature extraction on the second projection image to obtain a second feature image corresponding to the sample image, wherein the second feature image comprises at least one second grid.

After the projected image information corresponding to the sample image is obtained, the feature map of the sample image can be determined according to the projected image information. In the following, any frame of sample image is also taken as an example for description, and details of the rest of sample images are not repeated.

In one possible implementation, after determining the projected image information corresponding to the sample image, for example, the projected image information corresponding to the sample image may be projected onto a target image, which in this embodiment may be a 2D bird's eye view.

After the projection, a second projection view corresponding to the sample image can be obtained. Then, feature extraction can be performed on the second projection diagram, so that a second feature diagram corresponding to the sample image is obtained. It can be understood that the 2D bird's eye view includes a plurality of grids, and therefore, after feature extraction is performed on the second projection view obtained after projection, at least one second grid may also be included in the obtained second feature view.

Through the calibration parameters between the image acquisition equipment and the point cloud acquisition equipment, multiple frames of sample images are projected onto corresponding sample point clouds to obtain projected image information corresponding to each sample image, then the second characteristic diagram corresponding to the sample image is determined according to the projected image information, and processing is performed according to the sample images of multiple frames, so that the point cloud image information confusion caused by unavoidable calibration errors can be avoided in the process of image processing according to a single frame, and the effect of the whole model can be effectively improved.

And projecting projected image information corresponding to the multi-frame sample images onto the 2D aerial view respectively, so that projection images corresponding to the sample images can be effectively obtained, and then, based on the projection images, second characteristic images of the sample images can be determined simply and effectively.

S304, projecting the sample point cloud to a target image aiming at any frame of sample point cloud to obtain a first projection image corresponding to the sample point cloud, and performing feature extraction on the first projection image to obtain a first feature image corresponding to the sample point cloud, wherein the first feature image comprises at least one first grid.

In this embodiment, projection processing may be performed on multiple frames of sample point clouds, where the processing of the multiple frames of sample point clouds is similar, so that the following also takes any one of the multiple frames of sample point clouds as an example, and introduces a processing procedure of the sample point clouds, and further details on processing of the remaining sample point clouds are omitted.

In one possible implementation, for example, the sample point cloud may be projected onto a target image, which may also be a 2D aerial view. The sample point cloud can be expressed as { x, y, z, intensity }, for example, where x, y, z are three-dimensional coordinate information of the point cloud, and intensity is intensity information of the point cloud.

After projection, a first projection view corresponding to the sample point cloud can be obtained. Then, feature extraction can be performed on the first projection graph, so that a first feature graph corresponding to the sample point cloud is obtained. It can also be understood that the 2D bird's eye view includes a plurality of grids, so that after feature extraction is performed on the first projection view obtained after projection, at least one first grid can be included in the obtained first feature view.

The multi-frame sample point clouds are respectively projected onto the 2D aerial view, so that the projection images corresponding to the sample point clouds can be effectively obtained, and then the first characteristic images of the sample point clouds can be simply and effectively determined based on the projection images.

It should be noted that, in terms of network design, the original information of the multi-frame sample point cloud and the multi-frame sample image is extracted independently, so that the three-dimensional information of the point cloud and the semantic information of the image can be retained to the greatest extent, the effectiveness of model training is effectively improved, and meanwhile, the accuracy and the comprehensiveness of the detection result output by the model can be ensured.

S305, aiming at any first grid in the first feature map, a plurality of feature points in the first grid are obtained.

Also taking any sample point cloud as an example, after obtaining the first feature map corresponding to the sample point cloud, the first feature map may include a plurality of first grids, and each first grid may include a plurality of feature points. Therefore, in the present embodiment, for any first grid in the first feature map, a plurality of feature points in the first grid can be obtained.

For example, it can be understood by referring to the diagram of fig. 6, as shown in fig. 6, it is assumed that 9 first grids are currently included in the first feature map, and a plurality of feature points may be included in each first grid. For example, the 1 st first grid is taken as an example, and the first grid includes feature points a, b, c and d.

The illustration of fig. 6 is merely an example, and is for facilitating understanding of the relationship among the feature map, the grid, and the feature points, and in an actual implementation process, the specific representation relationship among the feature map, the grid, and the feature points may be selected and set according to actual requirements.

S306, determining a correlation parameter corresponding to each feature point, wherein the correlation parameter is used for indicating the degree of correlation between the feature point and the first grid.

Then, for each feature point, a respective corresponding degree of correlation parameter may be determined, where the degree of correlation parameter is used to indicate a degree of correlation between the feature point and the first grid to which the feature point belongs. Or it may be understood that the relevance parameter is used to indicate how much the feature point contributes to the feature of the first grid to which it belongs.

In one possible implementation manner, for example, a lightweight learnable Multilayer neural (MLP) network may be used to process each feature point in each first grid, so as to obtain a correlation parameter corresponding to each feature point.

The MPL network is adopted to determine the relevancy parameter, so that the high-efficiency dynamic interaction of time sequence multi-modal information can be ensured, and excessive calculation burden on a system can be avoided.

And determining the respective relevancy parameters of the feature points in the first grid, and then fusing the feature points by taking the relevancy parameters as the weights of the feature points, so that the grid feature corresponding to the first grid can be obtained, and the overall first grid feature of the first feature map can be effectively determined.

S307, obtaining grid characteristics corresponding to the first grid according to the correlation degree parameters corresponding to the characteristic points respectively, wherein the first grid characteristics comprise grid characteristics of a plurality of first grids in the first characteristic diagram respectively.

Taking any one of the first grids in the first feature map as an example, the grid feature corresponding to the first grid can be obtained for the correlation parameter corresponding to each feature point in the first grid. In a possible implementation manner, for example, the correlation parameter of each feature point in the first grid may be used as a weight, and then each feature point is fused, so as to obtain the grid feature of the first grid.

The above-described processing is performed for each first grid in the first feature map, so that the grid feature of each first grid can be obtained. Further, the first grid feature in this embodiment includes a grid feature of each of the plurality of first grids in the first feature map.

Meanwhile, it can be understood that, in the present embodiment, for each frame of sample point cloud, the corresponding first feature map is obtained through processing. Then, similarly, the first feature map of each frame of sample point cloud is obtained according to the above-described procedure, and the corresponding first grid feature is obtained.

And S308, aiming at any second grid in the second feature map, acquiring a plurality of feature points in the second grid.

The above describes an implementation of determining a first grid feature of a first feature map, and an implementation of determining a second grid feature for a second feature map is similar.

Also taking any sample image as an example, after obtaining a second feature map corresponding to the sample image, the second feature map may include a plurality of second grids, and each second grid may include a plurality of feature points.

S309, determining a correlation parameter corresponding to each feature point, wherein the correlation parameter is used for indicating the degree of correlation between the feature point and the second grid.

And for each feature point, a corresponding correlation parameter may be determined, where the implementation of determining the correlation parameter of the feature point in the second grid is similar to the implementation of determining the correlation parameter of the feature point in the first grid described in S306, and is not described herein again.

And S310, obtaining grid characteristics corresponding to a second grid according to the correlation degree parameters corresponding to the characteristic points, wherein the second grid characteristics comprise grid characteristics of a plurality of second grids in a second characteristic diagram.

After obtaining the correlation parameters of the feature points in the second grid, taking any one of the second grids as an example, the grid feature of the second grid may be determined according to the correlation parameters of the feature points in the second grid. Determining the grid characteristics of each second grid in the second characteristic diagram to obtain the second grid characteristics of the second characteristic diagram, wherein the implementation manner is similar to that described in the above S307, and details are not repeated here.

Similarly, in the present embodiment, for each frame of sample image, the corresponding second feature map is obtained through processing. Then, similarly, the second feature map of each frame sample image is obtained according to the above-described procedure, and its corresponding second grid feature is obtained.

And determining the respective relevance parameters of the feature points in the second grid, and then fusing the feature points by taking the relevance parameters as the weights of the feature points, so that the grid features corresponding to the second grid can be obtained, and the overall second grid features of the second feature map can be effectively determined.

S311, carrying out region division on the first characteristic diagram aiming at any one first characteristic diagram to obtain NxM first regions, wherein N and M are integers which are more than or equal to 1.

After the first feature maps corresponding to the sample point clouds are obtained, further, region division may be performed on the first feature maps. The processing procedures of the first characteristic diagrams are similar, so that the processing of the first characteristic diagram is described below by taking any one of the first characteristic diagrams as an example, and the processing of the remaining first characteristic diagrams is not described again.

In one possible implementation, the first feature map may be subjected to region division, so as to obtain N × M first regions, where N and M are integers greater than or equal to 1. In an actual implementation process, specific values of N and M may be selected and set according to actual requirements, and the values of N and M determine how many first regions the first feature map is specifically divided into.

For example, it can be understood with reference to fig. 7, as shown in fig. 7, it is assumed that the first feature map 701 is currently subjected to region division, and the first feature map 701 is divided into 4 × 4 first regions, that is, 16 first regions, i.e., the region 1 to the region 16 shown in fig. 7, are obtained.

S312, for any one of the second feature maps, the second feature map is subjected to region division to obtain N × M second regions.

In addition, in this embodiment, region division may also be performed on the second feature map of the sample image, where a region division manner of the second feature map is similar to the region division manner described above for the first feature map, and details are not repeated here.

It should be emphasized that, for the region division of the second feature map, the number of the divided second regions is also N × M, that is, the division manner of the second feature map is the same as the division manner of the first feature map.

S313, determining each first region and each second region at the same position as one region set according to the first region of each first feature map and the second region of each second feature map.

After the area division is performed on the first feature map of each sample point cloud and the second feature map of each sample image, because the area division manner of each first feature map and the area division manner of each second feature map are the same, for example, the first feature map divides 4 × 4 first areas and the second feature map divides 4 × 4 second areas, the divided areas of the feature maps can be in one-to-one correspondence.

Therefore, in this embodiment, each first region and each second region at the same position may be determined as a region set according to the first region of each first feature map and the second region of each second feature map.

For example, in the above-described example, assuming that each of the first feature maps and each of the second feature maps are divided into 4 × 4 regions, there are 16 regions in total, that is, there are 16 positions of the regions. Then, for the 16 positions, each first region and each second region of the same position are determined as a region set.

For example, it can be understood in conjunction with fig. 8, as shown in fig. 8, it is assumed that a first feature map of the sample point cloud 1, a first feature map of the sample point cloud 2, a second feature map of the sample image 1, and a second feature map of the sample image 2 currently exist. It is assumed that these 4 feature maps are all divided into 4 × 4 regions as shown in fig. 7.

Then, each first region and each second region at the same position are determined as a region set. For example, referring to fig. 8, if each region of the position 4 (the region 4 of the first feature map of the sample point cloud 1, the region 4 of the first feature map of the sample point cloud 2, the region 4 of the second feature map of the sample image 1, and the region 4 of the second feature map of the sample image 2) is determined as one region set, the region set 4 shown in fig. 7 can be obtained, for example.

The same operation is performed for each position, so that N × M area sets can be obtained, for example, 16 area sets can be obtained in the schematic diagram in fig. 8.

And S314, inputting the first grid characteristics corresponding to the first regions and the second grid characteristics corresponding to the second regions in the region set into the self-attention network aiming at any region set, so that the sub-characteristic information corresponding to the region set is output from the attention network.

After obtaining a plurality of region sets, processing is performed on each region set, where the processing processes of each region set are similar, so that the description is given by taking any one region set as an example, and the processing of the remaining region sets is similar and will not be described again.

It can be determined based on the above description that any one of the region sets includes a plurality of first regions and a plurality of second regions. And corresponding first grid features exist for the first feature map, after the first feature map is subjected to region division, the first region of the first feature map has corresponding first grid features, namely the region part of the first region, and the corresponding part of the first grid features exist in the whole first grid of the first feature map. And similarly, there is a corresponding second grid feature for the second feature map, so after the region division is performed on the second feature map, there is a corresponding second grid feature for the second region.

In a possible implementation manner, currently, in order to determine the sub-feature information of the region set, a first grid feature corresponding to each first region in the region set and a second grid feature corresponding to each second region in the region set may be input into the self-attention network. And processing the input data by the self-attention network so as to output the sub-feature information corresponding to the region set.

The self-attention network may be, for example, a transform self-attention network, or may also be another self-attention network, which is not limited in this embodiment. It can be understood that the self-attention network uniformly models the feature interrelations of different frames and different modes through multilayer nonlinear transformation, so that the effectiveness and accuracy of the obtained feature information common to a plurality of sample point clouds and a plurality of sample images can be effectively ensured.

The same processing is performed for each region set, so that sub-feature information corresponding to each region set can be obtained.

And S315, splicing the sub-feature information of each region set to obtain feature information.

After each region is divided, the region set composed of the first region and the second region is processed, so as to obtain the sub-feature information of each region set. Further, in order to obtain the feature information corresponding to the plurality of sample point clouds and the plurality of sample images, the sub-feature information of each region set may be spliced, so as to obtain the feature information corresponding to the plurality of sample point clouds and the plurality of sample images.

The method comprises the steps of obtaining a plurality of first areas after the first characteristic diagram is divided by carrying out area division on the first characteristic diagram, obtaining a plurality of second areas after the second characteristic diagram is divided by carrying out area division on the second characteristic diagram, determining the plurality of first areas and the plurality of second areas at the same positions as an area set, and respectively processing each area set, so that the data volume of single processing can be effectively reduced, and the overall calculation precision and efficiency of a processing system can be improved. And when each region set is processed, the grid features of each region in each region set are fused by using the self-attention network, so that unified modeling of the interrelation of the features of different frames and different modes can be effectively realized, and further unified feature information of multi-frame multi-modal environment data can be effectively obtained.

And S316, processing the characteristic information according to the detection network in the detection model to obtain a first object detection result output by the object detection model.

And S317, updating model parameters of the detection model according to the first object detection result and the sample object detection result.

The implementation manners of S316 and S317 are similar to the implementation manners of S203 and S204 described above, and are not described herein again.

The detection model training method provided by the embodiment of the application trains the detection model according to the multi-frame sample image and the multi-frame sample point cloud, in the specific training process, through the calibration parameters between the image acquisition equipment and the point cloud acquisition equipment, the multi-frame sample image is projected to the corresponding sample point cloud, and then the sample image is processed, so that the point cloud image information confusion caused by the unavoidable calibration error can be avoided when the image is processed according to a single frame, and the effect of the whole model can be effectively improved. And then projecting the projected image information and the multi-frame point clouds corresponding to the multi-frame sample images onto the 2D aerial view to obtain projection images corresponding to the sample images and the sample point clouds respectively, and then extracting features according to the projection images, so that the feature images corresponding to the sample point clouds and the sample images can be determined simply and effectively. And then, aiming at each feature map, determining the grid features of each grid according to the respective correlation parameters of a plurality of feature points in the grid in the feature map, and further obtaining the overall grid features corresponding to each feature map. Then, the area division is carried out on each characteristic diagram to obtain a plurality of areas after each characteristic diagram is divided, then the areas at the same position are determined to be an area set, the self-attention network is adopted to respectively process each area set to determine the sub-characteristic information of each area set, and then the sub-characteristic information of each area set is spliced to obtain the multi-frame multi-modal data unified characteristic hip-hop, so that the overall calculation accuracy and efficiency of the processing system can be effectively improved. The sum determines the result of object detection according to the processed characteristic information, and the sum realizes the training of the detection model according to the result of the object detection output by the model and the sample object detection result in the training data, so that the trained detection model can be effectively ensured, the processing of multi-frame multi-modal environment data can be effectively realized, and the object detection result can be output.

Based on the above description, in another possible implementation manner, after determining the feature information, further processing may be performed on the currently determined feature information, so as to obtain final feature information. The processing procedure is understood with reference to fig. 9, and fig. 9 is a schematic diagram of an implementation of the downsampling processing provided in the embodiment of the present application.

Based on the above description, it can be determined that, after performing region division on each of the first feature map and the second feature map, N × M region sets may be obtained, and for example, the N × M region sets may be determined as the original layer.

For example, as can be understood in conjunction with fig. 9, assuming that each first feature map and each second feature map are divided into 4 × 4 regions, 4 × 4 region sets can be obtained, and the 4 × 4 region sets can form an original layer shown as 901 in fig. 9, where the original layer 901 has a size H ₁ ×W ₁ 。

After the original layer is determined, downsampling N × M region sets in the original layer T times to obtain T downsampled layers, where the ith downsampled layer includes P _i ×Q _i A set of regions, wherein T is an integer greater than or equal to 1, P _i And Q _i Is an integer of 1 or more, and P _i Less than N, Q _i Less than M, i is from 1 to T.

Wherein after each down-sampling a down-sampled layer is obtained. Wherein the down-sampling process is understood to be that a plurality of region sets in the original layer are down-sampled into one region set, so that P is included in the i-th down-sampling layer _i ×Q _i A set of areas, P _i And Q _i Is an integer of 1 or more, and P is less than N, Q _i Less than M.

In one possible implementation, for example, 4 region sets in the original layer may be sequentially downsampled to 1 region set in the downsampled layer. Thus, P _i For example, may be equal to

And Q _i For example, may be equal to

For example, as will be understood with reference to fig. 9, assume that a 4 x 4 set of regions is included in the original layer. And then carrying out first down-sampling to down-sample 4 area sets in the original layer into 1 area set. Then after the 1 st down-sampling, the 1 st down-sampling layer shown as 902 in fig. 9 can be obtained, wherein the original layer 902 has a size H _1/2 ×W _1/2 . As shown in fig. 9, the obtained 1 st down-sampling layer 902 includes 2 × 2 region sets.

And, the second downsampling can be continued again, and 4 area sets in the 1 st downsampling layer are downsampled into 1 area set. Then after the 2 nd down-sampling, the 2 nd down-sampling layer shown at 903 in fig. 9 can be obtained, where the size of the original layer 903 is H _1/4 ×W _1/4 . As shown in fig. 9, the obtained 2 nd down-sampling layer 903 includes 1 × 1 region sets.

Then, for the ith down-sampling layer in the T down-sampling layers, P in the down-sampling layer can be determined _i ×Q _i The respective sub-feature information of each region set. The implementation of determining the sub-feature information of the region set is similar to that described above, and is not described here again.

The P may then be _i ×Q _i And splicing the sub-feature information of each region set in each region set to obtain the intermediate feature information of the ith down-sampling layer.

And further, mapping the intermediate feature information of the ith down-sampling layer to the size of the feature information of the original layer to obtain the adjusted intermediate feature information, wherein the feature information of the original layer is the determined feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image.

And then, fusing according to the adjusted intermediate characteristic information and the characteristic information of the original layer to obtain fused characteristic information.

For example, it can be understood with reference to fig. 9, as shown in fig. 9, for example, the intermediate feature information may be determined for the 1 st down-sampling layer 902, and then the size of the intermediate feature information of the 1 st down-sampling layer 902 may be adjusted, so as to obtain the adjusted intermediate feature information corresponding to the 1 st down-sampling layer 902. And, the intermediate feature information may also be determined for the 2 nd down-sampling layer 903, and then the size of the intermediate feature information of the 2 nd down-sampling layer 903 is adjusted to obtain the adjusted intermediate feature information corresponding to the 2 nd down-sampling layer 903.

At this time, the adjusted intermediate feature information corresponding to the 1 down-sampling layer 902, the adjusted intermediate feature information corresponding to the 2 nd down-sampling layer 903, and the feature information corresponding to the original layer 901 are of the same size, and then the 3 feature information are fused, so as to obtain the fused feature information.

And then determining the fused characteristic information as the characteristic information corresponding to the multi-frame sample point cloud and the multi-frame sample image, and then processing based on the characteristic information to obtain a first object detection result output by the object detection model.

It can be understood that, by designing the above-mentioned multi-scale sparse self-attention mechanism (as shown in fig. 9), the mechanism first performs T times of downsampling on an original layer to obtain T downsampling layers, then independently uses a self-attention module on each layer to determine respective corresponding intermediate feature information of each downsampling layer, then performs fusion according to the intermediate feature information and the feature information of the original layer to obtain final feature information, so that the capability of information perception under different scales can be enhanced. Meanwhile, due to the high sparsity of the point cloud data in space, due to the fact that the mechanism adopts down-sampling processing, sparse coding is carried out on the characteristics, repeated calculation in a full sparse area is avoided, and the operation efficiency of the whole system is greatly improved. Therefore, through the introduced process, the precision and the efficiency of model processing can be effectively improved.

Further, based on the above description, the following describes an implementation of obtaining at least one set of training data.

It is understood that at least one set of raw training data may be obtained first, either through network data, or local data, when obtaining training data. However, there may be a case where the network data or the local data is insufficient in training data, and therefore, after the original training data is obtained, the synthesized training data may be further obtained according to the original training data.

Referring to fig. 10, fig. 10 is a flowchart of a third method for training a detection model according to an embodiment of the present disclosure.

As shown in fig. 10, the method includes:

s1001, at least one group of original training data is obtained.

In this embodiment, at least one set of raw training data may be acquired first. The original training data is similar to the introduced training data, and the original training data may include a plurality of frames of sample point clouds, a plurality of frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images.

S1002, determining a sample point cloud and a sample image of at least one target object in the original training data.

After determining the original training data, it is understood that a plurality of frames of point clouds and a plurality of frames of images in which at least one object is included are included in the original training data. For example, any one of the objects may be determined as a target object, and then the sample point cloud and the sample image of the at least one target object may be clipped from the plurality of frames of sample point clouds and the plurality of frames of sample images.

S1003, at least one scene point cloud and a scene image are obtained.

In this embodiment, for example, it may be predetermined that there is at least one scene point cloud and at least one scene image. The scene point cloud and the scene image are used for providing different scenes, such as an outdoor scene, an indoor scene, a rainy scene, a sunny scene, and the like, and specific scene selection can be selected and set according to actual requirements.

Accordingly, the scene point cloud and the scene image are point cloud data and scene data acquired for the scenes, and may be acquired through a network, or may be acquired through local data, and the like, which is not limited in this embodiment.

S1004, aiming at any target object, synthesizing the sample point cloud and the scene point cloud of the target object to obtain a synthesized point cloud, and synthesizing the sample image and the scene image of the target object to obtain a synthesized image.

In the actual implementation process, all the objects existing in the sample point cloud and the sample image can be understood as target objects, and the processing for each target object is similar, so that only one target object is described below, and the implementation manners of the other objects are similar.

Specifically, the obtained sample point cloud of the target object and the introduced scene point cloud may be synthesized, so as to obtain a synthesized point cloud. And synthesizing the acquired sample image of the target object and the scene image to obtain a synthesized image.

S1005, determining consistency parameters corresponding to the synthetic point clouds respectively, and determining consistency parameters corresponding to the synthetic images respectively.

After the synthesized point clouds are obtained, in order to avoid the problems that the synthesized sample point clouds and the scene point clouds have conflicts, and the multi-frame occlusion relations of the image data are inconsistent, the consistency parameters of the synthesized point clouds and the consistency parameters of the synthesized images need to be further determined.

The consistency parameter is a parameter for indicating consistency of the synthesized point cloud data and the synthesized image data, and may be processed by using a data consistency network, or may be processed point by point and pixel by pixel to determine the consistency parameter.

S1006, determining the synthetic point cloud with the consistency parameter meeting the first preset condition as a target synthetic point cloud, and determining the synthetic image with the consistency parameter meeting the second preset condition as a target synthetic image.

After the consistency parameter is determined, the synthetic point cloud with the consistency parameter meeting the first preset condition can be determined as the target synthetic point cloud, and the synthetic image with the consistency parameter meeting the second preset condition can be determined as the target synthetic image.

The consistency parameter may be, for example, a parameter in a numerical class, and the first preset condition may be that the consistency parameter is greater than or equal to a first threshold; or, the consistency parameter may also be a binary parameter, for example, to indicate whether the synthesized image has consistency, and the first preset condition may also be that the consistency parameter indicates that the synthesized point cloud has consistency.

The second preset condition is similar to the first preset condition described above, except that the second preset condition is a condition set for the synthesized image, and specific implementation of the second preset condition is not described herein again.

And S1007, determining the target synthetic point cloud and the target synthetic image as a group of synthetic training data.

After determining the target synthetic point cloud and the target synthetic image, the target synthetic point cloud and the target synthetic image may be determined as a set of synthetic training data. Since there may be a plurality of target synthetic point clouds and target synthetic images, at least one set of synthetic training data may be obtained.

S1008, determining the at least one set of original training data and the at least one set of synthesized training data as the at least one set of training data.

Then, at least one set of original training data and at least one set of synthesized training data are determined as at least one set of training data, so as to obtain training data used for training the detection model.

It can be understood that the data-driven deep neural network largely depends on the quantity and quality of the used data, and the data collection amount of the automatic driving is huge, but still cannot exhaust all the situations which can occur in road traffic. In order to improve the data diversity under the limited data acquisition, the embodiment provides the time-series multi-modal data enhancement scheme introduced above, which can simultaneously ensure the consistency of the generated synthetic point cloud and the synthetic image data in time series and the consistency in the cross-modal data.

The detection model training method provided by the embodiment of the application comprises the steps of firstly intercepting all point clouds and image data of a certain target object in a continuous multi-frame automatic driving scene; then pasting the section of point cloud image data into a new automatic driving scene after projection transformation and random disturbance processing; after pasting, judging whether the point cloud data conflicts with the original scene frame by frame, judging whether the multi-frame shielding relation of the image data is consistent, and finally only keeping the consistent target synthetic data. Therefore, it can be understood that the implementation process introduced above can effectively synthesize available training data to improve the scene richness of the training data, and further effectively improve the precision and generalization of the detection model, and meanwhile, the scheme can also be directly applied to any automatic driving acquisition data to provide richer automatic driving scene data.

The above embodiment describes a training process for a detection model, and after the training of the detection model is completed, multi-frame point cloud data and multi-frame image data may be processed according to the detection model, so as to obtain an object detection result.

Therefore, the present disclosure further provides an object detection method, which is described below with reference to specific embodiments. First, description is made with reference to fig. 11, and fig. 11 is a flowchart of an object detection method according to an embodiment of the present disclosure.

As shown in fig. 11, the method includes:

s1101, acquiring a first point cloud and a first image acquired at a first moment.

In this embodiment, it is assumed that the first point cloud and the first image are obtained by shooting at a first time, and the first point cloud and the first image have the same acquisition time and therefore have a corresponding relationship.

The first time point can be understood as a time point at which object detection is required, that is, object detection is currently required for the first point cloud and the first image acquired at the first time point.

S1102, multiple frames of second point clouds and multiple frames of second images acquired at the first moment are acquired.

Based on the above description, it can be determined that, when the detection model in this embodiment performs object detection, the detection model may process the first point cloud and the first image acquired at the first time point when the object detection is required, and may further process the first point cloud and the first image according to the historical point cloud and the historical image.

Therefore, in this embodiment, a plurality of frames of second point clouds and a plurality of frames of second images acquired before the first time may also be acquired.

In one possible implementation, for example, all point clouds and all images acquired by a current device (for example, an autonomous vehicle) within a preset time period before a first time may be acquired, so as to determine a plurality of frames of second point clouds and a plurality of frames of second images. Or, the partial point cloud and the partial image collected within a preset time length before the current equipment is collected at the first moment can be collected. The partial point cloud and the partial image may be obtained intermittently at intervals of a first time period, or may be randomly acquired. The embodiment does not limit the specific implementation of obtaining the second point cloud and the second image. As long as the second point cloud and the second image are acquired before the first time instant and there are multiple frames.

It is understood that, similarly to the above description, there is a time-series correspondence relationship between the plurality of frames of the second point cloud and the plurality of frames of the second image that are currently acquired.

S1103, processing the first point cloud, the first image, the multiple frames of second point clouds and the multiple frames of second images according to the detection model to obtain object detection results corresponding to the first point cloud and the first image.

Wherein the detection model is a model trained according to the method of any one of claims 1 to 13.

After the first point cloud, the first image, the plurality of frames of the second point cloud, and the plurality of frames of the second image described above are determined, the data may be processed according to the detection model. The detection model is obtained by training according to the introduced embodiment, so that processing of multiple frames of point clouds and multiple frames of images can be effectively realized, and an object detection result corresponding to the first point cloud and the first image is output. The output object detection result may include the location and classification information of each object in the first point cloud and the first image.

It should be noted that, during the application process of the detection model, the internal processing procedure is similar to the above-described processing procedure for training the detection model, and when the internal processing procedure is a unique place, no additional synthetic training data is needed during the application process of the detection model.

The object detection method provided by the embodiment of the application comprises the following steps: and acquiring a first point cloud and a first image acquired at a first moment. And acquiring multiple frames of second point clouds and multiple frames of second images acquired before the first moment. And processing the first point cloud, the first image, the multiple frames of second point clouds and the multiple frames of second images according to the detection model to obtain object detection results corresponding to the first point cloud and the first image. Processing a plurality of frame images and a plurality of frame point clouds through the detection model obtained according to the training, wherein the plurality of frame images comprise a first image acquired at a first moment, a plurality of frame second images acquired before the first moment, and the plurality of frame point clouds comprise a first point cloud acquired at the first moment and a plurality of frame second point clouds acquired before the first moment, so that an object detection result is output. Because the object detection result is determined according to the multi-frame multi-modal environment data, the comprehensiveness and richness of data on which the output object detection result depends can be effectively ensured, and the accuracy and effectiveness of object detection can be effectively improved.

On the basis of the above-described embodiments, a complete system description of the method provided by the embodiment of the present application is provided below with reference to fig. 12. Fig. 12 is a schematic flowchart of a detection model training method and an object detection method according to an embodiment of the present disclosure.

As shown in fig. 12, multiple frames of laser point clouds may be obtained first, where a single frame of laser point cloud may be obtained by a multi-line rotating lidar or a solid-state lidar, and multiple frames of laser point clouds are accumulated by historically observed point clouds.

And multi-frame camera images can be acquired, wherein a single-frame camera image can be acquired through the vehicle-mounted camera and mainly comprises a front-back camera, a surround-view camera, other supplementary cameras and the like, and the multi-frame camera images are accumulated through historically observed images.

And then, calibrating the projection unit through a camera to obtain calibration parameters between the image acquisition device and the point cloud acquisition device.

In the model training process, multiple frames of laser point clouds and multiple frames of camera images can be used as original training data. And obtaining at least one set of synthesized training data according to the original training data by the data synthesis unit, wherein the specific implementation of the synthesized training data can refer to the above description, thereby obtaining at least one set of training data. The synthetic training data is obtained by converting the target projection in the existing training data into different scenes, so that the diversity of data scenes is greatly enriched, and the speed and the precision of network detection training are improved.

And then inputting the original training data, the synthesized training data and the acquired calibration parameters into a feature extraction network, and determining feature information corresponding to multiple frames of laser point clouds and multiple frames of camera images. And then, inputting the characteristic information into a detection network to obtain a detection result output by the detection model.

The above-described process can be understood as a process of detecting a model during the model training process. After the training of the detection model is completed, the processing procedure of the detection model is similar in the specific application process of the detection model, except that the step of synthesizing the training data is not described above. For more detailed implementation, reference may be made to the above description, which is not repeated herein.

In summary, the detection model training method and the object detection method provided by the embodiment of the application provide an efficient deep learning architecture, so that sufficient and efficient fusion of multi-frame point cloud and multi-frame image information is realized, and the overall accuracy of sensing target detection is further improved. And an effective data enhancement mechanism is also provided, and the training efficiency and the test precision of the network are improved aiming at the data characteristics of multi-frame and multi-mode.

Fig. 13 is a schematic structural diagram of a detection model training apparatus according to an embodiment of the present application. As shown in fig. 13, the apparatus 130 includes: an obtaining module 1301, a first processing module 1302, a second processing module 1303, and an updating module 1304.

An obtaining module 1301, configured to obtain at least one set of training data, where the training data includes multiple frames of sample point clouds, multiple frames of sample images, and sample object detection results corresponding to the sample point clouds and the sample images;

a first processing module 1302, configured to process the multiple frames of sample point clouds and the multiple frames of sample images according to a feature extraction network in a detection model, so as to obtain feature information corresponding to the multiple frames of sample point clouds and the multiple frames of sample images;

the second processing module 1303, configured to process the feature information according to the detection network in the detection model, to obtain a first object detection result output by the object detection model;

an updating module 1304, configured to update a model parameter of the detection model according to the first object detection result and the sample object detection result.

In one possible design, the feature extraction network comprises a feature encoding unit and a feature processing unit;

the first processing module 1302 is specifically configured to:

processing the multiple frames of sample point clouds and the multiple frames of sample images according to the feature coding unit to obtain first grid features corresponding to the sample point clouds and second grid features corresponding to the sample images;

and processing each first grid feature and each second grid feature according to the feature processing unit to obtain the feature information.

In one possible design, the first processing module 1302 is specifically configured to:

for any frame of sample image, projecting the sample image onto corresponding sample point cloud according to calibration parameters between image acquisition equipment and point cloud acquisition equipment to obtain projected image information corresponding to the sample image;

obtaining a first feature map corresponding to each sample point cloud and a second feature map corresponding to each sample image according to the projected image information corresponding to each sample point cloud and each sample image;

obtaining a first grid feature corresponding to each sample point cloud according to the first feature map;

and obtaining a second grid feature corresponding to each sample image according to the second feature map.

for any frame of the sample point cloud, projecting the sample point cloud onto a target image to obtain a first projection graph corresponding to the sample point cloud, wherein the first feature graph comprises at least one first grid;

extracting features of the first projection drawing to obtain a first feature drawing corresponding to the sample point cloud;

for any frame of the sample image, projecting the projected image information corresponding to the sample image onto the target image to obtain a second projection image corresponding to the sample image, wherein the second feature image comprises at least one second grid;

and performing feature extraction on the second projection drawing to obtain a second feature drawing corresponding to the sample image.

aiming at any one first grid in the first feature map, acquiring a plurality of feature points in the first grid;

determining a correlation parameter corresponding to each feature point, wherein the correlation parameter is used for indicating the degree of correlation between the feature point and the first grid;

obtaining grid features corresponding to the first grid according to the correlation degree parameters corresponding to the feature points, wherein the first grid features comprise grid features of a plurality of first grids in the first feature diagram.

aiming at any one second grid in the second feature map, acquiring a plurality of feature points in the second grid;

determining a correlation parameter corresponding to each feature point, wherein the correlation parameter is used for indicating the degree of correlation between the feature point and the second grid;

and obtaining grid features corresponding to the second grid according to the correlation degree parameters corresponding to the feature points, wherein the second grid features comprise grid features of a plurality of second grids in the second feature diagram.

In one possible design, the second processing module 1303 is specifically configured to:

for any one first feature map, performing region division on the first feature map to obtain N × M first regions, where N and M are integers greater than or equal to 1;

aiming at any one second feature map, carrying out region division on the second feature map to obtain NxM second regions;

and obtaining the feature information according to the first region of each first feature map, the second region of each second feature map, each first grid feature and each second grid feature.

determining each first region and each second region at the same position as a region set according to the first region of each first feature map and the second region of each second feature map;

for any one of the area sets, inputting a first grid feature corresponding to each first area in the area set and a second grid feature corresponding to each second area in the area set to a self-attention network, so that the self-attention network outputs sub-feature information corresponding to the area set;

and splicing the sub-feature information of each region set to obtain the feature information.

In one possible design, the second processing module is further configured to:

determining the N × M region sets as an original layer after determining each first region and each second region at the same position as a region set according to the first region of each first feature map and the second region of each second feature map;

carrying out T times of downsampling processing on the N multiplied by M area sets in the original layer to obtain T downsampling layers, wherein the ith downsampling layer comprises P _i ×Q _i A set of regions, wherein T is an integer greater than or equal to 1, and P _i And said Q _i Is an integer of 1 or more, and said P _i Less than N, Q _i And when the value is less than M, the value of i is 1 to T.

In one possible design, the second processing module 1303 is further configured to:

after the sub-feature information of each region set is spliced to obtain the feature information, determining P in the down-sampling layer aiming at the ith down-sampling layer in the T down-sampling layers _i ×Q _i The respective sub-feature information of each region set;

the P is added _i ×Q _i Splicing the sub-feature information of each region set in each region set to obtain the intermediate feature information of the ith down-sampling layer;

mapping the intermediate characteristic information of the ith down-sampling layer to the size of the characteristic information of the original layer to obtain the adjusted intermediate characteristic information, wherein the characteristic information of the original layer is the characteristic information corresponding to the multi-frame sample point cloud and the multi-frame sample image;

and fusing according to the adjusted intermediate characteristic information and the characteristic information of the original layer to obtain fused characteristic information.

In one possible design, the obtaining module 1301 is specifically configured to:

acquiring at least one group of original training data;

determining a sample point cloud and a sample image of at least one target object in the original training data;

acquiring at least one scene point cloud and a scene image;

and determining the at least one set of training data according to the at least one set of original training data, the sample point cloud and the sample image of the at least one target, the scene point cloud and the scene image.

for any one target object, synthesizing the sample point cloud of the target object and the scene point cloud to obtain a synthesized point cloud, and synthesizing the sample image of the target object and the scene image to obtain a synthesized image;

determining consistency parameters corresponding to the synthetic point clouds and consistency parameters corresponding to the synthetic images;

determining the synthetic point cloud with the consistency parameter meeting a first preset condition as a target synthetic point cloud, and determining the synthetic image with the consistency parameter meeting a second preset condition as a target synthetic image;

determining the target synthetic point cloud and the target synthetic image as a set of synthetic training data;

determining the at least one set of raw training data and the at least one set of synthetic training data as the at least one set of training data.

In a possible design, the multi-frame sample point cloud includes a sample point cloud acquired at a first time and a multi-frame sample point cloud acquired before the first time, the multi-frame sample image includes a sample image acquired at the first time and a multi-frame sample image acquired before the first time, and the sample object detection result is a detection result corresponding to the sample point cloud acquired at the first time and the sample image.

The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 14 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application. As shown in fig. 14, the apparatus 140 includes: a first obtaining module 1401, a second obtaining module 1402 and a processing module 1403.

A first obtaining module 1401, configured to obtain a first point cloud and a first image acquired at a first time;

a second obtaining module 1402, configured to obtain multiple frames of second point clouds and multiple frames of second images acquired before the first time;

a processing module 1403, configured to process the first point cloud, the first image, the multiple frames of second point clouds, and the multiple frames of second images according to a detection model to obtain object detection results corresponding to the first point cloud and the first image,

the detection model is obtained by training according to the detection model training method in the embodiment.

Fig. 15 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application, and as shown in fig. 15, an electronic device 150 according to the embodiment includes: a processor 1501 and a memory 1502; wherein

A memory 1502 for storing computer-executable instructions;

the processor 1501 is configured to execute the computer executable instructions stored in the memory to implement the steps performed by the electronic method in the above embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 1502 may be separate or integrated with the processor 1501.

When the memory 1502 is provided separately, the electronic device further includes a bus 1503 for connecting the memory 1502 and the processor 1501.

The embodiment of the present application further provides a computer-readable storage medium, where a computer execution instruction is stored, and when a processor executes the computer execution instruction, the detection model training method or the object detection method executed by the electronic device is implemented.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A detection model training method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the feature extraction network comprises a feature encoding unit and a feature processing unit;

the processing the multi-frame sample point cloud and the multi-frame sample image according to the feature extraction network in the detection model to obtain the feature information corresponding to the multi-frame sample point cloud and the multi-frame sample image comprises the following steps:

3. The method according to claim 2, wherein the processing the plurality of sample point clouds and the plurality of sample images according to the feature encoding unit to obtain first grid features corresponding to the sample point clouds and second grid features corresponding to the sample images comprises:

4. The method according to claim 3, wherein obtaining a first feature map corresponding to each of the plurality of sample point clouds and a second feature map corresponding to each of the plurality of sample images according to the projected image information corresponding to each of the plurality of sample point clouds and the plurality of sample images comprises:

performing feature extraction on the first projection drawing to obtain a first feature drawing corresponding to the sample point cloud;

5. The method of claim 4, wherein obtaining the first grid feature corresponding to each sample point cloud according to the first feature map comprises:

6. The method according to claim 4, wherein obtaining a second grid feature corresponding to each sample image according to the second feature map comprises:

7. The method according to any one of claims 2 to 6, wherein the processing each of the first grid features and each of the second grid features according to the feature processing unit to obtain the feature information includes:

8. The method according to claim 7, wherein obtaining the feature information according to the first region of each of the first feature maps, the second region of each of the second feature maps, each of the first grid features, and each of the second grid features comprises:

9. The method according to claim 7 or 8, wherein after determining each of the first regions and each of the second regions at the same position as a set of regions according to the first region of each of the first feature maps and the second region of each of the second feature maps, the method further comprises:

determining the N × M region sets as an original layer;

10. The method according to claim 9, wherein after the sub-feature information of each region set is spliced to obtain the feature information, the method further comprises:

determining P in the down-sampling layer for the ith down-sampling layer of the T down-sampling layers _i ×Q _i The respective sub-feature information of each region set;

11. The method according to any one of claims 1-9, wherein said obtaining at least one set of training data comprises:

acquiring at least one group of original training data;

acquiring at least one scene point cloud and a scene image;

12. The method of claim 11, wherein determining the at least one set of training data from the at least one set of raw training data, the sample point cloud and sample image of the at least one target, the scene point cloud, the scene image comprises:

aiming at any one target object, synthesizing the sample point cloud of the target object and the scene point cloud to obtain a synthesized point cloud, and synthesizing the sample image of the target object and the scene image to obtain a synthesized image;

determining consistency parameters corresponding to the synthetic point clouds respectively, and determining consistency parameters corresponding to the synthetic images respectively;

determining the synthesized point cloud with the consistency parameter meeting a first preset condition as a target synthesized point cloud, and determining the synthesized image with the consistency parameter meeting a second preset condition as a target synthesized image;

13. The method according to any one of claims 1 to 12, wherein the multi-frame sample point cloud includes a sample point cloud acquired at a first time and a multi-frame sample point cloud acquired before the first time, the multi-frame sample image includes a sample image acquired at the first time and a multi-frame sample image acquired before the first time, and the sample object detection result is a detection result corresponding to the sample point cloud acquired at the first time and the sample image.

14. An object detection method, comprising:

acquiring a first point cloud and a first image acquired at a first moment;

15. A test pattern training apparatus, comprising:

16. An object detecting apparatus, characterized by comprising:

a processing module, configured to process the first point cloud, the first image, the plurality of frames of second point clouds, and the plurality of frames of second images according to a detection model to obtain object detection results corresponding to the first point cloud and the first image,

17. An electronic device, comprising:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 13 or claim 14 when the program is executed.

18. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 13 or claim 14.

19. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any one of claims 1 to 13 or claim 14 when executed by a processor.