CN115761723A

CN115761723A - 3D target detection method and device based on multi-sensor fusion

Info

Publication number: CN115761723A
Application number: CN202211427433.6A
Authority: CN
Inventors: 陆强
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-03-07

Abstract

The invention provides a 3D target detection method and a device based on multi-sensor fusion, wherein the method comprises the following steps: acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence; inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values. According to the invention, the target to be detected is detected through the 3D target detection model, and the accuracy of 3D detection is improved by combining the time sequence fusion of each mode, especially the effect of motion information such as target detection rate and speed.

Description

3D target detection method and device based on multi-sensor fusion

Technical Field

The invention relates to the technical field of image processing, in particular to a 3D target detection method and device based on multi-sensor fusion.

Background

In an automatic driving system, 3D Object Detection (singular 3D Object Detection) is a very important task in a perception module, and compared with planar Object Detection, it can directly provide the real position, shape size, orientation and category of an Object in the surrounding environment, and can directly provide a theoretical basis for decision planning of rear-end prediction, planning, motion control and the like.

There can be a relatively mature network design to capture the high-level features that describe the class of objects. However, the target detection result by means of the plane image cannot directly provide useful decision bases for the automatic driving task, such as information of the precise position, shape and size of the missing target. In an autopilot system, the point cloud data of another sensor lidar provides real spatial information of the surrounding 3D space, such as distance depth, surface shape, etc. However, the sensing characteristics of the laser radar determine the disorder and sparsity of the point cloud data, and the unstructured data are difficult to directly apply a conventional convolutional neural network structure for processing.

In the past years, the task of image-based target detection develops rapidly, however, the lidar-based 3D target detection method can generally provide precise positioning by means of distance information provided by point cloud, but the sparse characteristic of the point cloud determines that such a method only performs better on a large-volume object, and other processing methods based on point cloud voxels or projection lose more detailed information and are difficult to perform on a small object; the method based on the camera can better utilize compact image data to detect small objects, but the depth and the perspective deformation make the small objects difficult to compete in positioning and shape, and the overall performance has a large difference. Even if the algorithm is based on fusion, most of the algorithms are only based on the fusion of a single-frame view and a region frame level, and not only the point cloud information is damaged, but also the detail information of the surface structure of the object is not considered; even some fusion-based detection algorithms are only based on a region suggestion frame on an image, and then 3D information is acquired by using the point cloud of a corresponding part, so that the performance of the region suggestion can be directly limited by the factors such as perspective deformation, front and back occlusion and the like of an object on the image.

Disclosure of Invention

The invention provides a 3D target detection method and device based on multi-sensor fusion, which are used for solving the defect of inaccurate 3D target detection result in the prior art, and improving the 3D detection precision based on time sequence fusion of various modes.

The invention provides a 3D target detection method based on multi-sensor fusion, which comprises the following steps: acquiring target images of at least two continuous frames and point cloud data corresponding to the target images in time sequence; inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; performing time sequence fusion and modal fusion on the features respectively extracted by the point cloud data and the target image by the 3D target detection model, and then performing 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training of a training image, an image characteristic true value corresponding to the training image, point cloud training data, a point cloud true value corresponding to the point cloud training data and a target true value.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, the 3D target detection model comprises the following steps: the point cloud feature extraction layer is used for respectively extracting features of each frame of input point cloud data to obtain point cloud features corresponding to each frame of point cloud data; the image feature extraction layer is used for respectively extracting features of input target images of each frame to obtain image features corresponding to the target images of each frame; the time sequence fusion layer carries out time sequence fusion based on the extracted image characteristics of each frame and the input point cloud data of each frame to obtain time sequence fusion characteristics;

the modal fusion layer is used for performing modal fusion on the time sequence fusion feature and the time sequence point cloud feature corresponding to the time sequence fusion feature to obtain a modal fusion feature; and the target detection layer is used for carrying out 3D target detection on the modal fusion characteristics to obtain a target detection result.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, the time sequence fusion is carried out based on the extracted image features of each frame and the input point cloud data of each frame, and the method comprises the following steps: according to the input point cloud data of each frame, carrying out view transformation on the corresponding frame image characteristics to obtain view transformation characteristics; and selecting the view transformation characteristics of two adjacent frames, and mapping the view transformation characteristics of the previous frame to the view transformation characteristics of the current frame to obtain the time sequence fusion characteristics.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, the view transformation is carried out on the corresponding frame image characteristics according to the input point cloud data of each frame, and the method comprises the following steps: respectively projecting each frame of point cloud data to corresponding frame image features, and endowing each pixel position of the corresponding image features with depth dimension coordinates based on the depth information of the point cloud data to obtain projection fusion features with three-dimensional pixel coordinates; and converting the pixel coordinates of the projection fusion characteristics to a radar coordinate system based on a preset coordinate conversion matrix to obtain view transformation characteristics.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, the characteristic extraction is respectively carried out on the input point cloud data of each frame, and the method comprises the following steps: respectively uniformly dividing single-frame point cloud data into a plurality of grid columns with preset sizes; and respectively extracting the features of each grid column based on a preset sampling threshold value to obtain the corresponding point cloud features.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, based on the preset sampling threshold, the characteristic extraction is respectively carried out on each grid column, and the method comprises the following steps: based on the fact that the number of point clouds in a single grid column is not smaller than the preset sampling threshold value, sampling point clouds in corresponding grid columns based on the preset sampling threshold value to obtain sampling features; based on the fact that the number of point clouds in a single grid column is smaller than the sampling threshold value, filling the corresponding grid column according to preset numerical values to obtain filling characteristics; and obtaining corresponding frame point cloud characteristics based on all the sampling characteristics and all the filling characteristics.

According to the 3D target detection method based on multi-sensor fusion provided by the invention, the training of the 3D target detection model comprises the following steps: acquiring a multi-frame training image, multi-frame point cloud training data and a target true value of the same target; the point cloud training data is used as input data for training, the target truth value is used as a label, and a pre-constructed 3D target detection model is trained to obtain a first training model; taking the training image as input data used for training, taking the target truth value as a label, and training the first training model to obtain a second training model; and taking the point cloud training data and the training image as input data for training, taking the target truth value as a label, and training the second training model to obtain a 3D target detection model.

The invention also provides a 3D target detection device based on multi-sensor fusion, which comprises: the data acquisition module is used for acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence; the 3D target detection module is used for inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training of a training image, an image characteristic true value corresponding to the training image, point cloud training data, a point cloud true value corresponding to the point cloud training data and a target true value.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the multi-sensor fusion-based 3D target detection method.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the multi-sensor fusion based 3D object detection method as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of any one of the above-described multi-sensor fusion based 3D object detection methods.

According to the 3D target detection method and device based on multi-sensor fusion, the target to be detected is detected through the 3D target detection model, and the accuracy of 3D detection is improved by combining time sequence fusion of each mode, especially the effect of motion information such as target detection rate and speed; in addition, single-mode input is supported, and when a certain sensor fails, the normal operation of the model is not influenced.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a 3D object detection method based on multi-sensor fusion provided by the invention;

FIG. 2 is a second schematic flowchart of a 3D object detection method based on multi-sensor fusion according to the present invention;

FIG. 3 is a schematic structural diagram of a 3D object detection device based on multi-sensor fusion provided by the invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a flow chart of a 3D target detection method based on multi-sensor fusion, which includes:

s11, acquiring target images of at least two continuous frames and point cloud data corresponding to the target images in time sequence;

s12, inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values.

It should be noted that S1N in this specification does not represent the sequence of the multi-sensor fusion-based 3D object detection method, and the multi-sensor fusion-based 3D object detection method of the present invention is described below with reference to fig. 2 specifically.

And S11, acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence.

In this embodiment, acquiring target images of at least two consecutive frames includes: acquiring a video stream in a target time period for a target area to be detected; and extracting video frames based on the video stream to obtain at least two continuous frame target images. It is necessary to add that, after extracting the video, the method includes: and sampling the extracted video frame based on a preset sampling frequency to obtain a target image.

In an alternative embodiment, acquiring at least two consecutive frames of the target image comprises: acquiring images to be detected which are continuously shot in a target time period of a target area to be detected; and sampling from the continuously shot detection images based on a preset sampling frequency to obtain a target image.

The target image is captured by a camera on the vehicle, and the camera may be a camera such as a camera of the vehicle body. In addition, the vehicle may be a vehicle, a ship, an airplane, or other vehicles for carrying people or goods, wherein the vehicle may be a private car or an operating vehicle, such as a shared automobile, a network appointment car, a taxi, a bus, a school bus, a truck, a passenger car, a train, a subway, a tram, or the like.

In addition, the point cloud data corresponding to the target image time sequence is obtained, and the method comprises the following steps: acquiring a radar point cloud data set aiming at a target area to be detected; and acquiring point cloud data of corresponding time sequence from the radar point cloud data set based on the time sequence corresponding to each frame of target image.

In this embodiment, referring to fig. 2,3D, the object detection model includes: the point cloud feature extraction layer is used for respectively extracting features of the input point cloud data of each frame to obtain point cloud features corresponding to the point cloud data of each frame; the image characteristic extraction layer is used for respectively extracting the characteristics of each input frame of target image to obtain the image characteristics corresponding to each frame of target image; the time sequence fusion layer carries out time sequence fusion based on the extracted image features of each frame and the input point cloud data of each frame to obtain time sequence fusion features; the modal fusion layer is used for performing modal fusion on the time sequence fusion characteristics and the time sequence point cloud characteristics corresponding to the time sequence fusion characteristics to obtain modal fusion characteristics; and the target detection layer is used for carrying out 3D target detection on the mode fusion characteristics to obtain a target detection result.

It should be noted that the target detection result includes a center point prediction result hm, a 3D size prediction result, and a prediction error reg. It is necessary to supplement that the obtained hm is a down-sampling of the original center point, and the result obtained by performing integer division and integer rounding based on a preset value is also required, so that the error between the hm and the actual center point position needs to be reflected by reg, and the actual center point position is convenient to be determined according to hm and reg.

Specifically, the method for extracting features of each frame of point cloud data comprises the following steps: respectively uniformly dividing single-frame point cloud data into a plurality of grid columns with preset sizes; and respectively extracting the features of each grid column based on a preset sampling threshold value to obtain the corresponding point cloud features.

Furthermore, the method for uniformly dividing the single-frame point cloud data into a plurality of grid columns with preset sizes includes: dividing single-frame point cloud data into a plurality of grids with preset sizes; and extending each grid in the Z-axis direction to obtain a corresponding grid column.

In addition, based on the preset sampling threshold, the feature extraction is respectively carried out on each grid column, and the method comprises the following steps: based on the fact that the number of point clouds in a single grid column is not smaller than a preset sampling threshold, sampling the point clouds in the corresponding grid columns based on the preset sampling threshold to obtain sampling characteristics; based on the fact that the number of point clouds in a single grid column is smaller than a sampling threshold value, filling the corresponding grid column according to preset numerical values to obtain filling characteristics; and obtaining corresponding frame point cloud characteristics based on all the sampling characteristics and all the filling characteristics.

It should be noted that, in an alternative embodiment, feature extraction may be performed on each frame of point cloud data input based on the point cloud Feature extraction result PFN (pilar Feature Net) to convert the input point cloud data into a sparse point cloud.

In addition, in this embodiment, performing time-series fusion based on the extracted image features of each frame and the input point cloud data of each frame includes: according to the input point cloud data of each frame, carrying out view transformation on the corresponding frame image characteristics to obtain view transformation characteristics; and selecting the view transformation characteristics of two adjacent frames, and mapping the view transformation characteristics of the previous frame to the view transformation characteristics of the current frame to obtain the time sequence fusion characteristics.

Furthermore, according to the input point cloud data of each frame, the view transformation is performed on the corresponding image features of the frame, which includes: respectively projecting each frame of point cloud data to the corresponding frame of image features, and endowing each pixel position of the corresponding image features with a depth dimension coordinate based on the depth information of the point cloud data to obtain a projection fusion feature with a three-dimensional pixel coordinate; and converting the pixel coordinates of the projection fusion characteristics to a radar coordinate system based on a preset coordinate conversion matrix to obtain view transformation characteristics.

In an alternative embodiment, training a 3D object detection model comprises: acquiring a multi-frame training image, multi-frame point cloud training data and a target true value of the same target; the method comprises the steps that point cloud training data are used as input data used for training, a target truth value is used as a label, a pre-constructed 3D target detection model is trained, and a first training model is obtained; training the first training model by taking the training image as input data used for training and taking the target truth value as a label to obtain a second training model; and taking the point cloud training data and the training image as input data used for training, taking the target truth value as a label, and training the second training model to obtain the 3D target detection model.

It should be noted that, when a pre-constructed 3D target detection model is trained, only point cloud training data is input, and no training image is input, so as to train only the 3D target detection model when point cloud is input; when the first training model is trained, only a training image is input, point cloud training data is not input, and only a detection model when the training image is input is trained; when the second training model is trained, the networks except the target detection layer are frozen, namely the point cloud feature extraction layer, the image feature extraction layer, the time sequence fusion layer and the modal fusion layer in the application are correspondingly frozen, so that the training for the target detection layer is realized, the detection precision of the whole 3D target detection model is further improved, especially the effects of motion information such as target detection rate, speed and the like are improved, single-modal input is supported, and when a certain sensor fails, the model can normally run.

In summary, the embodiment of the present invention detects the target to be detected through the 3D target detection model, and improves the precision of 3D detection, especially the effect of motion information such as target detection rate and speed, by combining the time sequence fusion of each mode; in addition, single-mode input is supported, and when a certain sensor fails, the normal operation of the model is not influenced.

The following describes the multi-sensor fusion-based 3D object detection apparatus provided in the present invention, and the multi-sensor fusion-based 3D object detection apparatus described below and the multi-sensor fusion-based 3D object detection method described above may be referred to in correspondence with each other.

Fig. 3 shows a schematic structural diagram of a 3D object detection device based on multi-sensor fusion, which includes:

the data acquisition module 31 is used for acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence;

the 3D target detection module 32 inputs the acquired point cloud data and the target image into the 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values.

In this embodiment, the data acquiring module 31 includes: the video acquisition unit is used for acquiring a video stream in a target time period for a target area to be detected; and the image extraction unit is used for extracting the video frames based on the video stream to obtain at least two continuous frames of target images. It should be added that the data obtaining module 31 further includes: and the sampling unit is used for sampling the extracted video frame based on a preset sampling frequency to obtain a target image.

In an optional embodiment, the data obtaining module 31 includes: acquiring images to be detected which are continuously shot in a target time period of a target area to be detected; and sampling from the continuously shot detection images based on a preset sampling frequency to obtain a target image.

In addition, the data acquisition module 31 further includes: the point cloud data acquisition unit is used for acquiring a radar point cloud data set aiming at a target area to be detected; and the point cloud data screening unit is used for acquiring point cloud data of corresponding time sequences from the radar point cloud data set based on the time sequences corresponding to the target images of the frames.

The 3D target detection module 32 comprises a data input unit, a 3D target detection model unit and a data output unit, wherein the data input unit is used for inputting the acquired point cloud data and the target image into the 3D target detection model unit; the 3D target detection model unit is used for carrying out target detection according to the input point cloud data and the target image to obtain a target detection result; and the data output unit is used for outputting the target detection result obtained by the 3D target detection model unit.

Still further, a 3D object detection model unit, comprising: the point cloud feature extraction unit is used for respectively extracting features of the input point cloud data of each frame to obtain point cloud features corresponding to the point cloud data of each frame; the image feature extraction unit is used for respectively extracting features of each input frame of target image to obtain image features corresponding to each frame of target image; the time sequence fusion unit is used for carrying out time sequence fusion on the basis of the extracted image features of each frame and the input point cloud data of each frame to obtain time sequence fusion features; the modal fusion unit is used for performing modal fusion on the time sequence fusion feature and the point cloud feature of the time sequence corresponding to the time sequence fusion feature to obtain a modal fusion feature; and the target detection unit is used for carrying out 3D target detection on the mode fusion characteristics to obtain a target detection result.

Wherein, the point cloud characteristic extraction unit includes: the grid dividing subunit uniformly divides the single-frame point cloud data into a plurality of grid columns with preset sizes; and the point cloud feature extraction subunit is used for respectively extracting features of each grid column based on a preset sampling threshold value to obtain corresponding point cloud features.

Further, the grid dividing subunit includes: a mesh division unit which divides the single-frame point cloud data into a plurality of meshes with preset sizes; and the grid column acquiring unit is used for extending each grid in the Z-axis direction to obtain a corresponding grid column.

The point cloud feature extraction subunit comprises: the sampling sun unit is used for sampling the point clouds in the corresponding grid columns based on the preset sampling threshold value to obtain sampling characteristics if the number of the point clouds in a single grid column is not less than the preset sampling threshold value; the filling sun unit is used for filling the corresponding grid columns according to preset numerical values based on the fact that the number of point clouds in a single grid column is smaller than a sampling threshold value, and filling characteristics are obtained; and the point cloud feature obtaining unit is used for obtaining the point cloud features of the corresponding frame based on all the sampling features and all the filling features.

In addition, in this embodiment, the timing fusion unit includes: the view transformation subunit performs view transformation on the corresponding frame image characteristics according to the input point cloud data of each frame to obtain view transformation characteristics; and the time sequence feature fusion subunit selects the view transformation features of two adjacent frames, and maps the view transformation features of the previous frame to the view transformation features of the current frame to obtain the time sequence fusion features.

Further, the view transformation subunit includes: the depth information fusion unit is used for respectively projecting each frame of point cloud data to the corresponding frame of image characteristics, endowing each pixel position of the corresponding image characteristics with a depth dimension coordinate based on the depth information of the point cloud data, and obtaining a projection fusion characteristic with a three-dimensional pixel coordinate; and the coordinate conversion unit is used for converting the pixel coordinates of the projection fusion characteristics to a radar coordinate system based on a preset coordinate conversion matrix to obtain the view transformation characteristics.

In an optional embodiment, the apparatus further comprises a training module for training the 3D object detection model. Specifically, a training module comprising: the training data acquisition unit is used for acquiring multi-frame training images, multi-frame point cloud training data and a target true value of the same target; the first training unit is used for training a pre-constructed 3D target detection model by taking point cloud training data as input data used for training and target truth values as labels to obtain a first training model; the second training unit is used for training the first training model by taking the training image as input data used for training and taking the target truth value as a label to obtain a second training model; and the third training unit is used for training the second training model by taking the point cloud training data and the training image as input data used for training and taking the target truth value as a label to obtain the 3D target detection model.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor) 41, a communication Interface (communication Interface) 42, a memory (memory) 43 and a communication bus 44, wherein the processor 41, the communication Interface 42 and the memory 43 complete communication with each other through the communication bus 44. Processor 41 may invoke logic instructions in memory 43 to perform a multi-sensor fusion based 3D object detection method comprising: acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence; inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values.

Furthermore, the logic instructions in the memory 43 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the multi-sensor fusion-based 3D object detection method provided by the above methods, the method comprising: acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence; inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the multi-sensor fusion-based 3D object detection method provided by the above methods, the method including: acquiring target images of at least two continuous frames and point cloud data corresponding to the time sequence of the target images; inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model; the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result; the 3D target detection model is obtained based on training images, image characteristic true values corresponding to the training images, point cloud training data, point cloud true values corresponding to the point cloud training data and target true values.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A3D target detection method based on multi-sensor fusion is characterized by comprising the following steps:

acquiring target images of at least two continuous frames and point cloud data corresponding to the target images in time sequence;

inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model;

performing time sequence fusion and modal fusion on the features respectively extracted by the point cloud data and the target image by the 3D target detection model, and then performing 3D target detection to obtain a target detection result;

the 3D target detection model is obtained based on training of a training image, an image characteristic true value corresponding to the training image, point cloud training data, a point cloud true value corresponding to the point cloud training data and a target true value.

2. The multi-sensor fusion-based 3D object detection method according to claim 1, wherein the 3D object detection model comprises:

the point cloud feature extraction layer is used for respectively extracting features of the input point cloud data of each frame to obtain point cloud features corresponding to the point cloud data of each frame;

the image feature extraction layer is used for respectively extracting features of input target images of each frame to obtain image features corresponding to the target images of each frame;

the time sequence fusion layer carries out time sequence fusion based on the extracted image characteristics of each frame and the input point cloud data of each frame to obtain time sequence fusion characteristics;

the modal fusion layer is used for performing modal fusion on the time sequence fusion feature and the time sequence point cloud feature corresponding to the time sequence fusion feature to obtain a modal fusion feature;

and the target detection layer is used for carrying out 3D target detection on the modal fusion characteristics to obtain a target detection result.

3. The multi-sensor fusion-based 3D object detection method according to claim 2, wherein the performing time-series fusion based on the extracted image features of each frame and the input point cloud data of each frame comprises:

according to the input point cloud data of each frame, carrying out view transformation on the corresponding frame image characteristics to obtain view transformation characteristics;

and selecting the view transformation characteristics of two adjacent frames, and mapping the view transformation characteristics of the previous frame to the view transformation characteristics of the current frame to obtain the time sequence fusion characteristics.

4. The multi-sensor fusion-based 3D target detection method according to claim 3, wherein the view transformation of the corresponding frame image features according to the input frame point cloud data comprises:

respectively projecting each frame of point cloud data to corresponding frame image features, and endowing each pixel position of the corresponding image features with depth dimension coordinates based on the depth information of the point cloud data to obtain projection fusion features with three-dimensional pixel coordinates;

and converting the pixel coordinates of the projection fusion characteristics to a radar coordinate system based on a preset coordinate conversion matrix to obtain view transformation characteristics.

5. The multi-sensor fusion-based 3D target detection method according to claim 2, wherein the performing feature extraction on each frame of point cloud data respectively comprises:

respectively uniformly dividing single-frame point cloud data into a plurality of grid columns with preset sizes;

and respectively extracting the features of each grid column based on a preset sampling threshold value to obtain the corresponding point cloud features.

6. The multi-sensor fusion-based 3D target detection method according to claim 5, wherein the performing feature extraction on each grid pillar based on a preset sampling threshold comprises:

based on the fact that the number of point clouds in a single grid column is not smaller than the preset sampling threshold, sampling the point clouds in the corresponding grid columns based on the preset sampling threshold to obtain sampling characteristics;

based on the fact that the number of point clouds in a single grid column is smaller than the sampling threshold value, filling the corresponding grid column according to preset numerical values to obtain filling characteristics;

and obtaining corresponding frame point cloud characteristics based on all the sampling characteristics and all the filling characteristics.

7. The multi-sensor fusion-based 3D object detection method of claim 2, wherein training the 3D object detection model comprises:

acquiring a multi-frame training image, multi-frame point cloud training data and a target true value of the same target;

the point cloud training data is used as input data for training, the target truth value is used as a label, and a pre-constructed 3D target detection model is trained to obtain a first training model;

training the first training model by taking the training image as input data used for training and the target truth value as a label to obtain a second training model;

and taking the point cloud training data and the training image as input data for training, taking the target truth value as a label, and training the second training model to obtain a 3D target detection model.

8. A3D target detection device based on multi-sensor fusion is characterized by comprising:

the data acquisition module is used for acquiring target images of at least two continuous frames and point cloud data corresponding to the target image time sequence;

the 3D target detection module is used for inputting the acquired point cloud data and the target image into a 3D target detection model to obtain a target detection result output by the 3D target detection model;

the 3D target detection model carries out time sequence fusion and modal fusion based on the characteristics respectively extracted from the point cloud data and the target image, and then carries out 3D target detection to obtain a target detection result;

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, implements the steps of the multi-sensor fusion based 3D object detection method according to any of claims 1 to 7.

10. A non-transitory computer readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, is adapted to carry out the steps of the multi-sensor fusion based 3D object detection method according to any one of claims 1 to 7.