CN113780257A

CN113780257A - Multi-mode fusion weak supervision vehicle target detection method and system

Info

Publication number: CN113780257A
Application number: CN202111338590.5A
Authority: CN
Inventors: 唐作进; 戴捷; 孙波; 马铜伟; 李道胜
Original assignee: Zidong Information Technology Suzhou Co ltd
Current assignee: Suzhou Zhongde Ruibo Intelligent Technology Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2021-12-10
Anticipated expiration: 2041-11-12
Also published as: CN113780257B

Abstract

The invention relates to a multimodal fusion weakly supervised vehicle target detection method and system. The method includes acquiring 3D point cloud data and image data; acquiring 3D prediction frame parameters and features based on the 3D point cloud data, and simultaneously acquiring a 2D point cloud map and Its features; the first-stage fusion features of obtaining the features of the 3D prediction frame and the features of the image and generating the 2D target detection frame, and the second-stage fusion features of obtaining the features of the 3D prediction frame and the features of the 2D point cloud map and generating 3D Candidate prediction frame; filter and filter the 3D candidate prediction frame based on the 2D target detection frame and the set supervised confidence threshold between the image and the point cloud, and output the 3D target detection frame used to detect the target object in the scene. The invention obtains point cloud features and image features without relying on labels, greatly reduces the dependence of 3D target detection on semantic labels, and significantly improves detection accuracy.

Description

Multi-mode fusion weak supervision vehicle target detection method and system

Technical Field

The invention relates to the technical field of target detection, in particular to a method and a system for detecting a multi-mode fusion weakly supervised vehicle target.

Background

A key task in scene understanding is detection of three-dimensional objects, which has become a hot research problem in various application fields such as automatic driving, and the purpose of the three-dimensional target detection technology is to detect and locate a three-dimensional bounding box of a detected object from input sensor data. Most of the existing three-dimensional object detectors are based on complete supervised learning, a large number of modal three-dimensional bounding boxes need to be labeled manually in irregular point cloud data in the application of scenes lacking 3D labels, and the time cost of the labeling process greatly limits the application of the three-dimensional object detection technology.

The weak supervision detection is a method capable of effectively reducing the dependence of target detection on training labels, but the existing weak supervision object detector is mainly used for two-dimensional object detection, but not three-dimensional detection. A method for realizing weak supervision and even unsupervised learning of 3D object detection is found, dependence of a detector on training labels can be greatly reduced, and label cost is reduced. Therefore, the study on the weakly supervised or semi-supervised three-dimensional object detector model to adapt to the scene lacking the 3D label has very important practical significance.

On the other hand, most of the existing target detection methods use a visual sensor, and are extremely susceptible to interference of factors such as illumination and visibility in an outdoor environment, and the accuracy of depth information acquisition by the visual sensor alone cannot be guaranteed.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problems in the prior art, and provide a method and a system for detecting a multi-mode fusion weakly supervised vehicle target, wherein point cloud characteristics and image characteristics are obtained without depending on a label, so that the dependence of 3D target detection on a semantic label is greatly reduced, the detection precision is remarkably improved, and the accuracy and the applicability of the target detection are further improved.

In order to solve the technical problem, the invention provides a multi-mode fusion weak supervision vehicle target detection method, which comprises the following steps:

acquiring 3D laser point cloud data and image data in a scene;

acquiring a 3D prediction frame parameter based on the 3D laser point cloud data, performing grid pooling feature extraction on the 3D prediction frame parameter to acquire a feature of a 3D prediction frame, acquiring a 2D point cloud map based on the 3D laser point cloud data, and performing feature extraction on the 2D point cloud map to acquire a feature of the 2D point cloud map;

fusing the features of the 3D prediction frame with the features of an image acquired based on the image data to acquire first-stage fusion features, generating a 2D target detection frame based on the first-stage fusion features, fusing the features of the 3D prediction frame with the features of a 2D point cloud map to acquire second-stage fusion features, and generating a 3D candidate prediction frame based on the second-stage fusion features;

and filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and a set supervision confidence threshold value between the image and the point cloud, and outputting a 3D target detection frame for detecting a target object in the scene, wherein the 3D target detection frame is the filtered and screened 3D candidate prediction frame.

In one embodiment of the invention, the 3D laser point cloud data in the scene is acquired by a laser radar device, and the image data in the scene is acquired by an RGB image acquisition device.

In one embodiment of the invention, the method for acquiring the 3D prediction frame parameters based on the 3D laser point cloud data comprises the following steps:

presetting a range anchor point for the 3D laser point cloud data by utilizing ground truth supervision, performing feature learning on the 3D laser point cloud data in the range anchor point through a PointNet network, extracting the 3D laser point cloud feature, and acquiring the 3D prediction frame parameter based on the 3D laser point cloud feature.

In an embodiment of the present invention, the method for performing mesh pooling feature extraction on the 3D prediction frame parameter to obtain the feature of the 3D prediction frame includes:

and learning the parameters of the 3D prediction frame by utilizing a PointNet network to obtain the 3D prediction frame with continuous parameters, and then deleting the overlapped 3D prediction frame to obtain the characteristics of the 3D prediction frame.

In one embodiment of the invention, a method of obtaining a 2D point cloud map based on the 3D laser point cloud data comprises:

and the 3D laser point cloud data generates a 2D point cloud map based on the same anchor point by using the preset anchor point projection.

In one embodiment of the invention, a method of acquiring features of an image based on the image data comprises:

and acquiring the characteristics of the image based on the image data by using the trained pre-training model.

In one embodiment of the present invention, a method for generating a 2D object detection frame based on the first-stage fusion features includes:

and classifying, regressing and projecting the first-stage fusion features to generate a 2D target detection frame.

In one embodiment of the present invention, generating a 3D candidate prediction box based on the second-stage fused features comprises:

and inputting the second-stage fusion features into an encoder and a decoder based on an attention mechanism for processing to obtain a 3D candidate prediction frame.

In one embodiment of the invention, the method for filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and the set supervision confidence threshold between the image and the point cloud comprises the following steps:

projecting the 3D candidate prediction frame into a 2D candidate prediction frame, judging whether the similarity between the 2D candidate prediction frame and the 2D target detection frame is greater than a set similarity threshold, if not, traversing the 3D candidate prediction frame, if so, continuously judging whether the confidence coefficient of the 3D candidate prediction frame is greater than a set supervision confidence coefficient threshold, if not, returning to the step of traversing the 3D candidate prediction frame, and if so, outputting the 3D candidate prediction frame.

In addition, the invention also provides a multi-mode fusion weak supervision vehicle target detection system, which comprises:

the system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring 3D laser point cloud data and image data in a scene;

the point cloud data processing module is used for acquiring a 3D prediction frame parameter based on the 3D laser point cloud data, performing grid pooling feature extraction on the 3D prediction frame parameter to acquire a feature of a 3D prediction frame, acquiring a 2D point cloud map based on the 3D laser point cloud data, and performing feature extraction on the 2D point cloud map to acquire a feature of the 2D point cloud map;

a feature fusion module, configured to fuse features of the 3D prediction frame with features of an image obtained based on the image data to obtain first-stage fusion features, generate a 2D target detection frame based on the first-stage fusion features, fuse features of the 3D prediction frame with features of a 2D point cloud map to obtain second-stage fusion features, and generate a 3D candidate prediction frame based on the second-stage fusion features;

and the network supervision module is used for filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and a supervision confidence threshold value between the set image and the point cloud, and outputting a 3D target detection frame for detecting a target object in the scene, wherein the 3D target detection frame is the filtered and screened 3D candidate prediction frame.

Compared with the prior art, the technical scheme of the invention has the following advantages:

the method obtains the point cloud characteristics and the image characteristics without depending on the labels, greatly reduces the dependence of 3D target detection on semantic labels, performs multi-stage fusion on the point cloud characteristics, the image characteristics and the point cloud characteristics, outputs a 3D target detection frame for detecting the target object in the scene in a network supervision mode, obviously improves the detection precision, and further improves the accuracy and the applicability of the target detection.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference will now be made in detail to the present disclosure, examples of which are illustrated in the accompanying drawings.

FIG. 1 is a flow chart of the multi-modal fusion vehicle target detection method.

FIG. 2 is another flow chart of the multi-modal fusion vehicle target detection method.

FIG. 3 is a schematic diagram of the hardware structure of the multi-modal fusion vehicle target detection system with weak supervision of the invention.

Wherein the reference numerals are as follows: 10. a data acquisition module; 20. a point cloud data processing module; 30. a feature fusion module; 40. and a network supervision module.

Detailed Description

The present invention is further described below in conjunction with the following figures and specific examples so that those skilled in the art may better understand the present invention and practice it, but the examples are not intended to limit the present invention.

Example one

Referring to fig. 1 and 2, the present embodiment provides a method for detecting a multi-modal fusion vehicle target under weak supervision, including the following steps:

s100: acquiring 3D laser point cloud data and image data in a scene;

s200: acquiring a 3D prediction frame parameter based on the 3D laser point cloud data, performing grid pooling feature extraction on the 3D prediction frame parameter to acquire a feature of a 3D prediction frame, acquiring a 2D point cloud map based on the 3D laser point cloud data, and performing feature extraction on the 2D point cloud map to acquire a feature of the 2D point cloud map;

s300: fusing the features of the 3D prediction frame with the features of an image acquired based on the image data to acquire first-stage fusion features, generating a 2D target detection frame based on the first-stage fusion features, fusing the features of the 3D prediction frame with the features of a 2D point cloud map to acquire second-stage fusion features, and generating a 3D candidate prediction frame based on the second-stage fusion features;

s400: and filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and a set supervision confidence threshold value between the image and the point cloud, and outputting a 3D target detection frame for detecting a target object in the scene, wherein the 3D target detection frame is the filtered and screened 3D candidate prediction frame.

The scene described in the present disclosure may be a scene in front of the vehicle, including a front scene and a side scene, for example, a scene in front of the host vehicle.

In the multi-modal fusion weak surveillance vehicle target detection method disclosed by the invention, the 3D laser point cloud data and the image data are acquired from the same scene.

In the multi-modal fusion weak supervision vehicle target detection method, the 3D laser point cloud data in a scene is acquired through a laser radar device, and the image data in the scene is acquired through an RGB image acquisition device. Color image data is acquired in an arbitrary scene, for example, using a common RGB camera. Or the laser radar 32/64 line is arranged above the vehicle, the laser radar is used as the origin of the vehicle coordinate system to be converted with the laser radar point cloud coordinate system, and the 3D laser point cloud data can be obtained by utilizing the conversion of the rotation matrix and the translation matrix.

The multi-mode fusion weak supervision vehicle target detection method disclosed by the invention has the advantages that the detection accuracy of the object target is further improved by adopting a method for carrying out feature fusion on the 3D laser point cloud data and the image data in the same scene.

For the multi-modal fusion vehicle target detection method of the above embodiment, in step S200, the method for obtaining the 3D prediction frame parameter based on the 3D laser point cloud data includes: presetting a range anchor point for the 3D laser point cloud data by using a small amount of ground truth supervision, performing feature learning on the 3D laser point cloud data in the range anchor point through a PointNet network, extracting the 3D laser point cloud feature, and acquiring the 3D prediction frame parameter based on the 3D laser point cloud feature.

When the 3D laser point cloud data passes through a PointNet network, the dimension of the characteristic point, the number of the seed points and the characteristic range radius r of the seed points are set according to the characteristics of x, y and z coordinates (length, width and height) and the depth D of the data, a small number of high-quality seed points with local proposals are generated through a plurality of characteristic extraction layers, the seed points are used as the central points of a 3D prediction frame, the VoteNet network is used for voting the central seed points, and the parameters of the 3D prediction frame are obtained.

For the multi-modal fusion vehicle target detection method of the above embodiment, in step S200, the method for extracting grid pooling features of the 3D prediction frame parameters to obtain the features of the 3D prediction frame includes: and learning the parameters of the 3D prediction frame by utilizing a PointNet network to obtain the 3D prediction frame with continuous parameters, and then deleting the overlapped 3D prediction frame to obtain the characteristics of the 3D prediction frame.

For the multi-modal fusion vehicle target detection method of the embodiment, in step S200, the method for acquiring a 2D point cloud map based on the 3D laser point cloud data includes: the 3D laser point cloud data generates a 2D point cloud map based on the same anchor point by using the preset anchor point projection, and because the point cloud data is influenced by the sparsity of the laser radar, the screening condition of projection is carried out by using the normalized point cloud density, and then the characteristic extraction is carried out by using a ResNet-50 residual network.

With the multimodal fusion unsupervised vehicle object detection method of the above embodiment, in step S300, the method of acquiring the features of an image based on the image data includes: and acquiring the characteristics of the image based on the image data by using the trained pre-training model.

With the multi-modal fusion unsupervised vehicle object detection method of the above embodiment, in step S400, the method of generating a 2D object detection box based on the first-stage fusion features includes: and classifying, regressing and projecting the first-stage fusion features to generate a 2D target detection frame.

With respect to the multi-modal fusion unsupervised vehicle object detection method of the above embodiment, in step S400, generating a 3D candidate prediction box based on the second-stage fusion features includes: and inputting the second-stage fusion features into an encoder and a decoder based on an attention mechanism for processing to obtain a 3D candidate prediction frame. Specifically, the encoder and the decoder comprise a query matrix, a key matrix, a value matrix and a plurality of attention heads. The single attention head firstly takes the key matrix and the value matrix as input, performs characteristic cross calculation after linear change, then adds the position mask to further learn the characteristics between the global proposal and the local proposal, then calculates the score of each position prediction target by the Softmax layer, and directly fuses and outputs the query matrix and the characteristics of the prediction target score after the linear change.

The 2D object detection frame generated in step S400 is obtained by performing learning training on the 2D object detection frame from data of any type of image in any scene, and the detection accuracy of the 2D object detection frame is good in the 2D object detection method. The method can regard the whole process of finally generating the 2D target detection frame by the image collected by the RGB color camera through multilayer feature extraction as a teacher network, and regard the whole process of generating the 3D candidate prediction frame by the point cloud data scanned by the laser radar through feature extraction and feature fusion as a student network.

For the multi-modal fusion vehicle target detection method of the above embodiment, in step S400, the method for filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and the set surveillance confidence threshold between the image and the point cloud includes: projecting the 3D candidate prediction frame into a 2D candidate prediction frame, judging whether the similarity between the 2D candidate prediction frame and the 2D target detection frame is greater than a set similarity threshold, if not, traversing the 3D candidate prediction frame, if so, continuously judging whether the confidence coefficient of the 3D candidate prediction frame is greater than a set supervision confidence coefficient threshold, if not, returning to the step of traversing the 3D candidate prediction frame, and if so, outputting the 3D candidate prediction frame.

For the multi-modal fusion weak supervision vehicle target detection method of the above embodiment, the teacher network is used to supervise the student network, and the student network learns knowledge from the teacher network and evaluates the own target detection network. According to the method, the confidence level between the student network and the teacher network is evaluated by setting a monitoring confidence level threshold, when the confidence level of the teacher network is larger than or equal to the set threshold, the student network finally outputs a 3D target detection frame for detecting a target object in a scene under the supervision of the teacher network, wherein the 3D target detection frame is a filtered and screened 3D candidate prediction frame.

Example two

In the following, a multi-modal fusion weakly supervised vehicle target detection system disclosed by the second embodiment of the present invention is introduced, and a multi-modal fusion weakly supervised vehicle target detection system described below and a multi-modal fusion weakly supervised vehicle target detection method described above may be referred to correspondingly.

Referring to fig. 3, a second embodiment of the invention discloses a multi-modal fusion vehicle target detection system with weak supervision, which specifically includes the following modules.

The system comprises a data acquisition module 10, a data acquisition module 10 and a data processing module, wherein the data acquisition module 10 is used for acquiring 3D laser point cloud data and image data in a scene;

the point cloud data processing module 20 is configured to obtain a 3D prediction frame parameter based on the 3D laser point cloud data, perform mesh pooling feature extraction on the 3D prediction frame parameter to obtain a feature of a 3D prediction frame, obtain a 2D point cloud map based on the 3D laser point cloud data, and perform feature extraction on the 2D point cloud map to obtain a feature of the 2D point cloud map;

a feature fusion module 30, where the feature fusion module 30 is configured to fuse features of the 3D prediction frame with features of an image obtained based on the image data to obtain a first-stage fusion feature, generate a 2D target detection frame based on the first-stage fusion feature, fuse features of the 3D prediction frame with features of a 2D point cloud map to obtain a second-stage fusion feature, and generate a 3D candidate prediction frame based on the second-stage fusion feature;

and the network supervision module 40 is configured to filter and screen the 3D candidate prediction frames based on the 2D target detection frames and a supervision confidence threshold between the set image and the point cloud, and output the 3D target detection frames for detecting the target object in the scene, where the 3D target detection frames are the filtered and screened 3D candidate prediction frames.

The multimodal fusion based unsupervised vehicle object detection system may include corresponding modules that perform each or several of the steps of the above-described flow diagrams. Thus, each step or several steps in the above-described flow diagrams may be performed by a respective module, and the system may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus connects together various circuits including one or more processors, memories, and/or hardware modules. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The multi-modal fusion vehicle target detection system of the embodiment is used for implementing the multi-modal fusion vehicle target detection method, so that the specific implementation of the system can be seen in the foregoing section of the embodiment of the multi-modal fusion vehicle target detection method, and therefore, the specific implementation thereof can refer to the description of the corresponding section of the embodiment, and will not be further described herein.

In addition, since the multi-modal fusion unsupervised vehicle target detection system of the embodiment is used for implementing the multi-modal fusion unsupervised vehicle target detection method, the function corresponds to that of the method, and the detailed description is omitted here.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A multi-mode fusion weak supervision vehicle target detection method is characterized by comprising the following steps:

acquiring 3D laser point cloud data and image data in a scene;

2. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the 3D laser point cloud data in the scene is acquired through a laser radar device, and the image data in the scene is acquired through an RGB image acquisition device.

3. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for acquiring the 3D prediction frame parameters based on the 3D laser point cloud data comprises the following steps:

4. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for extracting the grid pooling features of the 3D prediction frame parameters to obtain the features of the 3D prediction frame comprises the following steps:

5. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for acquiring the 2D point cloud map based on the 3D laser point cloud data comprises the following steps:

6. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for acquiring the characteristics of the image based on the image data comprises the following steps:

7. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for generating the 2D target detection frame based on the first-stage fusion features comprises the following steps:

8. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: generating a 3D candidate prediction box based on the second stage fused features comprises:

9. The multi-modal fusion unsupervised vehicle object detection method of claim 1, wherein: the method for filtering and screening the 3D candidate prediction frame based on the 2D target detection frame and the set supervision confidence threshold value between the image and the point cloud comprises the following steps:

10. A multi-modal fusion unsupervised vehicle object detection system, comprising: