CN112990229A

CN112990229A - Multi-modal 3D target detection method, system, terminal and medium

Info

Publication number: CN112990229A
Application number: CN202110263197.8A
Authority: CN
Inventors: 马超
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-06-18

Abstract

The present invention provides a multimodal 3D target detection method and system, respectively extracting the features of the original image I and the corresponding LiDAR point cloud L; The feature fusion of , forms LiDAR point cloud features, and the features of the original image I are used as image features to generate 3D region proposals and 2D region proposals respectively; extract features from the 3D region proposals and 2D region proposals and fuse, respectively, Generate final 3D object detection results. At the same time, a corresponding terminal and medium are provided. The present invention utilizes the geometric constraint relationship and feature correlation between modalities to complete target detection; through the feature fusion of the point-pixel level in the first stage and the region proposal level in the second stage, the 3D target detection task is completed; the image and Geometric constraints between LiDAR point clouds to generate high-quality region proposals.

Description

Multi-modal 3D target detection method, system, terminal and medium

Technical Field

The invention relates to a 3D target detection method, in particular to a multi-mode 3D target detection method, a system, a terminal and a medium based on a depth network.

Background

The target detection is an important direction in the field of computer vision, and has wide application prospect and market value. With the development of various sensor technologies and automotive technologies, the role of 3D object detection in the field of automated driving begins to emerge gradually. The most commonly used sensors for 3D target detection tasks in autonomous driving include cameras and LiDAR, with corresponding data types being images and LiDAR point clouds, respectively. Due to the complementarity of information between the two modalities, namely, the image and the LiDAR point cloud, and the gradually-reduced price of the laser radar, the multi-modality-based 3D target detection method gradually becomes the key point of domestic and foreign research.

Existing multi-modal-based 3D target detection methods are mainly classified into 3 types: anterior fusion, posterior fusion, and depth fusion. 1. Pre-fusion: the fusion is carried out at the input end, and different modes are preprocessed to be combined into a new representation mode; 2. post-fusion: each modality is processed separately and independently until the final stage, and the fusion of the result level is carried out, so that the fusion scheme allows the final result to come from a separate module, and has great redundancy; 3. deep fusion: fusing different modalities hierarchically in a neural network allows features from different modalities to interact information in the middle layer.

However, because of the large differences in data formats and distributions of images and LiDAR point clouds, experts and scholars both at home and abroad propose a number of algorithmic models that fuse two different modalities. For depth fusion, much focus is on projecting a point cloud of LiDAR points onto a 2D plane before fusing with image information. However, during the projection of the 3D point cloud onto the 2D plane, a loss of information inevitably occurs. For pre-fusion or post-fusion, fusion only occurs at the input or output end, and the relevance and complementarity between different modality data are not fully utilized.

Through search, the following results are found:

the publication number is: CN111860666A, China patent application for 3D target detection method based on the fusion of point cloud and image self-attention mechanism, published as 2020, 10, and 30, first, a multi-layer three-dimensional feature extraction method based on three-dimensional point cloud is provided. Then, a two-dimensional feature extraction method based on image geometric and semantic feature voting mechanism is provided. Secondly, a method of a geometric principle is provided, and the two-dimensional features are converted into a 3D detection pipeline of the point cloud and transmitted to a point cloud structure. And finally, a multi-tower training scheme is provided, the two-dimensional and three-dimensional characteristic gradient fusion cooperativity is optimized, and further fine adjustment is performed according to the fusion result. The method utilizes camera parameters to promote two-dimensional characteristics to a three-dimensional channel, and adopts a self-attention mechanism to realize the organic gradient fusion of the two-dimensional characteristics and the three-dimensional characteristics by using a multi-tower training method, thereby overcoming the inherent detection method based on three-dimensional point cloud sparse data, and fully utilizing high resolution and rich texture information of images to carry out supplementary optimization on the detection of the three-dimensional target so as to realize accurate detection of the three-dimensional target. The method still has the following technical problems:

1. the input of the method is an RGB-D image, and the method is not suitable for the sparse point cloud condition of an outdoor scene.

2. According to the method, extraction of the RGB image candidate frame and three-dimensional point cloud branches are decoupled, and meanwhile, classification of the three-dimensional point cloud and the RGB image are also decoupled, so that complementary information between the image and the point cloud cannot be fully utilized.

The method comprises the following steps of obtaining point cloud data and a corresponding color image according to a target detection method, a target detection device, computer equipment and a storage medium in a Chinese patent application with the publication number of CN110827202A and the publication date of 2020, 02 and 21; fusing the point cloud data and the corresponding color image to obtain fused data; completing the point cloud data according to the fusion data to obtain the completed point cloud data; carrying out target detection according to the fusion data to obtain an intermediate target detection result; acquiring supplemented point cloud data of an area corresponding to the intermediate target detection result from the supplemented point cloud data; and according to the acquired supplemented point cloud data, carrying out result correction on the intermediate target detection result to obtain a final target detection result. The method still has the following technical problems:

1. the point cloud data and the feature fusion input by the RGB image only stay on the aspect of feature splicing, and the depth correlation between two modal attributes is not considered.

2. The supervision signal of the detection result correction network of the method is from the complemented point cloud data obtained by complementing the point cloud data according to the fusion data, but the complemented point cloud introduces partial noise to the correction network due to inaccurate depth prediction of the object edge, thereby weakening the performance of the correction network.

At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.

Disclosure of Invention

The invention provides a method, a system, a terminal and a medium for multi-mode 3D target detection based on a depth network, aiming at the defects in the prior art, belonging to a multi-mode based depth fusion target detection technology, which takes original point cloud and an image as input to realize multi-mode 3D target detection.

According to an aspect of the present invention, there is provided a multi-modal 3D object detection method, including the steps of:

respectively extracting the characteristics of an original image I and the corresponding LiDAR point cloud L;

performing point and pixel feature fusion on the original image I and the corresponding LiDAR point cloud L features to form LiDAR point cloud features, and respectively generating a 3D area proposal and a 2D area proposal by using the features of the original image I as image features;

and respectively extracting features from the 3D area proposal and the 2D area proposal, and fusing to generate a final 3D target detection result.

Preferably, the separately extracting features of the raw image I and the corresponding LiDAR point cloud L includes:

obtaining an original image I, a corresponding LiDAR point cloud L and an image feature extractor FE_IAnd LiDAR point cloud feature extractor FE_L；

Inputting the original image I to the image feature extractor FE_IObtaining the characteristics F of the original image I_I；

Inputting the LiDAR point cloud L to the LiDAR point cloud feature extractor FE_LTo obtain the characteristics F of the LiDAR point cloud L_L。

Preferably, the performing feature fusion of points and pixels on the features of the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and generating a 3D area proposal and a 2D area proposal respectively by using the features of the original image I as image features includes:

according to the extracted features F of the LiDAR point cloud L_LDividing the LiDAR point cloud L into foreground points L_fAnd background point L_bFor all foreground spots L_fWith L_fA 3D anchor frame A is arranged for the center_3D；

The 3D anchor frame A_3DProjecting the image plane to obtain a corresponding 2D anchor frame A_2DForming a projection relation between points under a LiDAR coordinate system and pixels under an image coordinate system;

obtaining the characteristic F of the original image I according to the projection relation_IAnd features F of the LiDAR point cloud L_LThe feature F of the original image I is obtained according to the corresponding relation_IAnd features F of the LiDAR point cloud L_LCarrying out feature fusion of points and pixels;

taking the feature obtained by fusion as LiDAR point cloud feature F'_LFor the 3D anchor frame A_3DPerforming regression and classification tasks and calculating regression error

And classification error

Obtaining a regressed 3D anchor frame A'_3D(ii) a By the features F of the original image I_IAs image feature F_IFor the 2D anchor frame A_2DPerforming regression and classification tasks and calculating regression error

And classification error

Obtaining a regressed 2D anchor frame A'_2D；

Obtaining the first T2D anchor frames A 'with the highest score'_2DAnd 3D anchor frame A'_3DPro is proposed as 2D regions, respectively_2DAnd 3D region proposal Pro_3D；

Mixing the 3D anchor frame A'_3DProjecting the image plane to generate a 2D anchor frame A' according to the projection relation_2DAnd calculating the 2D anchor frame A ″)_2DAnd the 2D anchor frame A'_2DError L of_pro；

Constructing a loss function

Wherein, alpha, beta and gamma are coefficients respectively, and network parameters are updated by minimizing the loss function until the network converges.

Preferably, the 3D anchor frame A_3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.

Preferably, the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:

wherein, f.⁽ⁱ⁾，c.⁽ⁱ⁾，

Are the internal parameters of the camera sensor respectively,

is the corrective rotational matrix of camera No. 0,

is the rotation vector of the camera and LiDAR coordinate systems,

is the translation vector of the camera and LiDAR coordinate systems.

Preferably, α, β, γ are 1, 1, 0.5, respectively.

Preferably, the extracting features from the 3D region proposal and the 2D region proposal, respectively, and fusing them to generate a final 3D target detection result, includes:

proposing Pro from the 2D regions, respectively_2DAnd 3D region proposal Pro_3DInner extracted features

And

then the characteristics

And

fused features as 3D regions to suggest Pro_3DAs a 3D target detection result;

proposing Pro to 3D regions_3DRegression and classification were performed and the regression loss L 'was calculated'_regAnd classification loss of L'_cls；

Constructing a loss function L_refinement＝L′_cls+L′_regNetwork parameters are updated by minimizing the loss function until the network converges.

According to another aspect of the present invention, there is provided a multimodal 3D object detection system comprising:

an initial feature extraction module which respectively extracts features of the original image I and the corresponding LiDAR point cloud L;

a region proposal generation module, which performs feature fusion of points and pixels on the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and takes the features of the original image I as image features to respectively generate a 3D region proposal and a 2D region proposal;

and the target detection module is used for extracting features from the 3D area proposal and the 2D area proposal respectively and fusing the features to generate a final 3D target detection result.

According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program being operable to perform any of the methods described above.

According to a fourth aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is operable to perform the method of any of the above.

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:

the multi-mode 3D target detection method, the system, the terminal and the medium provided by the invention aim at the problem that the correlation and complementarity between two modes, namely an image and a LiDAR point cloud, are not fully utilized by a fusion method in the prior art, and the target detection is completed by utilizing the geometric constraint relation and the characteristic correlation between the modes.

The invention provides a multi-modal 3D target detection method, a multi-modal 3D target detection system, a multi-modal terminal and a multi-modal 3D target detection medium, which complete a 3D target detection task by the feature fusion of a point-pixel level of a first stage and an area proposal level of a second stage.

The multi-mode 3D target detection method, the multi-mode 3D target detection system, the multi-mode 3D target detection terminal and the multi-mode 3D target detection medium provided by the invention design a 2D-3D anchor frame coupling mechanism, and dynamic information interaction between an image and a point cloud can be realized through the mechanism.

According to the multi-mode 3D target detection method, the system, the terminal and the medium, designed point-pixel level feature fusion can effectively utilize geometric constraint between an image and LiDAR point cloud, meanwhile, information sharing is achieved, and finally a high-quality area proposal is generated.

According to the multi-mode 3D target detection method, the system, the terminal and the medium, the designed regional proposal level feature fusion can effectively learn the local features of the image and the LiDAR point cloud robustness, and the robust and high-quality detection is realized.

According to the multi-modal 3D target detection method, the multi-modal 3D target detection system, the multi-modal terminal and the multi-modal 3D target detection medium, compared with the prior art, the data enhancement method is improved in performance.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

fig. 1 is a flowchart of a multi-modal 3D object detection method according to an embodiment of the present invention.

Fig. 2 is a flowchart of a multi-modal 3D object detection method provided in a preferred embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a multi-modal 3D object detection system according to an embodiment of the present invention.

Detailed Description

The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

In the multi-modal 3D target detection method provided in this embodiment, an image and an original point cloud are first used as input, and are respectively input to independent feature extractors to extract corresponding features, and then point-pixel level feature fusion at a first stage is performed to generate a high-quality 3D region proposal. And then sending the image corresponding to the 3D area proposal and the LiDAR point cloud characteristics into a second stage for fusion and generating a final detection result.

As shown in fig. 1, the multi-modal 3D object detection method provided by this embodiment may include the following steps:

step 1: respectively extracting the characteristics of an original image I and the corresponding LiDAR point cloud L;

step 2: performing point and pixel feature fusion on the features extracted in the step 1 to form LiDAR point cloud features, and respectively generating a 3D region proposal and a 2D region proposal by taking the features of the image I extracted in the step 1 as image features;

and step 3: and (3) respectively extracting features from the 3D area proposal and the 2D area proposal obtained in the step (2) and fusing the features to generate a final 3D target detection result.

In step 1 of this embodiment, as a preferred embodiment, the steps of extracting features of the original image I and the corresponding LiDAR point cloud L respectively may include:

step 1.1: obtaining an original image I, a corresponding LiDAR point cloud L and an image feature extractor FE_IAnd LiDAR point cloud feature extractor FE_L；

Step 1.2: inputting an original image I to an image feature extractor FE_IObtaining the characteristics F of the original image I_IWhile, LiDAR point cloud L is input to LiDAR point cloud feature extractor FE_LObtaining features F of LiDAR point cloud L_L。

In step 2 of this embodiment, as a preferred embodiment, point and pixel feature fusion is performed on the features extracted in step 1 to form LiDAR point cloud features, and the features of the image I extracted in step 1 are used as image features to generate a 3D region proposal and a 2D region proposal, respectively, which may include the following steps:

step 2.1: according to the characteristics F of the LiDAR point cloud L extracted in the step 1_LDividing the LiDAR point cloud L into foreground points L_fAnd background point L_bFor all foreground spots L_fWith L_fA 3D anchor frame A is arranged for the center_3D；

Step 2.2: 3D anchor frame A obtained in step 2.1_3DProjecting the image plane to obtain a corresponding 2D anchor frame A_2DObtaining the projection relation between the points under the LiDAR coordinate system and the pixels under the image coordinate system;

step 2.3: according to the projection relation obtained in step 2.2, the feature F of the original image I can be obtained_IAnd features F of LiDAR point cloud L_LThe feature F of the original image I is obtained according to the corresponding relation_IAnd features F of LiDAR point cloud L_LCarrying out feature fusion of points and pixels;

step 2.4: taking the feature obtained by fusion in the step 2.3 as LiDAR point cloud feature F'_LFor the 3D anchor frame A generated in step 2.1_3DPerforming regression and classification tasks and calculating regression error

And classification error

Obtaining a regressed 3D anchor frame A'_3D(ii) a Taking the characteristics F of the original image I obtained in the step 1_IAs image feature F_IFor the 2D anchor frame A generated in step 2.2_2DPerforming regression and classification tasks and calculating regression error

And classification error

Obtaining a regressed 2D anchor frame A'_2D(ii) a Then respectively obtaining the top T2D anchor frames A 'with the highest score'_2DAnd 3D anchor frame A'_3DPro is proposed as 2D regions, respectively_2DAnd 3D area proposal Pri_3D；

Step 2.5: preparing the 3D anchor frame A 'obtained in the step 2.4'_3DProjecting the image plane according to the projection relation in the step 2.2 to generate a 2D anchor frame A ″_2DAnd calculating the 2D anchor frame A ″)_2DAnd the 2D anchor frame A 'in the step 2.4'_2DError L of_pro；

Step 2.6: constructing a loss function

In step 2.1 of this embodiment, as a specific application example, the 3D anchor frame A_3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.

In step 2.2 of this embodiment, as a preferred embodiment, the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:

wherein, f.⁽ⁱ⁾，c.⁽ⁱ⁾，

Are the internal parameters of the camera sensor respectively,

is the corrective rotational matrix of camera No. 0,

is the rotation vector of the camera and LiDAR coordinate systems,

is the translation vector of the camera and LiDAR coordinate systems.

In step 2.6 of this embodiment, α, β, γ are 1, 1, 0.5, respectively, as a specific application example.

In step 3 of this embodiment, as a preferred embodiment, the extracting the image features and the LiDAR point cloud features corresponding to the area proposal obtained in step 2, and performing feature fusion to generate a final detection result may include the following steps:

step 3.1: pro proposal from 2D regions, respectively_2DAnd 3D region proposal Pro_3DInner extracted features

And

then the characteristics

And

fused features as 3D regions to suggest Pro_3DAs a 3D target detection result; proposing to a 3D regionPro_3DRegression and classification were performed and the regression loss L 'was calculated'_regAnd classification loss of L'_cls；

Step 3.2: constructing a loss function L_refinement＝L′_cls+L′_regNetwork parameters are updated by minimizing the loss function until the network converges.

Fig. 2 is a flowchart of a multi-modal 3D object detection method according to a preferred embodiment of the present invention.

As shown in fig. 2, the multi-modal 3D object detection method provided by the preferred embodiment may include the following steps:

step 1: and respectively inputting the original image I and the corresponding LiDAR point cloud L into independent feature extractors to extract features.

Step 2: and (3) performing point-pixel level feature fusion on the features extracted in the step (1) to form LiDAR point cloud features, taking the features of the image I extracted in the step (1) as image features, and generating a high-quality 3D area proposal and a 2D area proposal.

And step 3: and (3) respectively extracting features from the 3D region proposal and the 2D region proposal obtained in the step (2) and fusing the features to generate a final detection result.

As a preferred embodiment, step 1 comprises the steps of:

step 1.1: obtaining an input raw image I, a corresponding LiDAR point cloud L, and an image feature extractor FE_ILiDAR point cloud feature extractor FE_L。

Step 1.2: inputting an image I to an image feature extractor FE_IObtaining image features F_IWhile inputting LiDAR point cloud L to LiDAR point cloud feature extractor FE_LObtaining LiDAR point cloud characteristics F_L。

As a preferred embodiment, step 2 comprises the steps of:

step 2.1: according to the LiDAR point cloud characteristics F learned in the step 1.2_LDivide the point cloud into foreground points L_fAnd background point L_bFor all foreground spots L_fWith L_fA 3D anchor frame A is arranged for the center_3DThe size of the anchor frame is as follows: long and long3.9 meters, 1.6 meters wide and 1.5 meters high.

Step 2.2: for the 3D anchor frame A in step 2.1_3DWe project it onto the image plane to get the corresponding 2D anchor frame a_2D. The projection relationship between a point x in the LiDAR coordinate system and a pixel y in the image coordinate system is as follows:

wherein f is as follows.⁽ⁱ⁾，c.⁽ⁱ⁾，

Is an internal parameter of the camera sensor and is,

is the corrective rotational matrix of camera No. 0,

is the rotation vector of the camera and LiDAR coordinate systems,

is the translation vector of the camera and LiDAR coordinate systems.

Step 2.3: according to the projection relation in the step 2.2, the image feature F in the step 1.2 can be obtained_IAnd LiDAR point cloud feature F_LAccording to the corresponding relationship of F_IAnd F_LFeature fusion at the point-pixel level is performed.

Step 2.4: taking the fused feature in the step 2.3 as a new LiDAR point cloud feature F'_LPerforming regression and classification tasks on the 3D anchor frame generated in the step 2.1 and calculating regression errors

And classification error

Obtaining a regressed 3D anchor frame A'_3D(ii) a In stepsF in step 1.2_IAs an image feature, the 2D anchor frame generated in step 2.1 is subjected to a regression and classification task and the regression error is calculated

And classification error

Obtaining a regressed 3D anchor frame A'_2D. Then respectively obtaining the top T2D anchor frames A 'with the highest score'_2DAnd 3D anchor frame A'_3DPro as 2D region proposal and 3D region proposal respectively_2D、Pro_3D。

Step 2.5: c, enabling the 3D anchor frame A 'in the step 2.4'_3DProjecting the image plane according to the projection relation in the step 2.2 to generate a 2D anchor frame A ″_2DAnd calculate A ″)_2DAnd A 'in step 2.4'_2DError L of_pro。

Step 2.6: constructing a loss function

Network parameters are updated by minimizing the loss function until the network converges. α, β, γ in this example are 1, 1, 0.5, respectively.

As a preferred embodiment, step 3 comprises the steps of:

step 3.1: with F in step 1_IAnd F 'in step 2'_LPro in step 2.4 as image and LiDAR point cloud features, respectively_2D、Pro_3DExtracting features in regions as 2D, 3D region proposals, respectively

Then will be

Features after fusion as Pro_3DAs a 3D object detection result. To Pro_3DRegression and classification were performed and the regression loss L 'was calculated'_regAnd classification loss of L'_cls。

The technical solutions provided by the above embodiments of the present invention are further described below with reference to a specific application example.

The PointRCNN detector is taken as an example and serves as a reference network of the specific application example. The method flow provided by the above embodiment of the present invention includes:

the first step is as follows: and respectively inputting the original image I and the corresponding LiDAR point cloud L into independent feature extractors to extract features.

1.1) obtaining an input original image I, a corresponding LiDAR point cloud L and an image feature extractor FE_IAnd each network parameter theta 1 and LiDAR point cloud feature extractor FE_LAnd various network parameters theta 2.

Wherein FE_IIs a ResNet-50 network, FE_LIs a PointNet + + network, and theta 1 and theta 2 are the corresponding pre-trained model parameters, respectively.

1.2) input of the image I into an image feature extractor FE_IObtaining image features F_IWhile inputting LiDAR point cloud L to LiDAR point cloud feature extractor FE_LObtaining LiDAR point cloud characteristics F_L。

The second step is that: and performing point-pixel level feature fusion on the features extracted in the first step, and generating a high-quality 3D region proposal.

2.1) Point cloud feature F extracted according to PointNet +_LScoring the LiDAR segmentation result, and judging the LiDAR segmentation result to be the foreground point L when the score is more than or equal to 0.3_fIf it is less than 0.3, it is determined as a background point L_b. For all foreground spots L_fLet us use L_fA 3D anchor frame A is arranged for the center_3DThe size of the anchor frame is (3.9 meters in length, 1.6 meters in width and 1.5 meters in height).

2.2) for 3D Anchor frame A in 2.1_3DWe project it to image planeOn the surface, obtaining a corresponding 2D anchor frame A_2D. The projection relationship between a point x in the LiDAR coordinate system and a pixel y in the image coordinate system is as follows:

wherein f is as follows.⁽ⁱ⁾，c.⁽ⁱ⁾，

Is an internal parameter of the camera sensor and is,

is the corrective rotational matrix of camera No. 0,

is the rotation vector of the camera and LiDAR coordinate systems,

is the translation vector of the camera and LiDAR coordinate systems.

2.3) according to the projection relation in 2.2), we can obtain the image characteristic F in 1.2_IAnd LiDAR point cloud feature F_LAccording to the corresponding relationship of F_IAnd F_LMaking a channel connection to form a new feature F'_L＝(F_I|F_L) Then new feature F'_LAnd continuously sending the information into a PointNet + + network for learning.

2.4) characteristic F 'after fusion with 2.3)'_LAs a new LiDAR point cloud feature, the 3D anchor frame generated in 2.1 is subjected to regression and classification tasks and the regression error is calculated

And classification error

Obtaining a regressed 3D anchor frame A'_3D(ii) a With F in 1.3_IAs an image feature, the 2D anchor generated in 2.1 is framed inLine regression and classification tasks and calculating regression errors

And classification error

Obtaining a regressed 3D anchor frame A'_2D. Then the first 9000 anchor boxes with the highest score were taken as the region proposal Pro_2D、Pro_3D。

2.5) 3D anchor frame A 'in 2.4)'_3DProjecting the image plane according to the projection relation in 2.2 to generate a 2D anchor frame A ″_2DAnd calculate A ″)_2DAnd 2.4 of A'_2DError L of_pro。

2.6) constructing the loss function

Network parameters are updated by minimizing the loss function until the network converges.

The third step: and extracting the image and the LiDAR point cloud characteristics corresponding to the 3D area proposal generated in the second step, and performing characteristic fusion to generate a final detection result.

3.1) with F in the first step_IAnd F 'in the second step'_LPro in 2.4 as image and LiDAR point cloud features, respectively_2D、Pro_3DExtracting features in regions as 2D, 3D region proposals, respectively

Then will be

Forming new features by making via connections

Is prepared from F'_proAs Pro_3DIs to Pro_3DRegression and classification were performed and the regression loss L 'was calculated'_regAnd classification loss of L'_cls。

3.2)Constructing a loss function L_refinement＝L′_cls+L′_regNetwork parameters are updated by minimizing the loss function until the network converges.

The implementation effect is as follows:

according to the steps, the test is carried out on a common 3D target detection data set KITTI. The data set is divided into a training set, a validation set, and a test set. The data set has 3D detection average accuracy as an evaluation index. Table 1 is a comparison of the performance of the present invention on the KITTI dataset against existing 3D target detection methods. As shown in table 1, it can be seen that the method provided by the above embodiment of the present invention has a significantly better improvement on the reference model than other algorithms.

TABLE 1

Another embodiment of the present invention provides a multi-modal 3D object detection system, as shown in fig. 3, including:

A third embodiment of the present invention provides a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to execute the method according to any one of the above embodiments of the present invention when executing the program.

Optionally, a memory for storing a program; a Memory, which may include a volatile Memory (RAM), such as a Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (ddr Data Rate Synchronous Dynamic Random Access Memory, ddr sdram), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memories are used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in partition in the memory or memories. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.

The computer programs, computer instructions, etc. described above may be stored in one or more memories in a partitioned manner. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.

A processor for executing the computer program stored in the memory to implement the steps of the method according to the above embodiments. Reference may be made in particular to the description relating to the preceding method embodiment.

The processor and the memory may be separate structures or may be an integrated structure integrated together. When the processor and the memory are separate structures, the memory, the processor may be coupled by a bus.

A fourth embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any of the above-mentioned embodiments of the invention.

The multi-modal 3D target detection method, the multi-modal 3D target detection system, the multi-modal terminal and the multi-modal 3D target detection medium provided by the embodiment of the invention solve the problem that the correlation and the complementarity between two modalities, namely an image and a LiDAR point cloud, are not fully utilized in the fusion method in the prior art, and the target detection is completed by utilizing the geometric constraint relationship and the characteristic correlation between the modalities; completing a 3D target detection task through the feature fusion of the point-pixel level of the first stage and the region proposal level of the second stage; dynamic information interaction between the image and the point cloud can be realized through a 2D-3D anchor frame coupling mechanism; the point-pixel level feature fusion can effectively utilize the geometric constraint between the image and the LiDAR point cloud, simultaneously realize the sharing of information and finally generate a high-quality area proposal; the region proposal level feature fusion can effectively learn the local features of the image and LiDAR point cloud robustness, and realize the detection of robustness and high quality; the data enhancement method achieves higher performance enhancement than the prior art.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may implement the composition of the system by referring to the technical solution of the method, that is, the embodiment in the method may be understood as a preferred example for constructing the system, and will not be described herein again.

Those skilled in the art will appreciate that, in addition to implementing the system and its various devices provided by the present invention in purely computer readable program code means, the method steps can be fully programmed to implement the same functions by implementing the system and its various devices in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A multi-modal 3D object detection method is characterized by comprising the following steps:

2. The multi-modal 3D object detection method of claim 1, wherein the separately extracting features of a raw image I and a corresponding LiDAR point cloud L comprises:

3. The multi-modal 3D object detection method of claim 1, wherein the fusing of point and pixel features of the raw image I and corresponding LiDAR point cloud L to form LiDAR point cloud features, using the features of the raw image I as image features to generate a 3D and a 2D area proposal, respectively, comprises:

And classification error

And classification error

Obtaining a regressed 2D anchor frame A'_2D；

Constructing a loss function

Wherein, alpha, beta and gamma are coefficientsNetwork parameters are updated by minimizing the loss function until the network converges.

4. The multi-modal 3D object detection method according to claim 3, wherein the 3D anchor frame A_3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.

5. The multi-modal 3D object detection method of claim 3, wherein the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:

wherein f is⁽ⁱ⁾，c⁽ⁱ⁾，

Are the internal parameters of the camera sensor respectively,

is the corrective rotational matrix of camera No. 0,

is the rotation vector of the camera and LiDAR coordinate systems,

is the translation vector of the camera and LiDAR coordinate systems.

6. The multi-modal 3D object detection method according to claim 3, wherein α, β, γ are 1, 1, 0.5, respectively.

7. The multi-modal 3D object detection method according to claim 3, wherein the extracting and fusing features from the 3D region proposal and the 2D region proposal respectively to generate a final 3D object detection result comprises:

And

then the characteristics

And

fused features as 3D regions to suggest Pro_3DAs a 3D target detection result;

8. A multi-modal 3D object detection system, comprising:

9. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, is operative to perform the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.