CN112990229A - Multi-modal 3D target detection method, system, terminal and medium - Google Patents

Multi-modal 3D target detection method, system, terminal and medium Download PDF

Info

Publication number
CN112990229A
CN112990229A CN202110263197.8A CN202110263197A CN112990229A CN 112990229 A CN112990229 A CN 112990229A CN 202110263197 A CN202110263197 A CN 202110263197A CN 112990229 A CN112990229 A CN 112990229A
Authority
CN
China
Prior art keywords
features
point cloud
image
lidar point
anchor frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110263197.8A
Other languages
Chinese (zh)
Inventor
马超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110263197.8A priority Critical patent/CN112990229A/en
Publication of CN112990229A publication Critical patent/CN112990229A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention provides a multi-modal 3D target detection method and system, which respectively extract the characteristics of an original image I and a corresponding LiDAR point cloud L; performing point and pixel feature fusion on the original image I and the corresponding LiDAR point cloud L features to form LiDAR point cloud features, and respectively generating a 3D area proposal and a 2D area proposal by using the features of the original image I as image features; and respectively extracting features from the 3D area proposal and the 2D area proposal, and fusing to generate a final 3D target detection result. A corresponding terminal and medium are also provided. The method completes target detection by using the geometric constraint relation and the characteristic relevance among the modes; completing a 3D target detection task through the feature fusion of the point-pixel level of the first stage and the region proposal level of the second stage; high quality area proposals are generated with geometric constraints between the image and the LiDAR point cloud.

Description

Multi-modal 3D target detection method, system, terminal and medium
Technical Field
The invention relates to a 3D target detection method, in particular to a multi-mode 3D target detection method, a system, a terminal and a medium based on a depth network.
Background
The target detection is an important direction in the field of computer vision, and has wide application prospect and market value. With the development of various sensor technologies and automotive technologies, the role of 3D object detection in the field of automated driving begins to emerge gradually. The most commonly used sensors for 3D target detection tasks in autonomous driving include cameras and LiDAR, with corresponding data types being images and LiDAR point clouds, respectively. Due to the complementarity of information between the two modalities, namely, the image and the LiDAR point cloud, and the gradually-reduced price of the laser radar, the multi-modality-based 3D target detection method gradually becomes the key point of domestic and foreign research.
Existing multi-modal-based 3D target detection methods are mainly classified into 3 types: anterior fusion, posterior fusion, and depth fusion. 1. Pre-fusion: the fusion is carried out at the input end, and different modes are preprocessed to be combined into a new representation mode; 2. post-fusion: each modality is processed separately and independently until the final stage, and the fusion of the result level is carried out, so that the fusion scheme allows the final result to come from a separate module, and has great redundancy; 3. deep fusion: fusing different modalities hierarchically in a neural network allows features from different modalities to interact information in the middle layer.
However, because of the large differences in data formats and distributions of images and LiDAR point clouds, experts and scholars both at home and abroad propose a number of algorithmic models that fuse two different modalities. For depth fusion, much focus is on projecting a point cloud of LiDAR points onto a 2D plane before fusing with image information. However, during the projection of the 3D point cloud onto the 2D plane, a loss of information inevitably occurs. For pre-fusion or post-fusion, fusion only occurs at the input or output end, and the relevance and complementarity between different modality data are not fully utilized.
Through search, the following results are found:
the publication number is: CN111860666A, China patent application for 3D target detection method based on the fusion of point cloud and image self-attention mechanism, published as 2020, 10, and 30, first, a multi-layer three-dimensional feature extraction method based on three-dimensional point cloud is provided. Then, a two-dimensional feature extraction method based on image geometric and semantic feature voting mechanism is provided. Secondly, a method of a geometric principle is provided, and the two-dimensional features are converted into a 3D detection pipeline of the point cloud and transmitted to a point cloud structure. And finally, a multi-tower training scheme is provided, the two-dimensional and three-dimensional characteristic gradient fusion cooperativity is optimized, and further fine adjustment is performed according to the fusion result. The method utilizes camera parameters to promote two-dimensional characteristics to a three-dimensional channel, and adopts a self-attention mechanism to realize the organic gradient fusion of the two-dimensional characteristics and the three-dimensional characteristics by using a multi-tower training method, thereby overcoming the inherent detection method based on three-dimensional point cloud sparse data, and fully utilizing high resolution and rich texture information of images to carry out supplementary optimization on the detection of the three-dimensional target so as to realize accurate detection of the three-dimensional target. The method still has the following technical problems:
1. the input of the method is an RGB-D image, and the method is not suitable for the sparse point cloud condition of an outdoor scene.
2. According to the method, extraction of the RGB image candidate frame and three-dimensional point cloud branches are decoupled, and meanwhile, classification of the three-dimensional point cloud and the RGB image are also decoupled, so that complementary information between the image and the point cloud cannot be fully utilized.
The method comprises the following steps of obtaining point cloud data and a corresponding color image according to a target detection method, a target detection device, computer equipment and a storage medium in a Chinese patent application with the publication number of CN110827202A and the publication date of 2020, 02 and 21; fusing the point cloud data and the corresponding color image to obtain fused data; completing the point cloud data according to the fusion data to obtain the completed point cloud data; carrying out target detection according to the fusion data to obtain an intermediate target detection result; acquiring supplemented point cloud data of an area corresponding to the intermediate target detection result from the supplemented point cloud data; and according to the acquired supplemented point cloud data, carrying out result correction on the intermediate target detection result to obtain a final target detection result. The method still has the following technical problems:
1. the point cloud data and the feature fusion input by the RGB image only stay on the aspect of feature splicing, and the depth correlation between two modal attributes is not considered.
2. The supervision signal of the detection result correction network of the method is from the complemented point cloud data obtained by complementing the point cloud data according to the fusion data, but the complemented point cloud introduces partial noise to the correction network due to inaccurate depth prediction of the object edge, thereby weakening the performance of the correction network.
At present, no explanation or report of the similar technology of the invention is found, and similar data at home and abroad are not collected.
Disclosure of Invention
The invention provides a method, a system, a terminal and a medium for multi-mode 3D target detection based on a depth network, aiming at the defects in the prior art, belonging to a multi-mode based depth fusion target detection technology, which takes original point cloud and an image as input to realize multi-mode 3D target detection.
According to an aspect of the present invention, there is provided a multi-modal 3D object detection method, including the steps of:
respectively extracting the characteristics of an original image I and the corresponding LiDAR point cloud L;
performing point and pixel feature fusion on the original image I and the corresponding LiDAR point cloud L features to form LiDAR point cloud features, and respectively generating a 3D area proposal and a 2D area proposal by using the features of the original image I as image features;
and respectively extracting features from the 3D area proposal and the 2D area proposal, and fusing to generate a final 3D target detection result.
Preferably, the separately extracting features of the raw image I and the corresponding LiDAR point cloud L includes:
obtaining an original image I, a corresponding LiDAR point cloud L and an image feature extractor FEIAnd LiDAR point cloud feature extractor FEL
Inputting the original image I to the image feature extractor FEIObtaining the characteristics F of the original image II
Inputting the LiDAR point cloud L to the LiDAR point cloud feature extractor FELTo obtain the characteristics F of the LiDAR point cloud LL
Preferably, the performing feature fusion of points and pixels on the features of the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and generating a 3D area proposal and a 2D area proposal respectively by using the features of the original image I as image features includes:
according to the extracted features F of the LiDAR point cloud LLDividing the LiDAR point cloud L into foreground points LfAnd background point LbFor all foreground spots LfWith LfA 3D anchor frame A is arranged for the center3D
The 3D anchor frame A3DProjecting the image plane to obtain a corresponding 2D anchor frame A2DForming a projection relation between points under a LiDAR coordinate system and pixels under an image coordinate system;
obtaining the characteristic F of the original image I according to the projection relationIAnd features F of the LiDAR point cloud LLThe feature F of the original image I is obtained according to the corresponding relationIAnd features F of the LiDAR point cloud LLCarrying out feature fusion of points and pixels;
taking the feature obtained by fusion as LiDAR point cloud feature F'LFor the 3D anchor frame A3DPerforming regression and classification tasks and calculating regression error
Figure BDA0002970954250000031
And classification error
Figure BDA0002970954250000032
Obtaining a regressed 3D anchor frame A'3D(ii) a By the features F of the original image IIAs image feature FIFor the 2D anchor frame A2DPerforming regression and classification tasks and calculating regression error
Figure BDA0002970954250000033
And classification error
Figure BDA0002970954250000034
Obtaining a regressed 2D anchor frame A'2D
Obtaining the first T2D anchor frames A 'with the highest score'2DAnd 3D anchor frame A'3DPro is proposed as 2D regions, respectively2DAnd 3D region proposal Pro3D
Mixing the 3D anchor frame A'3DProjecting the image plane to generate a 2D anchor frame A' according to the projection relation2DAnd calculating the 2D anchor frame A ″)2DAnd the 2D anchor frame A'2DError L ofpro
Constructing a loss function
Figure BDA0002970954250000041
Wherein, alpha, beta and gamma are coefficients respectively, and network parameters are updated by minimizing the loss function until the network converges.
Preferably, the 3D anchor frame A3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.
Preferably, the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:
Figure BDA0002970954250000042
wherein, f.(i),c.(i)
Figure BDA0002970954250000043
Are the internal parameters of the camera sensor respectively,
Figure BDA0002970954250000044
is the corrective rotational matrix of camera No. 0,
Figure BDA0002970954250000045
is the rotation vector of the camera and LiDAR coordinate systems,
Figure BDA0002970954250000046
is the translation vector of the camera and LiDAR coordinate systems.
Preferably, α, β, γ are 1, 1, 0.5, respectively.
Preferably, the extracting features from the 3D region proposal and the 2D region proposal, respectively, and fusing them to generate a final 3D target detection result, includes:
proposing Pro from the 2D regions, respectively2DAnd 3D region proposal Pro3DInner extracted features
Figure BDA0002970954250000047
And
Figure BDA0002970954250000048
then the characteristics
Figure BDA0002970954250000049
And
Figure BDA00029709542500000410
fused features as 3D regions to suggest Pro3DAs a 3D target detection result;
proposing Pro to 3D regions3DRegression and classification were performed and the regression loss L 'was calculated'regAnd classification loss of L'cls
Constructing a loss function Lrefinement=L′cls+L′regNetwork parameters are updated by minimizing the loss function until the network converges.
According to another aspect of the present invention, there is provided a multimodal 3D object detection system comprising:
an initial feature extraction module which respectively extracts features of the original image I and the corresponding LiDAR point cloud L;
a region proposal generation module, which performs feature fusion of points and pixels on the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and takes the features of the original image I as image features to respectively generate a 3D region proposal and a 2D region proposal;
and the target detection module is used for extracting features from the 3D area proposal and the 2D area proposal respectively and fusing the features to generate a final 3D target detection result.
According to a third aspect of the present invention, there is provided a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program being operable to perform any of the methods described above.
According to a fourth aspect of the invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is operable to perform the method of any of the above.
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following beneficial effects:
the multi-mode 3D target detection method, the system, the terminal and the medium provided by the invention aim at the problem that the correlation and complementarity between two modes, namely an image and a LiDAR point cloud, are not fully utilized by a fusion method in the prior art, and the target detection is completed by utilizing the geometric constraint relation and the characteristic correlation between the modes.
The invention provides a multi-modal 3D target detection method, a multi-modal 3D target detection system, a multi-modal terminal and a multi-modal 3D target detection medium, which complete a 3D target detection task by the feature fusion of a point-pixel level of a first stage and an area proposal level of a second stage.
The multi-mode 3D target detection method, the multi-mode 3D target detection system, the multi-mode 3D target detection terminal and the multi-mode 3D target detection medium provided by the invention design a 2D-3D anchor frame coupling mechanism, and dynamic information interaction between an image and a point cloud can be realized through the mechanism.
According to the multi-mode 3D target detection method, the system, the terminal and the medium, designed point-pixel level feature fusion can effectively utilize geometric constraint between an image and LiDAR point cloud, meanwhile, information sharing is achieved, and finally a high-quality area proposal is generated.
According to the multi-mode 3D target detection method, the system, the terminal and the medium, the designed regional proposal level feature fusion can effectively learn the local features of the image and the LiDAR point cloud robustness, and the robust and high-quality detection is realized.
According to the multi-modal 3D target detection method, the multi-modal 3D target detection system, the multi-modal terminal and the multi-modal 3D target detection medium, compared with the prior art, the data enhancement method is improved in performance.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
fig. 1 is a flowchart of a multi-modal 3D object detection method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a multi-modal 3D object detection method provided in a preferred embodiment of the present invention.
Fig. 3 is a schematic structural diagram of a multi-modal 3D object detection system according to an embodiment of the present invention.
Detailed Description
The following examples illustrate the invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Fig. 1 is a flowchart of a multi-modal 3D object detection method according to an embodiment of the present invention.
In the multi-modal 3D target detection method provided in this embodiment, an image and an original point cloud are first used as input, and are respectively input to independent feature extractors to extract corresponding features, and then point-pixel level feature fusion at a first stage is performed to generate a high-quality 3D region proposal. And then sending the image corresponding to the 3D area proposal and the LiDAR point cloud characteristics into a second stage for fusion and generating a final detection result.
As shown in fig. 1, the multi-modal 3D object detection method provided by this embodiment may include the following steps:
step 1: respectively extracting the characteristics of an original image I and the corresponding LiDAR point cloud L;
step 2: performing point and pixel feature fusion on the features extracted in the step 1 to form LiDAR point cloud features, and respectively generating a 3D region proposal and a 2D region proposal by taking the features of the image I extracted in the step 1 as image features;
and step 3: and (3) respectively extracting features from the 3D area proposal and the 2D area proposal obtained in the step (2) and fusing the features to generate a final 3D target detection result.
In step 1 of this embodiment, as a preferred embodiment, the steps of extracting features of the original image I and the corresponding LiDAR point cloud L respectively may include:
step 1.1: obtaining an original image I, a corresponding LiDAR point cloud L and an image feature extractor FEIAnd LiDAR point cloud feature extractor FEL
Step 1.2: inputting an original image I to an image feature extractor FEIObtaining the characteristics F of the original image IIWhile, LiDAR point cloud L is input to LiDAR point cloud feature extractor FELObtaining features F of LiDAR point cloud LL
In step 2 of this embodiment, as a preferred embodiment, point and pixel feature fusion is performed on the features extracted in step 1 to form LiDAR point cloud features, and the features of the image I extracted in step 1 are used as image features to generate a 3D region proposal and a 2D region proposal, respectively, which may include the following steps:
step 2.1: according to the characteristics F of the LiDAR point cloud L extracted in the step 1LDividing the LiDAR point cloud L into foreground points LfAnd background point LbFor all foreground spots LfWith LfA 3D anchor frame A is arranged for the center3D
Step 2.2: 3D anchor frame A obtained in step 2.13DProjecting the image plane to obtain a corresponding 2D anchor frame A2DObtaining the projection relation between the points under the LiDAR coordinate system and the pixels under the image coordinate system;
step 2.3: according to the projection relation obtained in step 2.2, the feature F of the original image I can be obtainedIAnd features F of LiDAR point cloud LLThe feature F of the original image I is obtained according to the corresponding relationIAnd features F of LiDAR point cloud LLCarrying out feature fusion of points and pixels;
step 2.4: taking the feature obtained by fusion in the step 2.3 as LiDAR point cloud feature F'LFor the 3D anchor frame A generated in step 2.13DPerforming regression and classification tasks and calculating regression error
Figure BDA0002970954250000071
And classification error
Figure BDA0002970954250000072
Obtaining a regressed 3D anchor frame A'3D(ii) a Taking the characteristics F of the original image I obtained in the step 1IAs image feature FIFor the 2D anchor frame A generated in step 2.22DPerforming regression and classification tasks and calculating regression error
Figure BDA0002970954250000073
And classification error
Figure BDA0002970954250000074
Obtaining a regressed 2D anchor frame A'2D(ii) a Then respectively obtaining the top T2D anchor frames A 'with the highest score'2DAnd 3D anchor frame A'3DPro is proposed as 2D regions, respectively2DAnd 3D area proposal Pri3D
Step 2.5: preparing the 3D anchor frame A 'obtained in the step 2.4'3DProjecting the image plane according to the projection relation in the step 2.2 to generate a 2D anchor frame A ″2DAnd calculating the 2D anchor frame A ″)2DAnd the 2D anchor frame A 'in the step 2.4'2DError L ofpro
Step 2.6: constructing a loss function
Figure BDA0002970954250000075
Wherein, alpha, beta and gamma are coefficients respectively, and network parameters are updated by minimizing the loss function until the network converges.
In step 2.1 of this embodiment, as a specific application example, the 3D anchor frame A3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.
In step 2.2 of this embodiment, as a preferred embodiment, the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:
Figure BDA0002970954250000076
wherein, f.(i),c.(i)
Figure BDA0002970954250000077
Are the internal parameters of the camera sensor respectively,
Figure BDA0002970954250000078
is the corrective rotational matrix of camera No. 0,
Figure BDA0002970954250000079
is the rotation vector of the camera and LiDAR coordinate systems,
Figure BDA00029709542500000710
is the translation vector of the camera and LiDAR coordinate systems.
In step 2.6 of this embodiment, α, β, γ are 1, 1, 0.5, respectively, as a specific application example.
In step 3 of this embodiment, as a preferred embodiment, the extracting the image features and the LiDAR point cloud features corresponding to the area proposal obtained in step 2, and performing feature fusion to generate a final detection result may include the following steps:
step 3.1: pro proposal from 2D regions, respectively2DAnd 3D region proposal Pro3DInner extracted features
Figure BDA00029709542500000711
And
Figure BDA00029709542500000712
then the characteristics
Figure BDA00029709542500000713
And
Figure BDA00029709542500000714
fused features as 3D regions to suggest Pro3DAs a 3D target detection result; proposing to a 3D regionPro3DRegression and classification were performed and the regression loss L 'was calculated'regAnd classification loss of L'cls
Step 3.2: constructing a loss function Lrefinement=L′cls+L′regNetwork parameters are updated by minimizing the loss function until the network converges.
Fig. 2 is a flowchart of a multi-modal 3D object detection method according to a preferred embodiment of the present invention.
As shown in fig. 2, the multi-modal 3D object detection method provided by the preferred embodiment may include the following steps:
step 1: and respectively inputting the original image I and the corresponding LiDAR point cloud L into independent feature extractors to extract features.
Step 2: and (3) performing point-pixel level feature fusion on the features extracted in the step (1) to form LiDAR point cloud features, taking the features of the image I extracted in the step (1) as image features, and generating a high-quality 3D area proposal and a 2D area proposal.
And step 3: and (3) respectively extracting features from the 3D region proposal and the 2D region proposal obtained in the step (2) and fusing the features to generate a final detection result.
As a preferred embodiment, step 1 comprises the steps of:
step 1.1: obtaining an input raw image I, a corresponding LiDAR point cloud L, and an image feature extractor FEILiDAR point cloud feature extractor FEL
Step 1.2: inputting an image I to an image feature extractor FEIObtaining image features FIWhile inputting LiDAR point cloud L to LiDAR point cloud feature extractor FELObtaining LiDAR point cloud characteristics FL
As a preferred embodiment, step 2 comprises the steps of:
step 2.1: according to the LiDAR point cloud characteristics F learned in the step 1.2LDivide the point cloud into foreground points LfAnd background point LbFor all foreground spots LfWith LfA 3D anchor frame A is arranged for the center3DThe size of the anchor frame is as follows: long and long3.9 meters, 1.6 meters wide and 1.5 meters high.
Step 2.2: for the 3D anchor frame A in step 2.13DWe project it onto the image plane to get the corresponding 2D anchor frame a2D. The projection relationship between a point x in the LiDAR coordinate system and a pixel y in the image coordinate system is as follows:
Figure BDA0002970954250000081
wherein f is as follows.(i),c.(i)
Figure BDA0002970954250000082
Is an internal parameter of the camera sensor and is,
Figure BDA0002970954250000083
is the corrective rotational matrix of camera No. 0,
Figure BDA0002970954250000084
is the rotation vector of the camera and LiDAR coordinate systems,
Figure BDA0002970954250000085
is the translation vector of the camera and LiDAR coordinate systems.
Step 2.3: according to the projection relation in the step 2.2, the image feature F in the step 1.2 can be obtainedIAnd LiDAR point cloud feature FLAccording to the corresponding relationship of FIAnd FLFeature fusion at the point-pixel level is performed.
Step 2.4: taking the fused feature in the step 2.3 as a new LiDAR point cloud feature F'LPerforming regression and classification tasks on the 3D anchor frame generated in the step 2.1 and calculating regression errors
Figure BDA0002970954250000086
And classification error
Figure BDA0002970954250000087
Obtaining a regressed 3D anchor frame A'3D(ii) a In stepsF in step 1.2IAs an image feature, the 2D anchor frame generated in step 2.1 is subjected to a regression and classification task and the regression error is calculated
Figure BDA0002970954250000091
And classification error
Figure BDA0002970954250000092
Obtaining a regressed 3D anchor frame A'2D. Then respectively obtaining the top T2D anchor frames A 'with the highest score'2DAnd 3D anchor frame A'3DPro as 2D region proposal and 3D region proposal respectively2D、Pro3D
Step 2.5: c, enabling the 3D anchor frame A 'in the step 2.4'3DProjecting the image plane according to the projection relation in the step 2.2 to generate a 2D anchor frame A ″2DAnd calculate A ″)2DAnd A 'in step 2.4'2DError L ofpro
Step 2.6: constructing a loss function
Figure BDA0002970954250000093
Network parameters are updated by minimizing the loss function until the network converges. α, β, γ in this example are 1, 1, 0.5, respectively.
As a preferred embodiment, step 3 comprises the steps of:
step 3.1: with F in step 1IAnd F 'in step 2'LPro in step 2.4 as image and LiDAR point cloud features, respectively2D、Pro3DExtracting features in regions as 2D, 3D region proposals, respectively
Figure BDA0002970954250000094
Figure BDA0002970954250000095
Then will be
Figure BDA0002970954250000096
Features after fusion as Pro3DAs a 3D object detection result. To Pro3DRegression and classification were performed and the regression loss L 'was calculated'regAnd classification loss of L'cls
Step 3.2: constructing a loss function Lrefinement=L′cls+L′regNetwork parameters are updated by minimizing the loss function until the network converges.
The technical solutions provided by the above embodiments of the present invention are further described below with reference to a specific application example.
The PointRCNN detector is taken as an example and serves as a reference network of the specific application example. The method flow provided by the above embodiment of the present invention includes:
the first step is as follows: and respectively inputting the original image I and the corresponding LiDAR point cloud L into independent feature extractors to extract features.
1.1) obtaining an input original image I, a corresponding LiDAR point cloud L and an image feature extractor FEIAnd each network parameter theta 1 and LiDAR point cloud feature extractor FELAnd various network parameters theta 2.
Wherein FEIIs a ResNet-50 network, FELIs a PointNet + + network, and theta 1 and theta 2 are the corresponding pre-trained model parameters, respectively.
1.2) input of the image I into an image feature extractor FEIObtaining image features FIWhile inputting LiDAR point cloud L to LiDAR point cloud feature extractor FELObtaining LiDAR point cloud characteristics FL
The second step is that: and performing point-pixel level feature fusion on the features extracted in the first step, and generating a high-quality 3D region proposal.
2.1) Point cloud feature F extracted according to PointNet +LScoring the LiDAR segmentation result, and judging the LiDAR segmentation result to be the foreground point L when the score is more than or equal to 0.3fIf it is less than 0.3, it is determined as a background point Lb. For all foreground spots LfLet us use LfA 3D anchor frame A is arranged for the center3DThe size of the anchor frame is (3.9 meters in length, 1.6 meters in width and 1.5 meters in height).
2.2) for 3D Anchor frame A in 2.13DWe project it to image planeOn the surface, obtaining a corresponding 2D anchor frame A2D. The projection relationship between a point x in the LiDAR coordinate system and a pixel y in the image coordinate system is as follows:
Figure BDA0002970954250000101
wherein f is as follows.(i),c.(i)
Figure BDA0002970954250000102
Is an internal parameter of the camera sensor and is,
Figure BDA0002970954250000103
is the corrective rotational matrix of camera No. 0,
Figure BDA0002970954250000104
is the rotation vector of the camera and LiDAR coordinate systems,
Figure BDA0002970954250000105
is the translation vector of the camera and LiDAR coordinate systems.
2.3) according to the projection relation in 2.2), we can obtain the image characteristic F in 1.2IAnd LiDAR point cloud feature FLAccording to the corresponding relationship of FIAnd FLMaking a channel connection to form a new feature F'L=(FI|FL) Then new feature F'LAnd continuously sending the information into a PointNet + + network for learning.
2.4) characteristic F 'after fusion with 2.3)'LAs a new LiDAR point cloud feature, the 3D anchor frame generated in 2.1 is subjected to regression and classification tasks and the regression error is calculated
Figure BDA0002970954250000106
And classification error
Figure BDA0002970954250000107
Obtaining a regressed 3D anchor frame A'3D(ii) a With F in 1.3IAs an image feature, the 2D anchor generated in 2.1 is framed inLine regression and classification tasks and calculating regression errors
Figure BDA0002970954250000108
And classification error
Figure BDA0002970954250000109
Obtaining a regressed 3D anchor frame A'2D. Then the first 9000 anchor boxes with the highest score were taken as the region proposal Pro2D、Pro3D
2.5) 3D anchor frame A 'in 2.4)'3DProjecting the image plane according to the projection relation in 2.2 to generate a 2D anchor frame A ″2DAnd calculate A ″)2DAnd 2.4 of A'2DError L ofpro
2.6) constructing the loss function
Figure BDA00029709542500001010
Network parameters are updated by minimizing the loss function until the network converges.
The third step: and extracting the image and the LiDAR point cloud characteristics corresponding to the 3D area proposal generated in the second step, and performing characteristic fusion to generate a final detection result.
3.1) with F in the first stepIAnd F 'in the second step'LPro in 2.4 as image and LiDAR point cloud features, respectively2D、Pro3DExtracting features in regions as 2D, 3D region proposals, respectively
Figure BDA00029709542500001011
Then will be
Figure BDA00029709542500001012
Forming new features by making via connections
Figure BDA00029709542500001013
Is prepared from F'proAs Pro3DIs to Pro3DRegression and classification were performed and the regression loss L 'was calculated'regAnd classification loss of L'cls
3.2)Constructing a loss function Lrefinement=L′cls+L′regNetwork parameters are updated by minimizing the loss function until the network converges.
The implementation effect is as follows:
according to the steps, the test is carried out on a common 3D target detection data set KITTI. The data set is divided into a training set, a validation set, and a test set. The data set has 3D detection average accuracy as an evaluation index. Table 1 is a comparison of the performance of the present invention on the KITTI dataset against existing 3D target detection methods. As shown in table 1, it can be seen that the method provided by the above embodiment of the present invention has a significantly better improvement on the reference model than other algorithms.
TABLE 1
Figure BDA0002970954250000111
Another embodiment of the present invention provides a multi-modal 3D object detection system, as shown in fig. 3, including:
an initial feature extraction module which respectively extracts features of the original image I and the corresponding LiDAR point cloud L;
a region proposal generation module, which performs feature fusion of points and pixels on the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and takes the features of the original image I as image features to respectively generate a 3D region proposal and a 2D region proposal;
and the target detection module is used for extracting features from the 3D area proposal and the 2D area proposal respectively and fusing the features to generate a final 3D target detection result.
A third embodiment of the present invention provides a terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to execute the method according to any one of the above embodiments of the present invention when executing the program.
Optionally, a memory for storing a program; a Memory, which may include a volatile Memory (RAM), such as a Random Access Memory (SRAM), a Double Data Rate Synchronous Dynamic Random Access Memory (ddr Data Rate Synchronous Dynamic Random Access Memory, ddr sdram), and the like; the memory may also comprise a non-volatile memory, such as a flash memory. The memories are used to store computer programs (e.g., applications, functional modules, etc. that implement the above-described methods), computer instructions, etc., which may be stored in partition in the memory or memories. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.
The computer programs, computer instructions, etc. described above may be stored in one or more memories in a partitioned manner. And the computer programs, computer instructions, data, etc. described above may be invoked by a processor.
A processor for executing the computer program stored in the memory to implement the steps of the method according to the above embodiments. Reference may be made in particular to the description relating to the preceding method embodiment.
The processor and the memory may be separate structures or may be an integrated structure integrated together. When the processor and the memory are separate structures, the memory, the processor may be coupled by a bus.
A fourth embodiment of the invention provides a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any of the above-mentioned embodiments of the invention.
The multi-modal 3D target detection method, the multi-modal 3D target detection system, the multi-modal terminal and the multi-modal 3D target detection medium provided by the embodiment of the invention solve the problem that the correlation and the complementarity between two modalities, namely an image and a LiDAR point cloud, are not fully utilized in the fusion method in the prior art, and the target detection is completed by utilizing the geometric constraint relationship and the characteristic correlation between the modalities; completing a 3D target detection task through the feature fusion of the point-pixel level of the first stage and the region proposal level of the second stage; dynamic information interaction between the image and the point cloud can be realized through a 2D-3D anchor frame coupling mechanism; the point-pixel level feature fusion can effectively utilize the geometric constraint between the image and the LiDAR point cloud, simultaneously realize the sharing of information and finally generate a high-quality area proposal; the region proposal level feature fusion can effectively learn the local features of the image and LiDAR point cloud robustness, and realize the detection of robustness and high quality; the data enhancement method achieves higher performance enhancement than the prior art.
It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, and the like in the system, and those skilled in the art may implement the composition of the system by referring to the technical solution of the method, that is, the embodiment in the method may be understood as a preferred example for constructing the system, and will not be described herein again.
Those skilled in the art will appreciate that, in addition to implementing the system and its various devices provided by the present invention in purely computer readable program code means, the method steps can be fully programmed to implement the same functions by implementing the system and its various devices in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices thereof provided by the present invention can be regarded as a hardware component, and the devices included in the system and various devices thereof for realizing various functions can also be regarded as structures in the hardware component; means for performing the functions may also be regarded as structures within both software modules and hardware components for performing the methods.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims (10)

1. A multi-modal 3D object detection method is characterized by comprising the following steps:
respectively extracting the characteristics of an original image I and the corresponding LiDAR point cloud L;
performing point and pixel feature fusion on the original image I and the corresponding LiDAR point cloud L features to form LiDAR point cloud features, and respectively generating a 3D area proposal and a 2D area proposal by using the features of the original image I as image features;
and respectively extracting features from the 3D area proposal and the 2D area proposal, and fusing to generate a final 3D target detection result.
2. The multi-modal 3D object detection method of claim 1, wherein the separately extracting features of a raw image I and a corresponding LiDAR point cloud L comprises:
obtaining an original image I, a corresponding LiDAR point cloud L and an image feature extractor FEIAnd LiDAR point cloud feature extractor FEL
Inputting the original image I to the image feature extractor FEIObtaining the characteristics F of the original image II
Inputting the LiDAR point cloud L to the LiDAR point cloud feature extractor FELTo obtain the characteristics F of the LiDAR point cloud LL
3. The multi-modal 3D object detection method of claim 1, wherein the fusing of point and pixel features of the raw image I and corresponding LiDAR point cloud L to form LiDAR point cloud features, using the features of the raw image I as image features to generate a 3D and a 2D area proposal, respectively, comprises:
according to the extracted features F of the LiDAR point cloud LLDividing the LiDAR point cloud L into foreground points LfAnd background point LbFor all foreground spots LfWith LfA 3D anchor frame A is arranged for the center3D
The 3D anchor frame A3DProjecting the image plane to obtain a corresponding 2D anchor frame A2DForming a projection relation between points under a LiDAR coordinate system and pixels under an image coordinate system;
obtaining the characteristic F of the original image I according to the projection relationIAnd features F of the LiDAR point cloud LLThe feature F of the original image I is obtained according to the corresponding relationIAnd features F of the LiDAR point cloud LLCarrying out feature fusion of points and pixels;
taking the feature obtained by fusion as LiDAR point cloud feature F'LFor the 3D anchor frame A3DPerforming regression and classification tasks and calculating regression error
Figure FDA0002970954240000011
And classification error
Figure FDA0002970954240000012
Obtaining a regressed 3D anchor frame A'3D(ii) a By the features F of the original image IIAs image feature FIFor the 2D anchor frame A2DPerforming regression and classification tasks and calculating regression error
Figure FDA0002970954240000021
And classification error
Figure FDA0002970954240000022
Obtaining a regressed 2D anchor frame A'2D
Obtaining the first T2D anchor frames A 'with the highest score'2DAnd 3D anchor frame A'3DPro is proposed as 2D regions, respectively2DAnd 3D region proposal Pro3D
Mixing the 3D anchor frame A'3DProjecting the image plane to generate a 2D anchor frame A' according to the projection relation2DAnd calculating the 2D anchor frame A ″)2DAnd the 2D anchor frame A'2DError L ofpro
Constructing a loss function
Figure FDA0002970954240000023
Wherein, alpha, beta and gamma are coefficientsNetwork parameters are updated by minimizing the loss function until the network converges.
4. The multi-modal 3D object detection method according to claim 3, wherein the 3D anchor frame A3DThe size is as follows: 3.9 meters long, 1.6 meters wide and 1.5 meters high.
5. The multi-modal 3D object detection method of claim 3, wherein the projection relationship between point x in the LiDAR coordinate system and pixel y in the image coordinate system is:
Figure FDA0002970954240000024
wherein f is(i),c(i)
Figure FDA0002970954240000025
Are the internal parameters of the camera sensor respectively,
Figure FDA0002970954240000026
is the corrective rotational matrix of camera No. 0,
Figure FDA0002970954240000027
is the rotation vector of the camera and LiDAR coordinate systems,
Figure FDA0002970954240000028
is the translation vector of the camera and LiDAR coordinate systems.
6. The multi-modal 3D object detection method according to claim 3, wherein α, β, γ are 1, 1, 0.5, respectively.
7. The multi-modal 3D object detection method according to claim 3, wherein the extracting and fusing features from the 3D region proposal and the 2D region proposal respectively to generate a final 3D object detection result comprises:
proposing Pro from the 2D regions, respectively2DAnd 3D region proposal Pro3DInner extracted features
Figure FDA0002970954240000029
And
Figure FDA00029709542400000210
then the characteristics
Figure FDA00029709542400000211
And
Figure FDA00029709542400000212
fused features as 3D regions to suggest Pro3DAs a 3D target detection result;
proposing Pro to 3D regions3DRegression and classification were performed and the regression loss L 'was calculated'regAnd classification loss of L'cls
Constructing a loss function Lrefinement=L′cls+L′regNetwork parameters are updated by minimizing the loss function until the network converges.
8. A multi-modal 3D object detection system, comprising:
an initial feature extraction module which respectively extracts features of the original image I and the corresponding LiDAR point cloud L;
a region proposal generation module, which performs feature fusion of points and pixels on the original image I and the corresponding LiDAR point cloud L to form LiDAR point cloud features, and takes the features of the original image I as image features to respectively generate a 3D region proposal and a 2D region proposal;
and the target detection module is used for extracting features from the 3D area proposal and the 2D area proposal respectively and fusing the features to generate a final 3D target detection result.
9. A terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the program, is operative to perform the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 7.
CN202110263197.8A 2021-03-11 2021-03-11 Multi-modal 3D target detection method, system, terminal and medium Pending CN112990229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110263197.8A CN112990229A (en) 2021-03-11 2021-03-11 Multi-modal 3D target detection method, system, terminal and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110263197.8A CN112990229A (en) 2021-03-11 2021-03-11 Multi-modal 3D target detection method, system, terminal and medium

Publications (1)

Publication Number Publication Date
CN112990229A true CN112990229A (en) 2021-06-18

Family

ID=76334903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110263197.8A Pending CN112990229A (en) 2021-03-11 2021-03-11 Multi-modal 3D target detection method, system, terminal and medium

Country Status (1)

Country Link
CN (1) CN112990229A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114267041A (en) * 2022-03-01 2022-04-01 北京鉴智科技有限公司 Method and device for identifying object in scene
CN114463736A (en) * 2021-12-28 2022-05-10 天津大学 Multi-target detection method and device based on multi-mode information fusion

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948661A (en) * 2019-02-27 2019-06-28 江苏大学 A kind of 3D vehicle checking method based on Multi-sensor Fusion
CN110738121A (en) * 2019-09-17 2020-01-31 北京科技大学 front vehicle detection method and detection system
CN110929692A (en) * 2019-12-11 2020-03-27 中国科学院长春光学精密机械与物理研究所 Three-dimensional target detection method and device based on multi-sensor information fusion
CN111027401A (en) * 2019-11-15 2020-04-17 电子科技大学 End-to-end target detection method with integration of camera and laser radar
CN111079685A (en) * 2019-12-25 2020-04-28 电子科技大学 3D target detection method
CN111291714A (en) * 2020-02-27 2020-06-16 同济大学 Vehicle detection method based on monocular vision and laser radar fusion
CN111563442A (en) * 2020-04-29 2020-08-21 上海交通大学 Slam method and system for fusing point cloud and camera image data based on laser radar
CN111583337A (en) * 2020-04-25 2020-08-25 华南理工大学 Omnibearing obstacle detection method based on multi-sensor fusion
CN111860666A (en) * 2020-07-27 2020-10-30 湖南工程学院 3D target detection method based on point cloud and image self-attention mechanism fusion
CN112052860A (en) * 2020-09-11 2020-12-08 中国人民解放军国防科技大学 Three-dimensional target detection method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948661A (en) * 2019-02-27 2019-06-28 江苏大学 A kind of 3D vehicle checking method based on Multi-sensor Fusion
CN110738121A (en) * 2019-09-17 2020-01-31 北京科技大学 front vehicle detection method and detection system
CN111027401A (en) * 2019-11-15 2020-04-17 电子科技大学 End-to-end target detection method with integration of camera and laser radar
CN110929692A (en) * 2019-12-11 2020-03-27 中国科学院长春光学精密机械与物理研究所 Three-dimensional target detection method and device based on multi-sensor information fusion
CN111079685A (en) * 2019-12-25 2020-04-28 电子科技大学 3D target detection method
CN111291714A (en) * 2020-02-27 2020-06-16 同济大学 Vehicle detection method based on monocular vision and laser radar fusion
CN111583337A (en) * 2020-04-25 2020-08-25 华南理工大学 Omnibearing obstacle detection method based on multi-sensor fusion
CN111563442A (en) * 2020-04-29 2020-08-21 上海交通大学 Slam method and system for fusing point cloud and camera image data based on laser radar
CN111860666A (en) * 2020-07-27 2020-10-30 湖南工程学院 3D target detection method based on point cloud and image self-attention mechanism fusion
CN112052860A (en) * 2020-09-11 2020-12-08 中国人民解放军国防科技大学 Three-dimensional target detection method and system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MING ZHU等: "Cross-Modality 3D Object Detection", 《ARXIV:2008.10436V1 [CS.CV]》 *
翟少华: "基于图像和点云融合的道路障碍物感知与参数化分析", 《中国优秀博硕士学位论文全文数据库(硕士) 工程科技||辑》 *
郑少武等: "基于激光点云与图像信息融合的交通环境车辆检测", 《仪器仪表学报》 *
马超: "Shallow and Deep Learning for Robust Online Object Tracking", 《万方数据知识服务平台》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114463736A (en) * 2021-12-28 2022-05-10 天津大学 Multi-target detection method and device based on multi-mode information fusion
CN114267041A (en) * 2022-03-01 2022-04-01 北京鉴智科技有限公司 Method and device for identifying object in scene
CN114267041B (en) * 2022-03-01 2022-05-13 北京鉴智科技有限公司 Method and device for identifying object in scene

Similar Documents

Publication Publication Date Title
Zhang et al. Learning signed distance field for multi-view surface reconstruction
Huang et al. Autonomous driving with deep learning: A survey of state-of-art technologies
Lin et al. Depth estimation from monocular images and sparse radar data
Zhang et al. Deep hierarchical guidance and regularization learning for end-to-end depth estimation
Yang et al. Self-supervised learning of depth inference for multi-view stereo
CN110689008A (en) Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
Biasutti et al. Lu-net: An efficient network for 3d lidar point cloud semantic segmentation based on end-to-end-learned 3d features and u-net
Zhang et al. Listereo: Generate dense depth maps from lidar and stereo imagery
CN112990229A (en) Multi-modal 3D target detection method, system, terminal and medium
Liu et al. Deep learning based multi-view stereo matching and 3D scene reconstruction from oblique aerial images
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
Shi et al. An improved lightweight deep neural network with knowledge distillation for local feature extraction and visual localization using images and LiDAR point clouds
Chen et al. Shape prior guided instance disparity estimation for 3d object detection
Du et al. Srh-net: Stacked recurrent hourglass network for stereo matching
Wang et al. CS2Fusion: Contrastive learning for Self-Supervised infrared and visible image fusion by estimating feature compensation map
Liu et al. Map-gen: An automated 3d-box annotation flow with multimodal attention point generator
Abdulwahab et al. Monocular depth map estimation based on a multi-scale deep architecture and curvilinear saliency feature boosting
Shi et al. Self-supervised learning of depth and ego-motion with differentiable bundle adjustment
CN104463962A (en) Three-dimensional scene reconstruction method based on GPS information video
CN117635989A (en) Binocular stereo matching method based on improved CFNet
Wu et al. Scene completeness-aware lidar depth completion for driving scenario
Li et al. Key supplement: Improving 3d car detection with pseudo point cloud
CN116958980A (en) Real-time scene text detection method
Chen et al. Monocular image depth prediction without depth sensors: An unsupervised learning method
Zhang et al. Reinforcing local structure perception for monocular depth estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210618