CN113408584A - RGB-D multi-modal feature fusion 3D target detection method - Google Patents

RGB-D multi-modal feature fusion 3D target detection method Download PDF

Info

Publication number
CN113408584A
CN113408584A CN202110545313.5A CN202110545313A CN113408584A CN 113408584 A CN113408584 A CN 113408584A CN 202110545313 A CN202110545313 A CN 202110545313A CN 113408584 A CN113408584 A CN 113408584A
Authority
CN
China
Prior art keywords
target
rgb
point cloud
detection
feature fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110545313.5A
Other languages
Chinese (zh)
Other versions
CN113408584B (en
Inventor
陈光柱
侯睿
韩银贺
唐在作
茹青君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Univeristy of Technology
Original Assignee
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Univeristy of Technology filed Critical Chengdu Univeristy of Technology
Priority to CN202110545313.5A priority Critical patent/CN113408584B/en
Publication of CN113408584A publication Critical patent/CN113408584A/en
Application granted granted Critical
Publication of CN113408584B publication Critical patent/CN113408584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an RGB-D multi-modal feature fusion 3D target detection method. The 3D target detection technology can obtain semantic information and spatial dimension information of a target, and is of great significance for realizing 3D intelligent target detection. Specifically, the method comprises the following steps: firstly, improving a YOLOv3 target detection network model to obtain a 2D prior area, providing an RGB-D target significance detection algorithm to extract target pixels, and obtaining target cone point cloud through cone projection; secondly, in order to remove outliers and reduce the number of target view cone point clouds, a multi-mode feature fusion strategy is provided to simplify the target view cone point clouds, and the strategy can replace the process of reasoning a 3D target based on a deep neural network; and finally, generating a 3D boundary frame of the target point cloud by using an axis alignment bounding box algorithm (AABB), and calculating the pose coordinates of the target point cloud by using a PCA algorithm. The invention has the beneficial effects that: the RGB-D multi-modal feature fusion 3D target detection method can improve the detection precision of scene multi-scale targets in application scenes with a small amount of 2D labeling data and no 3D labeling data, and has the advantages of good real-time performance and high precision.

Description

RGB-D multi-modal feature fusion 3D target detection method
Technical Field
The invention relates to the fields of computer vision, image recognition and target detection, in particular to an RGB-D multi-modal feature fusion 3D target detection method.
Background
The target detection is used as an important branch of machine vision, relates to the intersection of multiple disciplines and multiple fields, is the basis of high-level tasks such as target tracking, behavior recognition, field flow estimation and the like, and can be realized only by the maturity and development of the target detection. The target detection is to identify and position target objects in a scene, such as automobiles, pedestrians, roads and the like, and the target identification is to distinguish interested objects in the scene to obtain the category of the target objects and give out the classification probability; target positioning is to calibrate the position of an object of interest in a scene, and generally a box or a cube box is used to frame the boundary of the object of interest. The target detection has huge application prospects at present, such as body shadows of the target detection in the fields of face recognition, intelligent monitoring, intelligent workshops, intelligent transportation, unmanned driving and the like. The 3D target detection technology can obtain not only semantic information of the target but also space size information, and has research value and application prospect.
Currently, the traditional image processing method often needs a complex feature extractor designed for a specific detection target to realize target detection, and the algorithm has a poor generalization capability. The traditional target detection method is difficult to realize intelligent target detection. With the continuous development of artificial intelligence and computer vision technology, the neural network and deep learning technology based image recognition task has excellent performance. The 2D object detection algorithm has been developed rapidly. Compared with the traditional target detection method, the 2D target detection technology has the advantages that multiple types of targets can be efficiently detected, and the algorithm is high in detection precision, generalization capability and robustness. However, the 2D object detection method cannot acquire actual 3D spatial information (spatial pose coordinates, 3D size, etc.) of an object. Therefore, the 3D target detection can more accurately express the actual spatial position information of the detection target, is beneficial to accurately identifying and positioning the targets, and can more effectively ensure the safety of interactive operation with the targets.
In recent years, 3D target detection technology has been developed in a breakthrough manner as the accuracy of depth sensors such as 3D laser radar and RGB-D has been improved. The 3D target detection is used as an important task in the scene understanding process, and classification of an interested target in 3D data and positioning of a 3D boundary box can be achieved through the 3D target detection. The 3D target detection can further accurately position the 3D boundary box of the target compared with the 2D target detection when the semantic information of the target of interest is acquired. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques. At present, 3D target detection technology in the field of outdoor automatic driving is researched and applied more, and the indoor 3D target detection research direction mainly focuses on 3D detection of a life scene target, positioning of a mechanical arm and workpiece grabbing. These methods all rely on a large number of labeled data sets in a specific scene, which is not conducive to popularization in practical application scenarios.
Most of the existing 3D target detection methods need to construct a large-scale 3D labeling data set, and the difficulty in constructing the large-scale 3D labeling data set is high, so that the methods are difficult to realize 3D target detection in actual application requirements. Therefore, the RGB-D multi-mode feature fusion 3D target detection method is provided, and the method can be effectively applied to the 3D target detection of practical application scenes and has great research significance.
Disclosure of Invention
The invention mainly aims to provide an RGB-D multi-modal feature fusion 3D target detection method. According to the method, the detection precision of the multi-scale target in the detection scene can be improved by improving the YOLOv3 network model, and meanwhile, the method can realize efficient 3D target detection under the condition that a small amount of 2D labeling data is used and the 3D labeling data is not relied on.
The invention is realized by adopting the following technical scheme: an RGB-D multimodal feature fusion 3D object detection method (hereinafter abbreviated MMFF-3D object detection method), comprising the steps of:
step 1: and preliminarily establishing a target data set of the detection scene. And collecting pictures through a web crawler, taking the pictures in an actual workshop, and carrying out 2D labeling on the data set.
Step 2: based on the characteristics of a multiple-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is further improved to be an MD56 trunk network, an MD56-YOLOv3 target detection framework is obtained to improve the 2D detection precision of the target, and the MD56-YOLOv3 network framework is trained in the data set established in the step 1.
And step 3: and (3) constructing an RGB-D target significance detection algorithm based on the 2D rectangular region obtained on the basis of the step (2) to obtain a pixel point region of the target.
And 4, step 4: and 3, obtaining a pixel point region of the target on the basis of the step 3, generating a target view cone point cloud by a depth image and RGB image mapping alignment and view cone projection method, providing an RGB-D multi-mode feature fusion strategy to extract a simplified target view cone point cloud, and replacing a 3D target reasoning process based on a deep neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.
The beneficial technical effects of the invention are as follows:
1. the 2D detection precision of the multi-scale target in the scene can be effectively improved;
2. when the object is shielded, the target pixel can be effectively segmented and 3D detection is realized;
3. 3D target detection is realized under the conditions of a small amount of 2D labeling data and no 3D labeling data;
4. the MMFF-3D target detection method meets the requirements of 3D target detection on real-time performance and precision;
drawings
FIG. 1 is a MMFF-3D object detection network model framework schematic.
FIG. 2 is a schematic diagram of the MD56-YOLOv3 target detection framework.
FIG. 3 is a schematic diagram of the process of RGB-D target saliency detection algorithm acquiring a target pixel.
FIG. 4 is a schematic diagram of a process for implementing 3D target detection by an RGB-D multimodal feature fusion process.
FIG. 5 is a diagram of the target detection effect of an MMFF-3D target detection network model framework in an intelligent workshop application scenario.
Detailed description of the preferred embodiments
To facilitate understanding of the present invention, background knowledge of object detection, which is one of the most basic and challenging problems in computer vision, is introduced first, and has been receiving much attention in related research in the field of computer vision. The image recognition of the target detection is characterized in that the category of a certain interested target in the digital image is recognized, and the position of the interested target in the digital image is positioned; meanwhile, the target detection technology can be used as the basic research of visual processing tasks such as example segmentation, target tracking and the like. The target detection technology is taken as a research hotspot direction in the field of image processing, and is mainly realized by a method of designing a complicated artificial feature extractor in the stage of realizing target detection by utilizing the traditional computer vision processing technology. Compared with the traditional target detection algorithm for constructing the artificial feature extractor, the target detection algorithm based on the deep neural network has the advantages of simpler structural design, capability of automatically extracting features, high detection precision and good robustness. Therefore, the main research direction in the current target detection field is based on deep learning and neural network technology. And 3D target detection can further accurately position a 3D boundary box of the target compared with 2D target detection while obtaining semantic information of the target of interest. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques.
The following describes an embodiment of the present invention in detail by selecting a workshop scenario and performing specific implementation and application with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of a MMFF-3D object detection network model framework according to the present invention, and in combination with the diagram, the general implementation is as follows: establishing a workshop scene target data set, improving YOLOv3 to improve the multi-scale workshop scene target detection efficiency, combining 2D target detection and RGB-D fused target significance detection to generate 3D target viewing cone point cloud, projecting RGB image target characteristic information onto a three-dimensional space to perform multi-mode characteristic fusion with 3D point cloud density distribution information to obtain simplified 3D target viewing cone point cloud, and realizing the generation of a 3D target boundary box of a workshop scene by utilizing an AABB algorithm. Specifically, the method comprises the following steps: firstly, based on the characteristics of a multi-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is improved into an MD56 trunk network so as to improve the 2D detection accuracy of the multi-scale target of a workshop scene. And then, constructing an RGB-D fused target significance detection algorithm for the 2D rectangular region obtained by the 2D target detection based on the 2D detection region to obtain pixel points of the target, wherein the pixel points of the target are used for generating target view cone point cloud. And moreover, a target view cone point cloud is generated by a depth image and RGB image mapping alignment and view cone projection method, a simplified target view cone point cloud is extracted by utilizing an RGB-D multi-mode feature fusion strategy, and the process replaces a 3D target reasoning process based on a depth neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.
Step 1: and establishing a workshop scene target data set. 2000 main target RGB images including a workshop scene are collected through a web crawler, and the part of data is used as a Similar domain data set (Similar Domains data sets) in training; 3000 images (1500 RGB images and corresponding 1500 depth images) of the digitized workshop scene actually containing the main Target are taken with a RealSense D435 depth camera, and this part of the data will be used as the Target dataset in the training.
Step 2: the DarkNet53 is modified to improve the efficiency of target detection for multi-scale shop scenes (see also FIG. 2).
Step 21: firstly, adding a feature extraction layer (kernel size is 3 × 3, strings is 1, and padding is 1) at each of 3 scale feature extraction branches to increase the receptive fields of the prediction targets with different scales of y1, y2, and y 3.
Step 22: then, by adjusting the input size 416 of the network to 448, more feature information can be extracted and the detection accuracy of the network can be improved.
Step 23: finally, for the whole network, (as shown in fig. 2 (b)), the prediction result of the network needs to be coordinate-box decoded, and mapped to the true value of the image coordinate through the coordinate-box decoding. Since the network parameters are randomly initialized, the network-initialized coordinate box may have a problem of exceeding the actual coordinate box boundary. To limit the location range of the box, the upper left coordinate point of the grid (C) is scaled using sigmoid functionx,Cy) The offset is limited to a range of 0-1 so that the positions of the prediction boxes remain inside the respective meshes at all times. The process of decoding the width and height of the prediction box is to scale to the actual image size by multiplying the prior box by the corresponding sampling rate (as shown in the following equation).
Y=(sigmoid(ty)+Cy)*stride,
X=(sigmoid(tx)+Cx)*stride,
W=(PWetw)*stride,
H=(Pheth)*stride,
In the formula, tx, ty, tw, th represent the predicted result, Cx,CyCoordinates, P, representing the upper left corner of the grid point at which the prediction box is locatedW、PhThe width and height of the prior frame compared to the current grid size. Stride indicates the image size (i.e., sampling rate) of the input image after down-sampling adjustment.
And step 3: the RGB-D object saliency detection method aims at further fusing depth image thresholding to segment objects in an image (see also fig. 3). The fusion depth image threshold segmentation algorithm process is as follows: for the case that the rectangular areas of the target detection are not overlapped, according to the front threshold value FtIt is divided into a target and a background. For the overlapping condition of the rectangular areas of the two objects, the average depth values of the depth images corresponding to the two rectangular areas (shown as (r) and (c) in the attached figure 3) are calculated. If the average depth value of a certain target detection rectangular area is smaller, according to a threshold value FtIt is divided into a target and a background. For the rectangular area with large average depth value, according to the threshold value
Figure BDA0003073301300000051
And
Figure BDA0003073301300000052
it is divided into object, foreground and background. Threshold value Ft
Figure BDA0003073301300000053
And
Figure BDA0003073301300000054
is obtained by adopting an adaptive threshold value calculation method. When the rectangular areas of the object are not overlapped or the average depth value of the rectangular areas is small when the rectangular areas are overlapped (shown as phi in figure 3), the threshold number is 1, and the threshold is FtD, RGB target pixel Prgb(x, y) Pixel Point P that can incorporate depth image thresholdingd(x, y) is further calculated. When the average depth value of the rectangular areas is larger (as shown in fig. 3), the threshold number is 2, and the threshold is
Figure BDA0003073301300000055
And
Figure BDA0003073301300000056
RGB target pixel point value Prgb(x, y) Pixel Point values P that can incorporate depth image thresholdingd(x, y) is further calculated. The calculation is shown in the following formula:
Figure BDA0003073301300000057
Figure BDA0003073301300000058
and 4, step 4: and extracting a simplified target view cone point cloud by using an RGB-D multi-modal feature fusion strategy (see the attached figure 4).
Step 41: firstly, acquiring a target characteristic pixel point p in a pixel coordinate system of an RGB image based on a Canny edge extraction algorithm and a Harris corner detection algorithmrgb_f. Obtaining a target characteristic pixel point P in a camera coordinate system of the RGB camera through a transformation relation between a pixel coordinate system and an image coordinate system of the RGB camera and a transformation relation between the image coordinate system and the camera coordinate systemrgb_fAs shown in the following equation:
Figure BDA0003073301300000059
Figure BDA00030733013000000510
in the formula (I), the compound is shown in the specification,
Figure BDA00030733013000000511
is a characteristic pixel point, P, belonging to a target under an image coordinate system of an RGB camerargb_fIs a characteristic pixel point, T, belonging to a target under a camera coordinate system of an RGB cameraw2cIs an external reference matrix of RGB camera, KcIs an internal reference matrix of the RGB camera.
Step 42: secondly, obtaining P according to the mapping relation of the RGB camera and the depth camerargb_fTarget feature point P in camera coordinate system of corresponding depth camerad_fAs shown in the following equation:
Prgb_f=Td2rgbPd_f
Td2rgb=Tw2cTw2d
in the formula, Td2rgbA mapping matrix representing a camera coordinate system of the depth camera to a camera coordinate system of the RGB camera.
Step 43: finally, obtaining a target feature pixel point P in the camera coordinate system of the depth camera through the transformation relation between the pixel coordinate system and the image coordinate system of the depth camera and the transformation relation between the image coordinate system and the camera coordinate systemd_f. As shown in the following equation:
Figure BDA0003073301300000061
Figure BDA0003073301300000062
in the formula (I), the compound is shown in the specification,
Figure BDA0003073301300000063
representing target feature pixels, P, in a depth camera image coordinate systemd_fRepresenting target feature pixel points in a camera coordinate system of the depth camera,
Figure BDA0003073301300000064
representing the inverse of the depth camera's internal reference matrix,
Figure BDA0003073301300000065
representing the inverse of the outer reference matrix of the depth camera.
The formula is integrated, so that the simplified three-dimensional target characteristic point P is obtainedfThe following formula shows:
Figure BDA0003073301300000066
step 44: background outlier point clouds are removed, and due to the fact that a depth camera is affected by external environment factors such as illumination, pixel points of a depth image are distorted, outliers exist in generated target view cone point clouds and easily appear in a boundary area of the target view cone point clouds. Therefore, the density of outliers in the target view cone point cloud is generally low, and the outliers are screened out by adopting a density clustering-based method. For the simplified 3D target characteristic point PfiPerforming point cloud density distribution calculation, density distribution function DiAs shown in the following equation:
Figure BDA0003073301300000067
in the formula, xi,yi,ziIs a certain target feature point PfiThe coordinates of (a). r isx,ry,rzIs a radius parameter of the point cloud density.
Finally, the maximum density distribution D is selectediPoint cloud P of valuesfiAs the center of the point cloud density cluster. Point cloud markers not in a clustered point cloud collection
Figure BDA0003073301300000068
I.e. the outliers. Clustering points P with RGB features in point cloud setfiAnd assembling the target simplified view cone point clouds P. RGB characteristics and density distribution characteristics of the target are fused through an RGB-D multi-modal characteristic fusion strategy, and RGB characteristic information can ensure that P is a characteristic point cloud Prgb_fThe density characteristic distribution information can ensure that P is the point cloud P meeting the density requirementfi. Therefore, the obtained target simplified view cone point cloud P can be used for screening outliers while simplifying the point cloud.
And 5: and acquiring a target simplified view cone point cloud P through a multi-modal feature fusion strategy, and further calculating the target point cloud to obtain the space size and pose information of the target. In order to obtain the 3D Bounding Box and pose information of the target point cloud, a 3D Bounding Box of the target simplified view cone point cloud P is generated by using an Axis-Aligned Bounding Box (AABB) algorithm. Meanwhile, point cloud direction estimation is performed based on a principal component analysis method, and three principal feature vectors obtained through calculation are used as pose coordinates of the target point cloud (see fig. 4).
Step 6: training of the MMFF-3D algorithm network model, the framework only trains the MD56-YOLOv3 network model, a pre-training and fine-tuning transfer learning training method is adopted in the whole training process, the method improves the capability of the network model for learning target features on a target data set by learning similar features on a similar field data set, and the problem that the target data set is relatively few is solved to a certain extent.
Step 61: in the actual training process, firstly, the weight obtained by training on the ImageNet data set is used as the initial weight for training the backbone network, and the Similar Domains data set is pre-trained to obtain the pre-training weight.
Step 62: then, dividing the workshop target data set into a training set, a verification set and a test set according to the proportion of 7: 1: 2, randomly dividing. The MD56-YOLOv3 training in the Target data set is divided into two steps, wherein in the first step, the pre-training weight of the front 184 layers of the main network MD56 is frozen, an Adam optimizer is adopted, the learning rate is set to be 1e-3, the batch size is set to be 20, and the iteration number is set to be 200 epochs. Second, unfreeze the backbone network MD56, reduce the learning rate to 1e-4, set the blocksize to 5, and train 300 epochs again.
Step 6: and (3) testing by using the trained MMFF-3D target detection network model to obtain partial testing effect, as shown in figure 5.
Through experimental evaluation and verification analysis, the MMFF-3D target detection network framework has better 2D target detection effect in a workshop, and meanwhile, the 3D target detection also obtains better detection precision, as shown in tables 1 and 2.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
TABLE 1 comparison of the test results of improved MD56-YOLOv3 and YOLOv3
Figure BDA0003073301300000071
TABLE 2 comparison of detection effects of MMFF-3D target detection under different basic networks
Figure BDA0003073301300000072

Claims (8)

  1. The RGB-D multi-modal feature fusion 3D target detection method is characterized in that a deep learning technology and a 3D point cloud processing technology are combined and effectively applied to a small amount of 2D labels and image data sets without 3D labels, and the whole process comprises the following steps:
    step 1: establishing a data set of a detection target, and collecting RGB images including a main detection target through a web crawler, wherein the partial data is used as a similar field data set in training; shooting data (RGB image and depth image) actually containing main target images by using a depth camera, wherein the partial data is taken as a target data set in training, and the target data set is divided into a training set and a testing set;
    step 2: improving a backbone network DarkNet53 of a YOLOv3 target detection network to obtain an MD56-YOLOv3 target detection network;
    and step 3: pre-training an MD56-YOLOv3 target detection network by using the similar field data set in the step 1; then, the training set of the target data set in the step 1 is utilized to transfer and learn the training MD56-YOLOv3 target detection network;
    and 4, step 4: an RGB-D target significance detection algorithm is provided to segment pixel point areas of targets in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;
    and 5: mapping and aligning a target pixel point region obtained by dividing an RGB-D target significance detection algorithm to a depth image of a target, and generating a target visual cone point cloud through visual cone projection;
    step 6: providing an RGB-D multi-mode feature fusion strategy to simplify the target view cone point cloud obtained in the step 5 and obtain a simplified target view cone point cloud;
    and 7: generating a boundary box of the 3D target by utilizing an AABB algorithm and a PCA algorithm;
    and 8: and (3) integrating all algorithms related to the steps 1-7 into an RGB-D multi-modal feature fusion 3D target detection method, and testing by using the test set of the target data set collected in the step 1.
  2. 2. The RGB-D multimodal feature fusion 3D object detection method as claimed in claim 1, wherein the step 1 further includes the steps of:
    and 2D labeling the targets in the target data set, and storing the labeled target type information and the labeled target position information in a text file.
  3. 3. The RGB-D multimodal feature fusion 3D object detection method according to claim 1, wherein the step 2 further includes the steps of:
    step 21: adjusting the input target image size 416 of the MD56-YOLOv3 target detection network to 448 to extract more characteristic information and improve the detection accuracy of the network;
    step 22: respectively adding a feature extraction layer (kernel size is 3 x 3, strings is 1, and paging is 1) at 3 scale feature extraction branches of a YOLOv3 target detection network, and respectively increasing y1, y2 and y3 to output receptive fields of prediction targets with different scales;
    step 23: when the MD56-YOLOv3 target detects the network output characteristics, the corresponding characteristic dimension is adjusted, the prediction result of the network is subjected to coordinate frame decoding, and the coordinate frame decoding is mapped to the real value of the target image coordinate.
  4. 4. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein the GrabCut algorithm is required in step 3.
  5. 5. The RGB-D multi-modal feature fusion 3D object detection method of claim 4, wherein the RGB-D object saliency detection method further comprises the steps of:
    step 31: acquiring a target pixel point region in the RGB image based on a GrabCont algorithm, further acquiring the target pixel point region in the depth image by combining a threshold segmentation algorithm, and segmenting to obtain a pixel point region of a target in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;
    step 32: when the output 2D rectangular frames are not overlapped, according to the threshold value FtDividing the image into a target pixel point region and a background pixel point region;
    step 33: when the output 2D rectangular frames are overlapped, the threshold values are respectively set as
    Figure RE-FDA0003206184450000021
    And
    Figure RE-FDA0003206184450000022
    then utilize
    Figure RE-FDA0003206184450000023
    And
    Figure RE-FDA0003206184450000024
    dividing the output 2D rectangular frame into a target pixel point region, a foreground pixel point region and a background pixel point region;
    step 34: finally output RGB image target pixel point value Prgb(x, y) obtaining pixel point values P of threshold segmentation of target pixel point values and depth images in RGB images based on GrabCT algorithmd(x, y) are combined, and the expression formula is as follows:
    Figure RE-FDA0003206184450000025
    Figure RE-FDA0003206184450000026
  6. 6. the RGB-D multi-modal feature fusion 3D object detection method of claim 1, wherein the RGB features of the object in step 6 are extracted by an edge extraction algorithm and a corner detection algorithm, the depth image features are point cloud density distribution features of the object view cone point cloud, and the RGB-D multi-modal feature fusion is to fuse the RGB features and the point cloud density distribution features.
  7. 7. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein step 6 further comprises the steps of:
    step 61: based on the camera calibration principle, RGB characteristic information of a target is fused in a 3D point cloud through coordinate transformationObtaining a characteristic 3D point cloud P with RGB characteristicsfAs shown in the following equation:
    Figure RE-FDA0003206184450000027
    in the formula, Kc -1Representing the inverse of the reference matrix within the RGB camera,
    Figure RE-FDA0003206184450000028
    is the inverse of the RGB camera extrinsic matrix,
    Figure RE-FDA0003206184450000029
    representing the inverse of the outer reference matrix of the depth camera,
    Figure RE-FDA00032061844500000210
    is a characteristic pixel point in the RGB image;
    step 62: calculating the density distribution characteristics of the point cloud, wherein the density distribution characteristic function is shown as the following formula:
    Figure RE-FDA0003206184450000031
    in the formula, xi,yi,ziIs a certain characteristic point PfiThe coordinates of (a). r isx,ry,rzIs a radius parameter of the point cloud density;
    and step 63: selecting the value D having the maximum density distributioniPoint cloud P offiAs the center of the point cloud density cluster set, rx,ry,rzClustering is carried out on the radius of the point cloud density clustering set, and point cloud marks which are not in the point cloud density clustering set are marked
    Figure RE-FDA0003206184450000032
    I.e. points P with RGB characteristics in the outlier, point cloud density cluster setfiSet-forming simplified target view cone point cloud。
  8. 8. The RGB-D multi-modal feature fusion 3D object detection method as claimed in any one of claims 1-7, wherein after the RGB-D image is inputted, the RGB-D multi-modal feature fusion 3D object detection method can obtain not only semantic category information of an object in the object image but also 3D spatial position information of the object.
CN202110545313.5A 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method Active CN113408584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110545313.5A CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110545313.5A CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Publications (2)

Publication Number Publication Date
CN113408584A true CN113408584A (en) 2021-09-17
CN113408584B CN113408584B (en) 2022-07-26

Family

ID=77678851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110545313.5A Active CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Country Status (1)

Country Link
CN (1) CN113408584B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379793A (en) * 2021-05-19 2021-09-10 成都理工大学 On-line multi-target tracking method based on twin network structure and attention mechanism
CN113963044A (en) * 2021-09-30 2022-01-21 北京工业大学 RGBD camera-based intelligent loading method and system for cargo box
CN114170521A (en) * 2022-02-11 2022-03-11 杭州蓝芯科技有限公司 Forklift pallet butt joint identification positioning method
CN115578461A (en) * 2022-11-14 2023-01-06 之江实验室 Object attitude estimation method and device based on bidirectional RGB-D feature fusion
CN116580056A (en) * 2023-05-05 2023-08-11 武汉理工大学 Ship detection and tracking method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472534A (en) * 2019-07-31 2019-11-19 厦门理工学院 3D object detection method, device, equipment and storage medium based on RGB-D data
CN110689008A (en) * 2019-09-17 2020-01-14 大连理工大学 Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
AU2020101011A4 (en) * 2019-06-26 2020-07-23 Zhejiang University Method for identifying concrete cracks based on yolov3 deep learning model
CN111612728A (en) * 2020-05-25 2020-09-01 北京交通大学 3D point cloud densification method and device based on binocular RGB image
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111723721A (en) * 2020-06-15 2020-09-29 中国传媒大学 Three-dimensional target detection method, system and device based on RGB-D
CN112651406A (en) * 2020-12-18 2021-04-13 浙江大学 Depth perception and multi-mode automatic fusion RGB-D significance target detection method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
AU2020101011A4 (en) * 2019-06-26 2020-07-23 Zhejiang University Method for identifying concrete cracks based on yolov3 deep learning model
CN110472534A (en) * 2019-07-31 2019-11-19 厦门理工学院 3D object detection method, device, equipment and storage medium based on RGB-D data
CN110689008A (en) * 2019-09-17 2020-01-14 大连理工大学 Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
CN111612728A (en) * 2020-05-25 2020-09-01 北京交通大学 3D point cloud densification method and device based on binocular RGB image
CN111723721A (en) * 2020-06-15 2020-09-29 中国传媒大学 Three-dimensional target detection method, system and device based on RGB-D
CN112651406A (en) * 2020-12-18 2021-04-13 浙江大学 Depth perception and multi-mode automatic fusion RGB-D significance target detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杭凌霄: "基于RGB-D多模态图像的室内场景解析算法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379793A (en) * 2021-05-19 2021-09-10 成都理工大学 On-line multi-target tracking method based on twin network structure and attention mechanism
CN113379793B (en) * 2021-05-19 2022-08-12 成都理工大学 On-line multi-target tracking method based on twin network structure and attention mechanism
CN113963044A (en) * 2021-09-30 2022-01-21 北京工业大学 RGBD camera-based intelligent loading method and system for cargo box
CN113963044B (en) * 2021-09-30 2024-04-30 北京工业大学 Cargo box intelligent loading method and system based on RGBD camera
CN114170521A (en) * 2022-02-11 2022-03-11 杭州蓝芯科技有限公司 Forklift pallet butt joint identification positioning method
CN115578461A (en) * 2022-11-14 2023-01-06 之江实验室 Object attitude estimation method and device based on bidirectional RGB-D feature fusion
CN115578461B (en) * 2022-11-14 2023-03-10 之江实验室 Object attitude estimation method and device based on bidirectional RGB-D feature fusion
CN116580056A (en) * 2023-05-05 2023-08-11 武汉理工大学 Ship detection and tracking method and device, electronic equipment and storage medium
CN116580056B (en) * 2023-05-05 2023-11-17 武汉理工大学 Ship detection and tracking method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113408584B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
Cui et al. Deep learning for image and point cloud fusion in autonomous driving: A review
Yang et al. Visual perception enabled industry intelligence: state of the art, challenges and prospects
CN113408584B (en) RGB-D multi-modal feature fusion 3D target detection method
Garcia-Garcia et al. A survey on deep learning techniques for image and video semantic segmentation
Garcia-Garcia et al. A review on deep learning techniques applied to semantic segmentation
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Zhou et al. Self‐supervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain
Li et al. Adaptive deep convolutional neural networks for scene-specific object detection
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
Geng et al. Using deep learning in infrared images to enable human gesture recognition for autonomous vehicles
CN111461039B (en) Landmark identification method based on multi-scale feature fusion
Wang et al. An overview of 3d object detection
CN105574545B (en) The semantic cutting method of street environment image various visual angles and device
CN110287798B (en) Vector network pedestrian detection method based on feature modularization and context fusion
CN116052222A (en) Cattle face recognition method for naturally collecting cattle face image
Xu et al. Segment as points for efficient and effective online multi-object tracking and segmentation
CN116385958A (en) Edge intelligent detection method for power grid inspection and monitoring
Gu et al. Embedded and real-time vehicle detection system for challenging on-road scenes
Tsutsui et al. Distantly supervised road segmentation
Wael A comprehensive vehicle-detection-and-tracking technique for autonomous driving
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
Zhang et al. Multi-FEAT: Multi-feature edge alignment for targetless camera-LiDAR calibration
An et al. RS-AUG: Improve 3D object detection on LiDAR with realistic simulator based data augmentation
CN117115555A (en) Semi-supervised three-dimensional target detection method based on noise data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant