CN113408584B - RGB-D multi-modal feature fusion 3D target detection method - Google Patents

RGB-D multi-modal feature fusion 3D target detection method Download PDF

Info

Publication number
CN113408584B
CN113408584B CN202110545313.5A CN202110545313A CN113408584B CN 113408584 B CN113408584 B CN 113408584B CN 202110545313 A CN202110545313 A CN 202110545313A CN 113408584 B CN113408584 B CN 113408584B
Authority
CN
China
Prior art keywords
target
rgb
point cloud
detection
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110545313.5A
Other languages
Chinese (zh)
Other versions
CN113408584A (en
Inventor
陈光柱
侯睿
韩银贺
唐在作
茹青君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Univeristy of Technology
Original Assignee
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Univeristy of Technology filed Critical Chengdu Univeristy of Technology
Priority to CN202110545313.5A priority Critical patent/CN113408584B/en
Publication of CN113408584A publication Critical patent/CN113408584A/en
Application granted granted Critical
Publication of CN113408584B publication Critical patent/CN113408584B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/136Segmentation; Edge detection involving thresholding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/194Segmentation; Edge detection involving foreground-background segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an RGB-D multi-modal feature fusion 3D target detection method. The 3D target detection technology can obtain semantic information and spatial dimension information of a target, and is of great significance for realizing 3D intelligent target detection. Specifically, the method comprises the following steps: firstly, improving a YOLOv3 target detection network model to obtain a 2D prior area, providing an RGB-D target significance detection algorithm to extract target pixels, and obtaining target cone point cloud through cone projection; secondly, in order to remove outliers and reduce the number of target view cone point clouds, a multi-modal feature fusion strategy is provided to simplify the target view cone point clouds, and the strategy can replace the process of reasoning the 3D target based on the deep neural network; and finally, generating a 3D boundary frame of the target point cloud by using an axis alignment bounding box algorithm (AABB), and calculating the pose coordinates of the target point cloud by using a PCA algorithm. The invention has the beneficial effects that: the RGB-D multi-modal feature fusion 3D target detection method can improve the detection precision of scene multi-scale targets in application scenes with a small amount of 2D labeling data and no 3D labeling data, and has the advantages of good real-time performance and high precision.

Description

RGB-D multi-modal feature fusion 3D target detection method
Technical Field
The invention relates to the fields of computer vision, image recognition and target detection, in particular to an RGB-D multi-modal feature fusion 3D target detection method.
Background
The target detection is used as an important branch of machine vision, relates to the intersection of multiple disciplines and multiple fields, is the basis of high-level tasks such as target tracking, behavior recognition, field flow estimation and the like, and can be realized only by the maturity and development of the target detection. The target detection is to identify and position target objects in a scene, such as automobiles, pedestrians, roads and the like, and the target identification is to distinguish interested objects in the scene to obtain the category of the target objects and give the classification probability; target positioning is to calibrate the position of an object of interest in a scene, and generally a box or a cube box is used to frame the boundary of the object of interest. The target detection has huge application prospects at present, such as body shadows of the target detection in the fields of face recognition, intelligent monitoring, intelligent workshops, intelligent transportation, unmanned driving and the like. The 3D target detection technology can obtain not only semantic information of the target but also space size information, and has research value and application prospect.
Currently, the traditional image processing method often needs a complex feature extractor designed for a specific detection target to realize target detection, and the algorithm has a poor generalization capability. The traditional target detection method is difficult to realize intelligent target detection. With the continuous development of artificial intelligence and computer vision technology, the neural network and deep learning technology-based image recognition task obtains excellent performance. The 2D object detection algorithm is developed rapidly. Compared with the traditional target detection method, the 2D target detection technology has the advantages that multiple types of targets can be efficiently detected, and the algorithm is high in detection precision, generalization capability and robustness. However, the 2D object detection method cannot acquire actual 3D spatial information (spatial pose coordinates, 3D size, etc.) of the object. Therefore, the 3D target detection can more accurately express the actual spatial position information of the detection target, is beneficial to accurately identifying and positioning the targets, and can more effectively ensure the safety of interactive operation with the targets.
In recent years, 3D target detection technology has been developed in a breakthrough manner as the accuracy of depth sensors such as 3D laser radar and RGB-D has been improved. The 3D target detection is used as an important task in the scene understanding process, and the classification of the target of interest in the 3D data and the positioning of the 3D bounding box can be realized through the 3D target detection. The 3D target detection can further accurately position the 3D boundary box of the target compared with the 2D target detection when the semantic information of the target of interest is acquired. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques. At present, 3D target detection technology in the field of outdoor automatic driving is researched and applied more, and the indoor 3D target detection research direction mainly focuses on 3D detection of a life scene target, positioning of a mechanical arm and workpiece grabbing. These methods all rely on a large number of labeled data sets in a specific scene, which is not conducive to popularization in practical application scenarios.
Most of the existing 3D target detection methods need to construct a large-scale 3D labeling data set and are difficult to construct, and the methods are difficult to realize 3D target detection in practical application requirements. Therefore, the RGB-D multi-modal feature fusion 3D target detection method is provided, and the method can be effectively applied to the 3D target detection of practical application scenes and has great research significance.
Disclosure of Invention
The invention mainly aims to provide an RGB-D multi-modal feature fusion 3D target detection method. According to the method, the detection precision of the multi-scale target in the detection scene can be improved by improving the YOLOv3 network model, and meanwhile, the method can realize efficient 3D target detection under the conditions of a small amount of 2D labeling data and no dependence on 3D labeling data.
The invention is realized by adopting the following technical scheme: an RGB-D multimodal feature fusion 3D object detection method (hereinafter abbreviated MMFF-3D object detection method), comprising the steps of:
step 1: and preliminarily establishing a target data set of the detection scene. And collecting pictures through a web crawler, taking the pictures in an actual workshop, and carrying out 2D labeling on the data set.
Step 2: based on the characteristics of a multiple-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is further improved to be an MD56 trunk network, an MD56-YOLOv3 target detection framework is obtained to improve the 2D detection precision of the target, and the MD56-YOLOv3 network framework is trained in the data set established in the step 1.
And step 3: and (3) constructing an RGB-D target significance detection algorithm based on the 2D rectangular area obtained in the step (2) to obtain a pixel point area of the target.
And 4, step 4: and 3, obtaining a pixel point area of the target on the basis of the step 3, generating a target view cone point cloud by a depth image and RGB image mapping alignment and view cone projection method, providing an RGB-D multi-modal feature fusion strategy to extract a simplified target view cone point cloud, and replacing a 3D target reasoning process based on a deep neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.
The beneficial technical effects of the invention are as follows:
1. the 2D detection precision of the multi-scale target in the scene can be effectively improved;
2. when the object is shielded, the target pixel can be effectively segmented and 3D detection is realized;
3. 3D target detection is realized under the conditions of a small amount of 2D labeling data and no 3D labeling data;
4. the MMFF-3D target detection method meets the requirements of 3D target detection on real-time performance and precision;
drawings
FIG. 1 is a schematic diagram of an MMFF-3D object detection network model framework.
FIG. 2 is a schematic diagram of the MD56-YOLOv3 target detection framework.
FIG. 3 is a schematic diagram of the process of RGB-D target saliency detection algorithm acquiring a target pixel.
FIG. 4 is a schematic diagram of a process for implementing 3D target detection by an RGB-D multimodal feature fusion process.
FIG. 5 is a diagram of the target detection effect of an MMFF-3D target detection network model framework in an intelligent workshop application scenario.
Detailed description of the preferred embodiments
To facilitate understanding of the present invention, background knowledge of object detection, which is one of the most fundamental and challenging problems in computer vision, is first introduced, and has been widely focused on related research in the field of computer vision. The image recognition by target detection is characterized in that the category of a certain interested target in a digital image is recognized, and the position of the interested target in the digital image is positioned; meanwhile, the target detection technology can be used as the basic research of visual processing tasks such as example segmentation, target tracking and the like. The target detection technology is taken as a research hotspot direction in the field of image processing, and is mainly realized by a method of designing a complicated artificial feature extractor in a stage of realizing target detection by utilizing the traditional computer vision processing technology. Compared with the traditional target detection algorithm for constructing the artificial feature extractor, the target detection algorithm based on the deep neural network has the advantages of simpler structural design, capability of automatically extracting features, high detection precision and good robustness. Therefore, the main research direction in the current target detection field is based on deep learning and neural network technology. And 3D target detection can further accurately position a 3D boundary box of the target compared with 2D target detection while acquiring semantic information of the target of interest. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques.
The following describes an embodiment of the present invention in detail by selecting a workshop scenario and performing specific implementation and application with reference to the accompanying drawings.
FIG. 1 is a schematic diagram of a MMFF-3D object detection network model framework according to the present invention, and in combination with the diagram, the general implementation is as follows: establishing a workshop scene target data set, improving YOLOv3 to improve the multi-scale workshop scene target detection efficiency, combining 2D target detection and RGB-D fused target significance detection to generate 3D target view cone point cloud, projecting RGB image target characteristic information onto a three-dimensional space to perform multi-mode characteristic fusion with 3D point cloud density distribution information to obtain simplified 3D target view cone point cloud, and realizing 3D target bounding box generation of a workshop scene by utilizing an AABB algorithm. Specifically, the method comprises the following steps: firstly, based on the characteristics of a YOLOv3 target detection framework multi-scale prediction target, a convolutional neural backbone network DarkNet53 is improved to be an MD56 backbone network so as to improve the 2D detection precision of the workshop scene multi-scale target. And then, constructing an RGB-D fused target significance detection algorithm for the 2D rectangular region obtained by the 2D target detection based on the 2D detection region to obtain pixel points of the target, wherein the pixel points of the target are used for generating target view cone point cloud. And moreover, a target view cone point cloud is generated by a depth image and RGB image mapping alignment and view cone projection method, a simplified target view cone point cloud is extracted by utilizing an RGB-D multi-mode feature fusion strategy, and the process replaces a 3D target reasoning process based on a depth neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.
Step 1: and establishing a workshop scene target data set. 2000 images including main targets RGB of a workshop scene are collected through a web crawler, and the partial data are used as Similar domain data sets (Similar Domains data sets) in training; 3000 images (1500 RGB images and corresponding 1500 depth images) of the digitized workshop scene actually containing the main Target are taken with a RealSense D435 depth camera, and this part of the data will be used as the Target dataset in the training.
And 2, step: the DarkNet53 is modified to improve the efficiency of target detection for multi-scale shop scenes (see also FIG. 2).
Step 21: firstly, adding a layer of feature extraction layer with the convolution kernel size of 3 multiplied by 3, the step size of 1 and the filling size of 1 at the positions of 3 scale feature extraction branches, and respectively increasing the receptive fields of different scales of prediction targets of y1, y2 and y 3.
Step 22: then, by adjusting the input size 416 of the network to 448, more feature information can be extracted and the detection accuracy of the network can be improved.
Step 23: finally, for the whole network, (as shown in fig. 2 (b)), the prediction result of the network needs to be coordinate frame decoded, and the coordinate frame is decoded and mapped to the true value of the image coordinate. Since the network parameters are randomly initialized, the network-initialized coordinate box may have a problem of exceeding the actual coordinate box boundary. To limit the location range of the box, the upper left coordinate point of the grid (C) is scaled using sigmoid function x ,C y ) The offset is limited to a range of 0-1 so that the positions of the prediction boxes remain inside the respective meshes at all times. The process of decoding the width and height of the prediction box is to convert the prior box to the actual image size by multiplying the corresponding sampling rate (as shown in the following formula).
Y=(sigmoid(ty)+C y )*stride,
X=(sigmoid(tx)+C x )*stride,
W=(P W e tw )*stride,
H=(P h e th )*stride,
In the formula, tx, ty, tw, th represent the predicted result, C x ,C y Coordinates representing the upper left corner of the grid point where the prediction box is located, P W 、P h The prior frame is the width and height of the current grid size. Stride represents the down-sampled adjusted picture size (i.e., sampling rate) of the input picture.
And 3, step 3: the RGB-D object saliency detection method aims at further fusing depth image thresholding to segment objects in the image (see also fig. 3). The fusion depth image threshold segmentation algorithm process is as follows: for the case that the rectangular areas of the target detection are not overlapped, according to the front threshold value F t It is divided into a target and a background. For the overlapping condition of the rectangular areas of the two objects, the average depth values of the corresponding depth images of the two rectangular areas (shown in (r) and (c) in the attached figure 3) are calculated. If the average depth value of a certain target detection rectangular area is smaller, according to a threshold value F t It is divided into a target and a background. For rectangular area with large average depth value, according to threshold value
Figure GDA0003683390840000051
And
Figure GDA0003683390840000052
it is divided into object, foreground and background. Threshold value F t
Figure GDA0003683390840000053
And
Figure GDA0003683390840000054
is obtained by adopting an adaptive threshold value calculation method. When the rectangular areas of the object are not overlapped or the average depth value of the rectangular areas is small when the rectangular areas are overlapped (shown as (r) in the attached figure 3), the threshold number is 1, and the threshold is F t RGB target pixel P rgb (x, y) Pixel P combinable with depth image thresholding d (x, y) is further calculated. When the average depth value of the rectangular areas is larger (as shown in fig. 3), the threshold number is 2, and the threshold is
Figure GDA00036833908400000510
And
Figure GDA00036833908400000511
RGB target pixel point value P rgb (x, y) Pixel Point values P which may incorporate depth image thresholding d (x, y) is further calculated. The calculation is shown in the following formula:
Figure GDA0003683390840000055
Figure GDA0003683390840000056
and 4, step 4: and extracting a simplified target view cone point cloud by using an RGB-D multi-modal feature fusion strategy (see figure 4).
Step 41: firstly, acquiring a target characteristic pixel point p in a pixel coordinate system of an RGB image based on a Canny edge extraction algorithm and a Harris corner detection algorithm rgb_f . Obtaining a target characteristic pixel point P in a camera coordinate system of the RGB camera through a transformation relation between a pixel coordinate system and an image coordinate system of the RGB camera and a transformation relation between the image coordinate system and the camera coordinate system rgb_f As shown in the following equation:
Figure GDA0003683390840000057
Figure GDA0003683390840000058
in the formula (I), the compound is shown in the specification,
Figure GDA0003683390840000059
is a characteristic pixel point, P, belonging to a target under an image coordinate system of an RGB camera rgb_f Is a characteristic pixel point, T, belonging to a target under a camera coordinate system of an RGB camera w2c Is an external reference matrix of RGB camera, K c Is an RGB camera internal reference matrix.
Step 42: secondly, obtaining P according to the mapping relation of the RGB camera and the depth camera rgb_f Target feature point P in camera coordinate system of corresponding depth camera d_f As shown in the following equation:
P rgb_f =T d2rgb P d_f
T d2rgb =T w2c T w2d
in the formula, T d2rgb A mapping matrix representing a camera coordinate system of the depth camera to a camera coordinate system of the RGB camera.
Step 43: finally, obtaining a target characteristic pixel point P in a camera coordinate system of the depth camera through the transformation relation between a pixel coordinate system and an image coordinate system of the depth camera and the transformation relation between the image coordinate system and the camera coordinate system d_f . As shown in the following equation:
Figure GDA0003683390840000061
Figure GDA0003683390840000062
in the formula (I), the compound is shown in the specification,
Figure GDA0003683390840000063
representing depth camera image in coordinate systemTarget characteristic pixel point of (1), P d_f Representing target feature pixel points in a camera coordinate system of the depth camera,
Figure GDA0003683390840000064
representing the inverse of the depth camera's internal reference matrix,
Figure GDA0003683390840000065
representing the inverse of the outer reference matrix of the depth camera.
Synthesizing the above formulas to obtain the simplified three-dimensional target characteristic point P f The following formula shows:
Figure GDA0003683390840000066
and step 44: background outlier point clouds are removed, and depth image pixel points are distorted due to the fact that a depth camera is affected by external environment factors such as illumination, outliers exist in generated target view cone point clouds, and the outliers are prone to appearing in a boundary area of the target view cone point clouds. Therefore, the density of outliers in the target view cone point cloud is generally low, and the outliers are screened out by adopting a density clustering-based method. For the simplified 3D target characteristic point P fi Performing point cloud density distribution calculation, density distribution function D i As shown in the following equation:
Figure GDA0003683390840000067
in which N represents P fi Number of points in the range of the density radius of the center point cloud, x i ,y i ,z i Is a certain target feature point P fi Coordinate of (a), x j ,y j ,z j Represents by P fi Is the coordinate of a point in the density radius range of the central point cloud, wherein j belongs to [1, N ]]。r x ,r y ,r z Is a radius parameter of the point cloud density.
Finally, the maximum density distribution D is selected i Point cloud P of values fi As a point cloudCenter of density cluster. Point cloud markers that are not in a clustered point cloud set
Figure GDA0003683390840000068
I.e. the outliers. Clustering points P with RGB features in point cloud set fi And assembling the target simplified view cone point clouds P. RGB characteristics and density distribution characteristics of the target are fused through an RGB-D multi-modal characteristic fusion strategy, and RGB characteristic information can ensure that P is a characteristic point cloud P rgb_f The density characteristic distribution information can ensure that P is the point cloud P meeting the density requirement fi . Therefore, the obtained target simplified view cone point cloud P can be used for screening outliers while simplifying the point cloud.
And 5: and acquiring the target simplified view cone point cloud P through a multi-mode feature fusion strategy, and further calculating the target point cloud to obtain the space size and pose information of the target. In order to obtain the 3D Bounding Box and pose information of the target point cloud, a 3D Bounding Box of the target simplified view cone point cloud P is generated by using an Axis-Aligned Bounding Box (AABB) algorithm. Meanwhile, point cloud direction estimation is performed based on a principal component analysis method, and three principal feature vectors obtained through calculation are used as pose coordinates of the target point cloud (see fig. 4).
And 6: training of the MMFF-3D algorithm network model, the framework only trains the MD56-YOLOv3 network model, a pre-training and fine-tuning transfer learning training method is adopted in the whole training process, the method improves the capability of the network model for learning target features on a target data set by learning similar features on a similar field data set, and the problem that the target data set is relatively few is solved to a certain extent.
Step 61: in the actual training process, firstly, the weight obtained by training on the ImageNet data set is used as the initialization weight for training the main network, and the Similar Domains data set is pre-trained to obtain the pre-training weight.
Step 62: then, dividing the workshop target data set into a training set, a verification set and a test set according to the proportion of 7: 1: 2, randomly dividing. The MD56-YOLOv3 training in the Target data set is divided into two steps, wherein in the first step, the pre-training weight of the front 184 layer of the main network MD56 is frozen, an Adam optimizer is adopted, the learning rate is set to be 1e-3, the batchsize is set to be 20, and the iteration number is set to be 200 epochs. Second, unfreeze the backbone network MD56, reduce the learning rate to 1e-4, set the blocksize to 5, and train 300 epochs again.
Step 6: and (3) testing by using the trained MMFF-3D target detection network model to obtain partial testing effects, as shown in figure 5.
Through experimental evaluation and verification analysis, the MMFF-3D target detection network framework has better 2D target detection effect in a workshop, and meanwhile, the 3D target detection also obtains better detection precision, as shown in tables 1 and 2.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, numerous simple deductions or substitutions may be made without departing from the spirit of the invention, which shall be deemed to belong to the scope of the invention.
TABLE 1 comparison of the test results of improved MD56-YOLOv3 and YOLOv3
Figure GDA0003683390840000071
TABLE 2 comparison of detection effects of MMFF-3D target detection under different basic networks
Figure GDA0003683390840000072

Claims (6)

  1. The RGB-D multi-modal feature fusion 3D target detection method is characterized in that a deep learning technology and a 3D point cloud processing technology are combined and effectively applied to a small amount of 2D labels and image data sets without 3D labels, and the whole process comprises the following steps:
    step 1: establishing a data set of a detection target, and collecting RGB images including a main detection target through a web crawler, wherein the partial data is used as a similar field data set in training; shooting an RGB image and a depth image actually containing a main target by using a depth camera, taking the partial data as a target data set in training, and dividing the target data set into a training set and a test set;
    and 2, step: improving a backbone network DarkNet53 of a YOLOv3 target detection network to obtain an MD56-YOLOv3 target detection network;
    and step 3: pre-training an MD56-YOLOv3 target detection network by using the similar field data set in the step 1; then, the training set of the target data set in the step 1 is utilized to transfer learning training MD56-YOLOv3 target detection network;
    and 4, step 4: an RGB-D target significance detection algorithm is provided to segment pixel point areas of targets in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;
    and 5: mapping and aligning a target pixel point region obtained by dividing an RGB-D target significance detection algorithm to a depth image of a target, and generating a target visual cone point cloud through visual cone projection;
    and 6: providing an RGB-D multi-mode feature fusion strategy to simplify the target view cone point cloud obtained in the step 5 and obtain a simplified target view cone point cloud;
    and 7: generating a boundary box of the 3D target by utilizing an AABB algorithm and a PCA algorithm;
    and 8: integrating all algorithms related to the steps 1-7 into an RGB-D multi-modal feature fusion 3D target detection method, and testing by using a test set of the target data set collected in the step 1;
    the improvement of the backbone network DarkNet53 of the YOLOv3 target detection network in the step 2 comprises the following steps:
    step 21: adjusting the input target image size 416 of the MD56-YOLOv3 target detection network to 448 to extract more characteristic information and improve the detection accuracy of the network;
    step 22: respectively adding a feature extraction layer with the convolution kernel size of 3 x 3, the step length of 1 and the filling size of 1 at 3 scale feature extraction branches of a YOLOv3 target detection network, and respectively increasing the observation ranges of y1, y2 and y3 output predicted targets with different scales;
    step 23: when the MD56-YOLOv3 target detects the network output characteristics, the corresponding characteristic dimension is adjusted, the coordinate frame decoding is carried out on the prediction result of the network, and the actual value of the target image coordinate is mapped through the coordinate frame decoding;
    the RGB-D target saliency detection algorithm in step 4 above further includes the following steps:
    step 41: acquiring a target pixel point region in the RGB image based on a GrabCont algorithm, further acquiring the target pixel point region in the depth image by combining a threshold segmentation algorithm, and segmenting to obtain a pixel point region of a target in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;
    step 42: when the output 2D rectangular frames are not overlapped, according to the threshold value F t Dividing the image into a target pixel point area and a background pixel point area;
    step 43: when the output 2D rectangular frames are overlapped, the threshold values are respectively set as
    Figure FDA0003683390830000021
    And
    Figure FDA0003683390830000022
    then utilize
    Figure FDA0003683390830000023
    And
    Figure FDA0003683390830000024
    dividing the output 2D rectangular frame into a target pixel point region, a foreground pixel point region and a background pixel point region;
    step 44: finally output RGB image target pixel point value P rgb (x, y) obtaining pixel point values P of threshold segmentation of target pixel point values and depth images in RGB images based on GrabCT algorithm d (x, y) is combined, and the expression formula is as follows:
    Figure FDA0003683390830000025
    Figure FDA0003683390830000026
  2. 2. the RGB-D multimodal feature fusion 3D object detection method as claimed in claim 1, wherein the step 1 further includes the steps of:
    and 2D labeling the targets in the target data set, and storing the labeled target type information and the labeled target position information in a text file.
  3. 3. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein the GrabCut algorithm is required in step 3.
  4. 4. The RGB-D multi-modal feature fusion 3D object detection method of claim 1, wherein the RGB features of the object in step 6 are extracted by an edge extraction algorithm and a corner detection algorithm, the depth image features are point cloud density distribution features of the object view cone point cloud, and the RGB-D multi-modal feature fusion is to fuse the RGB features and the point cloud density distribution features.
  5. 5. The RGB-D multimodal feature fusion 3D object detection method of claim 1 wherein step 6 further comprises the steps of:
    step 61: based on a camera calibration principle, RGB feature information of a target is fused in a 3D point cloud through coordinate transformation to obtain a feature 3D point cloud P with RGB features f As shown in the following equation:
    Figure FDA0003683390830000027
    in the formula, K c -1 Representing the inverse of the reference matrix within the RGB camera,
    Figure FDA0003683390830000028
    is the inverse of the RGB camera extrinsic matrix,
    Figure FDA0003683390830000029
    represents the inverse of the external reference matrix of the depth camera,
    Figure FDA00036833908300000210
    is a characteristic pixel point in the RGB image;
    step 62: calculating the density distribution characteristics of the point cloud, wherein the density distribution characteristic function is shown as the following formula:
    Figure FDA00036833908300000211
    in which N represents P fi Number of points in the radius of the density of the center point cloud, x i ,y i ,z i Is a certain characteristic point P fi Coordinate of (a), x j ,y j ,z j Represents by P fi Is the coordinate of a point in the density radius range of the central point cloud, wherein j belongs to [1, N ∈],r x ,r y ,r z Is a point cloud density radius parameter;
    and step 63: selecting the value D having the maximum density distribution i Point cloud P of fi As the center of the point cloud density cluster set, r x ,r y ,r z Clustering is carried out on the radius of the point cloud density clustering set, and point cloud marks which are not in the point cloud density clustering set are marked
    Figure FDA0003683390830000031
    That is, the point P with RGB characteristics in the point cloud density clustering set of outlier fi And assembling the sets into a simplified target view cone point cloud.
  6. 6. The RGB-D multi-modal feature fusion 3D object detection method as claimed in any one of claims 1-5, wherein after the RGB-D image is inputted, the RGB-D multi-modal feature fusion 3D object detection method can obtain not only semantic category information of the object in the object image, but also 3D spatial position information of the object.
CN202110545313.5A 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method Active CN113408584B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110545313.5A CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110545313.5A CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Publications (2)

Publication Number Publication Date
CN113408584A CN113408584A (en) 2021-09-17
CN113408584B true CN113408584B (en) 2022-07-26

Family

ID=77678851

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110545313.5A Active CN113408584B (en) 2021-05-19 2021-05-19 RGB-D multi-modal feature fusion 3D target detection method

Country Status (1)

Country Link
CN (1) CN113408584B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113379793B (en) * 2021-05-19 2022-08-12 成都理工大学 On-line multi-target tracking method based on twin network structure and attention mechanism
CN113963044B (en) * 2021-09-30 2024-04-30 北京工业大学 Cargo box intelligent loading method and system based on RGBD camera
CN114170521B (en) * 2022-02-11 2022-06-17 杭州蓝芯科技有限公司 Forklift pallet butt joint identification positioning method
CN115578461B (en) * 2022-11-14 2023-03-10 之江实验室 Object attitude estimation method and device based on bidirectional RGB-D feature fusion
CN116580056B (en) * 2023-05-05 2023-11-17 武汉理工大学 Ship detection and tracking method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472534A (en) * 2019-07-31 2019-11-19 厦门理工学院 3D object detection method, device, equipment and storage medium based on RGB-D data
AU2020101011A4 (en) * 2019-06-26 2020-07-23 Zhejiang University Method for identifying concrete cracks based on yolov3 deep learning model
CN111612728A (en) * 2020-05-25 2020-09-01 北京交通大学 3D point cloud densification method and device based on binocular RGB image
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
CN111723721A (en) * 2020-06-15 2020-09-29 中国传媒大学 Three-dimensional target detection method, system and device based on RGB-D
CN112651406A (en) * 2020-12-18 2021-04-13 浙江大学 Depth perception and multi-mode automatic fusion RGB-D significance target detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689008A (en) * 2019-09-17 2020-01-14 大连理工大学 Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020181685A1 (en) * 2019-03-12 2020-09-17 南京邮电大学 Vehicle-mounted video target detection method based on deep learning
AU2020101011A4 (en) * 2019-06-26 2020-07-23 Zhejiang University Method for identifying concrete cracks based on yolov3 deep learning model
CN110472534A (en) * 2019-07-31 2019-11-19 厦门理工学院 3D object detection method, device, equipment and storage medium based on RGB-D data
CN111612728A (en) * 2020-05-25 2020-09-01 北京交通大学 3D point cloud densification method and device based on binocular RGB image
CN111723721A (en) * 2020-06-15 2020-09-29 中国传媒大学 Three-dimensional target detection method, system and device based on RGB-D
CN112651406A (en) * 2020-12-18 2021-04-13 浙江大学 Depth perception and multi-mode automatic fusion RGB-D significance target detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于RGB-D多模态图像的室内场景解析算法研究;杭凌霄;《中国优秀硕士学位论文全文数据库 (信息科技辑)》;20200815;I138-686 *

Also Published As

Publication number Publication date
CN113408584A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN113408584B (en) RGB-D multi-modal feature fusion 3D target detection method
Yang et al. Visual perception enabled industry intelligence: state of the art, challenges and prospects
Cui et al. Deep learning for image and point cloud fusion in autonomous driving: A review
Garcia-Garcia et al. A survey on deep learning techniques for image and video semantic segmentation
Sakaridis et al. Semantic foggy scene understanding with synthetic data
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
Wang et al. Data-driven based tiny-YOLOv3 method for front vehicle detection inducing SPP-net
Zhou et al. Self‐supervised learning to visually detect terrain surfaces for autonomous robots operating in forested terrain
CN110852182B (en) Depth video human body behavior recognition method based on three-dimensional space time sequence modeling
Alidoost et al. A CNN-based approach for automatic building detection and recognition of roof types using a single aerial image
Geng et al. Using deep learning in infrared images to enable human gesture recognition for autonomous vehicles
Wang et al. An overview of 3d object detection
CN107527054B (en) Automatic foreground extraction method based on multi-view fusion
CN105574545B (en) The semantic cutting method of street environment image various visual angles and device
CN116052222A (en) Cattle face recognition method for naturally collecting cattle face image
Xu et al. Segment as points for efficient and effective online multi-object tracking and segmentation
Gu et al. Embedded and real-time vehicle detection system for challenging on-road scenes
Tsutsui et al. Distantly supervised road segmentation
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN111523494A (en) Human body image detection method
Wael A comprehensive vehicle-detection-and-tracking technique for autonomous driving
Shuai et al. An improved YOLOv5-based method for multi-species tea shoot detection and picking point location in complex backgrounds
CN112668662B (en) Outdoor mountain forest environment target detection method based on improved YOLOv3 network
Zhang et al. Multi-FEAT: Multi-feature edge alignment for targetless camera-LiDAR calibration
CN117115555A (en) Semi-supervised three-dimensional target detection method based on noise data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant