CN113408584B

CN113408584B - RGB-D multi-modal feature fusion 3D target detection method

Info

Publication number: CN113408584B
Application number: CN202110545313.5A
Authority: CN
Inventors: 陈光柱; 侯睿; 韩银贺; 唐在作; 茹青君
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2022-07-26
Anticipated expiration: 2041-05-19
Also published as: CN113408584A

Abstract

The invention provides an RGB-D multi-modal feature fusion 3D target detection method. The 3D target detection technology can obtain semantic information and spatial dimension information of a target, and is of great significance for realizing 3D intelligent target detection. Specifically, the method comprises the following steps: firstly, improving a YOLOv3 target detection network model to obtain a 2D prior area, providing an RGB-D target significance detection algorithm to extract target pixels, and obtaining target cone point cloud through cone projection; secondly, in order to remove outliers and reduce the number of target view cone point clouds, a multi-modal feature fusion strategy is provided to simplify the target view cone point clouds, and the strategy can replace the process of reasoning the 3D target based on the deep neural network; and finally, generating a 3D boundary frame of the target point cloud by using an axis alignment bounding box algorithm (AABB), and calculating the pose coordinates of the target point cloud by using a PCA algorithm. The invention has the beneficial effects that: the RGB-D multi-modal feature fusion 3D target detection method can improve the detection precision of scene multi-scale targets in application scenes with a small amount of 2D labeling data and no 3D labeling data, and has the advantages of good real-time performance and high precision.

Description

RGB-D multi-modal feature fusion 3D target detection method

Technical Field

The invention relates to the fields of computer vision, image recognition and target detection, in particular to an RGB-D multi-modal feature fusion 3D target detection method.

Background

The target detection is used as an important branch of machine vision, relates to the intersection of multiple disciplines and multiple fields, is the basis of high-level tasks such as target tracking, behavior recognition, field flow estimation and the like, and can be realized only by the maturity and development of the target detection. The target detection is to identify and position target objects in a scene, such as automobiles, pedestrians, roads and the like, and the target identification is to distinguish interested objects in the scene to obtain the category of the target objects and give the classification probability; target positioning is to calibrate the position of an object of interest in a scene, and generally a box or a cube box is used to frame the boundary of the object of interest. The target detection has huge application prospects at present, such as body shadows of the target detection in the fields of face recognition, intelligent monitoring, intelligent workshops, intelligent transportation, unmanned driving and the like. The 3D target detection technology can obtain not only semantic information of the target but also space size information, and has research value and application prospect.

Currently, the traditional image processing method often needs a complex feature extractor designed for a specific detection target to realize target detection, and the algorithm has a poor generalization capability. The traditional target detection method is difficult to realize intelligent target detection. With the continuous development of artificial intelligence and computer vision technology, the neural network and deep learning technology-based image recognition task obtains excellent performance. The 2D object detection algorithm is developed rapidly. Compared with the traditional target detection method, the 2D target detection technology has the advantages that multiple types of targets can be efficiently detected, and the algorithm is high in detection precision, generalization capability and robustness. However, the 2D object detection method cannot acquire actual 3D spatial information (spatial pose coordinates, 3D size, etc.) of the object. Therefore, the 3D target detection can more accurately express the actual spatial position information of the detection target, is beneficial to accurately identifying and positioning the targets, and can more effectively ensure the safety of interactive operation with the targets.

In recent years, 3D target detection technology has been developed in a breakthrough manner as the accuracy of depth sensors such as 3D laser radar and RGB-D has been improved. The 3D target detection is used as an important task in the scene understanding process, and the classification of the target of interest in the 3D data and the positioning of the 3D bounding box can be realized through the 3D target detection. The 3D target detection can further accurately position the 3D boundary box of the target compared with the 2D target detection when the semantic information of the target of interest is acquired. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques. At present, 3D target detection technology in the field of outdoor automatic driving is researched and applied more, and the indoor 3D target detection research direction mainly focuses on 3D detection of a life scene target, positioning of a mechanical arm and workpiece grabbing. These methods all rely on a large number of labeled data sets in a specific scene, which is not conducive to popularization in practical application scenarios.

Most of the existing 3D target detection methods need to construct a large-scale 3D labeling data set and are difficult to construct, and the methods are difficult to realize 3D target detection in practical application requirements. Therefore, the RGB-D multi-modal feature fusion 3D target detection method is provided, and the method can be effectively applied to the 3D target detection of practical application scenes and has great research significance.

Disclosure of Invention

The invention mainly aims to provide an RGB-D multi-modal feature fusion 3D target detection method. According to the method, the detection precision of the multi-scale target in the detection scene can be improved by improving the YOLOv3 network model, and meanwhile, the method can realize efficient 3D target detection under the conditions of a small amount of 2D labeling data and no dependence on 3D labeling data.

The invention is realized by adopting the following technical scheme: an RGB-D multimodal feature fusion 3D object detection method (hereinafter abbreviated MMFF-3D object detection method), comprising the steps of:

step 1: and preliminarily establishing a target data set of the detection scene. And collecting pictures through a web crawler, taking the pictures in an actual workshop, and carrying out 2D labeling on the data set.

Step 2: based on the characteristics of a multiple-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is further improved to be an MD56 trunk network, an MD56-YOLOv3 target detection framework is obtained to improve the 2D detection precision of the target, and the MD56-YOLOv3 network framework is trained in the data set established in the step 1.

And step 3: and (3) constructing an RGB-D target significance detection algorithm based on the 2D rectangular area obtained in the step (2) to obtain a pixel point area of the target.

And 4, step 4: and 3, obtaining a pixel point area of the target on the basis of the step 3, generating a target view cone point cloud by a depth image and RGB image mapping alignment and view cone projection method, providing an RGB-D multi-modal feature fusion strategy to extract a simplified target view cone point cloud, and replacing a 3D target reasoning process based on a deep neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.

The beneficial technical effects of the invention are as follows:

1. the 2D detection precision of the multi-scale target in the scene can be effectively improved;

2. when the object is shielded, the target pixel can be effectively segmented and 3D detection is realized;

3. 3D target detection is realized under the conditions of a small amount of 2D labeling data and no 3D labeling data;

4. the MMFF-3D target detection method meets the requirements of 3D target detection on real-time performance and precision;

drawings

FIG. 1 is a schematic diagram of an MMFF-3D object detection network model framework.

FIG. 2 is a schematic diagram of the MD56-YOLOv3 target detection framework.

FIG. 3 is a schematic diagram of the process of RGB-D target saliency detection algorithm acquiring a target pixel.

FIG. 4 is a schematic diagram of a process for implementing 3D target detection by an RGB-D multimodal feature fusion process.

FIG. 5 is a diagram of the target detection effect of an MMFF-3D target detection network model framework in an intelligent workshop application scenario.

Detailed description of the preferred embodiments

To facilitate understanding of the present invention, background knowledge of object detection, which is one of the most fundamental and challenging problems in computer vision, is first introduced, and has been widely focused on related research in the field of computer vision. The image recognition by target detection is characterized in that the category of a certain interested target in a digital image is recognized, and the position of the interested target in the digital image is positioned; meanwhile, the target detection technology can be used as the basic research of visual processing tasks such as example segmentation, target tracking and the like. The target detection technology is taken as a research hotspot direction in the field of image processing, and is mainly realized by a method of designing a complicated artificial feature extractor in a stage of realizing target detection by utilizing the traditional computer vision processing technology. Compared with the traditional target detection algorithm for constructing the artificial feature extractor, the target detection algorithm based on the deep neural network has the advantages of simpler structural design, capability of automatically extracting features, high detection precision and good robustness. Therefore, the main research direction in the current target detection field is based on deep learning and neural network technology. And 3D target detection can further accurately position a 3D boundary box of the target compared with 2D target detection while acquiring semantic information of the target of interest. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques.

The following describes an embodiment of the present invention in detail by selecting a workshop scenario and performing specific implementation and application with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a MMFF-3D object detection network model framework according to the present invention, and in combination with the diagram, the general implementation is as follows: establishing a workshop scene target data set, improving YOLOv3 to improve the multi-scale workshop scene target detection efficiency, combining 2D target detection and RGB-D fused target significance detection to generate 3D target view cone point cloud, projecting RGB image target characteristic information onto a three-dimensional space to perform multi-mode characteristic fusion with 3D point cloud density distribution information to obtain simplified 3D target view cone point cloud, and realizing 3D target bounding box generation of a workshop scene by utilizing an AABB algorithm. Specifically, the method comprises the following steps: firstly, based on the characteristics of a YOLOv3 target detection framework multi-scale prediction target, a convolutional neural backbone network DarkNet53 is improved to be an MD56 backbone network so as to improve the 2D detection precision of the workshop scene multi-scale target. And then, constructing an RGB-D fused target significance detection algorithm for the 2D rectangular region obtained by the 2D target detection based on the 2D detection region to obtain pixel points of the target, wherein the pixel points of the target are used for generating target view cone point cloud. And moreover, a target view cone point cloud is generated by a depth image and RGB image mapping alignment and view cone projection method, a simplified target view cone point cloud is extracted by utilizing an RGB-D multi-mode feature fusion strategy, and the process replaces a 3D target reasoning process based on a depth neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.

Step 1: and establishing a workshop scene target data set. 2000 images including main targets RGB of a workshop scene are collected through a web crawler, and the partial data are used as Similar domain data sets (Similar Domains data sets) in training; 3000 images (1500 RGB images and corresponding 1500 depth images) of the digitized workshop scene actually containing the main Target are taken with a RealSense D435 depth camera, and this part of the data will be used as the Target dataset in the training.

And 2, step: the DarkNet53 is modified to improve the efficiency of target detection for multi-scale shop scenes (see also FIG. 2).

Step 21: firstly, adding a layer of feature extraction layer with the convolution kernel size of 3 multiplied by 3, the step size of 1 and the filling size of 1 at the positions of 3 scale feature extraction branches, and respectively increasing the receptive fields of different scales of prediction targets of y1, y2 and y 3.

Step 22: then, by adjusting the input size 416 of the network to 448, more feature information can be extracted and the detection accuracy of the network can be improved.

Step 23: finally, for the whole network, (as shown in fig. 2 (b)), the prediction result of the network needs to be coordinate frame decoded, and the coordinate frame is decoded and mapped to the true value of the image coordinate. Since the network parameters are randomly initialized, the network-initialized coordinate box may have a problem of exceeding the actual coordinate box boundary. To limit the location range of the box, the upper left coordinate point of the grid (C) is scaled using sigmoid function _x ，C _y ) The offset is limited to a range of 0-1 so that the positions of the prediction boxes remain inside the respective meshes at all times. The process of decoding the width and height of the prediction box is to convert the prior box to the actual image size by multiplying the corresponding sampling rate (as shown in the following formula).

Y＝(sigmoid(ty)+C _y )*stride，

X＝(sigmoid(tx)+C _x )*stride，

W＝(P _W e ^tw )*stride，

H＝(P _h e ^th )*stride，

In the formula, tx, ty, tw, th represent the predicted result, C _x ，C _y Coordinates representing the upper left corner of the grid point where the prediction box is located, P _W 、P _h The prior frame is the width and height of the current grid size. Stride represents the down-sampled adjusted picture size (i.e., sampling rate) of the input picture.

And 3, step 3: the RGB-D object saliency detection method aims at further fusing depth image thresholding to segment objects in the image (see also fig. 3). The fusion depth image threshold segmentation algorithm process is as follows: for the case that the rectangular areas of the target detection are not overlapped, according to the front threshold value F _t It is divided into a target and a background. For the overlapping condition of the rectangular areas of the two objects, the average depth values of the corresponding depth images of the two rectangular areas (shown in (r) and (c) in the attached figure 3) are calculated. If the average depth value of a certain target detection rectangular area is smaller, according to a threshold value F _t It is divided into a target and a background. For rectangular area with large average depth value, according to threshold value

And

it is divided into object, foreground and background. Threshold value F _t 、

And

is obtained by adopting an adaptive threshold value calculation method. When the rectangular areas of the object are not overlapped or the average depth value of the rectangular areas is small when the rectangular areas are overlapped (shown as (r) in the attached figure 3), the threshold number is 1, and the threshold is F _t RGB target pixel P _rgb (x, y) Pixel P combinable with depth image thresholding _d (x, y) is further calculated. When the average depth value of the rectangular areas is larger (as shown in fig. 3), the threshold number is 2, and the threshold is

And

RGB target pixel point value P _rgb (x, y) Pixel Point values P which may incorporate depth image thresholding _d (x, y) is further calculated. The calculation is shown in the following formula:

and 4, step 4: and extracting a simplified target view cone point cloud by using an RGB-D multi-modal feature fusion strategy (see figure 4).

Step 41: firstly, acquiring a target characteristic pixel point p in a pixel coordinate system of an RGB image based on a Canny edge extraction algorithm and a Harris corner detection algorithm _{rgb_f} . Obtaining a target characteristic pixel point P in a camera coordinate system of the RGB camera through a transformation relation between a pixel coordinate system and an image coordinate system of the RGB camera and a transformation relation between the image coordinate system and the camera coordinate system _{rgb_f} As shown in the following equation:

in the formula (I), the compound is shown in the specification,

is a characteristic pixel point, P, belonging to a target under an image coordinate system of an RGB camera _{rgb_f} Is a characteristic pixel point, T, belonging to a target under a camera coordinate system of an RGB camera _w2c Is an external reference matrix of RGB camera, K _c Is an RGB camera internal reference matrix.

Step 42: secondly, obtaining P according to the mapping relation of the RGB camera and the depth camera _{rgb_f} Target feature point P in camera coordinate system of corresponding depth camera _{d_f} As shown in the following equation:

P _{rgb_f} ＝T _d2rgb P _{d_f} ，

T _d2rgb ＝T _w2c T _w2d ，

in the formula, T _d2rgb A mapping matrix representing a camera coordinate system of the depth camera to a camera coordinate system of the RGB camera.

Step 43: finally, obtaining a target characteristic pixel point P in a camera coordinate system of the depth camera through the transformation relation between a pixel coordinate system and an image coordinate system of the depth camera and the transformation relation between the image coordinate system and the camera coordinate system _{d_f} . As shown in the following equation:

in the formula (I), the compound is shown in the specification,

representing depth camera image in coordinate systemTarget characteristic pixel point of (1), P _{d_f} Representing target feature pixel points in a camera coordinate system of the depth camera,

representing the inverse of the depth camera's internal reference matrix,

representing the inverse of the outer reference matrix of the depth camera.

Synthesizing the above formulas to obtain the simplified three-dimensional target characteristic point P _f The following formula shows:

and step 44: background outlier point clouds are removed, and depth image pixel points are distorted due to the fact that a depth camera is affected by external environment factors such as illumination, outliers exist in generated target view cone point clouds, and the outliers are prone to appearing in a boundary area of the target view cone point clouds. Therefore, the density of outliers in the target view cone point cloud is generally low, and the outliers are screened out by adopting a density clustering-based method. For the simplified 3D target characteristic point P _fi Performing point cloud density distribution calculation, density distribution function D _i As shown in the following equation:

in which N represents P _fi Number of points in the range of the density radius of the center point cloud, x _i ，y _i ，z _i Is a certain target feature point P _fi Coordinate of (a), x _j ，y _j ，z _j Represents by P _fi Is the coordinate of a point in the density radius range of the central point cloud, wherein j belongs to [1, N ]]。r _x ，r _y ，r _z Is a radius parameter of the point cloud density.

Finally, the maximum density distribution D is selected _i Point cloud P of values _fi As a point cloudCenter of density cluster. Point cloud markers that are not in a clustered point cloud set

I.e. the outliers. Clustering points P with RGB features in point cloud set _fi And assembling the target simplified view cone point clouds P. RGB characteristics and density distribution characteristics of the target are fused through an RGB-D multi-modal characteristic fusion strategy, and RGB characteristic information can ensure that P is a characteristic point cloud P _{rgb_f} The density characteristic distribution information can ensure that P is the point cloud P meeting the density requirement _fi . Therefore, the obtained target simplified view cone point cloud P can be used for screening outliers while simplifying the point cloud.

And 5: and acquiring the target simplified view cone point cloud P through a multi-mode feature fusion strategy, and further calculating the target point cloud to obtain the space size and pose information of the target. In order to obtain the 3D Bounding Box and pose information of the target point cloud, a 3D Bounding Box of the target simplified view cone point cloud P is generated by using an Axis-Aligned Bounding Box (AABB) algorithm. Meanwhile, point cloud direction estimation is performed based on a principal component analysis method, and three principal feature vectors obtained through calculation are used as pose coordinates of the target point cloud (see fig. 4).

And 6: training of the MMFF-3D algorithm network model, the framework only trains the MD56-YOLOv3 network model, a pre-training and fine-tuning transfer learning training method is adopted in the whole training process, the method improves the capability of the network model for learning target features on a target data set by learning similar features on a similar field data set, and the problem that the target data set is relatively few is solved to a certain extent.

Step 61: in the actual training process, firstly, the weight obtained by training on the ImageNet data set is used as the initialization weight for training the main network, and the Similar Domains data set is pre-trained to obtain the pre-training weight.

Step 62: then, dividing the workshop target data set into a training set, a verification set and a test set according to the proportion of 7: 1: 2, randomly dividing. The MD56-YOLOv3 training in the Target data set is divided into two steps, wherein in the first step, the pre-training weight of the front 184 layer of the main network MD56 is frozen, an Adam optimizer is adopted, the learning rate is set to be 1e-3, the batchsize is set to be 20, and the iteration number is set to be 200 epochs. Second, unfreeze the backbone network MD56, reduce the learning rate to 1e-4, set the blocksize to 5, and train 300 epochs again.

Step 6: and (3) testing by using the trained MMFF-3D target detection network model to obtain partial testing effects, as shown in figure 5.

Through experimental evaluation and verification analysis, the MMFF-3D target detection network framework has better 2D target detection effect in a workshop, and meanwhile, the 3D target detection also obtains better detection precision, as shown in tables 1 and 2.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, numerous simple deductions or substitutions may be made without departing from the spirit of the invention, which shall be deemed to belong to the scope of the invention.

TABLE 1 comparison of the test results of improved MD56-YOLOv3 and YOLOv3

TABLE 2 comparison of detection effects of MMFF-3D target detection under different basic networks

Claims

The RGB-D multi-modal feature fusion 3D target detection method is characterized in that a deep learning technology and a 3D point cloud processing technology are combined and effectively applied to a small amount of 2D labels and image data sets without 3D labels, and the whole process comprises the following steps:

step 1: establishing a data set of a detection target, and collecting RGB images including a main detection target through a web crawler, wherein the partial data is used as a similar field data set in training; shooting an RGB image and a depth image actually containing a main target by using a depth camera, taking the partial data as a target data set in training, and dividing the target data set into a training set and a test set;

and 2, step: improving a backbone network DarkNet53 of a YOLOv3 target detection network to obtain an MD56-YOLOv3 target detection network;

and step 3: pre-training an MD56-YOLOv3 target detection network by using the similar field data set in the step 1; then, the training set of the target data set in the step 1 is utilized to transfer learning training MD56-YOLOv3 target detection network;

and 4, step 4: an RGB-D target significance detection algorithm is provided to segment pixel point areas of targets in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;

and 5: mapping and aligning a target pixel point region obtained by dividing an RGB-D target significance detection algorithm to a depth image of a target, and generating a target visual cone point cloud through visual cone projection;

and 6: providing an RGB-D multi-mode feature fusion strategy to simplify the target view cone point cloud obtained in the step 5 and obtain a simplified target view cone point cloud;

and 7: generating a boundary box of the 3D target by utilizing an AABB algorithm and a PCA algorithm;

and 8: integrating all algorithms related to the steps 1-7 into an RGB-D multi-modal feature fusion 3D target detection method, and testing by using a test set of the target data set collected in the step 1;

the improvement of the backbone network DarkNet53 of the YOLOv3 target detection network in the step 2 comprises the following steps:

step 21: adjusting the input target image size 416 of the MD56-YOLOv3 target detection network to 448 to extract more characteristic information and improve the detection accuracy of the network;

step 22: respectively adding a feature extraction layer with the convolution kernel size of 3 x 3, the step length of 1 and the filling size of 1 at 3 scale feature extraction branches of a YOLOv3 target detection network, and respectively increasing the observation ranges of y1, y2 and y3 output predicted targets with different scales;

step 23: when the MD56-YOLOv3 target detects the network output characteristics, the corresponding characteristic dimension is adjusted, the coordinate frame decoding is carried out on the prediction result of the network, and the actual value of the target image coordinate is mapped through the coordinate frame decoding;

the RGB-D target saliency detection algorithm in step 4 above further includes the following steps:

step 41: acquiring a target pixel point region in the RGB image based on a GrabCont algorithm, further acquiring the target pixel point region in the depth image by combining a threshold segmentation algorithm, and segmenting to obtain a pixel point region of a target in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;

step 42: when the output 2D rectangular frames are not overlapped, according to the threshold value F _t Dividing the image into a target pixel point area and a background pixel point area;

step 43: when the output 2D rectangular frames are overlapped, the threshold values are respectively set as
And
then utilize
And
dividing the output 2D rectangular frame into a target pixel point region, a foreground pixel point region and a background pixel point region;

step 44: finally output RGB image target pixel point value P _rgb (x, y) obtaining pixel point values P of threshold segmentation of target pixel point values and depth images in RGB images based on GrabCT algorithm _d (x, y) is combined, and the expression formula is as follows:
2. the RGB-D multimodal feature fusion 3D object detection method as claimed in claim 1, wherein the step 1 further includes the steps of:

and 2D labeling the targets in the target data set, and storing the labeled target type information and the labeled target position information in a text file.
3. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein the GrabCut algorithm is required in step 3.
4. The RGB-D multi-modal feature fusion 3D object detection method of claim 1, wherein the RGB features of the object in step 6 are extracted by an edge extraction algorithm and a corner detection algorithm, the depth image features are point cloud density distribution features of the object view cone point cloud, and the RGB-D multi-modal feature fusion is to fuse the RGB features and the point cloud density distribution features.
5. The RGB-D multimodal feature fusion 3D object detection method of claim 1 wherein step 6 further comprises the steps of:

step 61: based on a camera calibration principle, RGB feature information of a target is fused in a 3D point cloud through coordinate transformation to obtain a feature 3D point cloud P with RGB features _f As shown in the following equation:

in the formula, K _c ^-1 Representing the inverse of the reference matrix within the RGB camera,
is the inverse of the RGB camera extrinsic matrix,
represents the inverse of the external reference matrix of the depth camera,
is a characteristic pixel point in the RGB image;

step 62: calculating the density distribution characteristics of the point cloud, wherein the density distribution characteristic function is shown as the following formula:

in which N represents P _fi Number of points in the radius of the density of the center point cloud, x _i ，y _i ，z _i Is a certain characteristic point P _fi Coordinate of (a), x _j ，y _j ，z _j Represents by P _fi Is the coordinate of a point in the density radius range of the central point cloud, wherein j belongs to [1, N ∈]，r _x ，r _y ，r _z Is a point cloud density radius parameter;

and step 63: selecting the value D having the maximum density distribution _i Point cloud P of _fi As the center of the point cloud density cluster set, r _x ，r _y ，r _z Clustering is carried out on the radius of the point cloud density clustering set, and point cloud marks which are not in the point cloud density clustering set are marked
That is, the point P with RGB characteristics in the point cloud density clustering set of outlier _fi And assembling the sets into a simplified target view cone point cloud.
6. The RGB-D multi-modal feature fusion 3D object detection method as claimed in any one of claims 1-5, wherein after the RGB-D image is inputted, the RGB-D multi-modal feature fusion 3D object detection method can obtain not only semantic category information of the object in the object image, but also 3D spatial position information of the object.