CN113408584A

CN113408584A - RGB-D multi-modal feature fusion 3D target detection method

Info

Publication number: CN113408584A
Application number: CN202110545313.5A
Authority: CN
Inventors: 陈光柱; 侯睿; 韩银贺; 唐在作; 茹青君
Original assignee: Chengdu Univeristy of Technology
Current assignee: Chengdu Univeristy of Technology
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2021-09-17
Anticipated expiration: 2041-05-19
Also published as: CN113408584B

Abstract

The invention provides an RGB-D multi-modal feature fusion 3D target detection method. The 3D target detection technology can obtain semantic information and spatial dimension information of a target, and is of great significance for realizing 3D intelligent target detection. Specifically, the method comprises the following steps: firstly, improving a YOLOv3 target detection network model to obtain a 2D prior area, providing an RGB-D target significance detection algorithm to extract target pixels, and obtaining target cone point cloud through cone projection; secondly, in order to remove outliers and reduce the number of target view cone point clouds, a multi-mode feature fusion strategy is provided to simplify the target view cone point clouds, and the strategy can replace the process of reasoning a 3D target based on a deep neural network; and finally, generating a 3D boundary frame of the target point cloud by using an axis alignment bounding box algorithm (AABB), and calculating the pose coordinates of the target point cloud by using a PCA algorithm. The invention has the beneficial effects that: the RGB-D multi-modal feature fusion 3D target detection method can improve the detection precision of scene multi-scale targets in application scenes with a small amount of 2D labeling data and no 3D labeling data, and has the advantages of good real-time performance and high precision.

Description

RGB-D multi-modal feature fusion 3D target detection method

Technical Field

The invention relates to the fields of computer vision, image recognition and target detection, in particular to an RGB-D multi-modal feature fusion 3D target detection method.

Background

The target detection is used as an important branch of machine vision, relates to the intersection of multiple disciplines and multiple fields, is the basis of high-level tasks such as target tracking, behavior recognition, field flow estimation and the like, and can be realized only by the maturity and development of the target detection. The target detection is to identify and position target objects in a scene, such as automobiles, pedestrians, roads and the like, and the target identification is to distinguish interested objects in the scene to obtain the category of the target objects and give out the classification probability; target positioning is to calibrate the position of an object of interest in a scene, and generally a box or a cube box is used to frame the boundary of the object of interest. The target detection has huge application prospects at present, such as body shadows of the target detection in the fields of face recognition, intelligent monitoring, intelligent workshops, intelligent transportation, unmanned driving and the like. The 3D target detection technology can obtain not only semantic information of the target but also space size information, and has research value and application prospect.

Currently, the traditional image processing method often needs a complex feature extractor designed for a specific detection target to realize target detection, and the algorithm has a poor generalization capability. The traditional target detection method is difficult to realize intelligent target detection. With the continuous development of artificial intelligence and computer vision technology, the neural network and deep learning technology based image recognition task has excellent performance. The 2D object detection algorithm has been developed rapidly. Compared with the traditional target detection method, the 2D target detection technology has the advantages that multiple types of targets can be efficiently detected, and the algorithm is high in detection precision, generalization capability and robustness. However, the 2D object detection method cannot acquire actual 3D spatial information (spatial pose coordinates, 3D size, etc.) of an object. Therefore, the 3D target detection can more accurately express the actual spatial position information of the detection target, is beneficial to accurately identifying and positioning the targets, and can more effectively ensure the safety of interactive operation with the targets.

In recent years, 3D target detection technology has been developed in a breakthrough manner as the accuracy of depth sensors such as 3D laser radar and RGB-D has been improved. The 3D target detection is used as an important task in the scene understanding process, and classification of an interested target in 3D data and positioning of a 3D boundary box can be achieved through the 3D target detection. The 3D target detection can further accurately position the 3D boundary box of the target compared with the 2D target detection when the semantic information of the target of interest is acquired. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques. At present, 3D target detection technology in the field of outdoor automatic driving is researched and applied more, and the indoor 3D target detection research direction mainly focuses on 3D detection of a life scene target, positioning of a mechanical arm and workpiece grabbing. These methods all rely on a large number of labeled data sets in a specific scene, which is not conducive to popularization in practical application scenarios.

Most of the existing 3D target detection methods need to construct a large-scale 3D labeling data set, and the difficulty in constructing the large-scale 3D labeling data set is high, so that the methods are difficult to realize 3D target detection in actual application requirements. Therefore, the RGB-D multi-mode feature fusion 3D target detection method is provided, and the method can be effectively applied to the 3D target detection of practical application scenes and has great research significance.

Disclosure of Invention

The invention mainly aims to provide an RGB-D multi-modal feature fusion 3D target detection method. According to the method, the detection precision of the multi-scale target in the detection scene can be improved by improving the YOLOv3 network model, and meanwhile, the method can realize efficient 3D target detection under the condition that a small amount of 2D labeling data is used and the 3D labeling data is not relied on.

The invention is realized by adopting the following technical scheme: an RGB-D multimodal feature fusion 3D object detection method (hereinafter abbreviated MMFF-3D object detection method), comprising the steps of:

step 1: and preliminarily establishing a target data set of the detection scene. And collecting pictures through a web crawler, taking the pictures in an actual workshop, and carrying out 2D labeling on the data set.

Step 2: based on the characteristics of a multiple-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is further improved to be an MD56 trunk network, an MD56-YOLOv3 target detection framework is obtained to improve the 2D detection precision of the target, and the MD56-YOLOv3 network framework is trained in the data set established in the step 1.

And step 3: and (3) constructing an RGB-D target significance detection algorithm based on the 2D rectangular region obtained on the basis of the step (2) to obtain a pixel point region of the target.

And 4, step 4: and 3, obtaining a pixel point region of the target on the basis of the step 3, generating a target view cone point cloud by a depth image and RGB image mapping alignment and view cone projection method, providing an RGB-D multi-mode feature fusion strategy to extract a simplified target view cone point cloud, and replacing a 3D target reasoning process based on a deep neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.

The beneficial technical effects of the invention are as follows:

1. the 2D detection precision of the multi-scale target in the scene can be effectively improved;

2. when the object is shielded, the target pixel can be effectively segmented and 3D detection is realized;

3. 3D target detection is realized under the conditions of a small amount of 2D labeling data and no 3D labeling data;

4. the MMFF-3D target detection method meets the requirements of 3D target detection on real-time performance and precision;

drawings

FIG. 1 is a MMFF-3D object detection network model framework schematic.

FIG. 2 is a schematic diagram of the MD56-YOLOv3 target detection framework.

FIG. 3 is a schematic diagram of the process of RGB-D target saliency detection algorithm acquiring a target pixel.

FIG. 4 is a schematic diagram of a process for implementing 3D target detection by an RGB-D multimodal feature fusion process.

FIG. 5 is a diagram of the target detection effect of an MMFF-3D target detection network model framework in an intelligent workshop application scenario.

Detailed description of the preferred embodiments

To facilitate understanding of the present invention, background knowledge of object detection, which is one of the most basic and challenging problems in computer vision, is introduced first, and has been receiving much attention in related research in the field of computer vision. The image recognition of the target detection is characterized in that the category of a certain interested target in the digital image is recognized, and the position of the interested target in the digital image is positioned; meanwhile, the target detection technology can be used as the basic research of visual processing tasks such as example segmentation, target tracking and the like. The target detection technology is taken as a research hotspot direction in the field of image processing, and is mainly realized by a method of designing a complicated artificial feature extractor in the stage of realizing target detection by utilizing the traditional computer vision processing technology. Compared with the traditional target detection algorithm for constructing the artificial feature extractor, the target detection algorithm based on the deep neural network has the advantages of simpler structural design, capability of automatically extracting features, high detection precision and good robustness. Therefore, the main research direction in the current target detection field is based on deep learning and neural network technology. And 3D target detection can further accurately position a 3D boundary box of the target compared with 2D target detection while obtaining semantic information of the target of interest. Therefore, 3D object detection techniques are more valuable than 2D object detection techniques.

The following describes an embodiment of the present invention in detail by selecting a workshop scenario and performing specific implementation and application with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a MMFF-3D object detection network model framework according to the present invention, and in combination with the diagram, the general implementation is as follows: establishing a workshop scene target data set, improving YOLOv3 to improve the multi-scale workshop scene target detection efficiency, combining 2D target detection and RGB-D fused target significance detection to generate 3D target viewing cone point cloud, projecting RGB image target characteristic information onto a three-dimensional space to perform multi-mode characteristic fusion with 3D point cloud density distribution information to obtain simplified 3D target viewing cone point cloud, and realizing the generation of a 3D target boundary box of a workshop scene by utilizing an AABB algorithm. Specifically, the method comprises the following steps: firstly, based on the characteristics of a multi-scale prediction target of a YOLOv3 target detection framework, a convolutional neural trunk network DarkNet53 is improved into an MD56 trunk network so as to improve the 2D detection accuracy of the multi-scale target of a workshop scene. And then, constructing an RGB-D fused target significance detection algorithm for the 2D rectangular region obtained by the 2D target detection based on the 2D detection region to obtain pixel points of the target, wherein the pixel points of the target are used for generating target view cone point cloud. And moreover, a target view cone point cloud is generated by a depth image and RGB image mapping alignment and view cone projection method, a simplified target view cone point cloud is extracted by utilizing an RGB-D multi-mode feature fusion strategy, and the process replaces a 3D target reasoning process based on a depth neural network. And finally, acquiring a 3D boundary box of the target point cloud by using an axis alignment bounding box algorithm, and calculating the 3D pose coordinate of the target point cloud by using a PCA algorithm.

Step 1: and establishing a workshop scene target data set. 2000 main target RGB images including a workshop scene are collected through a web crawler, and the part of data is used as a Similar domain data set (Similar Domains data sets) in training; 3000 images (1500 RGB images and corresponding 1500 depth images) of the digitized workshop scene actually containing the main Target are taken with a RealSense D435 depth camera, and this part of the data will be used as the Target dataset in the training.

Step 2: the DarkNet53 is modified to improve the efficiency of target detection for multi-scale shop scenes (see also FIG. 2).

Step 21: firstly, adding a feature extraction layer (kernel size is 3 × 3, strings is 1, and padding is 1) at each of 3 scale feature extraction branches to increase the receptive fields of the prediction targets with different scales of y1, y2, and y 3.

Step 22: then, by adjusting the input size 416 of the network to 448, more feature information can be extracted and the detection accuracy of the network can be improved.

Step 23: finally, for the whole network, (as shown in fig. 2 (b)), the prediction result of the network needs to be coordinate-box decoded, and mapped to the true value of the image coordinate through the coordinate-box decoding. Since the network parameters are randomly initialized, the network-initialized coordinate box may have a problem of exceeding the actual coordinate box boundary. To limit the location range of the box, the upper left coordinate point of the grid (C) is scaled using sigmoid function_x，C_y) The offset is limited to a range of 0-1 so that the positions of the prediction boxes remain inside the respective meshes at all times. The process of decoding the width and height of the prediction box is to scale to the actual image size by multiplying the prior box by the corresponding sampling rate (as shown in the following equation).

Y＝(sigmoid(ty)+C_y)*stride，

X＝(sigmoid(tx)+C_x)*stride，

W＝(P_We^tw)*stride，

H＝(P_he^th)*stride，

In the formula, tx, ty, tw, th represent the predicted result, C_x，C_yCoordinates, P, representing the upper left corner of the grid point at which the prediction box is located_W、P_hThe width and height of the prior frame compared to the current grid size. Stride indicates the image size (i.e., sampling rate) of the input image after down-sampling adjustment.

And step 3: the RGB-D object saliency detection method aims at further fusing depth image thresholding to segment objects in an image (see also fig. 3). The fusion depth image threshold segmentation algorithm process is as follows: for the case that the rectangular areas of the target detection are not overlapped, according to the front threshold value F_tIt is divided into a target and a background. For the overlapping condition of the rectangular areas of the two objects, the average depth values of the depth images corresponding to the two rectangular areas (shown as (r) and (c) in the attached figure 3) are calculated. If the average depth value of a certain target detection rectangular area is smaller, according to a threshold value F_tIt is divided into a target and a background. For the rectangular area with large average depth value, according to the threshold value

And

it is divided into object, foreground and background. Threshold value F_t、

And

is obtained by adopting an adaptive threshold value calculation method. When the rectangular areas of the object are not overlapped or the average depth value of the rectangular areas is small when the rectangular areas are overlapped (shown as phi in figure 3), the threshold number is 1, and the threshold is F_tD, RGB target pixel P_rgb(x, y) Pixel Point P that can incorporate depth image thresholding_d(x, y) is further calculated. When the average depth value of the rectangular areas is larger (as shown in fig. 3), the threshold number is 2, and the threshold is

And

RGB target pixel point value P_rgb(x, y) Pixel Point values P that can incorporate depth image thresholding_d(x, y) is further calculated. The calculation is shown in the following formula:

and 4, step 4: and extracting a simplified target view cone point cloud by using an RGB-D multi-modal feature fusion strategy (see the attached figure 4).

Step 41: firstly, acquiring a target characteristic pixel point p in a pixel coordinate system of an RGB image based on a Canny edge extraction algorithm and a Harris corner detection algorithm_{rgb_f}. Obtaining a target characteristic pixel point P in a camera coordinate system of the RGB camera through a transformation relation between a pixel coordinate system and an image coordinate system of the RGB camera and a transformation relation between the image coordinate system and the camera coordinate system_{rgb_f}As shown in the following equation:

in the formula (I), the compound is shown in the specification,

is a characteristic pixel point, P, belonging to a target under an image coordinate system of an RGB camera_{rgb_f}Is a characteristic pixel point, T, belonging to a target under a camera coordinate system of an RGB camera_w2cIs an external reference matrix of RGB camera, K_cIs an internal reference matrix of the RGB camera.

Step 42: secondly, obtaining P according to the mapping relation of the RGB camera and the depth camera_{rgb_f}Target feature point P in camera coordinate system of corresponding depth camera_{d_f}As shown in the following equation:

P_{rgb_f}＝T_d2rgbP_{d_f}，

T_d2rgb＝T_w2cT_w2d，

in the formula, T_d2rgbA mapping matrix representing a camera coordinate system of the depth camera to a camera coordinate system of the RGB camera.

Step 43: finally, obtaining a target feature pixel point P in the camera coordinate system of the depth camera through the transformation relation between the pixel coordinate system and the image coordinate system of the depth camera and the transformation relation between the image coordinate system and the camera coordinate system_{d_f}. As shown in the following equation:

in the formula (I), the compound is shown in the specification,

representing target feature pixels, P, in a depth camera image coordinate system_{d_f}Representing target feature pixel points in a camera coordinate system of the depth camera,

representing the inverse of the depth camera's internal reference matrix,

representing the inverse of the outer reference matrix of the depth camera.

The formula is integrated, so that the simplified three-dimensional target characteristic point P is obtained_fThe following formula shows:

step 44: background outlier point clouds are removed, and due to the fact that a depth camera is affected by external environment factors such as illumination, pixel points of a depth image are distorted, outliers exist in generated target view cone point clouds and easily appear in a boundary area of the target view cone point clouds. Therefore, the density of outliers in the target view cone point cloud is generally low, and the outliers are screened out by adopting a density clustering-based method. For the simplified 3D target characteristic point P_fiPerforming point cloud density distribution calculation, density distribution function D_iAs shown in the following equation:

in the formula, x_i，y_i，z_iIs a certain target feature point P_fiThe coordinates of (a). r is_x，r_y，r_zIs a radius parameter of the point cloud density.

Finally, the maximum density distribution D is selected_iPoint cloud P of values_fiAs the center of the point cloud density cluster. Point cloud markers not in a clustered point cloud collection

I.e. the outliers. Clustering points P with RGB features in point cloud set_fiAnd assembling the target simplified view cone point clouds P. RGB characteristics and density distribution characteristics of the target are fused through an RGB-D multi-modal characteristic fusion strategy, and RGB characteristic information can ensure that P is a characteristic point cloud P_{rgb_f}The density characteristic distribution information can ensure that P is the point cloud P meeting the density requirement_fi. Therefore, the obtained target simplified view cone point cloud P can be used for screening outliers while simplifying the point cloud.

And 5: and acquiring a target simplified view cone point cloud P through a multi-modal feature fusion strategy, and further calculating the target point cloud to obtain the space size and pose information of the target. In order to obtain the 3D Bounding Box and pose information of the target point cloud, a 3D Bounding Box of the target simplified view cone point cloud P is generated by using an Axis-Aligned Bounding Box (AABB) algorithm. Meanwhile, point cloud direction estimation is performed based on a principal component analysis method, and three principal feature vectors obtained through calculation are used as pose coordinates of the target point cloud (see fig. 4).

Step 6: training of the MMFF-3D algorithm network model, the framework only trains the MD56-YOLOv3 network model, a pre-training and fine-tuning transfer learning training method is adopted in the whole training process, the method improves the capability of the network model for learning target features on a target data set by learning similar features on a similar field data set, and the problem that the target data set is relatively few is solved to a certain extent.

Step 61: in the actual training process, firstly, the weight obtained by training on the ImageNet data set is used as the initial weight for training the backbone network, and the Similar Domains data set is pre-trained to obtain the pre-training weight.

Step 62: then, dividing the workshop target data set into a training set, a verification set and a test set according to the proportion of 7: 1: 2, randomly dividing. The MD56-YOLOv3 training in the Target data set is divided into two steps, wherein in the first step, the pre-training weight of the front 184 layers of the main network MD56 is frozen, an Adam optimizer is adopted, the learning rate is set to be 1e-3, the batch size is set to be 20, and the iteration number is set to be 200 epochs. Second, unfreeze the backbone network MD56, reduce the learning rate to 1e-4, set the blocksize to 5, and train 300 epochs again.

Step 6: and (3) testing by using the trained MMFF-3D target detection network model to obtain partial testing effect, as shown in figure 5.

Through experimental evaluation and verification analysis, the MMFF-3D target detection network framework has better 2D target detection effect in a workshop, and meanwhile, the 3D target detection also obtains better detection precision, as shown in tables 1 and 2.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

TABLE 1 comparison of the test results of improved MD56-YOLOv3 and YOLOv3

TABLE 2 comparison of detection effects of MMFF-3D target detection under different basic networks

Claims

The RGB-D multi-modal feature fusion 3D target detection method is characterized in that a deep learning technology and a 3D point cloud processing technology are combined and effectively applied to a small amount of 2D labels and image data sets without 3D labels, and the whole process comprises the following steps:

step 1: establishing a data set of a detection target, and collecting RGB images including a main detection target through a web crawler, wherein the partial data is used as a similar field data set in training; shooting data (RGB image and depth image) actually containing main target images by using a depth camera, wherein the partial data is taken as a target data set in training, and the target data set is divided into a training set and a testing set;

step 2: improving a backbone network DarkNet53 of a YOLOv3 target detection network to obtain an MD56-YOLOv3 target detection network;

and step 3: pre-training an MD56-YOLOv3 target detection network by using the similar field data set in the step 1; then, the training set of the target data set in the step 1 is utilized to transfer and learn the training MD56-YOLOv3 target detection network;

and 4, step 4: an RGB-D target significance detection algorithm is provided to segment pixel point areas of targets in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;

and 5: mapping and aligning a target pixel point region obtained by dividing an RGB-D target significance detection algorithm to a depth image of a target, and generating a target visual cone point cloud through visual cone projection;

step 6: providing an RGB-D multi-mode feature fusion strategy to simplify the target view cone point cloud obtained in the step 5 and obtain a simplified target view cone point cloud;

and 7: generating a boundary box of the 3D target by utilizing an AABB algorithm and a PCA algorithm;

and 8: and (3) integrating all algorithms related to the steps 1-7 into an RGB-D multi-modal feature fusion 3D target detection method, and testing by using the test set of the target data set collected in the step 1.
2. The RGB-D multimodal feature fusion 3D object detection method as claimed in claim 1, wherein the step 1 further includes the steps of:

and 2D labeling the targets in the target data set, and storing the labeled target type information and the labeled target position information in a text file.
3. The RGB-D multimodal feature fusion 3D object detection method according to claim 1, wherein the step 2 further includes the steps of:

step 21: adjusting the input target image size 416 of the MD56-YOLOv3 target detection network to 448 to extract more characteristic information and improve the detection accuracy of the network;

step 22: respectively adding a feature extraction layer (kernel size is 3 x 3, strings is 1, and paging is 1) at 3 scale feature extraction branches of a YOLOv3 target detection network, and respectively increasing y1, y2 and y3 to output receptive fields of prediction targets with different scales;

step 23: when the MD56-YOLOv3 target detects the network output characteristics, the corresponding characteristic dimension is adjusted, the prediction result of the network is subjected to coordinate frame decoding, and the coordinate frame decoding is mapped to the real value of the target image coordinate.
4. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein the GrabCut algorithm is required in step 3.
5. The RGB-D multi-modal feature fusion 3D object detection method of claim 4, wherein the RGB-D object saliency detection method further comprises the steps of:

step 31: acquiring a target pixel point region in the RGB image based on a GrabCont algorithm, further acquiring the target pixel point region in the depth image by combining a threshold segmentation algorithm, and segmenting to obtain a pixel point region of a target in a 2D rectangular frame output by an MD56-YOLOv3 target detection network;

step 32: when the output 2D rectangular frames are not overlapped, according to the threshold value F_tDividing the image into a target pixel point region and a background pixel point region;

step 33: when the output 2D rectangular frames are overlapped, the threshold values are respectively set as
And
then utilize
And
dividing the output 2D rectangular frame into a target pixel point region, a foreground pixel point region and a background pixel point region;

step 34: finally output RGB image target pixel point value P_rgb(x, y) obtaining pixel point values P of threshold segmentation of target pixel point values and depth images in RGB images based on GrabCT algorithm_d(x, y) are combined, and the expression formula is as follows:
6. the RGB-D multi-modal feature fusion 3D object detection method of claim 1, wherein the RGB features of the object in step 6 are extracted by an edge extraction algorithm and a corner detection algorithm, the depth image features are point cloud density distribution features of the object view cone point cloud, and the RGB-D multi-modal feature fusion is to fuse the RGB features and the point cloud density distribution features.
7. The RGB-D multimodal feature fusion 3D object detection method of claim 1, wherein step 6 further comprises the steps of:

step 61: based on the camera calibration principle, RGB characteristic information of a target is fused in a 3D point cloud through coordinate transformationObtaining a characteristic 3D point cloud P with RGB characteristics_fAs shown in the following equation:

in the formula, K_c ^-1Representing the inverse of the reference matrix within the RGB camera,
is the inverse of the RGB camera extrinsic matrix,
representing the inverse of the outer reference matrix of the depth camera,
is a characteristic pixel point in the RGB image;

step 62: calculating the density distribution characteristics of the point cloud, wherein the density distribution characteristic function is shown as the following formula:

in the formula, x_i，y_i，z_iIs a certain characteristic point P_fiThe coordinates of (a). r is_x，r_y，r_zIs a radius parameter of the point cloud density;

and step 63: selecting the value D having the maximum density distribution_iPoint cloud P of_fiAs the center of the point cloud density cluster set, r_x，r_y，r_zClustering is carried out on the radius of the point cloud density clustering set, and point cloud marks which are not in the point cloud density clustering set are marked
I.e. points P with RGB characteristics in the outlier, point cloud density cluster set_fiSet-forming simplified target view cone point cloud。
8. The RGB-D multi-modal feature fusion 3D object detection method as claimed in any one of claims 1-7, wherein after the RGB-D image is inputted, the RGB-D multi-modal feature fusion 3D object detection method can obtain not only semantic category information of an object in the object image but also 3D spatial position information of the object.