CN111079685A - 3D target detection method - Google Patents

3D target detection method Download PDF

Info

Publication number
CN111079685A
CN111079685A CN201911354155.4A CN201911354155A CN111079685A CN 111079685 A CN111079685 A CN 111079685A CN 201911354155 A CN201911354155 A CN 201911354155A CN 111079685 A CN111079685 A CN 111079685A
Authority
CN
China
Prior art keywords
feature
target
network
feature map
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911354155.4A
Other languages
Chinese (zh)
Other versions
CN111079685B (en
Inventor
王正宁
吕侠
何庆东
赵德明
张翔
蓝先迪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911354155.4A priority Critical patent/CN111079685B/en
Publication of CN111079685A publication Critical patent/CN111079685A/en
Application granted granted Critical
Publication of CN111079685B publication Critical patent/CN111079685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a 3D target detection method, which comprises the steps of firstly, carrying out feature extraction on a point cloud aerial view, a target image to be detected and a front view; obtaining a 3D target suggestion frame of the aerial view and the target image to be detected, fusing the feature maps obtained by fusing the aerial view, the target image to be detected and the front view respectively, and performing total feature map fusion by pixel-by-pixel addition averaging to obtain a final feature map fusion result; and projecting the 3D target suggestion frame on the finally fused feature map to form a 2D target suggestion frame to obtain ROI regional features corresponding to the 2D target suggestion frame, and finally carrying out classification and regression on the ROI regional features to obtain a final 3D target detection candidate frame. The invention effectively improves the detection and positioning performance of the detection network on different targets of interest in the 3D space in different environments, and solves the problem of poor detection of pedestrians and vehicles due to point cloud sparsity in a road environment.

Description

3D target detection method
Technical Field
The invention belongs to the field of image processing and computer vision, and particularly relates to a 3D target detection method.
Background
With the vigorous development of artificial intelligence technology, an intelligent automobile with an advanced assistant driving system ADAS and an unmanned technology as the core becomes the development direction of the automobile in the future, and 3D target detection as one of key technologies is always a research hotspot in the field.
For 3D target detection, there are currently three main approaches: the first method is monocular image object detection based on monocular RGB images, such as the monocular image object detection method proposed by x.chen et al (Chen X, Kundu K, Zhang Z, et al, simple 3D object detection for autonomus Driving [ C ]//2016IEEE Conference on Computer vision and pat ern registration (CVPR). IEEE,2016), which focuses on object shape prior, context features, and instance segmentation of the monocular image to generate a 3D object suggestion box, which inevitably lacks accurate depth information due to being a monocular image. The second method is a binocular RGB image-based 3D Object Detection method like the 3DOP Object Detection method proposed also by x.chen et al (Chen X, Kundu K, Zhu Y, et al, 3D Object projection programs u sing Stereo image for Accurate Object Class Detection [ J ]. ieee transactions on Pattern a sources and Machine Intelligence,2017:1-1) which generates a 3D Object suggestion box by encoding Object prior size, depth information (e.g. free space, point cloud density) into an energy function, and then regresses the 3D suggestion box by the R-CNN method. Currently, only a few methods use stereo vision for 3D object detection. The third method is 3D target detection based on LiDAR point clouds, and most of the most advanced 3D target detection methods rely on LiDAR data to provide accurate 3D information, however, the way of processing point clouds differs between different detection methods. The F-PointNet method, as proposed by C.R.Qi et al (C.R.Qi, W.Liu, C.Wu, H.Su, andL.J.Guibas, "Frustum pointnets for 3D object detection From rgb-D data," arXivpreprint arXiv:1711.08488,2017), takes the original point cloud as input and then locates the 3D object based on the 2D object detection and the truncated point cloud region predicted by the PointNet network. Fast object detection methods, as proposed by m.engelcke et al (m.engelcke, d.rao, d.z.wang, c.h.tung, and i.p. net. voice 3deep: Fast object detection In 3D point groups using information related to a local network, information and Automation (ICRA),2017IE EE International Conference on pages 1355-1361. IEEE,2017.1,2), use a structured voxel grid to quantify the point cloud raw data, and then use 2D or 3D CNN to detect 3D objects. The MV3D target detection method proposed by h.ma et al (x.chen, h.ma, j.wan, b.li, andt.xia, "Multi-view 3D object detection on network for Autonomous driving," ieee CVPR,2017) projects a point cloud into a 2D bird's eye view or front view, and then performs convolution processing using a Convolutional Neural Network (CNN), and in the process, RGB image data is also combined to obtain more dense information. The 3D target detection based on point cloud data proposed by the pommel et al (pommel, dragon. a kind of 3D target detection method based on point cloud data. patent No. 201811371861.5) also projects the point cloud data onto the 2D bird's eye view, extracts the point cloud data characteristics through the ASPP network, and generates the candidate target position in the 3D space.
Disclosure of Invention
Aiming at the problems, the invention provides a 3D target detection method, which is realized by adopting a deep convolutional network and specifically comprises the following steps:
(1) preparing a data set to be processed, wherein the data set comprises an original road vehicle RGB image data set and a corresponding target point cloud data set;
(2) dividing the original road vehicle RGB image data set and the corresponding target point cloud data set in the step (1) into a training set, a verification set and a test set, wherein the verification set and the test set are used for testing the detection performance of the deep convolutional network after the deep convolutional network is trained;
(3) projecting the point cloud data in the training set in the step (2) to a two-dimensional plane to obtain a bird's-eye view;
(4) projecting the point cloud data in the training set in the step (2) to a cylindrical plane to obtain a front view;
(5) performing feature extraction on the aerial view in the step (3) by using a feature extraction network, and respectively obtaining feature maps FB after passing through convolution layers1,FB2In which FB1Is of size FB22 times of the total weight of the composition;
(6) the characteristic diagram FB obtained in the step (5)2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FB3
(7) The characteristic diagram FB obtained in the steps (5) and (6)3And FB1Summing, and performing fusion processing to obtain a feature diagram FB;
(8) the feature map FB is used for realizing a task of extracting a target candidate position in the aerial view through a two-dimensional image target detection method, so that a candidate position of the target in the aerial view is generated;
(9) performing feature extraction on the RGB images of the original road vehicles in the training set in the step (2) by using a feature extraction network, and respectively obtaining feature maps FG after convolution layers1,FG2In which FG1Is FG in size22 times of the total weight of the composition;
(10) the characteristic diagram FG in the step (9)2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FG3
(11) Subjecting the characteristic maps FG obtained in steps (9) and (10)1And FG3Summing, performing fusion processing to obtain a feature map FG, and generating candidate positions of targets in an original road vehicle RGB image by a two-dimensional image target detection method;
(12) obtaining the positions of the target candidate frames in the bird's eye view image and the RGB image of the original road vehicle in the step (8) and the step (11), respectively, generating sub-networks through one 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;
(13) performing feature extraction on the front view in the step (4) by using a feature extraction network, and respectively obtaining feature maps FF after the front view passes through the convolutional layer1,FF2Wherein FF1Has a size of FF22 times of the total weight of the composition;
(14) the characteristic diagram FF obtained in the step (13) is2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FF3
(15) The characteristic diagrams FF obtained in the steps (13) and (14) are processed3And FF1Summing, and performing fusion processing to obtain a feature map FF;
(16) fusing the feature maps FB, FG and FF obtained in the steps (7), (11) and (15) by an element average value method to obtain a feature map F;
(17) projecting the 3D target suggestion frame obtained in the step (12) on a feature map F to form a 2D target suggestion frame, and obtaining an ROI (region of interest) feature map corresponding to the 2D target suggestion frame;
(18) after passing through the ROI area characteristic diagram obtained in the step (17) twice through the full connection layer, carrying out target classification and regression to obtain a final 3D target detection frame, namely, the final detection result of the 3D target detection is a 3D frame framed on a target;
therefore, the training set completes the training of the deep convolutional network, then the optimal deep convolutional network training model is selected by adopting the verification set, and the test set is used for completing the later-stage performance test of the selected optimal deep convolutional network training model or the performance test is used in practical application, so that the 3D target detection method based on data fusion is realized.
In the 2D target detection, the SPP network is used for replacing simple pooling operation, the SPP network is more robust than simple pooling, and the SPP network improves the scale invariance of images, is easier to converge and improves the accuracy of experiments due to the fact that the SPP network processes feature areas with different aspect ratios and different sizes. In addition, the point cloud data (the point cloud aerial view and the point cloud front view) and the RGB image data are subjected to multi-mode fusion, the sparsity of the point cloud data is compensated by using texture information dense in RGB images in positioning, firstly, the 2D target detection results of the point cloud aerial view and the RGB images are subjected to space matching to obtain a rough 3D target candidate frame, then, the characteristic information of the point cloud front view is extracted, multi-view characteristic fusion is carried out, the missing spatial information is compensated, the 3D target candidate frame is further refined, and therefore a good detection effect is achieved. By adopting the method, the depth information (point cloud data) of the target in the scene can be more accurately acquired through the laser radar, so that the spatial position information of the target of interest can be roughly acquired in a three-dimensional space, and then dense texture information provided by an RGB image is added to perform multi-mode data fusion, so that the detection and positioning performance of the detection network on different targets of interest in a 3D space in different environments is effectively improved, and the problem of poor detection of pedestrians and vehicles due to the point cloud sparsity in a road environment can be solved.
Drawings
FIG. 1 is a diagram of the deep convolutional network structure of the present invention
FIG. 2 is a network structure of a two-dimensional image target detection method of the present invention
FIG. 3 is a diagram of the SPP network structure of the present invention
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
The invention firstly provides a feature extraction method, which is used for extracting features of a point cloud aerial view, a point cloud front view and a target image to be detected. And then, generating a sub-network through the 3D candidate frame by using the feature map extracted from each view, and generating a 3D candidate suggestion frame by using a space matching method. And finally, providing a data fusion mode, fusing the multi-mode data, and classifying and regressing the target frame. The network structure of the whole invention is shown in fig. 1.
In the structure of the deep convolutional network shown in fig. 1, three branches are included from top to bottom, wherein the input image in the first row of branches is a bird's-eye view image obtained by projecting point cloud data onto a two-dimensional plane; the input map in the second row of branches is an unprocessed raw road vehicle RGB image; the input image in the third row of branches is a front view obtained by projecting point cloud data to a cylindrical plane; the input graphs in the three branches need to be processed by a VGG16 feature extraction network, the VGG16 feature extraction network comprises 13 convolutional layers and 4 pooling layers in total, but does not comprise full-connection layers, and the 4 pooling layers divide the convolutional layers of the VGG16 feature extraction network into 5 groups, wherein the 1 st group comprises 2 convolutional layers with parameter representations of all Conv 3-64; group 2 included 2 convolutional layers with parametric representation all Conv 3-128; group 3 included convolutional layers with 3 parametric representations all Conv 3-256; group 4 included convolutional layers with 3 parametric representations all Conv 3-512; the 5 th group comprises 3 convolutional layers with the parameter representation of Conv3-512, the maximum pooling treatment is carried out by adopting 1 same pooling layer after each group of convolutional layers, the size of the pooling core for carrying out the maximum pooling is 2 multiplied by 2, and the step length is 2; the parametric representation Conv3-64 indicates that the current layer convolution kernel size is 3 × 3, and the output is 64 channels; conv3-128 indicates that the current layer convolution kernel size is 3 × 3, with an output of 128 channels; conv3-256 indicates that the current layer convolution kernel size is 3 × 3, with 256 channels in output; conv3-512 indicates that the current layer convolution kernel size is 3 × 3, and the output is 512 channels; the sizes of convolution kernels in the VGG16 feature extraction network are uniformly set to be 3 x 3, and the step length is 1;
block1 (same as Block1 in the second and third row branches) in the first row branch of the network structure shown in fig. 1 is shown in table 1 below, where table 1 corresponds to Block1 in fig. 1, there are 5 feature maps in Block1, each feature map (except the first feature map) is obtained by the previous feature map through a group (respectively through group 2, group 3, group 4 and group 5) of convolutional layers plus pooling layers in VGG16, and VGG16 is obtained by using a maxpool layer as a boundary point to group convolutional layers, for example, in table 1, group 1 has 2 convolutional layers, group 2 also has 2, group 3 has 3, group 4 has 3, group 5 also has 3, and the first feature map in Block1 is obtained by passing an input image through the first group 1 of convolutional layers in VGG16, and adding a network layer 16 to the convolutional layer, the resulting signature is first subjected to a 2-fold upsampling deconvolution operation (the deconvolution operations in the second and third rows of the network in fig. 1 are identical to those in the branches of the first row, and the parameters are all the same), where the parameters of the deconvolution layer are: kernel _ size 4, padding 1, and strides 2, which are parameters designed in advance according to the upsampled multiple, and then feature-fused with a feature map obtained from a conv4_3 layer (i.e. the 10 th convolutional layer in table 1, one set of convolutional layers in VGG16 is bounded by a maxpool layer, and conv4_3 represents the 4 th set of 3 rd convolutional layers in VGG16) of the VGG16 network, and thereby generating candidate positions of the target in the bird's eye view; the term "VGG 16" is not necessarily required here, and may be another K (2 to 5) convolutional layer, which means that the feature extraction network here is not necessarily a VGG16 network, but may be another network, where K represents a network satisfying requirements, and (2 to 5) represents the number of convolutional layers, and only two feature maps extracted last are needed, and the relationship of 2 times in size is good, and the last two feature maps here mean that if image data is input, N feature maps (including a feature map of an intermediate result and a feature map output last) are generated in total through the entire feature extraction network (CNN), and after a set of convolutional layers, a feature map of an intermediate result can be extracted, and the last two feature maps refer to the N-1 th and N-th feature maps.
Similarly, the second line branch of the network in FIG. 1 generates candidate locations for objects in the inspection image (i.e., the original road-vehicle RGB image). The candidate positions generated by the two branches of the first and second rows are simultaneously used for generating a sub-network through a 3D candidate frame, and performing spatial matching to generate a target suggestion frame in a 3D space, which is the function of the 3D proxy in fig. 1.
After the third branch is subjected to feature extraction, the feature fusion method is the same as that of the first two branches. And taking out the feature maps of the three branches after respective fusion, and carrying out total feature map fusion at M positions in the network by pixel-by-pixel addition and average to obtain a final feature map fusion result. And projecting the 3D target suggestion frame generated at the 3D proxy position on the finally fused feature map to form a 2D target suggestion frame (2D proxy), obtaining ROI regional features corresponding to the 2D target suggestion frame, and finally carrying out classification and Bbox regression on the R OI regional features through two fully connected layers to obtain a final 3D target detection candidate frame.
Table 1 specifically and additionally illustrates parameters of the feature extraction layer VGG16, where 3 to 64 indicate that the current layer convolution kernel size is 3 × 3, and the output is 64 channels.
TABLE 1 VGG16 network architecture schematic (without full connectivity layer)
Input Image Conv3-64 Conv3-64 maxpool Conv3-128 Conv3-128
maxpool Conv3-256 Conv3-256 Conv3-256 maxpool Conv3-512
Conv3-512 Conv3-512 maxpool Conv3-512 Conv3-512 Conv3-512
The invention provides a 3D target detection method, which belongs to the field of artificial intelligence.
The invention provides a 3D target detection method, which is realized by adopting a deep convolutional network and specifically comprises the following steps:
(1) preparing a data set to be processed, comprising an original road vehicle RGB image data set and a corresponding target point cloud data set O3DWherein
Figure BDA0002335442510000061
Three-dimensional coordinates representing point cloud data in the target point cloud dataset;
Figure BDA0002335442510000062
Figure BDA0002335442510000063
(2) dividing the original road vehicle RGB image data set and the corresponding target point cloud data set in the step (1) into a training set, a verification set and a test set, wherein the verification set and the test set are used for testing the detection performance of the deep convolutional network after the deep convolutional network is trained;
(3) projecting the point cloud data in the training set in the step (2) to a two-dimensional plane to obtain a bird's-eye view;
(4) projecting the point cloud data in the training set in the step (2) to a cylindrical plane to obtain a front view;
(5) performing feature extraction on the aerial view in the step (3) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FB after the aerial view passes through convolutional layers1,FB2In which FB1Is of size FB22 times of the total weight of the composition;
(6) the characteristic diagram FB obtained in the step (5)2Carry out one2 times of up-sampling processing, i.e. deconvolution operation, to obtain a feature map FB3
(7) The characteristic diagram FB obtained in the steps (5) and (6)3And FB1Summing, and performing fusion processing to obtain a feature diagram FB;
FB=FB1+FB3(formula 3)
(8) The feature map FB is used for realizing a task of extracting a target candidate position in the aerial view through a two-dimensional image target detection method, so that a candidate position of the target in the aerial view is generated;
in step (8), a specific network implementation of the two-dimensional image target detection method is shown in fig. 2 and 3. In fig. 2, after an image is input, feature extraction is performed by using the first 7 convolutional layers of the VGG16 network to generate a feature map of multiple channels, that is, the feature map is obtained after the input image passes through the 7 th convolutional layer of the VGG16 network, and the feature map is sent to the RPN network next step. Compared with the traditional Selective Search, the RPN network is used for extracting the target candidate frame, namely the target candidate position on the feature map, the time consumption is less, and most importantly, the RPN network can be integrated in a deep learning network to realize end-to-end operation. For each position on the feature map, there are 9 possible candidate windows, and coverage detection of different scales of targets is realized. The size of the candidate window is {128, 256, 512} x three ratios {1:1,1:2,2:1 }. So RPN network outputs feature area windows of different sizes to be used as input for the next layer. Since the size of the feature region map input to the next layer is not fixed, it is common practice to cut and stretch, but this interferes with the original information. Therefore, an SPP network is added later, as shown in fig. 3, 3 kinds of pooled templates M1(4 × 4), M2(2 × 2), and M3(1 × 1) are used in the SPP network, where the interpretation of the 4 × 4 template in M1 is to divide the ROI map into 4 × 4 image blocks, i.e., 16 blocks, and then to take the average value of each block, i.e., to obtain a 16-dimensional vector. M2, M3 are identical. After template processing is carried out on the characteristic region windows with different sizes, results with the sizes of 4 multiplied by 512,2 multiplied by 512 and 1 multiplied by 512 are obtained respectively. The three results are respectively straightened into sizes of 16 × 512,4 × 512 and 1 × 512, and then are uniformly connected in one dimension, and the finally formed length is 21 × 512. The purpose is to transform a feature map of different size feature areas into a fully connected layer of uniform length. The SPP network is more robust than simple pooling, and because the feature regions with different aspect ratios and different sizes are processed, the SPP network improves the scale invariance of images, is easier to converge, and improves the accuracy of experiments.
(9) Performing feature extraction on the RGB images of the original road vehicles in the training set in the step (2) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FG after convolutional layers1,FG2In which FG1Is FG in size22 times of the total weight of the composition;
(10) the characteristic diagram FG in the step (9)2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FG3
(11) Subjecting the characteristic maps FG obtained in steps (9) and (10)1And FG3Summing, performing fusion processing to obtain a feature map FG, and generating candidate positions of targets in an original road vehicle RGB image by a two-dimensional image target detection method;
FG=FG1+FG3(formula 4)
(12) Obtaining the positions of the target candidate frames (namely the candidate positions of the target) in the bird's eye view image and the RGB image of the original road vehicle in the steps (8) and (11), respectively, generating sub-networks through one 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;
(13) performing feature extraction on the front view in the step (4) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FF after convolutional layers1,FF2Wherein FF1Has a size of FF22 times of the total weight of the composition;
(14) the characteristic diagram FF obtained in the step (13) is2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FF3
(15) The characteristic diagrams FF obtained in the steps (13) and (14) are processed3And FF1Summing, and performing fusion processing to obtain a feature map FF;
FF=FF1+FF3(formula 5)
(16) Fusing the feature maps FB, FG and FF obtained in the steps (7), (11) and (15) by an element average value method to obtain a feature map F;
(17) removing height information from the 3D target suggestion frame obtained in the step (12), projecting the 3D target suggestion frame on the feature map F to form a 2D target suggestion frame, which is a process of overlooking projection, and obtaining a ROI (region of interest) feature map corresponding to the 2D target suggestion frame;
(18) and (4) after the ROI area characteristic diagram obtained in the step (17) passes through the full connection layer twice, classifying and regressing the target to obtain a final 3D target detection frame, namely the final detection result of the 3D target detection is a 3D frame framed on the target.
Therefore, training of the deep convolutional network is completed by using the training set, then the optimal deep convolutional network training model is selected by using the verification set, and the performance of the selected optimal deep convolutional network training model is tested in the later period or used in practical application by using the test set.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims (4)

1. A3D target detection method is characterized in that the method is realized by adopting a deep convolutional network, and specifically comprises the following steps:
(1) preparing a data set to be processed, comprising an original road vehicle RGB image data set and a corresponding target point cloud data set O3DWherein
Figure FDA0002335442500000011
Three-dimensional coordinates representing point cloud data in the target point cloud dataset;
Figure FDA0002335442500000012
Figure FDA0002335442500000013
(2) dividing the original road vehicle RGB image data set and the corresponding target point cloud data set in the step (1) into a training set, a verification set and a test set, wherein the verification set and the test set are used for testing the detection performance of the deep convolutional network after the deep convolutional network is trained;
(3) projecting the point cloud data in the training set in the step (2) to a two-dimensional plane to obtain a bird's-eye view;
(4) projecting the point cloud data in the training set in the step (2) to a cylindrical plane to obtain a front view;
(5) performing feature extraction on the aerial view in the step (3) by using a feature extraction network, and respectively obtaining feature maps FB after passing through convolution layers1,FB2In which FB1Is of size FB22 times of the total weight of the composition;
(6) the characteristic diagram FB obtained in the step (5)2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FB3
(7) The characteristic diagram FB obtained in the steps (5) and (6)3And FB1Summing, and performing fusion processing to obtain a feature diagram FB;
FB=FB1+FB3equation 3
(8) The feature map FB is used for realizing a task of extracting a target candidate position in the aerial view through a two-dimensional image target detection method, so that the target candidate position in the aerial view is generated;
(9) performing feature extraction on the RGB images of the original road vehicles in the training set in the step (2) by using a feature extraction network, and respectively obtaining feature maps FG after convolution layers1,FG2In which FG1Is FG in size22 times of the total weight of the composition;
(10) the characteristic diagram FG in the step (9)2Performing a 2-fold upsampling process, i.e. deconvolution operation, to obtain a feature mapFG3
(11) Subjecting the characteristic maps FG obtained in steps (9) and (10)1And FG3Summing, performing fusion processing to obtain a feature map FG, and generating target candidate positions in an original road vehicle RGB image by a two-dimensional image target detection method;
FG=FG1+FG3equation 4
(12) Generating sub-networks by using the target candidate positions obtained in the steps (8) and (11) in the bird's eye view image and the original road vehicle RGB image respectively through a 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;
(13) performing feature extraction on the front view in the step (4) by using a feature extraction network, and respectively obtaining feature maps FF after the front view passes through the convolutional layer1,FF2Wherein FF1Has a size of FF22 times of the total weight of the composition;
(14) the characteristic diagram FF obtained in the step (13) is2Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FF3
(15) The characteristic diagrams FF obtained in the steps (13) and (14) are processed3And FF1Summing, and performing fusion processing to obtain a feature map FF;
FF=FF1+FF3equation 5
(16) Fusing the feature maps FB, FG and FF obtained in the steps (7), (11) and (15) by an element average value method to obtain a feature map F;
(17) projecting the 3D target suggestion frame obtained in the step (12) on a feature map F to form a 2D target suggestion frame, and obtaining an ROI (region of interest) feature map corresponding to the 2D target suggestion frame;
(18) after passing through the ROI area characteristic diagram obtained in the step (17) twice through the full connection layer, carrying out target classification and regression to obtain a final 3D target detection frame, namely, the final detection result of the 3D target detection is a 3D frame framed on a target;
therefore, the training set completes the training of the deep convolutional network, then the optimal deep convolutional network training model is selected by adopting the verification set, and the test set is used for completing the later test of the selected optimal deep convolutional network training model or being used in practical application, so that the 3D target detection method is realized.
2. The 3D object detection method according to claim 1, wherein the feature extraction network in the step (5), the step (9) and the step (13) is VGG 16.
3. The 3D object detection method according to claim 2, wherein the structure of the deep convolutional network comprises three branches from top to bottom, wherein the input image in the first row of branches is a bird's eye view image obtained by projecting point cloud data onto a two-dimensional plane; the input map in the second row of branches is an unprocessed raw road vehicle RGB image; the input image in the third row of branches is a front view obtained by projecting point cloud data to a cylindrical plane; the input graphs in the three branches need to be processed by a VGG16 feature extraction network, the VGG16 feature extraction network comprises 13 convolutional layers and 4 pooling layers in total, but does not comprise full-connection layers, and the 4 pooling layers divide the convolutional layers of the VGG16 feature extraction network into 5 groups, wherein the 1 st group comprises 2 convolutional layers with parameter representations of all Conv 3-64; group 2 included 2 convolutional layers with parametric representation all Conv 3-128; group 3 included convolutional layers with 3 parametric representations all Conv 3-256; group 4 included convolutional layers with 3 parametric representations all Conv 3-512; the 5 th group comprises 3 convolutional layers with the parameter representation of Conv3-512, the maximum pooling treatment is carried out by adopting 1 pooling layer with the same parameter after each group of convolutional layers, the size of the pooling core for carrying out the maximum pooling is 2 multiplied by 2, and the step length is 2; the parametric representation Conv3-64 indicates that the current layer convolution kernel size is 3 × 3, and the output is 64 channels; conv3-128 indicates that the current layer convolution kernel size is 3 × 3, with an output of 128 channels; conv3-256 indicates that the current layer convolution kernel size is 3 × 3, with 256 channels in output; conv3-512 indicates that the current layer convolution kernel size is 3 × 3, and the output is 512 channels; the sizes of convolution kernels in the VGG16 feature extraction network are uniformly set to be 3 x 3, and the step length is 1;
in the structure of the deep convolutional network, a total of 5 feature maps are contained in the Block1 in the first row of branches, each feature map except the first feature map is obtained by respectively processing the previous feature map through a corresponding group of convolutional layer plus pooling layers in the VGG16 feature extraction network, and the first feature map in the Block1 is obtained by processing an input image through the 1 st group of convolutional layer plus pooling layers in the VGG16 feature extraction network; performing deconvolution operation of 2 times of upsampling on a feature map obtained by processing the last 1 group of convolution layers and the pooling layer of the VGG16 feature extraction network, performing feature fusion on the feature map and a feature map obtained by the 3 rd convolution layer of the 4 th group of the VGG16 feature extraction network, and generating a target candidate position in the bird's-eye view image, wherein parameters of the deconvolution layer subjected to the deconvolution operation are as follows: kernel _ size 4, padding 1, and strides 2, which are preset according to the multiple of upsampling;
in the structure of the deep convolutional network, performing the same operation on a Block1 in a second row branch as a Block1 in a first row branch, and performing the same deconvolution operation in the second row branch as the deconvolution operation in the first row branch and parameters thereof to obtain target candidate positions in the original road vehicle RGB image output by the second row branch;
after the third row of branches are subjected to feature extraction, the feature fusion method is the same as that of the first row of branches and that of the second row of branches, the feature maps of the three rows of branches after respective fusion are taken out, and the total feature map fusion is carried out at M positions in the network through pixel-by-pixel addition averaging to obtain a final feature map fusion result.
4. The 3D object detection method according to claim 3, wherein the two-dimensional image object detection method in the steps (8) and (11) is implemented by a network: after an image is input, a multi-channel feature map is generated by adopting a VGG16 network, the feature map is sent to an RPN network in the next step, the RPN network is used for extracting a target candidate frame on the feature map, namely a target candidate position, each position on the feature map has 9 possible candidate windows, the coverage detection of targets with different scales is realized, the size of the candidate windows is {128, 256, 512} multiplied by three proportions {1:1,1:2,2:1}, so that feature region windows with different sizes are output by the RPN network and are used as the input of the next layer; adding an SPP network after extracting target candidate frames on a feature map, wherein 3 pooling templates M1(4 × 4), M2(2 × 2) and M3(1 × 1) are used in the SPP network, wherein the interpretation of the 4 × 4 template in M1 is that an ROI map is divided into 4 × 4 image blocks, namely 16 blocks, then the average value of each block is taken to obtain a 16-dimensional vector, the interpretation of M2 and M3 is the same as that of M1, after template processing is carried out on feature region windows with different sizes, results with the sizes of 4 × 4 × 512,2 × 2 × 512 and 1 × 1 × 512 are respectively obtained, the three results are respectively straightened to be the sizes of 16 × 512,4 × 512 and 1 × 512, and then are connected in one dimension, and the finally formed length is 21 × 512, so that the feature maps of feature regions with different sizes are uniformly converted into a fully connected layer with uniform length.
CN201911354155.4A 2019-12-25 2019-12-25 3D target detection method Active CN111079685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911354155.4A CN111079685B (en) 2019-12-25 2019-12-25 3D target detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911354155.4A CN111079685B (en) 2019-12-25 2019-12-25 3D target detection method

Publications (2)

Publication Number Publication Date
CN111079685A true CN111079685A (en) 2020-04-28
CN111079685B CN111079685B (en) 2022-07-26

Family

ID=70317646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911354155.4A Active CN111079685B (en) 2019-12-25 2019-12-25 3D target detection method

Country Status (1)

Country Link
CN (1) CN111079685B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112001226A (en) * 2020-07-07 2020-11-27 中科曙光(南京)计算技术有限公司 Unmanned 3D target detection method and device and storage medium
CN112052860A (en) * 2020-09-11 2020-12-08 中国人民解放军国防科技大学 Three-dimensional target detection method and system
CN112329678A (en) * 2020-11-12 2021-02-05 山东师范大学 Monocular pedestrian 3D positioning method based on information fusion
CN112488010A (en) * 2020-12-05 2021-03-12 武汉中海庭数据技术有限公司 High-precision target extraction method and system based on unmanned aerial vehicle point cloud data
CN112561997A (en) * 2020-12-10 2021-03-26 之江实验室 Robot-oriented pedestrian positioning method and device, electronic equipment and medium
CN112990229A (en) * 2021-03-11 2021-06-18 上海交通大学 Multi-modal 3D target detection method, system, terminal and medium
CN113011317A (en) * 2021-03-16 2021-06-22 青岛科技大学 Three-dimensional target detection method and detection device
CN113158763A (en) * 2021-02-23 2021-07-23 清华大学 Three-dimensional target detection method based on multi-view feature fusion of 4D millimeter waves and laser point clouds
WO2021226876A1 (en) * 2020-05-13 2021-11-18 华为技术有限公司 Target detection method and apparatus
CN114494248A (en) * 2022-04-01 2022-05-13 之江实验室 Three-dimensional target detection system and method based on point cloud and images under different visual angles
CN114998856A (en) * 2022-06-17 2022-09-02 苏州浪潮智能科技有限公司 3D target detection method, device, equipment and medium of multi-camera image
US11688177B2 (en) 2020-05-29 2023-06-27 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Obstacle detection method and device, apparatus, and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018094373A1 (en) * 2016-11-21 2018-05-24 Nio Usa, Inc. Sensor surface object detection methods and systems
US20180150703A1 (en) * 2016-11-29 2018-05-31 Autoequips Tech Co., Ltd. Vehicle image processing method and system thereof
CN109597087A (en) * 2018-11-15 2019-04-09 天津大学 A kind of 3D object detection method based on point cloud data
CN109612406A (en) * 2018-12-14 2019-04-12 中铁隧道局集团有限公司 A kind of random detection method of shield tunnel segment assembly ring assembly quality
CN110543858A (en) * 2019-09-05 2019-12-06 西北工业大学 Multi-mode self-adaptive fusion three-dimensional target detection method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018094373A1 (en) * 2016-11-21 2018-05-24 Nio Usa, Inc. Sensor surface object detection methods and systems
US20180150703A1 (en) * 2016-11-29 2018-05-31 Autoequips Tech Co., Ltd. Vehicle image processing method and system thereof
CN108121948A (en) * 2016-11-29 2018-06-05 帷享科技有限公司 Vehicle Image Processing Method and System
CN109597087A (en) * 2018-11-15 2019-04-09 天津大学 A kind of 3D object detection method based on point cloud data
CN109612406A (en) * 2018-12-14 2019-04-12 中铁隧道局集团有限公司 A kind of random detection method of shield tunnel segment assembly ring assembly quality
CN110543858A (en) * 2019-09-05 2019-12-06 西北工业大学 Multi-mode self-adaptive fusion three-dimensional target detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN X ET AL: "《Multi-view 3D object detection Network for autonomous driving》", 《IEEE》 *
刘俊生: "《基于激光点云与图像融合的车辆检测方法研究 》", 《中国优秀硕士学位论文全文数据库工程科技Ⅱ辑》 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021226876A1 (en) * 2020-05-13 2021-11-18 华为技术有限公司 Target detection method and apparatus
US11688177B2 (en) 2020-05-29 2023-06-27 Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. Obstacle detection method and device, apparatus, and storage medium
CN112001226A (en) * 2020-07-07 2020-11-27 中科曙光(南京)计算技术有限公司 Unmanned 3D target detection method and device and storage medium
CN112001226B (en) * 2020-07-07 2024-05-28 中科曙光(南京)计算技术有限公司 Unmanned 3D target detection method, device and storage medium
CN112052860A (en) * 2020-09-11 2020-12-08 中国人民解放军国防科技大学 Three-dimensional target detection method and system
CN112052860B (en) * 2020-09-11 2023-12-01 中国人民解放军国防科技大学 Three-dimensional target detection method and system
CN112329678A (en) * 2020-11-12 2021-02-05 山东师范大学 Monocular pedestrian 3D positioning method based on information fusion
CN112488010A (en) * 2020-12-05 2021-03-12 武汉中海庭数据技术有限公司 High-precision target extraction method and system based on unmanned aerial vehicle point cloud data
CN112561997A (en) * 2020-12-10 2021-03-26 之江实验室 Robot-oriented pedestrian positioning method and device, electronic equipment and medium
CN113158763B (en) * 2021-02-23 2021-12-07 清华大学 Three-dimensional target detection method based on multi-view feature fusion of 4D millimeter waves and laser point clouds
US11397242B1 (en) 2021-02-23 2022-07-26 Tsinghua University 3D object detection method based on multi-view feature fusion of 4D RaDAR and LiDAR point clouds
CN113158763A (en) * 2021-02-23 2021-07-23 清华大学 Three-dimensional target detection method based on multi-view feature fusion of 4D millimeter waves and laser point clouds
CN112990229A (en) * 2021-03-11 2021-06-18 上海交通大学 Multi-modal 3D target detection method, system, terminal and medium
CN113011317B (en) * 2021-03-16 2022-06-14 青岛科技大学 Three-dimensional target detection method and detection device
CN113011317A (en) * 2021-03-16 2021-06-22 青岛科技大学 Three-dimensional target detection method and detection device
CN114494248A (en) * 2022-04-01 2022-05-13 之江实验室 Three-dimensional target detection system and method based on point cloud and images under different visual angles
CN114998856A (en) * 2022-06-17 2022-09-02 苏州浪潮智能科技有限公司 3D target detection method, device, equipment and medium of multi-camera image
CN114998856B (en) * 2022-06-17 2023-08-08 苏州浪潮智能科技有限公司 3D target detection method, device, equipment and medium for multi-camera image

Also Published As

Publication number Publication date
CN111079685B (en) 2022-07-26

Similar Documents

Publication Publication Date Title
CN111079685B (en) 3D target detection method
CN111160214B (en) 3D target detection method based on data fusion
Wang et al. Fusing bird’s eye view lidar point cloud and front view camera image for 3d object detection
CN109753885B (en) Target detection method and device and pedestrian detection method and system
US11182644B2 (en) Method and apparatus for pose planar constraining on the basis of planar feature extraction
CN101901343B (en) Remote sensing image road extracting method based on stereo constraint
CN113065546B (en) Target pose estimation method and system based on attention mechanism and Hough voting
Biasutti et al. Lu-net: An efficient network for 3d lidar point cloud semantic segmentation based on end-to-end-learned 3d features and u-net
CN111832655A (en) Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN112529015A (en) Three-dimensional point cloud processing method, device and equipment based on geometric unwrapping
CN109285162A (en) A kind of image, semantic dividing method based on regional area conditional random field models
CN109919145B (en) Mine card detection method and system based on 3D point cloud deep learning
CN111046767B (en) 3D target detection method based on monocular image
CN106295613A (en) A kind of unmanned plane target localization method and system
CN110570440A (en) Image automatic segmentation method and device based on deep learning edge detection
CN112712546A (en) Target tracking method based on twin neural network
CN103955942A (en) SVM-based depth map extraction method of 2D image
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
Yang et al. Semantic segmentation in architectural floor plans for detecting walls and doors
CN102867171B (en) Label propagation and neighborhood preserving embedding-based facial expression recognition method
Li et al. An aerial image segmentation approach based on enhanced multi-scale convolutional neural network
CN104463962B (en) Three-dimensional scene reconstruction method based on GPS information video
Wang et al. Instance segmentation of point cloud captured by RGB-D sensor based on deep learning
Huang et al. ES-Net: An efficient stereo matching network
CN114462486A (en) Training method of image processing model, image processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant