CN110570457A

CN110570457A - Three-dimensional object detection and tracking method based on stream data

Info

Publication number: CN110570457A
Application number: CN201910725207.8A
Authority: CN
Inventors: 黄凯; 郭叙森; 古剑锋; 郭思璐; 杨铖章; 许子潇
Original assignee: National Sun Yat Sen University
Current assignee: National Sun Yat Sen University
Priority date: 2019-08-07
Filing date: 2019-08-07
Publication date: 2019-12-13
Anticipated expiration: 2039-08-07
Also published as: CN110570457B

Abstract

The invention relates to the field of three-dimensional target detection and tracking, in particular to a three-dimensional object detection and tracking method based on stream data. Then, inputting the feature map into a candidate frame extraction network to obtain a candidate frame; and acquiring a feature block from the feature map and the related feature map through the candidate frame and inputting the feature block into a regression network to respectively obtain the three-dimensional frame and the three-dimensional frame offset of the detected object. And solving other frame pictures between the key frames by an interpolation method, and correlating the targets in all the frames to obtain a tracking result. The invention only needs to detect the key frame, accelerates the speed of stream data detection, meets the requirement of the automatic driving environment on the real-time property and has better stability; meanwhile, point cloud information and image information are fused, so that the advantages and disadvantages are complemented, and the accuracy of object detection is improved.

Description

Three-dimensional object detection and tracking method based on stream data

Technical Field

The invention relates to the field of three-dimensional target detection and tracking, in particular to a three-dimensional object detection and tracking method based on stream data.

background

At present, the automatic driving and visual perception tasks are mainly divided into image-based, point cloud-based and image-point cloud fusion-based, and specifically include:

1. image-based methods are mainly represented by Mono3D,3DOP, etc., and because image data has no depth information, additional manually designed three-dimensional features need to be added. However, the single RGB data and the special hand-designed features are not conducive to the neural network to effectively learn 3D spatial information, and also limit the extension of this scenario. Furthermore, manual feature acquisition is generally time consuming and lengthy, and such methods currently have limited effectiveness and slow progress.

2. The point cloud based approach can be subdivided into three sub-branches:

Object detection is performed directly on the point cloud using 3D CNN. Such as 3D FCN and volume 3Deep, etc., which first structure point cloud data (typically three-dimensional voxels) and then extract features using three-dimensional convolution. Since the point cloud is very sparse and the three-dimensional convolution needs to be performed in three dimensions, the detection process is extremely time consuming. In addition, the size of the receptive field is limited due to high time consumption, so that the traditional 3D CNN cannot learn local features of different scales well. Secondly, a specific network structure is proposed for the point cloud, for example, the point cloud is divided into structural units such as Voxel and the like by the VoxelNet, and the network is adopted to extract features on non-empty structural units. Recently, with the proposed models of PointNet, PointNet + +, PointCNN, PointSIFT, OctNet, DynamicGraph CNN, etc., the center of gravity of the research is shifted to the method of learning the spatial geometric representation from the disordered point cloud data more effectively. Taking PointNet as an example, the work proposes a concept of a symmetric function based on the displacement invariance and the rotation invariance of point cloud data. The point cloud features can be extracted in a college manner by fitting a symmetric function by using a fully-connected network and a pooling layer. However, since such methods use a full connection layer, all points are generally required to be processed, and thus the speed when the method is applied to a large scene (the point cloud data is very much) is still to be improved. ③ work represented by PIXOR, FaF and complete-YOLO projects the point cloud onto a plane, such as a front view and a bird's eye view. The mapping process has information loss of a certain dimension, but almost all objects in the automatic driving scene are located on the same plane, so the influence of the information loss on the detection result is very small. The method simplifies the 3D CNN into the 2D CNN, reduces the space and time complexity of the algorithm, and enables real-time detection to be possible. However, due to the sparsity of the point cloud, the projected target points are few, so that the characteristic information is insufficient, and the effect is not ideal particularly for detecting small targets and distant objects.

3. scheme based on image and point cloud fusion. The method fuses rich texture information of the image and depth information of the point cloud, and representative work includes MV3D, Fuding BEV & FV, AVOD, F-PointNet and the like. The first three map the point cloud to one or several planes and optionally add manually designed features, which are then fused with the RGB image. Where MV3D is fused at the deep network layer, while Fusing BEV & FV suggests that Fusing before RPN would achieve better detection. The method needs additional modules to fuse data, so that the running speed of the model is reduced, and the real-time performance is difficult to meet. By reducing the input of manually designed features, the AVOD is able to achieve a certain real-time. On the other hand, the F-PointNet firstly uses 2D target detection to obtain a 2D positioning frame on image data, then projects the 2D positioning frame to a three-dimensional space to obtain a corresponding view field cone, and finally uses PointNet to carry out semantic segmentation on point cloud in the view field cone to finally obtain the three-dimensional positioning frame of the target. The method has the defects that the precision is limited in the 2D target detection process, and the effect on the shielding and other conditions is poor.

Disclosure of Invention

In order to overcome the problems of poor detection effect and poor real-time property in the prior art, the invention provides the three-dimensional object detection and tracking method based on the streaming data, which can accurately detect and position the three-dimensional object, improve the detection speed and realize real-time detection.

In order to solve the technical problems, the invention adopts the technical scheme that: the method for detecting and tracking the three-dimensional object based on the streaming data comprises the following steps:

The method comprises the following steps: inputting key frame data consisting of point cloud data and image data in front and back frames, preprocessing the data, and converting a projection structure of the point cloud data in a top view direction into a BEV (best effort) diagram;

step two: performing feature extraction on the two key frame data in the step one to obtain a feature map, inputting the extracted feature map into a region feature extraction module, and respectively obtaining candidate frame sets of the two key frames;

Step three: intercepting a feature block and adjusting the size of the feature block in the feature map by the candidate frame, and then inputting a classification network and a frame regression network to obtain the category and the three-dimensional frame position of the object;

step four: performing correlation operation on the data of the two key frames extracted in the step two to obtain a correlation feature map, intercepting feature blocks and adjusting the size of the candidate frame in the correlation feature map, and then inputting the feature blocks and the adjusted size into a regression network to obtain the offset of the three-dimensional frame of the object corresponding to the two key frames;

Step five: obtaining the three-dimensional frame of the object of other frame data between two key frame data by using an interpolation method according to the three-dimensional frame of the object and the offset corresponding to the three-dimensional frame, thereby obtaining the three-dimensional detection result of the object in all the frames;

Step six: and according to the detection result, correlating the objects corresponding to all the frame data to obtain a tracking result.

preferably, in the first step, the image data is normalized and then clipped to 1200x360 px; for point cloud data, take points in the range of [ -40,40] x [0,70] x [0,2.5] m, and then remove point values outside the image range. Normalization is to subtract the image mean from each pixel value of the image and then divide the image standard deviation

preferably, in the second step, the feature extraction is based on the structure of VGG16 and is added with a feature pyramid structure, and data extraction is performed on two key frame image data to obtain a point cloud feature map and an image feature map respectively; inputting the two feature maps into an RPN (Region provider Network) for prediction to obtain a plurality of three-dimensional candidate frames;

Preferably, in the second step, a non-maximum suppression algorithm is performed on the candidate frames to obtain K extraction frames. The RPN prediction candidate frames and the finally obtained prediction frames are very dense and have a plurality of overlapped frames, and some representative frames are screened out through a non-maximum suppression algorithm.

preferably, in the third step, the extracting frame intercepts corresponding feature blocks from the point cloud feature map and the image feature map respectively, adjusts the feature blocks to the same size, performs multi-view fusion, and performs classification and regression through a full-connection network to obtain a three-dimensional frame corresponding to the object.

Preferably, in the fourth step, the feature map of the images of the two frames before and after and the convolution cross-correlation feature of the point cloud feature map are calculated to obtain a correlation feature map C_img ^t,t+τAnd C_pc ^t,t+τ(ii) a Using said extraction box at C_img ^t,t+τAnd C_pc ^t,t+τIntercepting corresponding feature blocks on the feature map, adjusting the feature blocks to be the same in size, and fusing the two visual features to obtain a fused feature map; and inputting the fusion feature map into a full-connection network to obtain the three-dimensional frame offset of the object corresponding to the two key frames.

Preferably, in the third step and the fourth step, a non-maximum suppression algorithm is performed on the three-dimensional frame and the three-dimensional frame offset to perform screening. The number of three-dimensional frames is reduced, and the calculation burden is reduced.

preferably, in the sixth step, after the detection results of all the frames are obtained, the three-dimensional frames of different frames are associated by using an IOU Tracker algorithm; specifically, a threshold value of the degree of overlap is set, if the degree of overlap of the three-dimensional frames of the objects of the two frames of images before and after exceeds the threshold value, the same object is determined, and on the contrary, the same object is not determined.

compared with the prior art, the beneficial effects are: 1. the invention utilizes the characteristic of information redundancy among stream data, only needs to carry out target detection on key frames, and generates detection frames of other frames through interpolation, can accelerate the speed of stream data detection under the condition of having little influence on the accuracy of target detection, and solves the problem that the existing three-dimensional target detection network has too long detection time on continuous scene data and cannot meet the requirement of the automatic driving environment on real-time property.

2. the three-dimensional object detection method provided by the invention has high accuracy. Because the point cloud information and the image information are fused at the same time, the advantages and the disadvantages of the point cloud information and the image information are complementary. Compared with an object detection method only using images, the method disclosed by the invention integrates the depth data of the point cloud, and can process the condition of object shielding; compared with a three-dimensional object detection method based on point cloud only, the method disclosed by the invention integrates abundant texture information of the image, makes up for information loss caused by the sparsity of point cloud data, and can effectively reduce the omission ratio particularly under the condition that distant objects and small objects basically have no point cloud data.

3. the method provided by the invention has stronger robustness. On one hand, the method only processes the key frames, and the result of the non-key frames is obtained by interpolation of the key frame result, so that the detection frame has good continuity between the key frames, and the stability of the track is good; on the other hand, the method integrates the point cloud and the image data, so that the method has a good effect in various scenes.

4. The network provided by the invention is end-to-end, which is beneficial to network optimization to find out a global optimal solution, is beneficial to network training and also enables the whole framework to be simpler.

drawings

FIG. 1 is a flow chart of the present invention;

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent; for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted. The positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there are terms such as "upper", "lower", "left", "right", "long", "short", etc., indicating orientations or positional relationships based on the orientations or positional relationships shown in the drawings, it is only for convenience of description and simplicity of description, but does not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationships in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and specific meanings of the terms may be understood by those skilled in the art according to specific situations.

The technical scheme of the invention is further described in detail by the following specific embodiments in combination with the attached drawings:

Examples

Fig. 1 shows an embodiment of a method for detecting and tracking a three-dimensional object based on stream data, comprising the following steps:

The method comprises the following steps: inputting key frame data consisting of point cloud data and image data in front and back frames, and preprocessing the data; wherein the image data is normalized and then cropped to 1200x360 px; for point cloud data, take points within the range of [ -40,40] x [0,70] x [0,2.5] m, and then remove point values outside the image range. Normalization is the subtraction of the image mean from each pixel value of the image divided by the image standard deviation. The space of [ -40,40] x [0,70] x [0,2.5] is gridded into a three-dimensional tensor of 800x700x5, i.e. the tensor has each element corresponding to a small cuboid of the three-dimensional area 0.1x0.1x0.5, the value of the element being the maximum of the heights of all points in the small cuboid, and if there is no point in the small cuboid, the value is 0. Considering the difference of the number of the dots in different small cuboids, a new density channel is added, and the value is min (1.0, log (N +1)/log16), so that the size of the finally generated BEV map is 800x700x 6. And also considering the factor that the object is in motion and the origin of the coordinate system is inconsistent, and converting the image data of the front and the back key frames into the same coordinate system through an IMU data algorithm.

Step two: performing feature extraction on the two key frames in the step one, wherein the feature extraction is based on a VGG16 structure and is added into a feature pyramid structure to obtain a point cloud feature map and an image feature map; inputting the two feature maps into an RPN (Region provider network) for prediction to obtain a plurality of three-dimensional candidate frames; and carrying out a non-maximum suppression algorithm on the candidate frames to obtain K extraction frames, wherein the RPN prediction candidate frames and the finally obtained prediction frames are very dense and have a plurality of overlapped frames, screening some representative frames by the non-maximum suppression algorithm, and setting the threshold value of the algorithm to be 0.8.

Step three: intercepting a feature block and adjusting the size of the feature block in the feature map by the candidate frame, and then inputting a classification network and a frame regression network to obtain the category and the three-dimensional frame position of the object; the candidate frame is an extraction frame after screening, and the extraction frame respectively intercepts corresponding feature blocks from the point cloud feature map and the image feature map and adjusts the feature blocks to the same size. The feature blocks are further classified and regressed through a full-connection network after multi-view fusion is carried out, and the classes and the three-dimensional frame positions of the objects are obtained.

Step four: processing the mutual correlation characteristics of the two key frames extracted in the step two to obtain a correlation characteristic diagram, intercepting a characteristic block and adjusting the size of the candidate frame in the correlation characteristic diagram, and then inputting the characteristic block and the size of the candidate frame into a regression network to obtain the offset of the three-dimensional frame of the object corresponding to the image data of the two key frames;

Specifically, convolution cross-correlation characteristics of a characteristic diagram and a point cloud characteristic diagram of the image are calculated to obtain a related characteristic diagram C_img ^t,t+τand C_pc ^t,t+τ(ii) a Using said extraction box at C_img ^t,t+τand C_pc ^t,t+τIntercepting corresponding feature blocks on the feature map, adjusting the feature blocks to be the same in size, and fusing the two visual features to obtain a fused feature map; and inputting the fusion feature map into a full-connection network to obtain the three-dimensional frame offset of the object corresponding to the two key frames.

Step six: according to the detection result, using an IOU Tracker algorithm to associate three-dimensional frames of different frames; specifically, a threshold value of the degree of overlap is set, if the degree of overlap of the three-dimensional frames of the objects of the two frames of images before and after exceeds the threshold value, the same object is determined, and on the contrary, the same object is not determined.

In addition, in the third step and the fourth step, the three-dimensional frame and the three-dimensional frame offset are screened by a non-maximum suppression algorithm, and the threshold value of the algorithm is 0.63. The number of three-dimensional frames is reduced, and the calculation burden is reduced.

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A three-dimensional object detection and tracking method based on stream data is characterized by comprising the following steps:

step four: performing relevant operation on the data of the two key frames extracted in the step two to obtain a relevant feature map, intercepting feature blocks and adjusting the size of the candidate frame in the relevant feature map, and then inputting the feature blocks and the adjusted size into a regression network to obtain the offset of the three-dimensional frame of the object corresponding to the two key frames;

2. The method for detecting and tracking three-dimensional objects based on stream data as claimed in claim 1, wherein in the first step, the image data is normalized and then clipped to 1200x360 px; for point cloud data, take points in the range of [ -40,40] x [0,70] x [0,2.5] m, and then remove point values outside the image range.

3. The method for detecting and tracking three-dimensional objects based on stream data as claimed in claim 2, wherein in the second step, the feature extraction is based on a VGG16 structure and adds a feature pyramid structure, and data extraction is performed on two key frame image data to obtain a point cloud feature map and an image feature map respectively; and inputting the two feature maps into RPN prediction to obtain a plurality of three-dimensional candidate frames.

4. The method of claim 3, wherein in the second step, the candidate frame is subjected to a non-maximum suppression algorithm to obtain K extraction frames.

5. the method according to claim 4, wherein in the third step, the extracting frame respectively intercepts corresponding feature blocks from a point cloud feature map and an image feature map, and after adjustment to the same size, the extracting frame performs multi-view fusion and then performs classification and regression through a full-connection network to obtain the three-dimensional frame of the corresponding object.

6. The method as claimed in claim 5, wherein in the fourth step, the convolution cross-correlation characteristics of the image feature map and the point cloud feature map of the two frames before and after the first step are calculated to obtain a correlation feature map C_img ^t,t+τAnd C_pc ^t,t+τ(ii) a Using said extraction box at C_img ^t,t+τAnd C_pc ^t,t+τCorresponding feature blocks are cut out from the feature map, feature fusion of two visual angles is carried out after the feature blocks are adjusted to be the same in size, and a fusion feature map C is obtained_fusion ^t,t+τ(ii) a And inputting the fusion feature map into a full-connection network to obtain the three-dimensional frame offset of the object corresponding to the two key frames.

7. The stream data-based three-dimensional object detecting and tracking method according to claim 6, wherein in the third step and the fourth step, the non-maximum suppression algorithm is used for screening the three-dimensional frame and the three-dimensional frame offset.

8. The method for detecting and tracking the three-dimensional object based on the streaming data as claimed in claim 1, wherein in the sixth step, after the detection results of all the frames are obtained, the three-dimensional frames of different frames are associated by using an IOU Tracker algorithm; specifically, a threshold value of the degree of overlap is set, if the degree of overlap of the three-dimensional frames of the objects of the two frames of images before and after exceeds the threshold value, the same object is determined, and on the contrary, the same object is not determined.