CN115272416A

CN115272416A - Vehicle and pedestrian detection tracking method and system based on multi-source sensor fusion

Info

Publication number: CN115272416A
Application number: CN202210979600.1A
Authority: CN
Inventors: 张瑞亮; 胡政政; 王玮
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2022-08-16
Filing date: 2022-08-16
Publication date: 2022-11-01

Abstract

The invention discloses a vehicle and pedestrian detection tracking method and system based on multi-source sensor fusion, which comprises the following steps: carrying out distortion removal and example segmentation on the image data; acquiring a sparse point cloud depth map based on the laser radar point cloud coordinates; performing point cloud densification on the depth map to obtain a dense point cloud depth map; performing voxelization and coding on the dense point cloud and performing feature mapping to obtain a mapped two-dimensional dense tensor; acquiring a two-dimensional target central point position based on the heat map and a Gaussian distribution function, performing one-stage feature regression on the density tensor, and acquiring a primary target detection boundary frame; acquiring the refinement characteristics and confidence degree predicted values of the size and orientation of the bounding box; and tracking the vehicles and the pedestrians based on the nearest neighbor matching and the preliminary target detection bounding box. The invention solves the problems of missing detection and error classification of a single sensor, scale reduction and amplification in the field of view of the sensor, and missing detection and error positioning caused by compression and distortion of depth features in fusion, and reduces average scale error.

Description

Vehicle and pedestrian detection tracking method and system based on multi-source sensor fusion

Technical Field

The invention belongs to the technical field of environmental perception, and particularly relates to a vehicle and pedestrian detection and tracking method and system based on multi-source sensor fusion.

Background

The automatic driving automobile can relieve the traffic peak jam condition of commuting, reduce traffic accidents caused by human factors, and can promote the development of the scientific and technological industry and the upgrade of the travel service industry. The state starts to issue policies for promoting the development of the automatic driving industry, enterprises related to automobiles also start to adjust respective industry research and development strategies, and the coming of the 5G technology is believed to lead the automatic driving technology which intersects large and complex subjects in multiple subject fields to cause a new turn of academic research. The environment perception layer mainly realizes corresponding functions through sensors such as a camera, a laser radar, a millimeter wave radar, a combined navigation system and an ultrasonic radar, and the effectiveness and the safety of two subsequent levels can be guaranteed only through the environment perception layer with high reliability and accuracy, so that how to more accurately and reliably acquire environment information through the sensors becomes a research hotspot. The current environment perception technology mainly comprises three fields of simultaneous positioning and mapping, target detection and target tracking. The simultaneous positioning and map building technology is used for solving the problems of positioning and map building of the automatic driving vehicle, and the campus map building method only adopts the technology to build the campus map. Therefore, the invention takes solving the problems of the existing algorithms of target detection, target tracking and multi-sensor fusion and improving the corresponding evaluation index as the key point.

The target detection technology mainly comprises obstacle detection and traffic signal detection, the main research objects of the obstacle detection are obstacles such as vehicles and pedestrians, but because the environmental characteristics described by a single sensor are different in respective space and dimension, the problems of large average scale error and average attribute error caused by missed detection and misclassification can occur during detection; when the posture of the detected object changes, the existing algorithm of aligning the axis with the boundary box is difficult to enumerate and fit all target states, so that the average direction error is large; the traditional laser radar point cloud extraction feature algorithm is large in calculated amount and poor in real-time effect, and the average translation error is large.

The target tracking technology is mainly used for keeping stable identities of targets detected by a sensor and tracking the positions of the targets along with data frames to provide semantic information support for system decision, is a premise for making correct decisions by an intelligent automobile and is mainly applied to realization of automatic driving functions such as adaptive cruise, automatic emergency braking and the like. When the gesture of the tracked object changes, the scale of the tracked object in the visual field of the sensor changes simultaneously, when the scale is reduced, the background in the scene is contained in the three-dimensional target boundary frame, so that the number of tracking tracks for correctly tracking at least 80% of video frames of the target is reduced, and when the scale is increased, the three-dimensional target boundary frame is difficult to completely express target information, so that the number of times of mistakenly changing or exchanging the identity of the tracked target is increased; the existing algorithm adopts a three-dimensional target boundary box for tracking, which requires a large amount of calculation, thus resulting in short continuous tracking time and poor algorithm instantaneity; and the depth map obtained by projecting the sparse point cloud to the image cannot completely reflect the characteristics of the target, so that the multi-target tracking accuracy is low.

In order to better realize the environment perception function of automatic driving, the multi-source sensor fusion technology can complement the advantages of each sensor, and better obtain the attribute and state estimation of an environment target. However, because the heat map adopted in the fusion of the point cloud and the image is obtained according to the model prediction result, the characteristic of sparse distribution of the three-dimensional point cloud in the space can generate compression distortion on the depth characteristics, so that the distances between the central positions of different detection frames of the same target object are closer, and the problems of detection omission, wrong positioning and the like are further caused; the existing fusion scheme has poor processing on the information quantity of the multi-source sensor, so that the real-time performance and the precision of target detection and tracking based on fusion are difficult to be considered; however, it may be difficult to include enough information for predicting the characteristics such as the shape and orientation of the whole object through the limited number of point clouds in the three-dimensional target bounding box, and thus the fusion effect is not good.

Disclosure of Invention

The invention aims to provide a vehicle and pedestrian detection and tracking method and system based on multi-source sensor fusion, so as to solve the problems in the prior art.

In order to achieve the above object, the present invention provides a vehicle and pedestrian detection and tracking method based on multi-source sensor fusion, which includes:

acquiring image data, and performing distortion removal and instance segmentation processing on the image data to acquire a detection object range;

acquiring a sparse point cloud depth map based on the detection object range and the laser radar point cloud coordinates;

performing point cloud densification on the sparse point cloud depth map to obtain a dense point cloud depth map;

performing voxelization and coding processing on dense point cloud in the dense point cloud depth map, and performing feature mapping to obtain a dense tensor of the two-dimensional mapping of the dense point cloud features;

performing one-stage feature regression on the dense tensor to obtain a primary target detection bounding box;

performing two-stage feature regression on the preliminary target detection bounding box to obtain refined features of the size and the orientation of the preliminary target detection bounding box and a confidence coefficient predicted value;

and tracking the vehicles and the pedestrians based on the matching of the preliminary target detection bounding box and the nearest neighbor.

In another aspect, to achieve the above object, the present invention provides a vehicle and pedestrian detection and tracking system based on multi-source sensor fusion, including:

the system comprises an image processing module, a multi-source sensor feature level fusion module, a target detection module and a target tracking module;

the image processing module is used for acquiring image data, and performing distortion removal and instance segmentation processing on the image data to acquire a detection object range;

the multi-source sensor feature level fusion module is used for acquiring dense point cloud feature mapping according to the detection object range;

the target detection module performs one-stage and two-stage feature regression on the dense point cloud feature mapping, and obtains a target detection result through a multilayer perceptron;

and the target tracking module is used for tracking the target according to the characteristic regression result.

The invention has the technical effects that:

(1) And (4) detecting the target. Aiming at the problems of missing detection and error classification during detection of a single sensor, the invention performs feature level fusion on the laser radar point cloud and the camera image, and different sensors respectively complete feature extraction in data and then uniformly input feature information into a fusion model for comprehensive processing so as to reduce the task amount required to be processed by fusion and further reduce the target of average scale error and average attribute error. Secondly, aiming at the problem that the target state is difficult to fit due to the change of the attitude of the detection object, the method eliminates the method of exhausting potential targets and then performing post-processing based on an anchor mode, and adopts an algorithm based on a heat map central point without an internal direction so as to improve the detection precision of the targets with orientation angles and further reduce the targets with average scale errors. Finally, aiming at the problem that the resource consumption of laser radar point cloud processing causes poor real-time performance, the method abandons a method of convolution processing of point cloud directly, after image segmentation is carried out, the point cloud is projected into a divided region of interest to generate a depth map, and then the depth map is subjected to point cloud coding and a characteristic extraction to generate a characteristic mapping mode, so that the purposes of considering real-time performance and precision as much as possible and further reducing average translation error are achieved. The NDS value of the nuScenes detection score of the calculation method is 64.9%, the average precision mAP value of the mean value is 81.5%, the detection precision of the vehicle is 84.1%, and the detection precision of the pedestrian is 78.9%.

(2) And (4) tracking the target. Firstly, aiming at the problems of scale reduction and scale enlargement in a sensor visual field caused by the change of the gesture of a tracked object, the invention adopts a target detection mode without a heat map central point in an internal direction so as to effectively avoid the problem of target model updating error caused by the change of the object scale, further improve the quantity of tracking tracks which can be correctly tracked by at least 80 percent of video frames of the target and simultaneously reduce the target of the identity exchange times of the tracked target. Secondly, the target detection and data association modules are combined into one, the target center of the current frame is projected to the previous frame, the position deviation of the two frames is calculated by adopting a center point positioning loss function, then the negative speed estimation between the previous frame and the next frame is calculated to reversely guess the position of the target center of the previous frame, and the object closest to the position in the previous frame is searched, so that the target with improved real-time performance and tracking duration is realized. Finally, aiming at the problem that the sparse point cloud is low in multi-target tracking accuracy after being projected to an image, the sparse points and the depth values thereof in the range of the example segmentation foreground entity are used for generating an interested view cone area, the points with depth information are randomly sampled in the two-dimensional foreground entity segmentation area, the matched randomly sampled two-dimensional points are projected back to a three-dimensional space around the sparse points to obtain a virtual point cloud after nearest neighbor retrieval depth estimation, the obtained dense point cloud is subjected to voxelization and coding, the obtained columnar unit features are extracted to generate dense point cloud feature mapping, and dense point cloud densification work is completed, so that the target of improving the multi-target tracking accuracy is realized. The multi-target tracking accuracy MOTA of the algorithm is 0.668, the multi-target tracking accuracy MOTP is 0.250, the number MT of tracking tracks of which at least 80% of video frames of each target can be correctly tracked is 4851, the number ML of tracking tracks of which at most 20% of video frames of each target can be correctly tracked is 1405, the number FP of negative samples is 15769, and the number FN of negative samples is 18643.

(3) And fusing the characteristic levels of the multi-source sensor. Firstly, aiming at the problems of missed detection and wrong positioning caused by depth feature compression distortion in fusion, the invention designs a Gaussian scattering kernel by solving the minimum Gaussian radius through classification discussion, and distributes the feature information of the target to the vicinity of the Gaussian peak value of the center of each true value target as much as possible so as to realize the target with the intersection ratio of the prediction frame and the true value target detection frame being more than 0.7, which is not deleted by default and the problems of missed detection and wrong positioning are reduced. Secondly, aiming at the problem that the existing multi-source sensor fusion algorithm is difficult to consider real-time performance and precision, the invention adopts a feature level fusion scheme based on a central point without an internal direction, firstly completes the initial fusion of the multi-source sensor through dense point cloud feature mapping, reduces the search space of a detector, and enables a fusion layer and a decision layer to more fully utilize computing resources to improve the precision so as to realize the aim of balancing the real-time performance and the precision of the system. Finally, aiming at the problem that the target characteristics are insufficient due to the fact that the number of point clouds in the target frame is limited, the method extracts required characteristics from the central point of each face of the three-dimensional target boundary frame and the central point of the positioned heat map by using bilinear interpolation from the three-dimensional target detection boundary frame and characteristic map information output in the first stage, and transmits all the target frame characteristics to a multilayer perceptron after connection to obtain attribute refinement and confidence score prediction of the two-stage target boundary frame so as to achieve the target for improving the fusion effect. According to the evaluation indexes of target detection and tracking, the detection and tracking results of vehicles and pedestrians obtained by data set testing and real vehicle experiments are analyzed and compared from precision and error angles, and then other algorithm results are compared, so that the effectiveness and feasibility of the target detection and tracking algorithm are verified.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, are included to provide a further understanding of the application, and the description of the exemplary embodiments of the application are intended to be illustrative of the application and are not intended to limit the application. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating the manner in which one embodiment of a multi-source sensor fusion based vehicle and pedestrian detection and tracking algorithm is implemented in accordance with an embodiment of the present invention;

fig. 2 is a target detection result diagram of a vehicle and pedestrian detection tracking algorithm based on multi-source sensor fusion in an embodiment of the present invention, where (a) is a vehicle detection result diagram of a data set test, (b) is a vehicle detection result diagram of an actual vehicle test, (c) is a pedestrian detection result diagram of the data set test, and (d) is a pedestrian detection result diagram of the actual vehicle test;

fig. 3 is a target tracking result diagram of a vehicle and pedestrian detection tracking algorithm based on multi-source sensor fusion in an embodiment of the present invention, where (a) is an MT result diagram of a data set test, (b) is an MT result diagram of an actual vehicle experiment, (c) is an ML result diagram of a data set test, and (d) is an ML result diagram of an actual vehicle experiment.

Detailed Description

It should be noted that, in the present application, the embodiments and features of the embodiments may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

It should be noted that the steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer executable instructions and that, although a logical order is illustrated in the flow charts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Example one

The embodiment provides a vehicle and pedestrian detection and tracking method and system based on multi-source sensor fusion, wherein the method comprises the following steps:

acquiring image data of a camera, and carrying out distortion removal and instance segmentation processing on the image data to obtain a segmented approximate range of a detection object;

and converting the point cloud coordinates of the laser radar into a segmentation range to obtain a sparse point cloud depth map, generating a virtual point cloud according to the sparse point cloud in the foreground entity range segmented by the example and the interested view cone area of the two-dimensional random sampling point, and back-projecting the virtual point cloud to a three-dimensional space to obtain dense point cloud. And performing point cloud voxelization and encoding, and extracting the obtained columnar unit features to obtain dense point cloud feature mapping.

Adopting a heat map-based central point positioning method for the dense point cloud feature mapping, obtaining a minimum Gaussian radius value through classification discussion, then solving a true value of central point positioning by combining a Gaussian distribution function, obtaining a primary target detection boundary box feature through central point feature regression, and storing the primary target detection boundary box feature on a central feature of an object;

and further refining the characteristics of the preliminary target detection boundary frame through a two-stage target detection network, extracting the central point of each surface of the three-dimensional boundary frame and the characteristics of the heat map central point positioned by the foreword by using bilinear interpolation, connecting the central points into a vector containing the whole three-dimensional target boundary frame, and transmitting the vector to a multilayer sensor to obtain characteristic refinement and confidence prediction of the size and the orientation of the preliminary target detection boundary frame.

After the central point of the target is positioned through a heat map and a Gaussian distribution function, firstly, the negative speed estimation between the front frame and the rear frame is calculated by utilizing the target speed characteristics obtained by the first-stage network regression, so that the position of the target center of the previous frame is reversely presumed, then, after the motion characteristics of the target boundary frame of the current frame are extracted, the probability that the objects of the front frame and the rear frame are matched with the nearest neighbor is the same target is adopted, the latest successfully traced current frame target inherits the identity number of the previous frame, and the newly appeared target is endowed with a new identity number again. For the targets that are not tracked any more due to track mismatch in the consecutive 3 frames, the last known speed is adopted by the patent to estimate and update the positions of the targets in the current frame.

In some embodiments, the process of de-distorting and instance segmenting the image data comprises: firstly, coordinate calibration of radial distortion and tangential distortion is carried out, and then the image is divided into areas according to the characteristics, so that the characteristics of the areas are basically different from one another in each place but basically the same in the same place, and an object detection frame and a pixel area with a label are obtained.

In some embodiments, the laser radar point cloud coordinates are converted into a segmentation range to obtain a sparse point cloud depth map and the point cloud is processedThe process of densification comprises: from lidar space coordinate system C ₁ Initially, after rotation and translation, switch to C _c The coordinate system is converted into an image coordinate system C according to the pinhole imaging principle _pic And finally converted to the pixel coordinate system C _pix And matching the two-dimensional image pixel coordinates with the three-dimensional point cloud coordinates.

In some embodiments, the process of point cloud densification comprises: the method comprises the steps of randomly sampling points in a two-dimensional foreground entity partition area with depth information, adopting nearest neighbor retrieval depth estimation around an interested view cone area, taking the depth of a point cloud on the point cloud as the depth of a current pixel, and projecting matched randomly sampled two-dimensional points back to a three-dimensional space to obtain a virtual point cloud, so that the virtual point cloud and the original projected sparse point cloud at the moment have category information in example partition.

In some embodiments, the process of point cloud voxelization, encoding, and feature mapping includes: by means of columnar unit coding, irregular point clouds are gathered through voxels, the voxel information with fixed length, width and height is processed through a sparse convolution network, and the irregular point clouds are converted into a two-dimensional map view to avoid using time-consuming and serious three-dimensional convolution. And performing columnar segmentation and point cloud coding on the three-dimensional disordered point cloud data to convert the three-dimensional point cloud coordinates into a network input form. The method is characterized in that point cloud is encoded into a grid with the height of H and the width of W uniformly on a Cartesian coordinate system of an x-y depression angle in a certain step length mode, and then the grid containing point cloud data is stretched along a z axis to obtain a columnar unit set P. The points within the columnar cells are then encoded into 9-dimensional feature vectors { x, y, z, r, x } _c ,y _c ,z _c ,x _p ,y _p Where x, y, z are the spatial coordinates of the point, r is the reflectivity, x _c ,y _c ,z _c Is the deviation value, x, of the point from the coordinates of all points in the columnar cell _p ＝x-x _c ，y _p ＝y-y _c For the offset of the point to the center of the columnar unit in space, one frame of point cloud data is encoded into (D, P, N) dense tensors with the dimension of D, the number of the columnar units of P and the number of the points of the columnar units of N.

In a 1In some embodiments, the process of point cloud feature mapping includes: firstly, processing each dense tensor containing D-dimensional features by using an activation function, generating dense tensors (C, P, N) with the channel number of C, the number of column units of P and the number of points in the column units of N, then performing maximum pooling on each column unit to obtain the dense tensor with the dimension of (C, P), finally generating a map view in a stacking mode, and scattering the dense tensors (C, P) generated in the last step back to the original positions of the column units according to (x, y) coordinate index values corresponding to each column unit recorded in the process of constructing and stacking the column units to create the map view with the width W and the height H and the category number of (C, H, W). In summary, the conversion relationship is

Finally, two-dimensional mapping density tensors (C, H and W) of the point cloud features are obtained, the number of channels is C, the height is H, the width is W, and the two-dimensional convolution conditions are met.

In some embodiments, the process of one-stage target center point positioning comprises: adopting a central point positioning method based on a heat map for the dense point cloud characteristic mapping, discussing a prediction box and taking two corner points of a truth box as centers by classification and taking r as a center ₁ The circle with the radius is circumscribed, the prediction box and two corner points of the truth box are taken as the centers and r is used as ₁ The circle with radius is internally tangent, the prediction box and the left corner point of the truth box are taken as the center, and r is used ₃ The circle with radius is internally tangent and takes the right corner point of the truth frame as the center and takes r as the center ₃ And obtaining a minimum Gaussian radius value to design a Gaussian scattering kernel for three conditions of circle external tangent of the radius, solving a true value of central point positioning by combining a Gaussian distribution function, obtaining a primary target detection boundary box characteristic through central point characteristic regression, and distributing the characteristic information of the target to the vicinity of a Gaussian peak value of the center of each true value target as much as possible.

In some embodiments, the process of one-stage target detection box feature regression includes: obtaining the true value Y of the center point location _xyk Then, the heat map information of the central point, the positioning offset of the central point, the size of the target, the height from the ground, the orientation of the target and the speed of the target can be assigned to six objectsAnd (4) performing sexual storage on the central feature of the object, and combining the object attributes output by the regression of the six one-stage features to obtain a preliminary three-dimensional target boundary box with complete state information.

In some embodiments, the process of two-stage target detection box feature regression includes: the second stage network extracts needed extra point features from the three-dimensional target boundary frame and feature map information output by the first stage network, extracts respective corresponding features from the two-dimensional dense point cloud feature mapping obtained by the second chapter by using bilinear interpolation for the 5 points, then connects the extracted point features to obtain a vector containing the integral three-dimensional target boundary frame features, and transmits the vector to the multilayer perceptron MLP to obtain the final confidence score prediction of the size and orientation prediction parameters of the first stage preliminary target boundary frame and the combination of the two stages of target detection.

In some embodiments, the process of target tracking comprises: after the central point of the target is positioned through a heat map and a Gaussian distribution function, firstly, the negative speed estimation between the front frame and the rear frame is calculated by utilizing the target speed characteristics obtained by the first-stage network regression, so that the position of the target center of the previous frame is reversely presumed, then, after the motion characteristics of the target boundary frame of the current frame are extracted, the probability that the objects of the front frame and the rear frame are matched with the nearest neighbor is the same target is calculated by adopting the similarity, finally, the successfully tracked target of the current frame inherits the identity number of the previous frame, and the newly appeared target is endowed with a new identity number again. For 3 consecutive frames of objects that are no longer tracked due to track mismatch, the patent uses the last known velocity to estimate and update the position of these objects in the current frame. Therefore, targets containing a plurality of peak centers can be analyzed in the same heat map, and three-dimensional target tracking is simplified into searching and matching of the nearest target center points to realize speed prediction and tracking.

Vehicle and pedestrian detection tracker based on multisource sensor fusion includes:

the multi-source sensor feature level fusion module comprises: and converting the laser radar point cloud coordinates into a segmentation range to obtain a sparse point cloud depth map, generating a virtual point cloud according to the sparse point cloud in the foreground entity range segmented by the example and the interested view cone region of the two-dimensional random sampling point, and back-projecting the virtual point cloud to a three-dimensional space to obtain dense point cloud. Performing point cloud voxelization and encoding, and extracting the obtained columnar unit features to obtain dense point cloud feature mapping;

a target detection module: the system comprises a multi-layer sensor, a multi-stage sensor, a multi-layer sensor, a central point feature regression module, a preliminary target detection bounding box feature extraction module, a heat map feature extraction module, a data processing module and a data processing module, wherein the real value of central point positioning is solved by combining with a Gaussian distribution function, the preliminary target detection bounding box feature is obtained through central point feature regression and is stored on the central feature of an object, the two-stage target detection network is further refined, the central point feature and the heat map central point feature of each surface of the three-dimensional bounding box are extracted through bilinear interpolation, the vector containing the whole three-dimensional target bounding box is connected, and the vector is transmitted to the multi-layer sensor to obtain the feature refinement and confidence prediction of the size and the orientation of the preliminary target detection bounding box;

a target tracking module: the method is used for calculating negative speed estimation between a front frame and a rear frame by using target speed characteristics obtained by first-stage network regression so as to reversely guess the position of the target center of the previous frame, extracting the motion characteristics of a target boundary frame of the current frame, adopting similarity calculation and nearest neighbor matching of objects of the front frame and the rear frame to be the same target probability, finally, successfully tracking the target of the current frame inherits the identity number of the previous frame, and endowing a new identity number to a newly-appeared target.

Example two

As shown in fig. 1 to 3, the present embodiment provides a specific implementation of a vehicle and pedestrian detection and tracking method based on multi-source sensor fusion, including:

2.1 image Pre-processing

Firstly, an approximate range of a detected object can be obtained after image data of a camera is subjected to distortion removal and instance segmentation, a laser radar point cloud coordinate is converted into a segmentation range to obtain a sparse point cloud depth map, then a virtual point cloud is generated according to the sparse point cloud in a foreground entity range segmented by a target and an interested view cone area of a two-dimensional random sampling point and is back projected to a three-dimensional space, then point cloud voxelization and coding are carried out on the obtained dense point cloud, the obtained column unit features are extracted to generate dense point cloud feature mapping, and effective data are provided for subsequent fusion with a central point based on a heat map.

(1) Image de-distortion

The image shot by the camera can be distorted, and the distortion mainly comprises three kinds of distortion, namely pincushion distortion, barrel distortion and tangential distortion. The calibration coordinates for radial and tangential distortion are performed separately using the following equations to obtain the image coordinates for projection:

wherein x _{pic_rec} 、y _{pic_rec} Respectively representing the coordinates, x, of the corrected image coordinate system _pic And y _pic Respectively representing the coordinates, k, of the original image coordinate system ₁ 、k ₂ 、k ₃ 、p ₁ 、p ₂ Each representing a respective distortion coefficient, r being the distance of the point from the imaging center, k alone _c ＝[k ₁ k ₂ p ₁ p ₂ k ₃ ]To indicate the distortion situation.

(2) Instance partitioning

Example segmentation refers to dividing an image into regions according to features in computer vision such that the features in the regions are substantially different from one another in various places but substantially the same in the same place. Example segmentation the image is subjected to the above-mentioned object detection and semantic segmentation, respectively, to obtain an object detection frame and a labeled pixel region.

2.2 coordinate transformation and densification of Point clouds

In order to match the pixel coordinates of the two-dimensional image with the three-dimensional point cloud coordinates, the method introduces four coordinate systems which are respectively a space coordinate system of a laser radar, a space coordinate system of a camera, an image coordinate system and a pixel coordinate system, and the whole conversion thought is from a space coordinate system C of the laser radar ₁ Initially, after rotation and translation, switch to C _c The coordinate system is converted into an image coordinate system C according to the pinhole imaging principle _pic Finally, switch toPixel coordinate system C _pix 。

(1) Conversion between laser radar space coordinate system and camera space coordinate system

The conversion is accomplished primarily by rotating and translating the matrix. Is provided with a point M at C ₁ The coordinate in (A) is (x) ₁ ,y ₁ ,z ₁ ) At C _c The coordinate in (A) is (x) _c ,y _c ,z _c ) C with translation matrix T ₁ To C _c At the location of the axis origin, then around Z ₁ The axis passing through the rotation matrix R _z C is to be ₁ Conversion to and C _c X of (2) _c 、 Y _c The position of the axes parallel, likewise about X ₁ The axis passing through the rotation matrix R _x C is to be ₁ Conversion to and C _c Y of (A) is _c 、Z _c The axis being parallel, about Y ₁ The axis passing through the rotation matrix R _y C is to be ₁ Conversion to and C _c X of (2) _c 、Z _c The axis parallel position.

Below with R _z Let the rotation angle be θ, and obtain the following from the trigonometric relationship:

the formula is combined and simplified to obtain:

written as a matrix representation:

r is to be _z Complement 3 × 3 matrix:

the rotation angles of the x-axis and the y-axis are respectively alpha and beta, and then the rotation matrix R is solved respectively by the same method _x And R _y Then C is obtained _c Coordinates of (2), then the rotation matrix R ₁ The general formula of (c) is:

obtaining a rotation matrix R ₁ Post-solving translation matrix T ₁ The function of which is to convert the laser radar space coordinate system C ₁ And a camera spatial coordinate system C _c Coordinate transformation is carried out, let x _d 、y _d 、z _d For the distance between the two XYZ axes, then:

in summary, the available rotation matrix R ₁ And translation matrix T ₁ C is to be ₁ Conversion to C _c ：

C is to be ₁ Complement is a 4 × 1 matrix:

conversion between camera spatial coordinate system and image coordinate system

Is provided with a point M at C _c The coordinate in (b) is (x) _c ,y _c ,z _c ) At C _pic Has the coordinates of (x) _pic ,y _pic ) Projecting M point to image coordinate system through focus position according to pinhole imaging principle and triangle similarity relation principle, making focus distance be f, then two cameras in the image are in X position _c The displacement in the direction is b, then:

the matrix expression form is as follows:

wherein, the first and the second end of the pipe are connected with each other,

i.e. the transformation matrix K ₁ 。

(2) Conversion between image coordinate system and pixel coordinate system

Since the units in the image coordinate system and the pixel coordinate system are different, conversion is also required,

let the coordinate X in the image coordinate system _pic And Y _pic Corresponding to u in the pixel coordinate system ₀ And v ₀ Corresponding length is d _x 、d _y Then:

the matrix can be expressed as:

i.e. the transformation matrix K ₂ 。

(3) Conversion between laser radar space coordinate system and pixel coordinate system

In summary, the projection matrix P transformed by the two in the ideal state is:

(4) Generation of dense point cloud depth map

The near cutting surface is close to the position of the camera, the far cutting surface is far from the position of the camera and respectively represents the minimum distance and the maximum distance which can be drawn by the camera, the middle range of the near cutting surface and the far cutting surface is the visual field range which is called as an interested visual cone area and can be obtained by deduction according to similar triangles. However, the cone of interest region derived from the derivation is too large, so that it is also necessary to introduce an allowable deviation value in the depth direction to limit the lidar radar point cloud pillar in the region to obtain a more accurate region of interest.

Since the sparse point cloud projection points projected into the image example segmentation range by three in the sparse point cloud depth map cannot completely reflect the characteristics of the target, the interested view cone region is generated by using the sparse point cloud and the depth value thereof. Because the point cloud depths are basically consistent in the same foreground target object, points in a two-dimensional foreground entity segmentation area with depth information can be randomly sampled. And adopting nearest neighbor retrieval depth estimation around the interesting view cone area, taking the depth of the associated point cloud as the depth of the current pixel, and then projecting the matched randomly sampled two-dimensional points back to a three-dimensional space to obtain a virtual point cloud, so that the virtual point cloud and the original projected sparse point cloud at the moment have category information in instance segmentation.

2.3 processing of dense Point clouds

(1) Voxelization and coding

Because the hardware condition required for directly carrying out neural network convolution processing on the three-dimensional point cloud is higher, the efficiency is very low, and the real-time performance is poor, the method introduces a volume pixel concept similar to a two-dimensional pixel, gathers irregular point clouds through voxels in a columnar unit coding mode, adopts a sparse convolution network to process and fix length, width and height voxel information, and converts the fixed length, width and height voxel information into a two-dimensional map view to avoid using time-consuming and serious three-dimensional convolution. Although this may lose a certain amount of information, the information within each voxel is aggregated together to reduce the loss.

After the point cloud is subjected to voxelization preprocessing, in order to convert the three-dimensional point cloud coordinates into a network input form, three-dimensional disordered point cloud data is required to be adopted for column segmentation and point cloud coding. The method is characterized in that point cloud is uniformly coded into grids with the height of H and the width of W only on a Cartesian coordinate system of an x-y depression angle according to a certain step length, and then the grids containing point cloud data are stretched along a z axis to obtain a columnar unit set P.

The points within the columnar cells are then encoded into 9-dimensional feature vectors { x, y, z, r, x } _c ,y _c ,z _c ,x _p ,y _p Where x, y, z are the spatial coordinates of the point, r is the reflectivity, x _c ,y _c ,z _c Is the deviation value, x, of the point from the coordinates of all points in the columnar cell _p ＝x-x _c ，y _p ＝y-y _c Is the offset of the point in space from the center of the columnar element.

However, the point cloud has a certain sparsity in space, and the columnar units do not necessarily have enough points, so that the dense columnar units need to be screened out by using an intercepting and filling method for convolution. The intercepting means that when the number of points in each columnar unit exceeds a threshold value N, random down-sampling is carried out to reduce the number of the points, and the supplementing means that when the number of the points in each columnar unit is lower than the threshold value N, the points are cleared and then are supplemented by a method of filling 0. By the method, one frame of point cloud data is encoded into a (D, P, N) dense number with one dimension of D, the number of the columnar units of P and the number of the middle points of the columnar units of N.

(2) Dense point cloud feature mapping

Firstly, processing each dense tensor containing D-dimensional features by using an activation function, generating dense tensors (C, P, N) with the channel number of C, the number of column units of P and the number of points in the column units of N, then performing maximum pooling on each column unit to obtain the dense tensor with the dimension of (C, P), finally generating a map view in a stacking mode, and scattering the dense tensors (C, P) generated in the last step back to the original column unit position according to the (x, y) coordinate index value corresponding to each column unit recorded in the process of constructing and stacking the column units to create the map view of (C, H, W) with the wide W and high H categories of C.

In summary, the conversion relationship is

And finally obtaining dense tensor (C, H, W) of the point cloud feature in two-dimensional mapping, wherein the number of channels is C, the height is H, the width is W, and the two-dimensional convolution condition is provided.

2.4 one-stage target regression

In the first stage, the network adopts a heat map-based central point positioning method for dense point cloud feature mapping, obtains a minimum Gaussian radius value through classification discussion, then solves a true value of central point positioning by combining a Gaussian distribution function, and obtains a primary target detection boundary frame feature through central point feature regression.

(1) Thermal map-based target center point positioning

And inputting the dense point cloud features obtained in the previous step into a two-dimensional target detection neural network for processing to perform feature level fusion of the point cloud and the image, wherein C is the number of channels, and H and W are the height and width of two-dimensional mapping.

(1) Predicted value of center point positioning

Wherein

Represents the value at the k-th channel (x, y) in the heat map, and the number of channels k needs to correspond to the number of channels C of the dense point cloud feature map, R represents the step size of the network output, and is set as R =4 by default.

(2) True value of center point positioning

When the predicted value is

Then, the (x, y) position representing the heat map is a key point, and the position of the key point represents the central point of an object detection frame; when the predicted value is

Background at (x, y) representing the heat map;

and more generally, when the predicted value is

The real value p of the original image center position of any key point is divided by the following sampling multiple, namely the step length R, and the object center position is obtained by rounding downwards

Distributing all real key points on a heat map, and generating a two-dimensional normal distribution by using a Gaussian convolution kernel, wherein the value (x, Y) of any k channel of the heat map is the center of the target boundary frame, so that the central point positions the real value Y _xyk Can be expressed as a gaussian function:

wherein the target scale adaptive variance σ _p In relation to the object width W and height H, σ is now described by means of a Gaussian distribution function _p And Y _xyk A detailed solution is performed.

(2) Gaussian distribution function

(1) Categorical discussion solving gaussian radius

Defining the intersection ratio of a prediction target frame and a true value target frame as an overlap region overlap, firstly dividing the calculation of the Gaussian radius into three types according to the tangency condition of the prediction frame and the true value frame for discussion, wherein the inside and outside conditions refer to the relationship between the prediction frame and two circles:

the method comprises the following steps: the two corner points of the prediction box and the truth box are taken as the centers and r is taken as ₁ Circumscribed by a circle of radius

First, the critical value of the gaussian radius r is calculated:

one-dimensional quadratic equation sorted as r:

4*overlap*r ² +2*overlap*(h+w)*r+(overlap-1)*(h*w)＝0 (3-5)

it is equivalent to solving this one-dimensional quadratic equation for r.

Let a =4 × overlap, b =2 × overlap (h + w), c = (overlap-1) × (h × w)

According to the root discriminant formula and r needs to be greater than 0, then:

is recorded as r ₁ 。

Step two: the prediction box and two corner points of the truth box are taken as the center and r is taken as the center ₁ Is inscribed in a circle of radius

First, a critical value of the gaussian radius r is calculated:

one-dimensional quadratic equation sorted into r:

4r ² -2(h+w)r+(1-overlap)hw＝0 (3-7)

then let a =4,b = -2 (h + w), c = (1-overlap) (h x w)

is recorded as r ₂ 。

Step three: the left corner point of the prediction box and the truth box is taken as the center and r is taken as the center ₃ The circle with radius is internally tangent and takes the right corner point of the real value frame as the center and takes r as the center ₃ Circumscribed by a circle of radius

First, the critical value of the gaussian radius r is calculated:

and (3) obtaining an equation of r:

then let a =1, b = - (w + h),

obtaining:

is recorded as r ₃ 。

(2) Solving central point positioning true value according to Gaussian radius

Take the minimum of the three gaussian radii r = min (r) ₁ ,r ₂ ,r ₃ ) The radius f (w, h) of the Gaussian kernel as a function of the width w and height h of the target detection frame, i.e.

f(w,h)＝r＝min(r ₁ ,r ₂ ,r ₃ ) (3-10)

Finally, expressing the target scale adaptive variance sigma of any central point p in the category k by using a Gaussian distribution function _p Comprises the following steps:

σ _p ＝max(f(w，h),τ) (3-11)

where τ =2 is the minimum gaussian radius within the allowable range. So that the center point of the formula (3-2) locates the true value Y _xyk Expressed as:

in summary, the gaussian scattering kernel is designed by solving the minimum gaussian radius through the classification discussion, and the feature information of the target is distributed as much as possible near the gaussian peak in the center of each true target, so as to achieve the purpose that the prediction box whose intersection ratio with the true target detection box is less than 0.7 is not deleted by default, and the problems of missing detection and error location are reduced.

(3) One-stage target detection frame feature regression

Obtaining the actual positioning value Y of the central point _xyk Then, the central point heat map information, the central point positioning offset, the target size, the ground clearance, the target orientation and the target speed are stored on the central feature of the object, and a three-dimensional boundary frame result of complete state information can be obtained by combining the object attributes output by the regression of the six one-stage features. Wherein the ground clearance is helpful for locating objects in three dimensions, while missing elevation information deleted in feature mapping can be added; the target orientation rotation angle uses sine and cosine of the yaw angle to continuously regress the target, and then the directionality of the target is predicted, the related parameters of the two are few, the solution is relatively simple, the target speed provides an effective basis for subsequent target tracking, therefore, the first three items are only analyzed in the section, and the rest characteristics are not repeated.

(1) Center point heat map loss function

Constructing the Focal local function in the classification task loss function to solve the problem of the extreme imbalance of the number of positive and negative samples, the central point heat map loss function can be expressed as:

wherein N is the number of targets in the input picture and the number of key points in the heat map, xyz represents coordinate points of k types of targets on all the heat maps,

for the predicted value of detection, Y _xyk In order to be the true value of the value,

and

is a function of the cross-entropy loss,

and

is a problem that the function is dominated by easy samples, and (1-Y) _xyk ) ^β For adjusting the points around the center point to prevent false detection, the hyper-parameters α =2 and β =4 are typically used to equalize the samples.

In the formula, when Y _xyk If =1, if

In the case of the easily classifiable sample close to 1,

the term makes the loss function small, so that the point can properly reduce the training weight, i.e. the loss value; when in

When the sample is close to 0, namely difficult to classify,

the term makes the loss function large, so the un-learned centroids are weighted more heavily.

In the formula, when Y _xyk If not equal to 1, if

Close to 1 i.e. closer to the virtual center point,

the loss value is increased when the training specific gravity is increased, but the loss value is required to be (1-Y) because the training specific gravity is closer to the center point _xy k) ^β Reducing the training proportion by a little, thereby balancing the training; when it comes to

Close to 0, i.e., farther from the virtual center point, (1-Y) _xy k) ^β The loss value is increased when the training specific gravity is increased, but the loss value is far from the central point, so the training is required to be used

So that the training proportion is reduced by a little, thereby balancing the training.

(2) Center point positioning offset loss function

When the Gaussian distribution of the central point of any position is analyzed in the previous central point positioning, the central position of the real object can be known from the formula (3-2)

The central position p of the object on the original image is obtained by downsampling with the step length of R and then approximate rounding, so that the position deviation of the central point exists. In order to compensate for quantization errors caused by output step sizes and accurately predict the position of the central point of the bounding box in the input image, the patent additionally calculates the offset of the central point positioning, and the formula is as follows:

constructing a Loss function in a regression task Loss function for supervised training, namely measuring the relationship between a predicted value and a true value by adopting an average absolute error, and expressing the Loss function of the central point offset as follows:

wherein N is the number of targets in the input picture and the number of key points in the heat map, p is the coordinate of the center point of the targets in the picture,

to representThe offset of the prediction by the network is,

(3) target size loss function

Order to

The bounding box coordinates of object k are then the center points of:

by using

To predict the length and width of each target k by regression after the center point

Representing the target size after downsampling. To reduce the computational burden, this patent uses two channels with output width W and length H

To predict the target, can be expressed as:

constructing a Loss function in the regression task Loss function for supervised training, namely measuring the relationship between a predicted value and a true value by using an average absolute error, and expressing the target length and width prediction Loss function as follows:

wherein N is the number of targets in the input picture and is also the number of key points in the heat map,

as a central point coordinate P _k OfThe target size.

(4) Overall objective loss function

The overall loss function is a combination of the above three, and is assigned different weights, namely:

L _det ＝L _k +λ _offset L _offset +λ _size L _size (3-19)

wherein λ is _offset ＝1，λ _size ＝0.1。

2.5 two-stage target detection frame feature regression

In obtaining a stage of target features, because the number of point clouds in each three-dimensional detection is not large, predicting the shape and orientation of the whole object by using the limited point clouds may not contain enough information to obtain the complete target feature information, for example, a single sensor usually can only see a part of the target, but not the central point thereof. Therefore, based on the problem, the feature regression is performed again by adopting the second refinement stage of the lightweight point cloud feature extractor.

And in the second stage, extracting the required additional point features from the three-dimensional target bounding box and the feature map information output in the first stage. Theoretically, the three-dimensional target bounding box should have 6 central points, but because the projections of the top surface and the bottom bounding box center are in a central position in the two-dimensional dense point cloud feature mapping, only the central points of four surfaces and the previously obtained target central points need to be considered. Extracting respective corresponding features from the two-dimensional dense point cloud feature mapping obtained from the second chapter by using bilinear interpolation for the 5 points, then connecting the extracted point features to obtain a vector containing the features of the integral three-dimensional target boundary frame, and transmitting the vector to the multilayer perceptron MLP to obtain the final confidence score prediction of the size and orientation prediction parameters of the primary target boundary frame in the first stage and the target detection in the combination of the two stages. By IoU _t To express the intersection ratio between the t-th three-dimensional target boundary prediction frame and the truth three-dimensional target boundary prediction frame, I _t The second stage true confidence score for the tth three-dimensional target boundary prediction box can be tabulatedShown as follows:

I _t ＝min(1,max(0,2×IoU _t -0.5)) (3-20)

constructing a binary classification cross entropy loss function in the classification task loss function for supervision training, and providing unbiased estimation agent loss to measure the distance between a predicted value and a true value:

wherein the content of the first and second substances,

and predicting the confidence score for the second stage of the t three-dimensional target boundary prediction box.

In the inference process, the confidence of the target box of the first-stage network is considered

From the formula (3-1)

The confidence scores for the two stages are geometrically averaged to obtain the final predicted confidence score:

3.0 vehicle and pedestrian tracking based on multisource sensor feature level fusion

The method is different from other existing target tracking algorithms in that a mode of firstly detecting the target and then correlating data is mostly adopted, the target detection and the data correlation are integrated in a neural network, and a central point without an internal direction is adopted to represent each target. Like other regression targets, the velocity estimation is supervised training using the L1 loss function of the real target at the current time step position. The tracking process is greatly simplified after the central point is obtained, after the central point of the target is positioned through a heat map and a Gaussian distribution function, firstly, the negative speed estimation between the front frame and the rear frame is calculated by utilizing the target speed characteristics obtained by the first-stage network regression, so that the position of the target center of the previous frame is reversely speculated, then, after the motion characteristics of the target boundary frame of the current frame are extracted, the probability that the objects of the front frame and the rear frame are the same target is calculated by adopting the similarity and the nearest neighbor matching, finally, the successfully tracked current frame target inherits the identity number of the previous frame, and the newly appeared target is endowed with a new identity number again. For 3 consecutive frames of objects that are no longer tracked due to track mismatch, the patent uses the last known velocity to estimate and update the position of these objects in the current frame. Therefore, targets comprising a plurality of peak centers can be analyzed in the same heat map, three-dimensional target tracking is simplified into searching and matching of the nearest target center point to realize speed prediction and tracking, and the calculated amount is greatly reduced.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A vehicle and pedestrian detection tracking method based on multi-source sensor fusion is characterized by comprising the following steps:

performing voxelization and coding processing and feature mapping on dense point cloud in the dense point cloud depth map to obtain a dense tensor of the two-dimensional mapping of the dense point cloud features;

2. The multi-source sensor fusion based vehicle and pedestrian detection and tracking method of claim 1, wherein the process of performing de-distortion and instance segmentation processing on the image data comprises:

and calibrating coordinates of radial distortion and tangential distortion of the image data, performing area division on the image data, acquiring a target detection frame and a pixel area with a label, and acquiring a detection object range based on the target detection frame and the pixel area.

3. The multi-source sensor fusion based vehicle and pedestrian detection and tracking method according to claim 1, wherein the process of obtaining a sparse point cloud depth map based on the detection object range and lidar point cloud coordinates comprises:

and matching the pixel coordinates of the two-dimensional image with the coordinates of the three-dimensional point cloud based on a laser radar space coordinate system, a camera space coordinate system, an image coordinate system and a pixel coordinate system.

4. The multi-source sensor fusion-based vehicle and pedestrian detection and tracking method according to claim 1, wherein the point cloud densification of the sparse point cloud depth map comprises:

randomly sampling points in a two-dimensional foreground entity segmentation area with depth information, and adopting nearest neighbor retrieval depth estimation around an interested view cone area;

and taking the depth of the associated point cloud as the depth of the current pixel, and projecting the matched randomly sampled two-dimensional points back to a three-dimensional space to obtain a virtual point cloud, so that the virtual point cloud and the original projected sparse point cloud at the moment have category information in example segmentation at the same time.

5. The method for detecting and tracking the vehicles and the pedestrians based on the multi-source sensor fusion as claimed in claim 1, wherein the process of performing the voxelization and the encoding process and the feature mapping on the dense point cloud in the dense point cloud depth map comprises:

and adopting a columnar unit coding mode to gather irregular point clouds in the dense point cloud depth map through voxels, processing and fixing the length, width and height voxel information based on a sparse convolution network, and converting the voxel information into a two-dimensional map view.

6. The method for detecting and tracking the vehicles and pedestrians based on the multi-source sensor fusion according to claim 1, wherein the step of performing the one-stage feature regression on the dense tensor comprises:

performing feature mapping on the dense point cloud by adopting a heat map-based central point positioning method, acquiring a minimum Gaussian radius value based on classification discussion, solving a true value of central point positioning based on a Gaussian distribution function, performing feature regression on the central point, acquiring feature information of a primary target detection boundary frame, storing object attributes on the central point of a detected object, and acquiring a primary three-dimensional target boundary frame by combining the object attributes; the object attributes comprise center point heat map information, center point positioning offset, target size, ground clearance, target orientation and target speed.

7. The multi-source sensor fusion-based vehicle and pedestrian detection and tracking method according to claim 1, wherein the process of performing two-stage feature regression on the feature information of the preliminary target detection bounding box comprises:

extracting extra point features based on a preliminary three-dimensional target boundary frame and feature information obtained by one-stage feature regression, and extracting corresponding features from the mapping density tensor of the dense point cloud features based on a bilinear interpolation method;

connecting the corresponding features to obtain a feature vector of the whole three-dimensional target boundary frame;

and transmitting the feature vector of the whole three-dimensional target bounding box into a multi-layer perceptron MLP to obtain refined features and confidence degree prediction of the size and the orientation of the preliminary three-dimensional target bounding box.

8. The multi-source sensor fusion-based vehicle and pedestrian detection and tracking method according to claim 1, wherein the process of performing vehicle and pedestrian tracking based on the preliminary target detection bounding box and nearest neighbor matching comprises:

calculating and predicting the position of the center of a previous frame of detection target based on characteristic information obtained by one-stage characteristic regression, extracting the motion characteristic of a three-dimensional target boundary frame of a current frame, adopting the probability that the detection targets of the previous frame and the next frame are matched with the nearest neighbor are the same target by similarity calculation, inheriting the identity number of the previous frame by the current frame target which is successfully tracked, endowing a new identity number for the newly appeared target again, and simplifying the three-dimensional target tracking into the nearest target center point search matching for speed prediction and tracking.

9. A vehicle and pedestrian detection tracking system based on multi-source sensor fusion, comprising: the system comprises an image processing module, a multi-source sensor feature level fusion module, a target detection module and a target tracking module;

10. The multi-source sensor fusion based vehicle and pedestrian detection and tracking system of claim 9, wherein:

the multisource sensor feature level fusion module acquires a sparse point cloud depth map through the detection object range and the laser radar point cloud coordinates; performing point cloud densification on the sparse point cloud depth map to obtain a dense point cloud depth map, performing voxelization and coding processing on dense point clouds in the dense point cloud depth map, performing feature mapping, and obtaining dense point cloud feature mapping;

the target detection module solves a true value of the central point positioning by combining a Gaussian distribution function, obtains a primary target detection boundary frame characteristic by central point characteristic regression, stores the primary target detection boundary frame characteristic on the central characteristic of an object, further refines the primary target detection boundary frame characteristic, extracts the central point and heat map central point characteristics of each surface of the primary target detection boundary frame by using bilinear interpolation, connects the central point and the heat map central point characteristics into a vector containing an integral three-dimensional target boundary frame, and transmits the vector to a multilayer sensor to obtain characteristic refinement and confidence prediction of the size and the orientation of the primary target detection boundary frame so as to obtain a final target detection result;

the target tracking module calculates and predicts the position of the center of the previous frame of detection target according to the characteristic information obtained by the target detection module, extracts the motion characteristic of the three-dimensional target boundary frame of the current frame, inherits the identification number of the previous frame of target successfully tracked by adopting the probability that the detection targets of the frames before and after similarity calculation and nearest neighbor matching are the same target, endows the newly appeared target with new identification number again, and simplifies the three-dimensional target tracking into the searching and matching of the nearest target center point for target tracking.