CN111626217B

CN111626217B - Target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion

Info

Publication number: CN111626217B
Application number: CN202010466491.4A
Authority: CN
Inventors: 邬松渊; 赵捷
Original assignee: Ningbo Boden Intelligent Technology Co ltd
Current assignee: Ningbo Boden Intelligent Technology Co ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2023-08-22
Anticipated expiration: 2040-05-28
Also published as: CN111626217A

Abstract

The invention discloses a target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion, which relates to the field of automatic driving target detection and tracking and comprises the following steps: s100, pre-training a deep Labv3+ model; s200, converting the three-dimensional point cloud data into a specified format; s300, preprocessing three-dimensional point cloud data in a specified format; s400, training a PointRCNN-deep Labv3+ model; s500, updating and tracking the target state. According to the invention, each laser data point feature contains space information and has an image semantic segmentation result, so that the identification effect of PointRCNN is improved, and the accuracy of identifying the pedestrian target with smaller target and higher similarity with the environment is improved.

Description

Target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion

Technical Field

The invention relates to the field of automatic driving target detection and tracking, in particular to a target detection and tracking method based on fusion of a two-dimensional picture and a three-dimensional point cloud.

Background

At present, unmanned operation has reached the stage of L3-level landing, and all automobile main factories, automatic driving beginners, automobile system suppliers and university research and development institutions have listed landing as the current working center of gravity. The most core functional module in automatic driving is composed of a perception layer, a decision layer and a control layer. The perception layer mainly comprises the following components: and information acquisition is carried out on surrounding environments by devices such as a laser radar, a millimeter wave radar, a visual constant sensor and the like. The unmanned detection system carries out target detection according to the acquired image, three-dimensional point cloud and other data, and the scene segmentation and other recognition methods acquire the understanding of the unmanned vehicle to the surrounding environment, so that specific functions such as autonomous cruising, automatic lane changing, traffic sign recognition, traffic jam automatic driving, high-speed driving and the like can be realized. Different from the vision sensor, the laser radar can effectively improve the accuracy of the vehicle on the external environment perception modeling. The key technology of laser radar in automatic driving is mainly divided into three-dimensional point cloud segmentation, road extraction, environment modeling, obstacle detection and tracking and information fusion of various sensors by integrating various research and practical operations. The three-dimensional point cloud data volume produced by the laser radar can reach millions per second, and the common clustering algorithm cannot meet the requirement of data real-time calculation. The three-dimensional point cloud segmentation refers to that in order to quickly extract useful object information, the three-dimensional point cloud is segmented according to the integral features and the local features of the three-dimensional point cloud distribution, so that a plurality of independent subsets are formed. The expectation of each subset is that each subset corresponds to a perceived target that will have a physical meaning and reflects the geometric and pose characteristics of the target object. The three-dimensional point cloud segmentation is an important basis for guaranteeing the subsequent target classification and tracking performance of the laser radar. Currently, three-dimensional point cloud segmentation and object detection methods based on deep learning are prevailing.

In general, deep neural networks require input information to have a normalized format, such as two-dimensional images, time-sequential speech, and the like. The original three-dimensional point cloud data are often some unordered point sets in space, and a certain three-dimensional point cloud is assumed to contain N three-dimensional points, each point is represented by (x, y, z) three-dimensional coordinates, even if the changes of shielding, visual angles and the like are not considered, the points are arranged and combined in sequence, and N-! One possibility is to use a single-piece plastic. Therefore, we need to design a function so that the function value is independent of the order of input data.

In actual data labeling, a large amount of labeled data is required for training the deep neural network. The labeling of three-dimensional point cloud data in the market is mostly performed manually. The labeling staff can have a large number of conditions of false detection, missed detection and incapability of ensuring precision during operation. In order to solve the "pain point" on the market at present, an automatic labeling tool combined with a deep learning algorithm is necessary.

The three-dimensional point cloud target recognition method generally proposed at present can be divided into two major categories, namely a grid-based recognition method and a laser point-based recognition method. And converting the unordered three-dimensional points into ordered characteristics such as 3D voxels or 2D aerial view characteristics based on a grid recognition method, and then carrying out 3D target recognition by using a 3D or 2D convolutional neural network. Aiming at the problem of information loss in the three-dimensional point cloud data conversion process based on the grid recognition method, the current mainstream mode is to mutually fuse multiple sensors, so that the information can be supplemented and corrected. For example, MV3D-Net being industrialized, which fuses vision and laser point cloud information, searches for a target region of interest only through top view and front view of a three-dimensional point cloud, and combines image features for target recognition, unlike the previous voxel-based methods, which compromise the computational complexity and loss of data feature conversion process information. The AVOD model takes a three-dimensional point cloud aerial view angle and a corresponding image as input, cuts and scales the image by utilizing a 3D anchor point grid map, performs feature fusion on an interest domain output by the model, and finally obtains a target recognition result through a fully connected network. The MMF performs two-stage processing by utilizing laser radar data, on one hand, original RGB image information is introduced into depth features, RGBD images are obtained after mutual stitching and used for feature extraction as image complementary information, on the other hand, the laser radar data are converted into aerial view angles, rough interest areas are provided through a depth network, and laser point cloud features and image features in the areas are mutually stitched and fused for boundary frame fine adjustment, so that a more accurate target recognition result is obtained. ContFuse is to perform depth continuous fusion on the three-dimensional point cloud and the images under the multi-scale and multi-sensor through a double-flow network structure, so that high-precision three-dimensional space target detection and positioning are realized.

The other type is a target recognition method based on a laser point recognition method, which directly extracts effective features through laser point cloud data, and is also becoming popular after PointNet and PointNet++ are proposed. Because the PointNet method does not need a data point preprocessing process, the characteristic extraction difficulty caused by three-dimensional point cloud disorder is solved by using pooling operation, thereby effectively avoiding three-dimensional point cloud information loss and enabling a final target identification result to be relatively accurate. F-PointNet is used as a first network model for performing target identification by using PointNet, a 2D interest area is searched by using a mask RCNN, laser point cloud data in the interest area is obtained by combining depth information, and feature extraction and regression of target boundary frame parameters are performed through two PointNet steps. The PointRCNN only depends on laser point cloud data, and the feature extraction and the interest region extraction are carried out and the target recognition and fine adjustment of the second stage are carried out through the PointNet of the first stage, so that a more excellent target recognition result is obtained, and the image information supplement is not needed.

Because of the 3D space characteristics of the laser point cloud, any targets cannot be overlapped and overlapped as in a 2D space, interference factors are relatively few and difficulty is low in a multi-target tracking process, most of current 3D target tracking schemes are based on detected target tracking, namely targets in a three-dimensional space are identified through an identification model, the target identification result of the frame and a plurality of previous frame tracking results are compared and matched, and finally the tracking model is updated. At present, a tracking model AB3DMOT with highest processing frame rate in a three-dimensional point cloud space can track a target by only utilizing a PointRCNN model to identify the three-dimensional point cloud target and a 3D Kalman filter.

The labeling software commonly used in the market currently has the following problems:

1. the difficulty of pedestrian recognition is large: with the excellent performance of deep learning in the fields of images, laser radars and the like, more and more excellent target detection and tracking algorithms are proposed. Because of the physical characteristics of the laser radar, the precise distance information which is not possessed by the common camera is introduced, and the mutual shielding capability between targets is effectively avoided, so that the laser radar is more and more valued by researchers in the development stage of automatic driving. Taking the tracking reference of the automobile class of the KITTI test data set (KITTI is a standard test data set facing to automatic driving) as an example, the multi-target tracking accuracy of the laser point cloud can reach 88.89% at the highest. However, the effect of tracking pedestrians is less effective. By analyzing the two types of three-dimensional point cloud data structures, the number of laser points of the automobile targets is generally more, the occupied space is larger, the three-dimensional point cloud space has obvious L-shaped structures, I-shaped structures and the like, and the identification of the model is relatively simple; the number of laser points included in the pedestrian three-dimensional point cloud result is smaller, the corresponding volume is smaller, and the obvious distance limitation exists, namely, as the distance increases, the number of the pedestrian three-dimensional point cloud data points is linearly reduced, the laser points are sparse, the object recognition is not facilitated, the pedestrians can appear at any position in any scene, the scenes can comprise a series of background objects such as roadblocks, clusters and street lamps, the objects and the pedestrians can have a certain degree of similarity, and the difficulty of realizing the object recognition of the pedestrians through the laser point cloud data is further increased.

2. Tracking model complexity is high: comparing various tracking algorithms in the KITTI list, it can be found that these models mostly increase tracking accuracy at the expense of increasing system complexity and computational cost, which can make researchers have a great challenge to analysis of the modules, for example, for increasing tracking accuracy, researchers cannot effectively distinguish which parts of the system have the greatest contribution to the results, thus causing confusion. For example, there may be considerable differences between the excellent algorithmic models, such as FANTrack, DSM, extraCK, from the network structure and the data processing process, yet the tracking behavior is quite similar. Also in JCSTD, MOTBeyondPixels, the adverse effect of the increased computational cost is quite obvious, and despite excellent accuracy, the high computational performance and high time consumption required to make real-time tracking still far from being achieved, which further results in high dependent costs.

3. Tracking the number of targets varies: in the multi-target tracking process, ID exchange is one of the most common problems, namely, when multiple tracked targets are attached to or close to each other, the tracking model cannot effectively distinguish the tracked targets, so that an ID exchange phenomenon occurs. The laser radar data has smaller object shielding rate and no lamination or superposition phenomenon, so that the number of targets in the space is more than that of the targets contained in the images in the corresponding directions at the same moment, and the requirements on the stability of tracking and more various target numbers in the laser point cloud space are also stricter. The current target tracking method in the 3D space is mainly realized through a filter, and the method is relatively dependent on a target matching strategy, so that the effect of the model is also uneven.

Accordingly, those skilled in the art have been working to develop a method for target detection and tracking that achieves high efficiency and high accuracy.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention aims to solve the problems of low pedestrian category identification capability, complex tracking model and inapplicability of the conventional cross matching standard in the current three-dimensional point cloud data set.

The inventor designs a target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion, combines a feature extraction process taking PointRCNN in AB3DMOT as a main part and a deep Labv3+ picture example segmentation result to enable each laser data point feature to contain space information and have an image semantic segmentation result, and simultaneously provides a novel multi-condition joint judgment mode suitable for pedestrian tracks aiming at the defect of a relevance matching algorithm of AB3 DMOT.

In one embodiment of the invention, a target detection and tracking method based on fusion of a two-dimensional picture and a three-dimensional point cloud is provided, which comprises the following steps:

s100, pre-training a deep Labv3+ model;

s200, converting the three-dimensional point cloud data into a specified format;

s300, preprocessing three-dimensional point cloud data in a specified format;

s400, training a PointRCNN-deep Labv3+ model;

s500, updating and tracking the target state.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in the foregoing embodiment, the step S100 includes reading an image file in the Cityscapes, pre-training a model of deep labv3+ based on a truth file in a data set of the image file in combination with a corresponding image file, and adopting a specific loss function as a target until the accuracy is no longer significantly improved, ending training of the whole deep learning framework, and saving the corresponding neural network parameters.

Further, in the target detection and tracking method based on the fusion of the two-dimensional picture and the three-dimensional point cloud in the above embodiment, the specific loss function is formula (1), so that the model can realize accurate semantic segmentation of the image:

L _deeplabv3+ (x)＝∑w(x)log(p _k (x)), (1)

wherein

x is the pixel position on the two-dimensional plane, a _k (x) Representing the value of the kth channel corresponding to x in the final output layer of the neural network. P is p _k (x) Representing the probability that a pixel belongs to class k. w (x) represents the classification result vector of the real label of the pixel x position, L _deeplabv3+ (x) Representing the sum of the probabilities that x belongs to the class of correct tags.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, the step S100 further includes:

s110, inputting Cityscapes image data in training, wherein the Cityscapes image data comprises batch size, image quantity and channel quantity;

s120, the coding network obtains feature graphs with different hole sizes through hole convolution, and the feature graphs are input to a subsequent convolution network for feature extraction after superposition and splicing, so that an effective coding feature result is finally obtained;

s130, the decoding network performs information supplementation through the full convolution and the characteristics of the corresponding layers in the encoding network, performs layer-by-layer up sampling, finally restores to the original input image size, and outputs the classification information of each pixel point.

Further, in the target detection and tracking method based on the two-dimensional image and the three-dimensional point cloud fusion in the above embodiment, the above hole convolution is 1×1, 3×3, different sizes and different sampling rates.

Optionally, in the target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion in any of the foregoing embodiments, the pre-training in step S100 further includes graph semantic segmentation and image object classification:

s140, extracting image semantic segmentation information in the Cityscapes data set, and extracting classification information of target pixels;

s150, reading all image data, and configuring the image pixel classification meeting the requirements;

s160, deploying deep Labv3+ into one micro-service (dock) in the GPU server.

Further, in the target detection and tracking method based on the two-dimensional picture and the three-dimensional point cloud fusion in the above-described embodiment, the image object classification includes cars, trucks, pedestrians, riding persons, and the ground.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, step S100 further includes checking an effect of pre-training.

Further, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in the above embodiment, the method for checking the effect of pre-training includes developing the visualization in python using matplotlib library, then comparing the results with the true value, and taking the ratio of the intersection and union of the two sets of the true value and the predicted value in the image pixel (i.e. MIoU) as the final criterion, and the larger the value, the more excellent the performance.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, the three-dimensional point cloud in the foregoing step S200 is derived from multiple beams of the 3D lidar, where its horizontal and vertical field ranges are 360 ° and 40 °, respectively, and the horizontal range reaches 300 meters.

Optionally, in the target detection and tracking method based on the fusion of the two-dimensional image and the three-dimensional point cloud in any of the embodiments, the specified format in the step S200 is a format that is convenient for the three-dimensional point cloud algorithm to read in.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the embodiments, the specified format in the step S200 is a pcd format.

Optionally, in the target detection and tracking method based on the fusion of the two-dimensional image and the three-dimensional point cloud in any of the foregoing embodiments, the preprocessing in the foregoing step S300 includes increasing the number of points by an upsampling method when the points are too sparse, and reducing the corresponding number of three-dimensional point clouds by a downsampling method when the points are too dense, so that the three-dimensional point clouds are uniformly distributed on the entire plane.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, the preprocessing in the foregoing step S300 further includes foreground and background extraction, where a loss function is formula (3):

L _fore (p _u )＝-α _u (1-p _u ) ^β log(p _u ), (3)

wherein

p _u Representing different probability processing results of foreground and background points, alpha _u And beta is an artificially defined constant for controlling the weights of foreground and background points, L _fore (p _u ) The method is used for relieving the problem of unbalanced category distribution by means of a Focal Loss function under the condition that the number ratio of foreground points to background points is 1:3 or more.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, the step S400 includes:

s410, reading a specified format file in the KITTI;

s420, inputting a true value file based on the KITTI data set and combining a corresponding three-dimensional point cloud data file and an image file;

s430, fixing the DeepLabv3+ model weight;

s440, training a PointRCNN-deep Labv3+ model;

s450, adopting a specific loss function as a target, and ending the training of the whole deep learning framework until the precision is not obviously improved;

s460, saving the corresponding neural network parameters.

Alternatively, in the target detection and tracking method based on the two-dimensional picture and three-dimensional point cloud fusion in any of the above embodiments, the specified format in the above step S410 is pdc format.

Optionally, in the target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion in any of the foregoing embodiments, step S440 further includes semantic segmentation and three-dimensional point cloud object classification:

s441, extracting three-dimensional frame drawing coordinate information in a KITTI data set and information of two-dimensional frames of left-view and right-view two-dimensional views corresponding to the three-dimensional frame drawing coordinate information, and extracting related classification information;

s442, reading all three-dimensional point cloud data, and configuring target three-dimensional frame information and classification meeting the requirements;

s443, deploying the PointRCNN-deep Labv3+ model into a micro service (dock) in the GPU server.

Further, in the target detection and tracking method based on the two-dimensional picture and the three-dimensional point cloud fusion in the above embodiment, the three-dimensional point cloud object classification includes cars, trucks, pedestrians, riding persons, and the ground.

Optionally, in the target detection and tracking method based on the two-dimensional image and the three-dimensional point cloud fusion in any of the foregoing embodiments, step S440 further includes checking the effect of the pre-training.

Further, in the target detection and tracking method based on the two-dimensional image and the three-dimensional point cloud fusion in the above embodiment, the method for checking the effect of pre-training includes developing the visualization in python by using PCL (point cloud library) library, and then comparing the results with the true values, and counting the ratio of the intersection and union of the two sets of the 3D bounding box and the prediction bounding box of the true value in the three-dimensional space (namely IoU) as the final judgment basis, and the larger the numerical value, the more excellent the performance.

Optionally, in the method for detecting and tracking a target based on fusion of a two-dimensional image and a three-dimensional point cloud in any embodiment, the step S500 includes that in the tracking process, the algorithm model uses PointRCNN-deepcapv3+ to identify the target, uses the cross-over ratio and the distance ratio as a joint matching condition, uses the hungarian algorithm to match the identification result, and realizes the update and tracking of the target state through the filter.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any one of the embodiments, the filter is a 3D kalman filter.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the foregoing embodiments, the step S500 includes:

s510, training an AB3DMOT-MCM-deep Labv3+ model, and inputting Cityscapes image data and KITTI three-dimensional point cloud image data, wherein the three-dimensional point cloud image data comprises batch size, image number and corresponding channel number, and three-dimensional point cloud data number and corresponding channel number;

s520, searching a target region of interest through the PointRCNN by using the identification network, performing semantic segmentation on the image by using the deep Labv3+, taking the three-dimensional point cloud characteristics of the region of interest and the corresponding image semantic segmentation result as supplementary information, and inputting the supplementary information into a second stage of the PointRCNN to obtain a precise target identification result;

s530, the data matching module performs matching calculation by comparing the identification target parameter and the track prediction parameter and using the distance ratio and the intersection ratio: updating the matching track, checking the unmatched track, deleting the unmatched track when the maximum memory time limit is exceeded, otherwise, keeping the original state unchanged; track creation is carried out on the unmatched identification targets;

s540, a 3D Kalman filter predicts and updates coordinates x, y, z, size parameters, yaw angle and relative speed of a target track by using a traditional Kalman filter creation and update mode.

Further, in the target detection and tracking method based on the two-dimensional image and the three-dimensional point cloud fusion in the above embodiment, the step S510 further includes combining the PointRCNN-deepcbrv3+ and the AB3DMOT multi-matching condition tracking model, performing target recognition by using the PointRCNN-deepcbrv3+ and performing target tracking by using the tracking model in the AB3 DMOT.

Optionally, in the target detection and tracking method based on two-dimensional image and three-dimensional point cloud fusion in any of the above embodiments, the matching condition blending ratio and the distance ratio calculation functions in the step S530 are (5) and (6), respectively:

wherein ,S_a 3D bounding volume representing trajectory prediction results, S _b A 3D surrounding area volume representing detection recognition results, S _a ∩S _b Representing the intersection region of the two volumes, S _a ∪S _b The IoU represents the ratio calculation result of the intersection and union of the two 3D surrounding areas, and the calculation result is used as one of the matching judgment bases according to the set threshold;

wherein ,t₁ and t₂ Center point coordinates representing the track prediction result and center point coordinates representing the detection recognition result, dis (t) ₁ ,t ₂ ) The Euclidean distance between two center point coordinates is represented, w is the target width of the track prediction result, posR represents the Euclidean distance between the centers of two 3D bounding boxes and the width ratio calculation result predicted by the tracker, and the calculated value is used as one of the matching judgment bases according to the set threshold.

According to the invention, the feature extraction process taking PointRCNN in AB3DMOT as a leading part and the deep Labv3+ are combined with each other to divide the picture instance, so that each laser data point feature contains space information and has an image semantic division result, the identification effect of the PointRCNN is improved, and the identification accuracy of the pedestrian target with smaller target and higher similarity to the environment is effectively improved. Aiming at the defects of the association degree matching algorithm of the AB3DMOT, a novel multi-condition joint judging mode suitable for pedestrian tracks is provided, so that the target matching capacity is improved, the model is more excellent in pedestrian tracking performance, and the purposes of high efficiency and high accuracy are achieved.

The conception, specific structure, and technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, features, and effects of the present invention.

Drawings

FIG. 1 is a flowchart illustrating a method according to an example embodiment;

FIG. 2 is a schematic diagram illustrating a deep Labv3+ flow scheme according to an exemplary embodiment;

fig. 3 is a diagram illustrating an AB3DMOT-MCM-deep labv3+ architecture according to an exemplary embodiment.

Detailed Description

The following description of the preferred embodiments of the present invention refers to the accompanying drawings, which make the technical contents thereof more clear and easy to understand. The present invention may be embodied in many different forms of embodiments and the scope of the present invention is not limited to only the embodiments described herein.

In the drawings, like structural elements are referred to by like reference numerals and components having similar structure or function are referred to by like reference numerals. The dimensions and thickness of each component shown in the drawings are arbitrarily shown, and the present invention is not limited to the dimensions and thickness of each component. The thickness of the components is schematically and appropriately exaggerated in some places in the drawings for clarity of illustration.

The inventor combines the feature extraction process taking PointRCNN in AB3DMOT as a leading part and the deep Labv3+ to the picture example segmentation result, so that each laser data point feature contains space information and has an image semantic segmentation result, and a novel multi-condition joint judgment mode suitable for pedestrian tracks is provided aiming at the defect of the association matching algorithm of AB3 DMOT. The inventor designs a target detection and tracking method based on two-dimensional picture and three-dimensional point cloud fusion, as shown in fig. 1, comprising the following steps:

s100, pre-training a deep Labv3+ model, namely reading an image file in Cityscapes, pre-training the deep Labv3+ model based on a truth value file in an image file data set and a corresponding image file, adopting a specific loss function as a target, ending the training of the whole deep learning framework until the accuracy is not obviously improved any more, and storing corresponding neural network parameters; the inventor has determined that the specific loss function is defined as follows, in order to enable the model to achieve accurate semantic segmentation of the image:

L _deeplabv3+ (x)＝∑w(x)log(p _k (x)), (1)

wherein

Step S100 is refined, as shown in fig. 2:

s110, inputting Cityscapes image data in pre-training, wherein the Cityrcapes image data comprises batch size, image quantity and channel number;

s120, the encoding network Encoder firstly utilizes a deep convolutional network (DCNN) to extract basic features of the Image. In order to increase the receptive field of the filter, so that the filter can learn global and local information more accurately, the coding network performs feature extraction in a cavity convolution (Atrous Conv) mode, the specific cavity convolution filter types comprise 1X1Conv, 3X3Conv rate 6, 3X3Conv rate 12, 3X3Conv rate 18 and an Image Pooling layer (Image Pooling), different feature graphs are obtained through the operations, the feature graphs are input to a subsequent convolution network after superposition and splicing, the 1X1Conv is used for performing convolution extraction, and finally an effective coding feature result is obtained;

s130, decoding the network Decoder, performing convolution feature extraction on the 1X1Conv by using the primary Features (Low-Level Features) output by the DCNN layer in the coding network to obtain a feature map, and performing 4 times up-sampling (namely Upsamples by 4) amplification on the feature map finally output by the Encoder layer to enable the feature map to be consistent with the feature map obtained before in size and mutually splicing (Concat) to obtain a new feature map. In order to obtain more effective features, the new feature map is amplified again by 3X3Conv and 4 times up-sampling (namely Upsamples by 4), up-sampling is performed layer by layer, finally the original input image size is restored, and the classification result Prediction of each pixel point is output;

step S100 further includes image object classification and semantic segmentation, the image object classification including cars, trucks, pedestrians, riders and ground, specifically including:

s160, deploying the model into a micro service center (dock) in the GPU server.

In addition, step S100 also includes verifying the effectiveness of the pre-training, specifically developing a visualization in python using matplotlib library, and then performing a visual comparison of the results in combination with a truth.

S200, converting three-dimensional point cloud data into a specified format, wherein the three-dimensional point cloud data are from multi-line bundles of a 3D laser radar, the horizontal and vertical visual fields are 360 degrees and 40 degrees respectively, and the horizontal range reaches 300 meters; the specified format is a format which is convenient for the three-dimensional point cloud algorithm to read in, such as a pcd format.

S300, preprocessing the three-dimensional point cloud data in a specified format, wherein the preprocessing comprises the steps of increasing the number of points by adopting an up-sampling method when the points are too sparse, and reducing the corresponding three-dimensional point cloud number by adopting a down-sampling method when the points are too dense so that the three-dimensional point cloud is uniformly distributed on the whole plane; the preprocessing also comprises foreground and background extraction, wherein the loss function is as follows:

L _fore (p _u )＝-α _u (1-p _u ) ^β log(p _u ), (3)

wherein

p _u Representing different probability processing results of foreground and background points, alpha _u And beta is an artificially defined constant for controlling the weights of foreground and background points, L _fore (p _u ) Is a Focal Loss function and is used for relieving the problem of unbalanced distribution of categories when the number of foreground points and the number of background points are greatly different.

S400, training a PointRCNN-deep Labv3+ model, comprising:

s410, reading a specified format file in the KITTI, which is generally in a pcd format;

s420, inputting a truth value file based on the KITTI data set (the last step) and combining the corresponding three-dimensional point cloud data file and the image file;

s430, fixing the DeepLabv3+ model weight;

s440, training a PointRCNN-deep Labv3+ model, wherein the model comprises semantic segmentation and three-dimensional point cloud object classification, and the three-dimensional point cloud object classification comprises cars, trucks, pedestrians, riding persons and the ground, and specifically comprises the following steps:

s443, deploying the model into a micro service center (dock) in the GPU server;

s460, saving the corresponding neural network parameters.

Step S400 also includes verifying the effectiveness of the pre-training, specifically including developing a visualization in python using the PCL (point cloud library) library, and then performing a visual comparison of the results in combination with the truth values.

S500, target state updating and tracking are achieved, the algorithm model utilizes PointRCNN-deep Labv3+ to identify targets, the intersection ratio and the distance ratio are used as joint matching conditions, the Hungary algorithm is utilized to match identification results, target state updating and tracking are achieved through a filter, and the 3D Kalman filter is selected by the filter. As shown in fig. 3, the method specifically includes:

s510, training an AB3DMOT-MCM-deep Labv3+ model, wherein the network structure combines a PointRCNN-deep Labv3+ and an AB3DMOT multi-matching condition tracking model, namely 3D target detection and identification are carried out by using the PointRCNN-deep Labv3+ and then target tracking is carried out by using a tracking model in the AB3 DMOT. Respectively inputting Cityscapes image data and KITTI three-dimensional point cloud image data to PointRCNN and deep Labv3+, wherein the Cityscapes image data comprises batch size, image quantity and corresponding channel number, and three-dimensional point cloud data quantity and corresponding channel number;

s520, the 3D target detection network searches a target region of interest through the PointRCNN, performs semantic segmentation on the image by using DeepLabv3+, takes three-dimensional point cloud characteristics of the region of interest and corresponding image semantic segmentation results as supplementary information, and inputs the supplementary information into a second stage of the PointRCNN (namely data matching) to obtain a precise target recognition result;

s530, the data matching module performs matching calculation by comparing the identification target parameter and the track prediction parameter and utilizing the distance ratio PosR and the intersection ratio IoU, wherein the specific process is as follows: for each predicted trajectory and bounding box of the identified target, the results of the distance ratio PosR and the intersection ratio IoU are calculated first, added according to the weight of 0.5, and when the sum exceeds the set threshold value of 0.3, the matching is indicated, and for the target trajectories smaller than 0.3, the value of the intersection ratio IoU is calculated separately and matched, and when the sum exceeds the set threshold value of 0.3, the matching is classified as trajectory matching. Updating all the matching results, and if tracking is not matched, deleting the track if the maximum memory time limit is exceeded, otherwise, keeping the original state unchanged; and carrying out track generation on the detection mismatch. And after all result processing is completed, carrying out updating prediction on the 3D Kalman filter for subsequent track matching calculation. Wherein the inventor designs matching condition intersection ratio IoU and distance ratio PosR calculation functions as (5) and (6) respectively:

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The target detection and tracking method based on the fusion of the two-dimensional picture and the three-dimensional point cloud is characterized by comprising the following steps of:

s100, pre-training a deep Labv3+ model, namely reading an image file in Cityscapes, pre-training the deep Labv3+ model based on a truth value file in an image file data set and a corresponding image file, adopting a specific loss function as a target, ending training of the whole deep learning frame until the precision is not obviously improved any more, and storing corresponding neural network parameters, wherein the specific loss function is shown as a formula (1), so that the model can realize accurate semantic segmentation of the image:

L _deeplabv3+ (x)＝∑w(x)log(p _k (x)), (1)；

wherein ,

x is the pixel position on the two-dimensional plane, a _k (x) Representing the value of the kth channel corresponding to x in the final output layer of the neural network; p is p _k (x) Representing the probability that a pixel belongs to class k; w (x) represents the classification result vector of the real label of the pixel x position, L _deeplabv3+ (x) Representing the sum of probabilities of the classification to which the correct label belongs;

further comprises:

s130, the decoding network performs information supplementation through the full convolution and the characteristics of the corresponding layers in the coding network, performs layer-by-layer up sampling, finally restores to the original input image size, and outputs the classification information of each pixel point;

s300, preprocessing three-dimensional point cloud data in a specified format, wherein the preprocessing further comprises foreground and background extraction, and a loss function is shown as a formula (3):

L _fore (p _u )＝-α _u (1-p _u ) ^β log(p _u ), (3)；

wherein ,

p _u representing different probability processing results of foreground and background points, alpha _u And beta is an artificially defined constant for controlling the weights of foreground and background points, L _fore (p _u ) The method is used for relieving the problem of unbalanced category distribution in a manner of Focal Loss function under the condition that the number ratio of foreground points to background points is 1:3 or more different;

s400, training a PointRCNN-deep Labv3+ model;

s500, updating and tracking the target state.

2. The method for detecting and tracking a target based on two-dimensional picture and three-dimensional point cloud fusion according to claim 1, wherein the pre-training of step S100 further comprises semantic segmentation and image object classification:

s160, deploying deep Labv3+ into one micro-service (dock) in the GPU server.

3. The method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion according to claim 2, wherein said step S100 further comprises checking the effect of pre-training.

4. A method for detecting and tracking targets based on two-dimensional pictures and three-dimensional point cloud fusion as claimed in claim 3, wherein the method for checking the effect of pre-training comprises the steps of developing a visualization in python by using a matplotlib library, then comparing results by combining true values, and taking the ratio of the intersection and union of two sets of true values and predicted values in pixels of a statistical image as a final judgment basis, wherein the larger the value is, the more excellent the value is.

5. The method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion according to claim 4, wherein the preprocessing in step S300 includes increasing the number of points by up-sampling when the points are too sparse, and reducing the number of corresponding three-dimensional point clouds by down-sampling when the points are too dense, so that the three-dimensional point clouds are uniformly distributed in the whole plane.

6. The method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion according to claim 5, wherein the step S400 comprises:

s410, reading a specified format file in the KITTI;

s430, fixing the DeepLabv3+ model weight;

s440, training a PointRCNN-deep Labv3+ model;

s460, saving the corresponding neural network parameters.

7. The method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion as claimed in claim 6, wherein said step S440 further comprises semantic segmentation and three-dimensional point cloud object classification:

s443, deploying the PointRCNN-deep Labv3+ model into one micro-service in the GPU server.

8. The method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion according to claim 7, wherein the step S500 comprises:

9. A method for detecting and tracking a target based on two-dimensional image and three-dimensional point cloud fusion according to claim 8, wherein the step S510 further comprises combining the PointRCNN-deepcbrv3+ and AB3DMOT multi-matching condition tracking model, performing target recognition by using the PointRCNN-deepcbrv3+ and performing target tracking by using the tracking model in AB3 DMOT.