CN114923491A

CN114923491A - Three-dimensional multi-target online tracking method based on feature fusion and distance fusion

Info

Publication number: CN114923491A
Application number: CN202210512577.5A
Authority: CN
Inventors: 达飞鹏; 陈汶铭
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-08-19

Abstract

The invention discloses a three-dimensional multi-target online tracking method based on feature fusion and distance fusion, which comprises the following steps of: firstly, acquiring detection data of each target by using a target detection algorithm; then extracting fusion characteristics and position characteristics of the target; then calculating the position distance and the characteristic distance between the track and the detection; then calculating the characteristic distance weight and calculating a final distance matrix according to the weight; and finally, carrying out data association by using greedy matching according to a final distance matrix between the tracks and the detection to obtain the multi-target tracking track of the current frame. By extracting and fusing multi-modal characteristics of the target, the characteristics which fully express the target properties in different environments can be extracted; according to the invention, the information of a plurality of modes is fused in different layers, so that the precision of multi-target tracking is improved; and the fusion process does not need manual intervention, so that a large amount of manual experiments are avoided for searching a proper threshold value, and a large amount of time is saved while the multi-target tracking precision is improved.

Description

Three-dimensional multi-target online tracking method based on feature fusion and distance fusion

Technical Field

The invention relates to the field of machine learning and computer vision, in particular to a three-dimensional multi-target online tracking method based on feature fusion and distance fusion, which is particularly suitable for outdoor complex tracking scenes of targets.

Background

With the continuous development of computer technology, the target tracking technology gradually becomes a key development technology direction in intelligent video analysis and multimedia applications (such as intelligent monitoring, motion analysis and automatic driving). A plurality of targets are detected and tracked in a complex scene, the target tracks are analyzed and predicted in real time, and richer data information is provided for practical application.

Due to the high degree of iteration of computer hardware and deep learning techniques, image and point cloud based target detection techniques are also becoming more mature. The accurate target detection result provides stronger data support for the multi-target tracking technology. In a real scene, the precision of the multi-target tracking algorithm is often interfered by the change of an external environment. Due to complex interaction and shielding among targets and changes of external environments of illumination and rainwater, the single modal characteristics of the targets are difficult to fully express the states and properties of the targets. Moreover, the high-dimensional semantic features and the low-dimensional physical features of the target are not suitable for direct fusion on a feature plane, and if more modal features are utilized, the problem of fusion level needs to be solved.

Aiming at the problems, the characteristics of different modes of the target are extracted, the weight of characteristic channels is adjusted through the extrusion excitation module and the characteristics are fused together, so that the characteristics of all the modes can play a role, and the final tracking precision is improved; modeling the motion of the targets to obtain the motion information of the targets so as to obtain the distance between the targets, and further fusing information of more modes in a decision level; the characteristic distance weight estimation network is learned in a data-driven mode, the weight of the characteristic distance between the targets is accurately given, the time consumed by manual experiments is reduced, and the efficiency is improved.

Disclosure of Invention

The purpose of the invention is as follows: in order to overcome the defects in the existing multi-target tracking technology, the invention provides a three-dimensional multi-target online tracking method based on feature fusion and distance fusion.

The technical scheme is as follows: in order to realize the purpose, the invention adopts the technical scheme that:

a three-dimensional multi-target online tracking method based on feature fusion and distance fusion comprises the following steps:

step 1: detecting vehicles of the current image sequence and the point cloud sequence by using a target detector to obtain detection data;

step 2: extracting two-dimensional characteristics, three-dimensional characteristics and position information of a target in a current frame;

and step 3: fusing the two-dimensional characteristic and the three-dimensional characteristic of the target in the current frame to obtain a fused characteristic of the target; predicting the position of the survival track in the current frame by using Kalman filtering;

and 4, step 4: calculating the Euclidean distance between the fusion characteristic distance and the position between the current frame and all targets of the historical track;

and 5: fusing the characteristic distance and Euclidean distance between the targets;

step 6: and performing data association according to the distance between the targets.

Further, the step 1 specifically includes the following steps:

selecting a target detector DSA-PV-RCNN to perform target detection on a currently input image sequence and a point cloud sequence, wherein the training of the target detector is based on a public data set KITTI, and the detection category is a vehicle; acquiring M detection data bounding box data of a current frame:

where α represents the angle of the target relative to the sensor, u ₁ And v ₁ Representing the horizontal and vertical coordinates, u, of the upper left corner of the two-dimensional bounding box in the image ₂ And v ₂ The horizontal and vertical coordinates of the lower right corner of the two-dimensional surrounding frame in the image are represented, h, w and l represent the height, width and length of the three-dimensional surrounding frame, x, y and z represent the coordinates of the central point of the three-dimensional surrounding frame in the point cloud, and r represents the rotation angle of the three-dimensional surrounding frame.

Further, the step 2 specifically includes the following steps:

step 2.1: according to detected data

Cutting down the picture of the corresponding position in the image, and zooming to the fixed size 224 multiplied by 224; then inputting the picture into a two-dimensional feature extraction network to extract two-dimensional features of the target

Performing this operation on all M objects can obtain a two-dimensional feature of each object, which can be expressed as

Step 2.2: according to (h) in the detection data ⁱ ,w ⁱ ,l ⁱ ,x ⁱ ,y ⁱ ,z ⁱ ,r ⁱ ) Extracting the points at the corresponding positions in the point cloud, inputting the points into a three-dimensional feature extraction network, and extracting the three-dimensional features of the target

Performing this operation on all M objects can obtain the three-dimensional features of each object, which can be expressed as

Step 2.3: according to (x) in the detected data ⁱ ,y ⁱ ,z ⁱ ) And obtaining the three-dimensional position information of the target.

Further, the step 3 specifically includes the following steps:

step 3.1: using squeeze actuated modules N ₁ Adjusting the weight of the target two-dimensional feature channel to obtain a new target two-dimensional feature

Step 3.2: using a squeeze actuated module N ₂ Adjusting the weight of the target three-dimensional characteristic channel to obtain a new target three-dimensional characteristic

Step 3.3: splicing the new two-dimensional characteristic and the three-dimensional characteristic of the target to obtain the fusion characteristic of the target

Representing feature splicing;

step 3.4: based on detection data in the historical track set, a state equation of the tracked target is established, and the state equation consists of 6 variables:

(x, y, z) represents three-dimensional position information of the object,

a difference between the three-dimensional position information representing the target and the three-dimensional position information of the previous frame; then, Kalman filtering prediction is applied to obtain three-dimensional position information of the track in the current frame

Further, the step 4 specifically includes the following steps:

step 4.1: estimating a network N by a characteristic distance ₃ Calculating the characteristic distance between the ith target of the current frame and the jth track in the historical tracks

Represents the fusion characteristic of the ith target of the current frame,

showing the fusion characteristics of the jth historical track in all the historical tracks up to the t-1 frame. Calculating the characteristic distance between the historical track and all the targets of the current frame to obtain a characteristic distance matrix

M represents the target number of the current frame, and N represents the target number of the adjacent previous frame;

step 4.2: according to the predicted position obtained in the step 3.4 and the detection position extracted in the step 2.3, the Euclidean distance can be calculated

(x ⁱ ,y ⁱ ,z ⁱ ) Three-dimensional position information of the ith object in the current frame is represented,

representing the predicted position of the jth historical track in the current frame; the Euclidean distance between the historical track and all the targets of the current frame is calculated to obtain an Euclidean distance matrix

further, the step 5 specifically includes the following steps:

step 5.1: estimating network N by feature distance weights ₄ Calculating a characteristic distance weight A:

target fusion characteristics calculated in step 3.3;

step 5.2: and according to the weight, weighting and fusing the characteristic distance and Euclidean distance between the targets:

where denotes multiplication of corresponding elements.

Further, the step 6 specifically includes the following steps:

obtaining a distance matrix D between the current frame target and the historical track according to the step 5 _final Selecting the column with the lowest distance from the first row of the matrix, if the distance is less than 2, regarding that the row and the column represent the corresponding targets, assigning the same Identity (ID), and setting the columns to be 100 to prevent the secondary matching; if it is higher than 2, the object represented by the row is considered to have no matching history object, and a new ID is assigned.

Has the advantages that: the three-dimensional multi-target online tracking method based on feature fusion and distance fusion has the following advantages.

1) According to the accurate detection result of DSA-PV-RCNN, besides the position information of the bounding box, the extracted high-dimensional semantic features can more fully express the target features.

2) The extrusion excitation module based on the attention mechanism adjusts the channel weight of the two-dimensional characteristic and the three-dimensional characteristic of the target, so that a more important part in the original characteristic is emphasized, and a more important role is played in the subsequent calculation process.

3) The fusion characteristics after the multi-modal characteristics are fused can more fully express the state and the property of the target, and the robustness of the algorithm in a complex scene is enhanced.

4) The Euclidean distance obtained by target motion modeling is fused in a decision level, and the robustness of the algorithm in a complex scene is further enhanced by utilizing information of more modes of the target.

5) The characteristic distance weight estimation network is learned in a data-driven mode, time consumed by manually searching for proper parameters is reduced, and the final tracking effect is improved.

Drawings

FIG. 1 is an overall flow chart of a three-dimensional multi-target online tracking method based on feature fusion and distance fusion provided by the invention;

FIG. 2 is a network architecture diagram of a two-dimensional feature extraction network provided by the present invention;

FIG. 3 is a network architecture diagram of a three-dimensional feature extraction network provided by the present invention;

FIG. 4 is a network architecture diagram of a squeeze activation module for use with the present invention;

FIG. 5 is an algorithmic flow diagram of a fusion module provided by the present invention;

fig. 6 is a network structure diagram of a characteristic distance estimation network provided by the present invention.

Fig. 7 is a diagram of a feature distance weight estimation network architecture provided by the present invention.

FIG. 8 is a flow chart of a data correlation algorithm used by the present invention.

FIG. 9 is a data set tracking result of the three-dimensional multi-target online tracking method based on feature fusion and distance fusion provided by the invention.

Fig. 10 is a visualization diagram of a tracking result of the three-dimensional multi-target online tracking method based on feature fusion and distance fusion, provided by the invention, wherein the upper diagram is a visualization result of a 115 th frame of a KITTI data set video sequence 0, and the lower diagram is a visualization result of a 100 th frame of a video sequence 2.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

As shown in the figure, the three-dimensional multi-target online tracking method based on feature fusion and distance fusion comprises the following steps:

and 5: fusing characteristic distances and Euclidean distances between targets;

and 6: and performing data association according to the distance between the targets.

In this embodiment, step 1 specifically includes the following steps:

In this embodiment, step 2 specifically includes the following steps:

step 2.1: according to detected data

Image processing methodThe picture at the corresponding position is cut off and is scaled to the fixed size 224 × 224; then inputting the picture into a two-dimensional feature extraction network to extract two-dimensional features of the target

Performing this operation on all M targets may result in a two-dimensional feature for each target, which may be expressed as

Step 2.2: according to (h) in the detection data ⁱ ,w ⁱ ,l ⁱ ,x ⁱ ,y ⁱ ,z ⁱ ,r ⁱ ) Extracting the point of the corresponding position in the point cloud, inputting the point into a three-dimensional feature extraction network, and extracting the three-dimensional feature of the target

Step 2.3: according to (x) in the detection data ⁱ ,y ⁱ ,z ⁱ ) And obtaining the three-dimensional position information of the target.

In this embodiment, step 3 specifically includes the following steps:

step 3.1: using a squeeze actuated module N ₁ Adjusting the weight of the target two-dimensional feature channel to obtain a new target two-dimensional feature

Representing feature splicing;

step 3.4: establishing a state equation of the tracked target based on detection data in the historical track set, wherein the state equation comprises 6 variables:

(x, y, z) represents three-dimensional position information of the object,

In this embodiment, step 4 specifically includes the following steps:

Represents the fusion characteristic of the ith target of the current frame,

showing the fusion characteristics of the jth historical track in all the historical tracks up to the t-1 th frame. Calculating the characteristic distance between the historical track and all the targets of the current frame to obtain a characteristic distance matrix

representing the predicted position of the jth historical track in the current frame; the Euclidean distance between the historical track and all targets of the current frame is calculated to obtain a Euclidean distance matrix

M represents the number of objects in the current frame, and N represents the number of objects in the adjacent previous frame；

In this embodiment, step 5 specifically includes the following steps:

target fusion characteristics calculated in step 3.3;

step 5.2: and according to the weight, weighting and fusing the characteristic distance and the Euclidean distance between the targets:

where denotes multiplication of corresponding elements.

In this embodiment, step 6 specifically includes the following steps:

according to the distance matrix D between the current frame target and the historical track obtained in the step 5 _final Selecting the column with the lowest distance from the first row of the matrix, if the distance is less than 2, regarding that the row and the column represent the corresponding targets, assigning the same Identity (ID), and setting the columns to be 100 to prevent the secondary matching; if it is higher than 2, the object represented by the row is considered to have no matching history object, and a new ID is assigned.

Examples

The invention relates to a three-dimensional multi-target online tracking method based on feature fusion and distance fusion.A DSA-PV-RCNN target detector is based on to obtain detection data on an image sequence and a point cloud sequence, and the type of a target of multi-target tracking is selected as a vehicle; the training dataset for the detector and tracker is the KITTI dataset, which includes 8008 pictures 24070 vehicles. The backbone network of the two-dimensional feature extraction network of the algorithm is VGGNet, the backbone network of the three-dimensional feature extraction network is PointNet, and the training losses of the distance estimation network and the feature distance weight estimation network are contrast loss and foldout loss.

Experiment: and testing the algorithm on the KITTI data set so as to verify the multi-target tracking effectiveness of the method provided by the method.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A three-dimensional multi-target online tracking method based on feature fusion and distance fusion is characterized in that:

2. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 1, wherein the specific method of step 1 is as follows:

3. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 2, wherein the specific method of step 2 is:

step 2.1: according to detected data

Cutting down the picture at the corresponding position in the image, and zooming the picture to be 224 multiplied by 224 with a fixed size; then inputting the picture into a two-dimensional feature extraction network to extract two-dimensional features of the target

Performing this operation on all M objects yields a two-dimensional feature for each object, denoted as

Performing this operation on all M objects obtains the three-dimensional features of each object, expressed as

4. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 3, wherein the specific method of step 3 is as follows:

Step 3.2: using squeeze actuated modules N ₂ Adjusting the weight of the target three-dimensional characteristic channel to obtain a new target three-dimensional characteristic

Step 3.3: splicing the new two-dimensional characteristics and the three-dimensional characteristics of the target to obtain the fusion characteristics of the target

Representing feature splicing;

(x, y, z) represents three-dimensional position information of the object,

5. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 4, wherein the specific method of step 4 is as follows:

Represents the fusion characteristic of the ith target of the current frame,

showing the fusion characteristics of the jth historical track in all the historical tracks up to the t-1 th frame; calculating the characteristic distance between the historical track and all the targets of the current frame to obtain a characteristic distance matrix

step 4.2: according to the predicted position obtained in the step 3.4 and the detection position extracted in the step 2.3, the Euclidean distance is calculated

representing the predicted position of the jth historical track in the current frame; calculating Euclidean distances between the historical track and all targets of the current frame to obtain an Euclidean distance matrix

M denotes the number of objects of the current frame and N denotes the number of objects of the adjacent previous frame.

6. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 5, wherein the specific method of step 5 is:

the target fusion characteristics are obtained by calculation in the step 3.3;

where denotes multiplication of corresponding elements.

7. The three-dimensional multi-target online tracking method based on feature fusion and distance fusion as claimed in claim 6, wherein the specific method of step 6 is as follows:

obtaining a distance matrix D between the current frame target and the historical track according to the step 5 _final Selecting the column with the lowest distance from the first row of the matrix, if the distance is less than 2, regarding that the row and the column represent the corresponding targets, assigning the same Identity (ID), and setting the columns to be 100 to prevent the secondary matching; if it is higher than 2, the target represented by the row is considered to have no history target matching with it, and a new ID is assigned to it.