CN115830075A

CN115830075A - Hierarchical association matching method for pedestrian multi-target tracking

Info

Publication number: CN115830075A
Application number: CN202310132470.2A
Authority: CN
Inventors: 梅桐
Original assignee: Wuhan Guangyinfei Technology Development Co ltd
Current assignee: Wuhan Guangyinfei Technology Development Co ltd
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-03-21

Abstract

The invention discloses a hierarchical association matching method for pedestrian multi-target tracking, which comprises the following steps: the method comprises the following steps: acquiring a target detection frame and a detection score; step two: predicting the position of the track in the current frame; step three: dividing high and low detection frames according to a threshold value; step four: measuring apparent feature distance using cosine distance; step five: measuring a motion feature distance using the DIoU distance; step six: using DIoU distance measurement low score detection and unmatched tracking target; step seven: the invention obtains the boundary frame and the identity ID of the target track, introduces the low-score detection and tracking target matching strategy by adopting a hierarchical matching method so as to recover the shielded pedestrian target in the low-score detection frame, reduces the missing detection between the detection and the track matching and has the characteristic of reducing the missing detection between the detection and the track matching.

Description

Hierarchical association matching method for pedestrian multi-target tracking

Technical Field

The invention relates to the technical field of multi-target tracking of pedestrians, in particular to a hierarchical association matching method for multi-target tracking of pedestrians.

Background

In a monitoring video scene, pedestrian crowding exists at many times, and at the moment, the target is in different states such as local shielding or complete shielding, so that the track information of the tracked target cannot be updated in time.

In order to solve the problem of target interaction occlusion and further improve the accuracy and the real-time performance of a multi-target tracking algorithm, the research direction is focused on establishing more reliable data association measurement. Most methods only keep the detection boxes with detection scores higher than the threshold value to be in data association with the track to obtain the identity information of the track in the current frame. If objects with detection scores below the threshold value (e.g. objects are occluded) are discarded, the non-negligible real object loss and object identity ID switching are brought about. But if all the high and low detection blocks in each frame are considered, more false detections are introduced. Therefore, it is necessary to design a hierarchical association matching method for pedestrian multi-target tracking, which can reduce the missing detection between detection and trajectory matching.

Disclosure of Invention

The invention aims to provide a hierarchical association matching method for pedestrian multi-target tracking so as to solve the problems in the background technology.

In order to solve the technical problems, the invention provides the following technical scheme: a hierarchical association matching method for pedestrian multi-target tracking comprises

The method comprises the following steps: acquiring a target detection frame and a detection score;

step two: predicting the position of the track in the current frame;

step three: dividing high and low detection frames according to a threshold value;

step four: measuring apparent feature distances using cosine distances;

step five: measuring a motion feature distance using the DIoU distance;

step six: using DIoU distance measurement low score detection and unmatched tracking target;

step seven: and obtaining a bounding box and an identity ID of the target track.

According to the above technical solution, the method for obtaining the target detection frame and the detection score further comprises:

first, a video frame is input

Extracting video frame features through a global context information enhancement network of a MOT-CN network

；

Second, the extracted shared features

Decoupling features that are further used for multitasking;

then, the characteristic decoupling unit outputs the specific characteristic

Input to the detection branch,

Inputting the feature into a feature embedding unit;

finally, obtaining a target detection frame according to the center point score, the center point offset and the target boundary frame

And detecting the score

。

According to the above technical solution, the method for predicting the position of the trajectory in the current frame further includes:

using a Kalman filter before separating out the low-resolution detection frame and the high-resolution detection frame

To predict

At the new position of each track in the current frame

The calculation formula is as follows:

=KF_predict(

)

wherein the content of the first and second substances,

is a lost track.

According to the technical scheme, the specific steps of dividing the high and low detection frames according to the threshold value comprise:

according to the set detection score threshold

And

divide all the detection frames into

And

two-part sets, i.e. higher for scores than

Put into the high-resolution detection frame set

(ii) a For the detection of score ranging from

To

The person in (1) is put into the low-score detection frame set

。

According to the above technical solution, the method for measuring the apparent feature distance using cosine distance specifically includes:

firstly, in the high score detection frame of the current frame

And all the tracked target tracks of the previous frame

And a lost track

The data correlation calculation is carried out, namely a high score detection frame is calculated by using cosine distance

And a prediction box of the target trajectory

Degree of appearance similarity of

；

Then, matching between the objects is completed by using the Hungarian algorithm based on the appearance similarity.

According to the above technical solution, the method for measuring a motion feature distance using a DIoU distance further includes:

first, calculate the high score detection box using the DIoU metric

And a prediction box of the remaining target trajectory

Motion feature similarity between them

；

Then, the Hungarian algorithm is used to complete the matching between the objects.

According to the above technical solution, the method for low-score detection and unmatched target tracking by using DIoU distance metric further comprises:

first, the DIoU metric is used to calculate the low score detection box

And a prediction box of the remaining target trajectory

Motion feature similarity between them

；

The unmatched target tracks are then retained in the unmatched tracks

And delete all unmatched low score detection boxes.

According to the technical scheme, the method for obtaining the bounding box and the identity ID of the target track comprises the following steps: for the

When its detection score is higher than a threshold value

If more than two frames exist continuously, a new target track is obtained, and the target track in the current frame is output

And the identity ID of the target.

A hierarchical association matching system using the hierarchical association matching method, characterized in that: the system comprises:

the MOT-CN network architecture is used for obtaining a target detection frame and a detection score;

the track prediction module is used for predicting the position of the track in the current frame;

the dividing module is used for dividing the high-low detection frames according to a threshold value;

the analysis module is used for analyzing the measurement detection target characteristics;

and the matching acquisition module is used for acquiring the bounding box and the identity ID of the target track after matching the detection score.

Compared with the prior art, the invention has the following beneficial effects: according to the invention, the adopted hierarchical matching method introduces a low-score detection and tracking target matching strategy to recover the shielded pedestrian target in the low-score detection frame and reduce the missing detection between detection and track matching.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a flowchart of a pedestrian multi-target tracking-oriented hierarchical association matching method according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a module composition of a pedestrian multi-target tracking-oriented hierarchical association matching system according to a second embodiment of the present invention;

fig. 3 is a schematic diagram of the overall structure of the MOT-CN network framework of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a first embodiment, fig. 1 is a flowchart of a pedestrian multi-target tracking-oriented hierarchical association matching method provided in an embodiment of the present invention, where this embodiment is applicable to a surveillance video scene, and the method can be executed by the hierarchical association matching system provided in the embodiment of the present invention, as shown in fig. 1, the method specifically includes the following steps:

the method comprises the following steps: and acquiring a target detection frame and a detection score.

In the embodiment of the invention, a target detection frame and a detection score are obtained by using a MOT-CN (Multi-Object Tracking for latency based on CenterNet), wherein the MOT-CN network consists of four parts: the device comprises a feature extraction unit, a feature decoupling unit, a detection branch and feature embedding. MOT-CN with video frame

For input, where H and W respectively represent input video frames

Height and width of (a).

Illustratively, first, a video frame is input

Extracting video frame characteristics through a Global Context Enhancement Network (GCEN) Network of MOT-CN

. The GCEN network comprises ResNet-34, a deep aggregation (DLA) variant, a context-aware enhancement (CEM) module and a channel attention-directed (CAM) module. The CEM module of GCEN aims at enhancing high-level semantic features and transmitting the enhanced semantic information to a lower-level feature level, exploring a large amount of context information from multiple receptive fields; meanwhile, when merging feature information in top-down paths, one CAM bank is added in each cross-connect to optimize the final integrated features at each level to mitigate aliasing effects. The overall feature extraction process of GCEN can be expressed as

Wherein, in the step (A),

representing a shared feature extraction operation.

Second, the extracted shared features

The features that are further used for multitasking are decoupled. The characteristic decoupling unit consists of two branches, each branch comprises a kernel with the size of

The convolution layer is used for decomposing the learned shared characteristics into detection specific information and ReID information so as to solve the problem of multi-task optimization contradiction. As shown by the grey decoupling block (Decouple block) in FIG. 3, the characteristic decoupling input is

The outputs of the two branches being respectively

And

the process can be expressed as:

,

wherein

And

are all shown as

The convolution operation of (1).

Then, the characteristic decoupling unit outputs the specific characteristic

Input to the detection branch,

Input to the feature embedding unit. The detection branch unit comprises three parallel branches which are respectively used for estimating the position of a central point of a target, the offset of the center of the target and the size of a bounding box of the target, so that the related target can be positioned according to specific detection information. The characteristics of the output of the detection branch unit are sequentially recorded as a central point position matrix

. Each branch of the detecting branch unit is provided with one

And a convolution kernel of

The convolution kernel generates the final target feature. This process can be expressed as:

、

、

wherein

The convolution operations of the centroid branch, the centroid offset branch, and the bounding box size branch are represented, respectively. The feature embedding unit extracts features by using each target central point so as to be used for appearance feature matching in a subsequent tracking process and improve the tracking robustness. The input features of the feature embedding unit are

The resulting embedded features are noted

Then, there are:

wherein, in the step (A),

is a feature extraction network consisting of

A convolutional layer and

and (4) rolling and laminating the layers.

And detecting the score

. Specifically, firstly, sorting is carried out according to the numerical values of the central point score matrix F1, and the first N points are selected to obtain a detection score set

And center point coordinate set

(ii) a Then according to the central point coordinate set in the central point offset matrix F2

The position of (2) is taken out of the offset of the corresponding position

And from the set of center point coordinates in the target bounding box matrix F3

The position of (2) is taken out of the upper and lower width and height values of the corresponding position

(ii) a Finally, calculating the position set of the target detection frame

，

Where t denotes the few frames of the video throughout. Obtained in this step

According to

And a threshold value

And

further will be

Detection frame division high-resolution detection frame

And low score detection frame

。

Step two: and predicting the position of the track in the current frame.

Illustratively, a Kalman filter is used before separating the low-fraction detection frame and the high-fraction detection frame

To predict

At the new position of each track in the current frame

. The calculation formula is as follows:

=KF_predict(

)

wherein the content of the first and second substances,

is a lost track.

Step three: and dividing high and low detection frames according to a threshold value.

Illustratively, in embodiments of the invention, the score threshold is set according to the detection

And

through experimental analysis, taking

=0.4,

=0.1, the tracking performance can be optimized. Divide all the detection frames into

And

two-part sets, i.e. higher than for scores

Put them into high-resolution detection frame set

(ii) a For the detection of score ranging from

To

Put them into the low score detection box set

。

Step four: appearance feature distance is measured using cosine distance.

Firstly, in the high score detection frame of the current frame

And all the tracked target tracks of the previous frame

And a lost track

And a prediction box of the target trajectory

Degree of appearance similarity of

. Because the apparent vector dimension of the target is higher, the cosine distance still keeps 'same as 1, orthogonal as 0 and opposite as-1' under the condition of high dimension "The Euclidean distance is affected by the dimension, the range is not fixed, and the meaning is fuzzy. Therefore, the invention selects the cosine distance to calculate the appearance similarity P1 of the two targets, and the calculation formula is as follows:

wherein the content of the first and second substances,

representing the appearance characteristics of the detected object,

and representing the appearance characteristics of the tracking target, wherein the appearance characteristics are obtained by the characteristic embedding unit of the first step.

Then, matching between the objects is completed by using the Hungarian algorithm based on the appearance similarity. The Hungarian algorithm mainly finds the best match between the detection target and the tracking target according to the similarity matrix, and keeps the unmatched detection boxes in the high-score detection boxes

In the method, the unmatched target tracks are kept in the unmatched tracks

In (1).

Step five: the motion feature distance is measured using the DIoU distance.

In the embodiment of the invention, firstly, the DIoU measurement is adopted to calculate the high score detection box

And a prediction box of the remaining target trajectory

Motion feature similarity between them

The formula is as follows:

wherein the content of the first and second substances,

is a detection frame

And a prediction block

The cross-over ratio of (a) to (b);

and

=

respectively show the detection frame

And a prediction box

The coordinates of the center point of (a),

is the euclidean distance, the center point of the target in this step is the low dimensional data with dimension 2, the euclidean distance is suitable for low dimensional data measurement, while the appearance vector of step four is the high dimensional data, suitable for cosine distance measurement,

is calculated by the formula

For calculating the distance between the center points;

is the diagonal length of the smallest rectangular box covering both boxes.

Then, the Hungarian algorithm is used to complete the matching between the objects. Will notThe matched detection frame is reserved in the high score detection frame

In the method, the unmatched target tracks are kept in the unmatched tracks

In (1).

Step six: and (4) detecting and unmatching the tracking target by using a DIoU distance measure.

In the embodiment of the invention, the DIoU measurement is firstly adopted to calculate the low score detection box

And a prediction box of the remaining target trajectory

Motion feature similarity between them

The formula is as follows:

wherein the content of the first and second substances,

is a detection frame

And a prediction box

The size of the cross-over ratio of (c),

and

respectively show the detection frame

And a prediction box

The coordinates of the center point of (a),

is the euclidean distance, for calculating the distance between the center points,

is the diagonal length of the smallest rectangular box covering both boxes.

The unmatched target tracks are then retained in the unmatched tracks

And delete all unmatched low score detection boxes.

In the embodiment of the present invention, for

If its detection score is higher than a threshold value

And the identity ID of the target.

The basic idea of the embodiment of the invention is that each detection frame is subjected to detection according to the detection scores

And high score detection threshold

And low score detection threshold

Is divided if

Divide the detection box into high scoreDetect the frame if

The detection frames are divided into low-level detection frames, the high-level detection frames and the low-level detection frames are sequentially subjected to hierarchical association matching, similarity measurement and data association matching is firstly carried out on the high-level detection frames and tracks through appearance features and motion features, similarity calculation is carried out on the remaining unmatched tracks and the low-level detection frames again, and the remaining unmatched tracks and the low-level detection frames are associated, so that pedestrian targets in the low-level detection frames are recovered, and background detection is filtered out. The motion information measures the distance of the motion feature using a DIoU, and simultaneously considers the overlapping area between the target frames and the center point distance of the two bounding boxes when suppressing the redundant boxes.

In a second embodiment, a second embodiment of the present invention provides a hierarchical association matching system for pedestrian multi-target tracking, and fig. 2 is a schematic diagram of modules of the hierarchical association matching system for pedestrian multi-target tracking according to the second embodiment of the present invention, as shown in fig. 2, including:

and the MOT-CN network architecture is used for obtaining a target detection frame and a detection score.

And the track prediction module is used for predicting the position of the track in the current frame.

And the dividing module is used for dividing the high and low detection frames according to the threshold value.

And the analysis module is used for analyzing the measurement detection target characteristics.

In some embodiments of the invention, the MOT-CN network architecture comprises:

and the characteristic extraction unit is used for extracting the video frame characteristics from the input video frame.

And the characteristic decoupling unit is used for decomposing the learned shared characteristics into specific detection information and ReID information.

And the detection branch unit is used for estimating the position of the center point of the target, the offset of the center of the target and the size of the boundary box of the target.

And the characteristic embedding unit is used for extracting the characteristic by utilizing each target central point.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A hierarchical association matching method for pedestrian multi-target tracking is characterized by comprising the following steps: the method specifically comprises the following steps:

step two: predicting the position of the track in the current frame;

step four: measuring apparent feature distance using cosine distance;

step five: measuring a motion feature distance using the DIoU distance;

2. The pedestrian multi-target tracking oriented hierarchical association matching method according to claim 1, characterized in that: the method for obtaining the target detection frame and the detection score further comprises the following steps:

first, a video frame is input

Extracting video frame characteristics through a global context information enhancement network of a MOT-CN network

；

Second, the extracted shared features

Decoupling features that are further used for multitasking;

then, the characteristic decoupling unit outputs the specific characteristic

Input to the detection branch,

Inputting the feature into a feature embedding unit;

And detecting the score

。

3. The pedestrian multi-target tracking-oriented hierarchical association matching method according to claim 1, characterized in that: the method for predicting the position of the track in the current frame further comprises the following steps:

To predict

At the new position of each track in the current frame

The calculation formula is as follows:

=KF_predict(

)

wherein the content of the first and second substances,

is a lost track.

4. The pedestrian multi-target tracking-oriented hierarchical association matching method according to claim 1, characterized in that: the specific steps of dividing the high and low detection frames according to the threshold value comprise:

according to the set detection score threshold

And

divide all the detection frames into

And

two-part sets, i.e. higher than for scores

Put into the high-resolution detection frame set

(ii) a For the detection of score ranging from

To

The person in (1) is put into the low-score detection frame set

。

5. The pedestrian multi-target tracking-oriented hierarchical association matching method according to claim 1, characterized in that: the method for measuring the apparent feature distance by using the cosine distance specifically comprises the following steps:

firstly, in the high score detection frame of the current frame

And all the tracked target tracks of the previous frame

And a lost track

And a prediction box of the target trajectory

Degree of appearance similarity of

；

6. The pedestrian multi-target tracking-oriented hierarchical association matching method according to claim 1, characterized in that: the method of measuring a motion feature distance using a DIoU distance further includes:

first, calculate the high score detection box using the DIoU metric

And a prediction box of the remaining target trajectory

Motion feature similarity between them

；

And then, matching among the targets is completed by using a Hungarian algorithm.

7. The pedestrian multi-target tracking-oriented hierarchical association matching method according to claim 1, characterized in that: the method for detecting and unmatching the tracking target with the DIoU distance metric further comprises the following steps:

the low score detection box is first calculated using the DIoU metric

And a prediction box of the remaining target trajectory

Motion feature similarity between them

；

The unmatched target tracks are then retained in the unmatched tracks

And delete all unmatched low score detection boxes.

8. According to claim 1The hierarchical association matching method for pedestrian multi-target tracking is characterized by comprising the following steps: the method for obtaining the bounding box and the identity ID of the target track comprises the following steps: for the

When its detection score is higher than a threshold value

And the identity ID of the target.

9. A hierarchical relevance matching system using the hierarchical relevance matching method of any of the preceding claims 1-8, characterized by: the system comprises:

the dividing module is used for dividing the high-low detection frames according to the threshold value;