CN114998999B

CN114998999B - Multi-target tracking method and device based on multi-frame input and track smoothing

Info

Publication number: CN114998999B
Application number: CN202210856428.0A
Authority: CN
Inventors: 张文广; 徐晓刚; 虞舒敏; 曹卫强
Original assignee: Zhejiang Gongshang University; Zhejiang Lab
Current assignee: Zhejiang Gongshang University; Zhejiang Lab
Priority date: 2022-07-21
Filing date: 2022-07-21
Publication date: 2022-12-06
Anticipated expiration: 2042-07-21
Also published as: CN114998999A

Abstract

The invention discloses a multi-target tracking method and a multi-target tracking device based on multi-frame input and track smoothing, wherein the method comprises the following steps: step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data; step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track; and step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian target detection and feature extraction result and previous frames of pedestrian target detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained; and step S4: and matching the shortest characteristic distance by using the coordinates and the appearance characteristics of the multi-frame image target, and smoothing the track by using a track curvature smoothing function to finally obtain the track of the current frame. The method has the advantages of low time consumption and good robustness to the shielding problem of the same kind of targets.

Description

Multi-target tracking method and device based on multi-frame input and track smoothing

Technical Field

The invention relates to the technical field of image recognition, in particular to a multi-target tracking method and device based on multi-frame input and track smoothing.

Background

With the wide deployment of urban public area monitoring cameras, the online detection and multi-target tracking technology for interested targets has significant academic and commercial values based on the requirements of public safety and emergency recourse.

Most of the current tracking algorithms for objects such as pedestrians use a detection network to obtain the position of an interested object, then use a ReID network to extract the appearance characteristic of the object, and finally use a Hungarian algorithm or a greedy algorithm to perform matching based on the distance measurement of a characteristic space. However, this approach has significant drawbacks: 1. when the targets are matched, only feature matching is carried out on the targets and the previous frame or frames, and the identity identification numbers are easy to exchange due to similar features of the blocked targets; 2. the selection of a fixed feature distance threshold is highly likely to result in a newly appearing target matching a track that has historically disappeared because there is no active track matching.

Based on the above two problems, the academic world mainly relies on proposing a network with better detection performance and proposing a network with stronger feature expression robustness, but as shown in fig. 3, in such a situation, because the occlusion of the same kind of object can lead to that a part of the appearance features of the object are covered by the appearance features of other objects, when two people meet, one person blocks the other person, and at the moment of occlusion, the appearance features of the blocked person also become the features of the occluded person, therefore, in a manner based on feature matching, the identity ID of the object is very easy to cause interchange, so as to generate a misconnection track as shown by the B indication line in fig. 3, the real situation should be the track shown by the a indication line in fig. 3, and this problem has not been solved well.

Based on the above, a pedestrian multi-target tracking method which is high in operation efficiency, resistant to similar target shielding and excellent in performance needs to be provided.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a multi-target tracking method and a multi-target tracking device based on multi-frame input and track smoothing, and the specific technical scheme is as follows:

a multi-target tracking method based on multi-frame input and track smoothing comprises the following steps:

step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data;

step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track;

and step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian target detection and feature extraction result and previous frames of pedestrian target detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained;

and step S4: and matching the shortest characteristic distance by using the coordinates and the appearance characteristics of the multi-frame image target, and smoothing the track by using a track curvature smoothing function to finally obtain the track of the current frame.

Further, the step S1 specifically includes: marking pedestrians in the pedestrian video sequence frame by using marking software for the acquired open source pedestrian video, wherein the marking comprises marking a target frame and an Identification (ID) number of a target, and the ID numbers are accumulated from 1; then, cutting and binding the pedestrian video with a fixed length to generate a track segment, wherein the track segment is composed of 2m +1 image sequence frames, namely the data of the track segment is composed of m image frames before to m image frames after an image frame at a certain moment, and m is a positive integer.

Further, the pedestrian multi-target tracking network model is formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the multi-scale feature extraction module and a target detection head of the Yolov5-L main network are arranged in parallel and input in the same mode, and the multi-scale feature extraction module is composed of a 3 × 256 convolution layer and a 1 × 256 × 3 convolution layer; the input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to output an appearance feature map with the same size as the input image, and then the appearance feature corresponding to the target frame is obtained by intercepting in the appearance feature map based on a preset frame to which the target frame detected by a target detection head belongs.

Further, the training of the pedestrian multi-target tracking network model is to train by adopting the fragment type trajectory data, simultaneously send image sequence frames of the fragment type trajectory data into the pedestrian multi-target tracking network model for reasoning, calculate to obtain coordinates of a target, namely a target frame and appearance characteristics, match the coordinates and the appearance characteristics of the target by adopting a shortest characteristic distance and a trajectory curvature smooth function based on the coordinates and the appearance characteristics of the target, and simultaneously utilize a total loss function to solve a gradient for reasoning the backward direction of the pedestrian multi-target tracking network model.

Further, the total loss function is a loss function combining the trajectory feature distance and the fitted loss function with a weighted average of the average L1 loss functions of the trajectory detection

，

A loss function representing the combined trajectory feature distance and fit,

represents the average L1 loss function of track segment object detection.

Further, the combined track characteristic distance and the fitted loss function are obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function, and feature extraction and track matching of the pedestrian multi-target tracking network model are trained and learned;

the trajectory feature distance loss function is expressed as:

；

wherein,

，i∈[1，2m+1]representing the target frame in the ith image frame

And the ith image frame real label target frame

The characteristic distance is represented by a cosine function of a characteristic vector included angle, and 2m +1 is the number of image sequence frames of the track segment;

the trajectory curvature smoothing loss function is expressed as:

，

wherein x represents the number of target tracks formed in the track segment,

the curvature of the track of the jth predicted target is the average track of the 2m +1 frame image,

for the curvature of the corresponding real label trajectory, j ∈ [1,x],

The predicted target track and the actual label track are the average track curvature difference; the matching specifically comprises the steps that the predicted target track is matched with the real label track by adopting the rule of curve front end, middle end and rear end IOU matching;

thus, the combined trajectory feature distance and fitted penalty function is represented as:

，

，

wherein

And

are all weighted based on

The learning of the pedestrian multi-target tracking network model on feature extraction and track matching is supervised.

Further, the average L1 loss function of the track segment target detection is expressed as:

，

wherein,

represents the average L1 loss function of target detection for the ith frame image, and 2m +1 is the number of image sequence frames of the track segment.

Further, the step S4 specifically includes: the method comprises the following steps of adopting a trained pedestrian multi-target tracking network model, carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of multi-frame image targets, carrying out track smoothing, and enabling the weighted sum of the average characteristic distance and the track curvature of the target and the track target of the previous 2m frames to be the minimum by the track obtained by matching, wherein the weighted sum is expressed as:

，

，

wherein

Representing the apparent feature distance of the current predicted image frame k from the ith image frame in its 2m preceding frames,

and

are all weighted weights.

A multi-target tracking device based on multi-frame input and track smoothing comprises one or more processors and is used for realizing the multi-target tracking method based on multi-frame input and track smoothing.

A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the multi-target tracking method based on multi-frame input and trajectory smoothing.

Compared with the prior art, the invention has the beneficial effects that: 1. by constructing the fragment type trajectory data set, different angles of the same target in a plurality of images can simultaneously supervise the learning of the feature extraction module, so that the learned appearance features are more robust; 2. in the continuous frames of the video, the motion of the pedestrian is linear in a short time, the track jump cannot occur, and the track jump caused by mismatching due to similar target shielding can be well filtered through track smoothing based on 2m +1 frames; 3. by sending a section of training track data set, the detection module can learn the target information of the same target at different moments at the same time, so that the detection performance is improved to a certain extent; 4. by sharing the cache mode with the characteristics and the detection result, only 1 frame of picture needs to be inferred during deployment, and the effect of inputting 2m +1 frame of picture is achieved.

Drawings

Fig. 1 is a schematic flow chart of a multi-target tracking method based on multi-frame input and track smoothing according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an overall network framework of a multi-target tracking method based on multi-frame input and trajectory smoothing according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of the present invention in which an identity exchange problem exists;

fig. 4 is a schematic structural diagram of a multi-target tracking apparatus based on multi-frame input and track smoothing according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.

The invention provides a multi-frame input and smooth track-based multi-target tracking method, which aims to solve the problem of track misconnection caused by shielding of similar targets in the conventional multi-target tracking algorithm. The invention provides a single-stage network based on multi-frame input detection and ReID feature extraction, wherein Yolov5-L is used as a backbone of a network model, a target feature extraction module is added at the tail end of the network, and coordinates and appearance features of a pedestrian target are simultaneously obtained through the single-stage network; and adopting multi-frame input, matching multi-frame target characteristics and smoothing an online track to reduce the misconnection rate.

As shown in fig. 1, the method specifically includes the following steps:

specifically, the marking of the pedestrian coordinates and the pedestrian track refers to marking the pedestrians in the video sequence frame by using professional marking software, and comprises marking a target frame and an Identification (ID) number of the target, wherein the ID numbers are accumulated from 1;

the generation of fragment type track data refers to that historical video data is cut and bound in a fixed length to generate track fragments, overlapping frames or non-overlapping frames can exist among the track fragments, a track fragment before and after a certain time of the history is supposed to be composed of 2m +1 image sequence frames, namely the data of the track fragment is composed of m image frames from the front to the back of the image frame at the certain time, m is a positive integer, and the fragment type track data only binds image frame serial numbers and does not need to relate to repeated copying of images.

specifically, as shown in fig. 2, a multi-frame input pedestrian multi-target tracking network model based on the shortest feature distance and the smooth matching of the trajectory is mainly formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the size of an input picture of the network model is 960 × 960, the multi-scale feature extraction module is arranged in parallel with a target detection head Detect of the Yolov5-L main network, the Yolov5-L main network performs target detection on an input image through a target detection head, the input of the multi-scale feature extraction module is the same as that of the target detection head, and the multi-scale feature extraction module is composed of one 3 × 256 convolution layer and one 1 × 256 × 3 convolution layer and performs feature extraction on a target; the method comprises the steps that an input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to finally obtain an appearance feature map with the same size as the input image, and finally, appearance features corresponding to a target frame are obtained in the appearance feature map by intercepting on the basis of a preset frame to which the target frame obtained by detection of a target detection head Detect belongs, wherein the dimension of the appearance features of a single target is 256 dimensions.

The training of the pedestrian multi-target tracking network model is to train by adopting the fragment type track data, send image sequence frames of the fragment type track data into the pedestrian multi-target tracking network model for reasoning, calculate to obtain coordinates of a target, namely a target frame and appearance characteristics, match the coordinates and the appearance characteristics of the target by adopting a shortest characteristic distance and a track curvature smooth function, and simultaneously solve gradient through a total loss function to carry out backward reasoning of the pedestrian multi-target tracking network model.

The feature extraction of the pedestrian multi-target tracking network model and the matching training of the track adopt a loss function combining track feature distance and fitting to supervise the learning of the feature extraction and the track matching of the network model. The loss function is obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function.

For the distance of the feature space, the cosine function of the included angle of the feature vector is adopted to represent, and then the target frame in the ith image frame

And the ith image frame real label target frame

Is expressed as:

，i∈[1，2m+1]，

when the included angle of the feature vector is closer to 0 degree, the average predicted value is closer to the true value, and the feature distance is closer

The closer to 0, the characteristic distance otherwise

The closer to 1; thus, the trajectory feature distance loss function for a trajectory segment is represented as:

；

assuming that the track segments together form the track of x objects, calculating the average track curvature of the j-th object at 2m +1 frame image as

，j∈[1，x]Calculating the average track curvature difference between the predicted target track and the real label track as

In which

Matching the predicted track and the real label track by adopting a rule of curve front end, middle end and rear end IOU matching for the curvature of the corresponding real label track, namely, taking a coordinate of a first target, a coordinate of a middle target and a coordinate of a last disappearance in the predicted track and coordinates of first, middle and last time periods corresponding to the real label track to perform IOU matching, if the average IOU is maximum, considering the predicted track and the real track to be the same track, and calculating a loss function of fitting the predicted track and the real label track based on a calculation formula of an average track curvature difference value, namely a track curvature smooth loss function:

。

finally, the combined trajectory feature distance and fitted loss function is expressed as:

，

，

wherein

And

are all weighted based on

The pedestrian multi-target tracking network model detects and trains the L1 loss function in the commonly used single-frame target detection model, and assumes the target of the ith frame imageAverage L1 loss function representation of target detection

Then the average L1 loss function of target detection for all images of the track segment

And (4) showing.

Finally, the total loss function of the training of the multi-target tracking network model of the pedestrian is expressed as

。

And step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian detection and feature extraction result and previous frames of pedestrian detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained;

specifically, a trained pedestrian multi-target tracking network model is used for detecting a pedestrian target frame of an obtained frame image and corresponding appearance characteristics of the frame image, a track segment is formed by front and back m frames of images in the training process, however, a track segment is inferred by 2m +1 formed by front 2m frames of images in actual application deployment, so that the model is not required to be inferred for 2m +1 times each time in actual inference, only 1 time is required to be inferred, and the result of the front 2m frames is obtained by previous cache.

Specifically, based on the image inference result of 2m +1 frame, the shortest feature distance matching and the track curvature smoothing principle are used, that is, the track obtained by matching makes the weighted sum of the average feature distance and the track curvature of the current predicted target and the track target of the previous 2m frames minimum, and the weighted sum is expressed as:

，

，

wherein

and

are all weighted weights.

Corresponding to the embodiment of the multi-target tracking method based on multi-frame input and track smoothing, the invention also provides an embodiment of the multi-target tracking device based on multi-frame input and track smoothing.

Referring to fig. 4, the multi-target tracking device based on multi-frame input and trajectory smoothing provided by the embodiment of the present invention includes one or more processors, and is configured to implement the multi-target tracking method based on multi-frame input and trajectory smoothing in the foregoing embodiment.

The embodiment of the multi-target tracking device based on multi-frame input and track smoothing can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an arbitrary device with data processing capability where a multi-frame input and trajectory smoothing-based multi-target tracking apparatus of the present invention is located is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, in an embodiment, the arbitrary device with data processing capability where the apparatus is located may also include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, and when the program is executed by a processor, the multi-target tracking method based on multi-frame input and track smoothing in the above embodiments is implemented.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing device described in any previous embodiment. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims

1. A multi-target tracking method based on multi-frame input and track smoothing is characterized by comprising the following steps:

specifically, the training of the pedestrian multi-target tracking network model is to train by adopting the fragment type track data, send image sequence frames of the fragment type track data into the pedestrian multi-target tracking network model for reasoning at the same time, calculate to obtain a coordinate of a target, namely a target frame and an appearance characteristic, match the coordinate and the appearance characteristic of the target by adopting a shortest characteristic distance and a track curvature smooth function based on the coordinate and the appearance characteristic of the target, and simultaneously utilize a total loss function to solve a gradient for backward reasoning of the pedestrian multi-target tracking network model;

the total loss function is a loss function combining the trajectory feature distance and the fitted loss function with the weighted average of the average L1 loss functions of the trajectory detection

，

A loss function representing the combined trajectory feature distance and fit,

an average L1 loss function representing track segment target detection;

the combined track characteristic distance and the fitted loss function are obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function, and training and learning are carried out on characteristic extraction and track matching of the pedestrian multi-target tracking network model;

the trajectory feature distance loss function is expressed as:

；

wherein,

，i∈[1，2m+1]representing the target frame in the ith image frame

And the ith image frame real label target frame

the trajectory curvature smoothing loss function is expressed as:

，

wherein x represents the number of target tracks formed in the track segment,

for the curvature of the corresponding real label trajectory, j ∈ [1,x],

The predicted target track and the real label track are the average track curvature difference; the matching is specifically matching of a predicted target track and a real label track, and matching is carried out by adopting a rule of curve front end, middle end and rear end IOU matching;

，

，

wherein

And

are all weighted based on

The learning of a pedestrian multi-target tracking network model on feature extraction and track matching is supervised;

and step S4: the method comprises the following steps of carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of a multi-frame image target, carrying out track smoothing by utilizing a track curvature smoothing function, and finally obtaining the track of a current frame, wherein the method specifically comprises the following steps: the method comprises the following steps of adopting a trained pedestrian multi-target tracking network model, carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of multi-frame image targets, carrying out track smoothing, and enabling the weighted sum of the average characteristic distance and the track curvature of the target and the track target of the previous 2m frames to be the minimum by the track obtained by matching, wherein the weighted sum is expressed as:

，

，

wherein

and

are all weighted weights.

2. The multi-target tracking method based on multi-frame input and track smoothing as claimed in claim 1, wherein the step S1 specifically comprises: marking pedestrians in the pedestrian video sequence frame by using marking software for the acquired open source pedestrian video, wherein the marking comprises marking a target frame and an Identification (ID) number of a target, and the ID numbers are accumulated from 1; then, cutting and binding the pedestrian video with a fixed length to generate a track segment, wherein the track segment is composed of 2m +1 image sequence frames, namely the data of the track segment is composed of m image frames before to m image frames after an image frame at a certain moment, and m is a positive integer.

3. The multi-frame input and track smoothing-based multi-target tracking method according to claim 1, wherein the pedestrian multi-target tracking network model is formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the multi-scale feature extraction module is arranged in parallel with a target detection head of the Yolov5-L main network and has the same input, and the multi-scale feature extraction module is composed of one 3 x 256 convolution layer and one 1 x 256 convolution layer; the input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to output an appearance feature map with the same size as the input image, and then the appearance feature corresponding to the target frame is obtained by intercepting in the appearance feature map based on a preset frame to which the target frame detected by a target detection head belongs.

4. The multi-target tracking method based on multi-frame input and track smoothing as claimed in claim 1, wherein the average L1 loss function of track segment target detection is represented as:

，

wherein,

represents the average L1 loss function for target detection of the ith frame image, and 2m +1 is the number of image sequence frames of the track segment.

5. A multi-target tracking device based on multi-frame input and trajectory smoothing, which is characterized by comprising one or more processors and is used for implementing the multi-target tracking method based on multi-frame input and trajectory smoothing as claimed in any one of claims 1 to 4.

6. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the multi-target tracking method based on multi-frame input and trajectory smoothing according to any one of claims 1 to 4.