CN114998999B - Multi-target tracking method and device based on multi-frame input and track smoothing - Google Patents

Multi-target tracking method and device based on multi-frame input and track smoothing Download PDF

Info

Publication number
CN114998999B
CN114998999B CN202210856428.0A CN202210856428A CN114998999B CN 114998999 B CN114998999 B CN 114998999B CN 202210856428 A CN202210856428 A CN 202210856428A CN 114998999 B CN114998999 B CN 114998999B
Authority
CN
China
Prior art keywords
track
target
frame
pedestrian
target tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210856428.0A
Other languages
Chinese (zh)
Other versions
CN114998999A (en
Inventor
张文广
徐晓刚
虞舒敏
曹卫强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Zhejiang Lab
Original Assignee
Zhejiang Gongshang University
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University, Zhejiang Lab filed Critical Zhejiang Gongshang University
Priority to CN202210856428.0A priority Critical patent/CN114998999B/en
Publication of CN114998999A publication Critical patent/CN114998999A/en
Application granted granted Critical
Publication of CN114998999B publication Critical patent/CN114998999B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-target tracking method and a multi-target tracking device based on multi-frame input and track smoothing, wherein the method comprises the following steps: step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data; step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track; and step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian target detection and feature extraction result and previous frames of pedestrian target detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained; and step S4: and matching the shortest characteristic distance by using the coordinates and the appearance characteristics of the multi-frame image target, and smoothing the track by using a track curvature smoothing function to finally obtain the track of the current frame. The method has the advantages of low time consumption and good robustness to the shielding problem of the same kind of targets.

Description

Multi-target tracking method and device based on multi-frame input and track smoothing
Technical Field
The invention relates to the technical field of image recognition, in particular to a multi-target tracking method and device based on multi-frame input and track smoothing.
Background
With the wide deployment of urban public area monitoring cameras, the online detection and multi-target tracking technology for interested targets has significant academic and commercial values based on the requirements of public safety and emergency recourse.
Most of the current tracking algorithms for objects such as pedestrians use a detection network to obtain the position of an interested object, then use a ReID network to extract the appearance characteristic of the object, and finally use a Hungarian algorithm or a greedy algorithm to perform matching based on the distance measurement of a characteristic space. However, this approach has significant drawbacks: 1. when the targets are matched, only feature matching is carried out on the targets and the previous frame or frames, and the identity identification numbers are easy to exchange due to similar features of the blocked targets; 2. the selection of a fixed feature distance threshold is highly likely to result in a newly appearing target matching a track that has historically disappeared because there is no active track matching.
Based on the above two problems, the academic world mainly relies on proposing a network with better detection performance and proposing a network with stronger feature expression robustness, but as shown in fig. 3, in such a situation, because the occlusion of the same kind of object can lead to that a part of the appearance features of the object are covered by the appearance features of other objects, when two people meet, one person blocks the other person, and at the moment of occlusion, the appearance features of the blocked person also become the features of the occluded person, therefore, in a manner based on feature matching, the identity ID of the object is very easy to cause interchange, so as to generate a misconnection track as shown by the B indication line in fig. 3, the real situation should be the track shown by the a indication line in fig. 3, and this problem has not been solved well.
Based on the above, a pedestrian multi-target tracking method which is high in operation efficiency, resistant to similar target shielding and excellent in performance needs to be provided.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides a multi-target tracking method and a multi-target tracking device based on multi-frame input and track smoothing, and the specific technical scheme is as follows:
a multi-target tracking method based on multi-frame input and track smoothing comprises the following steps:
step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data;
step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track;
and step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian target detection and feature extraction result and previous frames of pedestrian target detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained;
and step S4: and matching the shortest characteristic distance by using the coordinates and the appearance characteristics of the multi-frame image target, and smoothing the track by using a track curvature smoothing function to finally obtain the track of the current frame.
Further, the step S1 specifically includes: marking pedestrians in the pedestrian video sequence frame by using marking software for the acquired open source pedestrian video, wherein the marking comprises marking a target frame and an Identification (ID) number of a target, and the ID numbers are accumulated from 1; then, cutting and binding the pedestrian video with a fixed length to generate a track segment, wherein the track segment is composed of 2m +1 image sequence frames, namely the data of the track segment is composed of m image frames before to m image frames after an image frame at a certain moment, and m is a positive integer.
Further, the pedestrian multi-target tracking network model is formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the multi-scale feature extraction module and a target detection head of the Yolov5-L main network are arranged in parallel and input in the same mode, and the multi-scale feature extraction module is composed of a 3 × 256 convolution layer and a 1 × 256 × 3 convolution layer; the input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to output an appearance feature map with the same size as the input image, and then the appearance feature corresponding to the target frame is obtained by intercepting in the appearance feature map based on a preset frame to which the target frame detected by a target detection head belongs.
Further, the training of the pedestrian multi-target tracking network model is to train by adopting the fragment type trajectory data, simultaneously send image sequence frames of the fragment type trajectory data into the pedestrian multi-target tracking network model for reasoning, calculate to obtain coordinates of a target, namely a target frame and appearance characteristics, match the coordinates and the appearance characteristics of the target by adopting a shortest characteristic distance and a trajectory curvature smooth function based on the coordinates and the appearance characteristics of the target, and simultaneously utilize a total loss function to solve a gradient for reasoning the backward direction of the pedestrian multi-target tracking network model.
Further, the total loss function is a loss function combining the trajectory feature distance and the fitted loss function with a weighted average of the average L1 loss functions of the trajectory detection
Figure 367742DEST_PATH_IMAGE002
Figure 231793DEST_PATH_IMAGE004
A loss function representing the combined trajectory feature distance and fit,
Figure 249427DEST_PATH_IMAGE006
represents the average L1 loss function of track segment object detection.
Further, the combined track characteristic distance and the fitted loss function are obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function, and feature extraction and track matching of the pedestrian multi-target tracking network model are trained and learned;
the trajectory feature distance loss function is expressed as:
Figure 540731DEST_PATH_IMAGE008
wherein,
Figure 276606DEST_PATH_IMAGE010
,i∈[1,2m+1]representing the target frame in the ith image frame
Figure 100002_DEST_PATH_IMAGE012
And the ith image frame real label target frame
Figure 911725DEST_PATH_IMAGE014
The characteristic distance is represented by a cosine function of a characteristic vector included angle, and 2m +1 is the number of image sequence frames of the track segment;
the trajectory curvature smoothing loss function is expressed as:
Figure 49445DEST_PATH_IMAGE016
wherein x represents the number of target tracks formed in the track segment,
Figure 511650DEST_PATH_IMAGE018
the curvature of the track of the jth predicted target is the average track of the 2m +1 frame image,
Figure 100002_DEST_PATH_IMAGE020
for the curvature of the corresponding real label trajectory, j ∈ [1,x],
Figure 100002_DEST_PATH_IMAGE022
The predicted target track and the actual label track are the average track curvature difference; the matching specifically comprises the steps that the predicted target track is matched with the real label track by adopting the rule of curve front end, middle end and rear end IOU matching;
thus, the combined trajectory feature distance and fitted penalty function is represented as:
Figure 100002_DEST_PATH_IMAGE024
Figure 100002_DEST_PATH_IMAGE026
wherein
Figure 100002_DEST_PATH_IMAGE028
And
Figure 100002_DEST_PATH_IMAGE030
are all weighted based on
Figure 100002_DEST_PATH_IMAGE032
The learning of the pedestrian multi-target tracking network model on feature extraction and track matching is supervised.
Further, the average L1 loss function of the track segment target detection is expressed as:
Figure 100002_DEST_PATH_IMAGE034
wherein,
Figure 100002_DEST_PATH_IMAGE036
represents the average L1 loss function of target detection for the ith frame image, and 2m +1 is the number of image sequence frames of the track segment.
Further, the step S4 specifically includes: the method comprises the following steps of adopting a trained pedestrian multi-target tracking network model, carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of multi-frame image targets, carrying out track smoothing, and enabling the weighted sum of the average characteristic distance and the track curvature of the target and the track target of the previous 2m frames to be the minimum by the track obtained by matching, wherein the weighted sum is expressed as:
Figure 100002_DEST_PATH_IMAGE038
Figure 100002_DEST_PATH_IMAGE040
wherein
Figure 100002_DEST_PATH_IMAGE042
Representing the apparent feature distance of the current predicted image frame k from the ith image frame in its 2m preceding frames,
Figure DEST_PATH_IMAGE043
and
Figure 100002_DEST_PATH_IMAGE044
are all weighted weights.
A multi-target tracking device based on multi-frame input and track smoothing comprises one or more processors and is used for realizing the multi-target tracking method based on multi-frame input and track smoothing.
A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the multi-target tracking method based on multi-frame input and trajectory smoothing.
Compared with the prior art, the invention has the beneficial effects that: 1. by constructing the fragment type trajectory data set, different angles of the same target in a plurality of images can simultaneously supervise the learning of the feature extraction module, so that the learned appearance features are more robust; 2. in the continuous frames of the video, the motion of the pedestrian is linear in a short time, the track jump cannot occur, and the track jump caused by mismatching due to similar target shielding can be well filtered through track smoothing based on 2m +1 frames; 3. by sending a section of training track data set, the detection module can learn the target information of the same target at different moments at the same time, so that the detection performance is improved to a certain extent; 4. by sharing the cache mode with the characteristics and the detection result, only 1 frame of picture needs to be inferred during deployment, and the effect of inputting 2m +1 frame of picture is achieved.
Drawings
Fig. 1 is a schematic flow chart of a multi-target tracking method based on multi-frame input and track smoothing according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an overall network framework of a multi-target tracking method based on multi-frame input and trajectory smoothing according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an embodiment of the present invention in which an identity exchange problem exists;
fig. 4 is a schematic structural diagram of a multi-target tracking apparatus based on multi-frame input and track smoothing according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples.
The invention provides a multi-frame input and smooth track-based multi-target tracking method, which aims to solve the problem of track misconnection caused by shielding of similar targets in the conventional multi-target tracking algorithm. The invention provides a single-stage network based on multi-frame input detection and ReID feature extraction, wherein Yolov5-L is used as a backbone of a network model, a target feature extraction module is added at the tail end of the network, and coordinates and appearance features of a pedestrian target are simultaneously obtained through the single-stage network; and adopting multi-frame input, matching multi-frame target characteristics and smoothing an online track to reduce the misconnection rate.
As shown in fig. 1, the method specifically includes the following steps:
step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data;
specifically, the marking of the pedestrian coordinates and the pedestrian track refers to marking the pedestrians in the video sequence frame by using professional marking software, and comprises marking a target frame and an Identification (ID) number of the target, wherein the ID numbers are accumulated from 1;
the generation of fragment type track data refers to that historical video data is cut and bound in a fixed length to generate track fragments, overlapping frames or non-overlapping frames can exist among the track fragments, a track fragment before and after a certain time of the history is supposed to be composed of 2m +1 image sequence frames, namely the data of the track fragment is composed of m image frames from the front to the back of the image frame at the certain time, m is a positive integer, and the fragment type track data only binds image frame serial numbers and does not need to relate to repeated copying of images.
Step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track;
specifically, as shown in fig. 2, a multi-frame input pedestrian multi-target tracking network model based on the shortest feature distance and the smooth matching of the trajectory is mainly formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the size of an input picture of the network model is 960 × 960, the multi-scale feature extraction module is arranged in parallel with a target detection head Detect of the Yolov5-L main network, the Yolov5-L main network performs target detection on an input image through a target detection head, the input of the multi-scale feature extraction module is the same as that of the target detection head, and the multi-scale feature extraction module is composed of one 3 × 256 convolution layer and one 1 × 256 × 3 convolution layer and performs feature extraction on a target; the method comprises the steps that an input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to finally obtain an appearance feature map with the same size as the input image, and finally, appearance features corresponding to a target frame are obtained in the appearance feature map by intercepting on the basis of a preset frame to which the target frame obtained by detection of a target detection head Detect belongs, wherein the dimension of the appearance features of a single target is 256 dimensions.
The training of the pedestrian multi-target tracking network model is to train by adopting the fragment type track data, send image sequence frames of the fragment type track data into the pedestrian multi-target tracking network model for reasoning, calculate to obtain coordinates of a target, namely a target frame and appearance characteristics, match the coordinates and the appearance characteristics of the target by adopting a shortest characteristic distance and a track curvature smooth function, and simultaneously solve gradient through a total loss function to carry out backward reasoning of the pedestrian multi-target tracking network model.
The feature extraction of the pedestrian multi-target tracking network model and the matching training of the track adopt a loss function combining track feature distance and fitting to supervise the learning of the feature extraction and the track matching of the network model. The loss function is obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function.
For the distance of the feature space, the cosine function of the included angle of the feature vector is adopted to represent, and then the target frame in the ith image frame
Figure DEST_PATH_IMAGE045
And the ith image frame real label target frame
Figure 92411DEST_PATH_IMAGE014
Is expressed as:
Figure DEST_PATH_IMAGE046
,i∈[1,2m+1],
when the included angle of the feature vector is closer to 0 degree, the average predicted value is closer to the true value, and the feature distance is closer
Figure DEST_PATH_IMAGE048
The closer to 0, the characteristic distance otherwise
Figure DEST_PATH_IMAGE049
The closer to 1; thus, the trajectory feature distance loss function for a trajectory segment is represented as:
Figure 439210DEST_PATH_IMAGE008
assuming that the track segments together form the track of x objects, calculating the average track curvature of the j-th object at 2m +1 frame image as
Figure 431436DEST_PATH_IMAGE018
,j∈[1,x]Calculating the average track curvature difference between the predicted target track and the real label track as
Figure DEST_PATH_IMAGE050
In which
Figure 798964DEST_PATH_IMAGE020
Matching the predicted track and the real label track by adopting a rule of curve front end, middle end and rear end IOU matching for the curvature of the corresponding real label track, namely, taking a coordinate of a first target, a coordinate of a middle target and a coordinate of a last disappearance in the predicted track and coordinates of first, middle and last time periods corresponding to the real label track to perform IOU matching, if the average IOU is maximum, considering the predicted track and the real track to be the same track, and calculating a loss function of fitting the predicted track and the real label track based on a calculation formula of an average track curvature difference value, namely a track curvature smooth loss function:
Figure DEST_PATH_IMAGE051
finally, the combined trajectory feature distance and fitted loss function is expressed as:
Figure 482667DEST_PATH_IMAGE024
Figure 23370DEST_PATH_IMAGE026
wherein
Figure DEST_PATH_IMAGE052
And
Figure DEST_PATH_IMAGE053
are all weighted based on
Figure 73365DEST_PATH_IMAGE032
The learning of the pedestrian multi-target tracking network model on feature extraction and track matching is supervised.
The pedestrian multi-target tracking network model detects and trains the L1 loss function in the commonly used single-frame target detection model, and assumes the target of the ith frame imageAverage L1 loss function representation of target detection
Figure 611794DEST_PATH_IMAGE036
Then the average L1 loss function of target detection for all images of the track segment
Figure DEST_PATH_IMAGE054
And (4) showing.
Finally, the total loss function of the training of the multi-target tracking network model of the pedestrian is expressed as
Figure 278399DEST_PATH_IMAGE002
And step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian detection and feature extraction result and previous frames of pedestrian detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained;
specifically, a trained pedestrian multi-target tracking network model is used for detecting a pedestrian target frame of an obtained frame image and corresponding appearance characteristics of the frame image, a track segment is formed by front and back m frames of images in the training process, however, a track segment is inferred by 2m +1 formed by front 2m frames of images in actual application deployment, so that the model is not required to be inferred for 2m +1 times each time in actual inference, only 1 time is required to be inferred, and the result of the front 2m frames is obtained by previous cache.
And step S4: and matching the shortest characteristic distance by using the coordinates and the appearance characteristics of the multi-frame image target, and smoothing the track by using a track curvature smoothing function to finally obtain the track of the current frame.
Specifically, based on the image inference result of 2m +1 frame, the shortest feature distance matching and the track curvature smoothing principle are used, that is, the track obtained by matching makes the weighted sum of the average feature distance and the track curvature of the current predicted target and the track target of the previous 2m frames minimum, and the weighted sum is expressed as:
Figure DEST_PATH_IMAGE055
Figure 793432DEST_PATH_IMAGE040
wherein
Figure DEST_PATH_IMAGE056
Representing the apparent feature distance of the current predicted image frame k from the ith image frame in its 2m preceding frames,
Figure 229092DEST_PATH_IMAGE043
and
Figure DEST_PATH_IMAGE057
are all weighted weights.
Corresponding to the embodiment of the multi-target tracking method based on multi-frame input and track smoothing, the invention also provides an embodiment of the multi-target tracking device based on multi-frame input and track smoothing.
Referring to fig. 4, the multi-target tracking device based on multi-frame input and trajectory smoothing provided by the embodiment of the present invention includes one or more processors, and is configured to implement the multi-target tracking method based on multi-frame input and trajectory smoothing in the foregoing embodiment.
The embodiment of the multi-target tracking device based on multi-frame input and track smoothing can be applied to any equipment with data processing capability, such as computers and other equipment or devices. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a device in a logical sense, a processor of any device with data processing capability reads corresponding computer program instructions in the nonvolatile memory into the memory for operation. In terms of hardware, as shown in fig. 4, a hardware structure diagram of an arbitrary device with data processing capability where a multi-frame input and trajectory smoothing-based multi-target tracking apparatus of the present invention is located is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, in an embodiment, the arbitrary device with data processing capability where the apparatus is located may also include other hardware according to an actual function of the arbitrary device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, and when the program is executed by a processor, the multi-target tracking method based on multi-frame input and track smoothing in the above embodiments is implemented.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing device described in any previous embodiment. The computer readable storage medium may also be an external storage device such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims (6)

1. A multi-target tracking method based on multi-frame input and track smoothing is characterized by comprising the following steps:
step S1: acquiring a pedestrian video data set, marking pedestrian coordinates and pedestrian tracks, and generating fragment type track data;
step S2: constructing and training a pedestrian multi-target tracking network model based on multi-frame input and smooth track;
specifically, the training of the pedestrian multi-target tracking network model is to train by adopting the fragment type track data, send image sequence frames of the fragment type track data into the pedestrian multi-target tracking network model for reasoning at the same time, calculate to obtain a coordinate of a target, namely a target frame and an appearance characteristic, match the coordinate and the appearance characteristic of the target by adopting a shortest characteristic distance and a track curvature smooth function based on the coordinate and the appearance characteristic of the target, and simultaneously utilize a total loss function to solve a gradient for backward reasoning of the pedestrian multi-target tracking network model;
the total loss function is a loss function combining the trajectory feature distance and the fitted loss function with the weighted average of the average L1 loss functions of the trajectory detection
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE004
A loss function representing the combined trajectory feature distance and fit,
Figure DEST_PATH_IMAGE006
an average L1 loss function representing track segment target detection;
the combined track characteristic distance and the fitted loss function are obtained by weighted average of a track characteristic distance loss function and a track curvature smooth loss function, and training and learning are carried out on characteristic extraction and track matching of the pedestrian multi-target tracking network model;
the trajectory feature distance loss function is expressed as:
Figure DEST_PATH_IMAGE008
wherein,
Figure DEST_PATH_IMAGE010
,i∈[1,2m+1]representing the target frame in the ith image frame
Figure DEST_PATH_IMAGE012
And the ith image frame real label target frame
Figure DEST_PATH_IMAGE014
The characteristic distance is represented by a cosine function of a characteristic vector included angle, and 2m +1 is the number of image sequence frames of the track segment;
the trajectory curvature smoothing loss function is expressed as:
Figure DEST_PATH_IMAGE016
wherein x represents the number of target tracks formed in the track segment,
Figure DEST_PATH_IMAGE018
the curvature of the track of the jth predicted target is the average track of the 2m +1 frame image,
Figure DEST_PATH_IMAGE020
for the curvature of the corresponding real label trajectory, j ∈ [1,x],
Figure DEST_PATH_IMAGE022
The predicted target track and the real label track are the average track curvature difference; the matching is specifically matching of a predicted target track and a real label track, and matching is carried out by adopting a rule of curve front end, middle end and rear end IOU matching;
thus, the combined trajectory feature distance and fitted penalty function is represented as:
Figure DEST_PATH_IMAGE024
Figure DEST_PATH_IMAGE026
wherein
Figure DEST_PATH_IMAGE028
And
Figure DEST_PATH_IMAGE030
are all weighted based on
Figure DEST_PATH_IMAGE032
The learning of a pedestrian multi-target tracking network model on feature extraction and track matching is supervised;
and step S3: reasoning is carried out based on a pedestrian multi-target tracking network model obtained through training, and a current frame pedestrian target detection and feature extraction result and previous frames of pedestrian target detection and feature extraction results are obtained, namely the coordinates and the appearance features of a multi-frame image target are obtained;
and step S4: the method comprises the following steps of carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of a multi-frame image target, carrying out track smoothing by utilizing a track curvature smoothing function, and finally obtaining the track of a current frame, wherein the method specifically comprises the following steps: the method comprises the following steps of adopting a trained pedestrian multi-target tracking network model, carrying out shortest characteristic distance matching by utilizing coordinates and appearance characteristics of multi-frame image targets, carrying out track smoothing, and enabling the weighted sum of the average characteristic distance and the track curvature of the target and the track target of the previous 2m frames to be the minimum by the track obtained by matching, wherein the weighted sum is expressed as:
Figure DEST_PATH_IMAGE034
Figure DEST_PATH_IMAGE036
wherein
Figure DEST_PATH_IMAGE038
Representing the apparent feature distance of the current predicted image frame k from the ith image frame in its 2m preceding frames,
Figure DEST_PATH_IMAGE039
and
Figure DEST_PATH_IMAGE040
are all weighted weights.
2. The multi-target tracking method based on multi-frame input and track smoothing as claimed in claim 1, wherein the step S1 specifically comprises: marking pedestrians in the pedestrian video sequence frame by using marking software for the acquired open source pedestrian video, wherein the marking comprises marking a target frame and an Identification (ID) number of a target, and the ID numbers are accumulated from 1; then, cutting and binding the pedestrian video with a fixed length to generate a track segment, wherein the track segment is composed of 2m +1 image sequence frames, namely the data of the track segment is composed of m image frames before to m image frames after an image frame at a certain moment, and m is a positive integer.
3. The multi-frame input and track smoothing-based multi-target tracking method according to claim 1, wherein the pedestrian multi-target tracking network model is formed by combining a Yolov5-L main network and a multi-scale feature extraction module, the multi-scale feature extraction module is arranged in parallel with a target detection head of the Yolov5-L main network and has the same input, and the multi-scale feature extraction module is composed of one 3 x 256 convolution layer and one 1 x 256 convolution layer; the input image passes through a Yolov5-L main network and then passes through a multi-scale feature extraction module to output an appearance feature map with the same size as the input image, and then the appearance feature corresponding to the target frame is obtained by intercepting in the appearance feature map based on a preset frame to which the target frame detected by a target detection head belongs.
4. The multi-target tracking method based on multi-frame input and track smoothing as claimed in claim 1, wherein the average L1 loss function of track segment target detection is represented as:
Figure DEST_PATH_IMAGE042
wherein,
Figure DEST_PATH_IMAGE044
represents the average L1 loss function for target detection of the ith frame image, and 2m +1 is the number of image sequence frames of the track segment.
5. A multi-target tracking device based on multi-frame input and trajectory smoothing, which is characterized by comprising one or more processors and is used for implementing the multi-target tracking method based on multi-frame input and trajectory smoothing as claimed in any one of claims 1 to 4.
6. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements the multi-target tracking method based on multi-frame input and trajectory smoothing according to any one of claims 1 to 4.
CN202210856428.0A 2022-07-21 2022-07-21 Multi-target tracking method and device based on multi-frame input and track smoothing Active CN114998999B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210856428.0A CN114998999B (en) 2022-07-21 2022-07-21 Multi-target tracking method and device based on multi-frame input and track smoothing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210856428.0A CN114998999B (en) 2022-07-21 2022-07-21 Multi-target tracking method and device based on multi-frame input and track smoothing

Publications (2)

Publication Number Publication Date
CN114998999A CN114998999A (en) 2022-09-02
CN114998999B true CN114998999B (en) 2022-12-06

Family

ID=83021963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210856428.0A Active CN114998999B (en) 2022-07-21 2022-07-21 Multi-target tracking method and device based on multi-frame input and track smoothing

Country Status (1)

Country Link
CN (1) CN114998999B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115342822B (en) * 2022-10-18 2022-12-23 智道网联科技(北京)有限公司 Intersection track data rendering method, device and system
CN115880338B (en) * 2023-03-02 2023-06-02 浙江大华技术股份有限公司 Labeling method, labeling device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135314A (en) * 2019-05-07 2019-08-16 电子科技大学 A kind of multi-object tracking method based on depth Trajectory prediction
CN110349187A (en) * 2019-07-18 2019-10-18 深圳大学 Method for tracking target, device and storage medium based on TSK Fuzzy Classifier
CN111767847A (en) * 2020-06-29 2020-10-13 佛山市南海区广工大数控装备协同创新研究院 Pedestrian multi-target tracking method integrating target detection and association
CN111797738A (en) * 2020-06-23 2020-10-20 同济大学 Multi-target traffic behavior fast extraction method based on video identification
CN114677633A (en) * 2022-05-26 2022-06-28 之江实验室 Multi-component feature fusion-based pedestrian detection multi-target tracking system and method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854273B (en) * 2012-11-28 2017-08-25 天佑科技股份有限公司 A kind of nearly positive vertical view monitor video pedestrian tracking method of counting and device
US11341512B2 (en) * 2018-12-20 2022-05-24 Here Global B.V. Distinguishing between pedestrian and vehicle travel modes by mining mix-mode trajectory probe data
CN110378259A (en) * 2019-07-05 2019-10-25 桂林电子科技大学 A kind of multiple target Activity recognition method and system towards monitor video

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110135314A (en) * 2019-05-07 2019-08-16 电子科技大学 A kind of multi-object tracking method based on depth Trajectory prediction
CN110349187A (en) * 2019-07-18 2019-10-18 深圳大学 Method for tracking target, device and storage medium based on TSK Fuzzy Classifier
CN111797738A (en) * 2020-06-23 2020-10-20 同济大学 Multi-target traffic behavior fast extraction method based on video identification
CN111767847A (en) * 2020-06-29 2020-10-13 佛山市南海区广工大数控装备协同创新研究院 Pedestrian multi-target tracking method integrating target detection and association
CN114677633A (en) * 2022-05-26 2022-06-28 之江实验室 Multi-component feature fusion-based pedestrian detection multi-target tracking system and method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Fusion Approach for Multi-Frame Optical Flow Estimation;Zhile Ren 等;《2019 IEEE Winter Conference on Applications of Computer Vision (WACV)》;20190307;全文 *
Aerial image object detection based on improved YOLOv5;Qing Wen 等;《2022 2nd International Conference on Consumer Electronics and Computer Engineering (ICCECE)》;20220221;全文 *
基于YOLOv3与卡尔曼滤波的多目标跟踪算法;任珈民等;《计算机应用与软件》;20200512(第05期);全文 *

Also Published As

Publication number Publication date
CN114998999A (en) 2022-09-02

Similar Documents

Publication Publication Date Title
CN114998999B (en) Multi-target tracking method and device based on multi-frame input and track smoothing
CN111627045B (en) Multi-pedestrian online tracking method, device and equipment under single lens and storage medium
Sakaridis et al. Map-guided curriculum domain adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation
Han et al. Mat: Motion-aware multi-object tracking
Shin Yoon et al. Pixel-level matching for video object segmentation using convolutional neural networks
CN109426805B (en) Method, apparatus and computer program product for object detection
CN113034541B (en) Target tracking method and device, computer equipment and storage medium
US9754178B2 (en) Long-term static object detection
Rajasegaran et al. Tracking people by predicting 3d appearance, location and pose
CN106803263A (en) A kind of method for tracking target and device
WO2012127815A1 (en) Moving object detecting apparatus and moving object detecting method
CN113159006B (en) Attendance checking method and system based on face recognition, electronic equipment and storage medium
CN111027555B (en) License plate recognition method and device and electronic equipment
CN110298867A (en) A kind of video target tracking method
David An intellectual individual performance abnormality discovery system in civic surroundings
CN114677633A (en) Multi-component feature fusion-based pedestrian detection multi-target tracking system and method
Tao et al. An adaptive frame selection network with enhanced dilated convolution for video smoke recognition
Chen et al. Multiperson tracking by online learned grouping model with nonlinear motion context
Liu et al. Real-time anomaly detection on surveillance video with two-stream spatio-temporal generative model
CN111382705A (en) Reverse behavior detection method and device, electronic equipment and readable storage medium
CN114742112A (en) Object association method and device and electronic equipment
CN111382606A (en) Tumble detection method, tumble detection device and electronic equipment
CN110378515A (en) A kind of prediction technique of emergency event, device, storage medium and server
Choudhury et al. Scale aware deep pedestrian detection
CN114529587A (en) Video target tracking method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant