CN117011335B

CN117011335B - Multi-target tracking method and system based on self-adaptive double decoders

Info

Publication number: CN117011335B
Application number: CN202310926896.5A
Authority: CN
Inventors: 翟超; 倪志祥; 李玉军; 杨阳
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2024-04-09
Anticipated expiration: 2043-07-26
Also published as: CN117011335A

Abstract

The invention belongs to the technical field of deep learning and signal processing, and provides a multi-target tracking method and system based on a self-adaptive double decoder, wherein the multi-target tracking method comprises the following steps: dividing video data into frames, and sequentially extracting a boundary frame and corresponding appearance characteristics of the video data by a target detection and appearance characteristic extraction model based on a self-adaptive double decoder; in the appearance information association matching process, calculating an appearance similarity matrix between the existing track and the current frame detection, and realizing pairwise maximum matching through a Hungary algorithm; in the motion information association matching process, a DIOU motion similarity matrix between a current frame prediction frame and a prediction frame is calculated in a width expanding mode through a noise self-adaptive Kalman filtering prediction track based on width prediction in a prediction frame of the current frame, and maximum matching is achieved through a Hungary algorithm; the ID number and prediction box of each object in each frame of the video will be obtained. The invention has important application and research value for a plurality of fields and advanced computer vision tasks.

Description

Multi-target tracking method and system based on self-adaptive double decoders

Technical Field

The invention belongs to the technical field of deep learning and signal processing, and particularly relates to a multi-target tracking method and system based on a self-adaptive double decoder.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Multi-object tracking (Multiple Object Tracking, MOT) is a challenging problem in the field of computer vision, whose purpose is to detect all objects to be detected in a video scene and assign a unique Identity (ID) to each object to obtain trajectory information of all objects of interest in the video scene, including the bounding box and corresponding ID number of each object of interest in each frame. With the development of related technologies such as deep learning, MOT is widely applied to various fields such as intelligent security, automatic driving, robot perception, market promotion and the like. Meanwhile, MOT is the basis of advanced computer vision tasks such as behavior recognition, gesture estimation, video analysis and the like. For example, in intelligent security, the technique may track all people present under video surveillance at the same time, on the basis of which all times of occurrence of suspicious people in video surveillance are retrieved and their behavior in video surveillance is identified for subsequent investigation processing.

Common paradigms for multi-target tracking include a method of separate detection and feature extraction (Separate Detection and Embedding, SDE), a method of joint detection and feature extraction (Joint Detection and Embedding, JDE), and a method of joint detection and tracking (Joint Detection and Tracking, JDT). The SDE-based method sequentially completes 3 subtasks, namely, firstly, a target is positioned through a detection network, then, the characteristics of the target are extracted, and finally, the similarity among the targets is calculated through a data association algorithm and the targets are associated. The JDE method outputs the position and appearance characteristics of the targets in a network at the same time, calculates the similarity between the targets through a data association algorithm and associates the targets. Whereas the JDT method is to complete 3 sub-tasks in a single network, thereby completing the tracking process. The SDE paradigm has the advantage of being able to customize the most appropriate solution for 3 subtasks, with the disadvantage that the 3 subtasks are independent and cannot meet real-time requirements. The JDE paradigm has the advantages that target detection and feature extraction are fused, network computing efficiency can be greatly improved, real-time requirements are met, and the JDE paradigm is not an end-to-end method, but has good generalization capability and important research significance and application value. Although the JDT paradigm combines 3 subtasks to enable end-to-end global optimization, this approach often suffers from difficult training and inadequate generalization.

The representative method of the JDE paradigm, fairMOT, provides a simple and powerful baseline for subsequent research, which uses variants of deep aggregation operators (Deep Layer Aggregation, DLA 34) as backbone networks to extract common features, adopts an Anchor-Free modeling approach of estimating heat maps, center points and border widths and heights for target detection tasks, and learns 128-dimensional feature representations for appearance feature extraction tasks. For data correlation operations, kalman Filter (KF) is a common classical method, which is used to estimate motion information of a track, and can predict the bounding box position of an existing track on a current frame, so as to correlate with a newly detected target bounding box on the current frame. Specifically, a similarity matrix of motion information can be obtained by calculating an Intersection-Over-Union (IOU) metric of a prediction boundary box of an existing track and a target boundary box newly detected by a current frame, and finally maximum matching between a plurality of prediction boundary boxes of the existing track and a plurality of latest detection frames is realized through a hungarian algorithm. Similar to the matching process of motion information, the process of appearance information matching is to extract appearance characteristics of the targets which are newly detected by the existing track and the current frame respectively, so as to obtain an appearance similarity matrix, and finally, the appearance similarity matrix is sent into a Hungary algorithm to realize maximum matching.

The inventor finds that the defects of the prior multi-target tracking method FairMOT are as follows:

when the position and the appearance characteristics of the target are output simultaneously in a network, the FairMOT sets a homogeneous branch for the target detection task and the appearance characteristic extraction task so as to fairly treat the two tasks. However, this network design using only homogeneous branches is insufficient, because the target detection task and the pedestrian re-recognition task are essentially different, and therefore, when extracting features, different tasks should learn different features with emphasis according to the difference between the tasks. In addition, the traditional Kalman filter is insufficient in accuracy in boundary box prediction, the situation of boundary box width prediction inaccuracy is easy to occur, noise carried by an observed value is changed in the process of state updating, but a noise covariance matrix is a constant matrix and cannot be changed along with the change of the noise carried by the observed value. Meanwhile, in a crowded scene, a mode of obtaining a motion similarity matrix by calculating the intersection ratio of a prediction boundary box of an existing track and a target boundary box newly detected by a current frame lacks sufficient expressive force, and a complex scene cannot be finely modeled.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a multi-target tracking method and system based on a self-adaptive double decoder, which combine a self-adaptive double decoder structure with FairMOT, can better relieve the excessive competition condition of a target detection task and an appearance feature extraction task, realize more friendly feature learning for respective tasks and improve the overall tracking performance of the multi-target tracking method. The method has good expandability, the self-adaptive double-decoder structure can be suitable for improvement of different backbone networks under the JDE (joint data acquisition) paradigm, the noise self-adaptive Kalman filtering based on the width prediction can replace the original standard Kalman filtering method in a plurality of MOT (metal oxide semiconductor) methods, and the motion information matching method based on the width matching and DIOU (Distance-IOU) measurement can be used in MOT solutions which optionally comprise the standard IOU motion information matching method, and has important theoretical significance and application value.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a first aspect of the present invention provides a multi-target tracking method based on an adaptive dual decoder, comprising:

acquiring video data, preprocessing the video data, and dividing the video data into continuous frame picture data according to frames;

based on continuous frame picture data and a trained target detection and appearance feature extraction model, detecting interesting targets and extracting appearance features of a current frame to obtain the latest detection frames and the corresponding appearance features of all interesting targets of the current frame;

the construction process of the target detection and appearance feature extraction model comprises the following steps: adopting an improved FairMOT based on an adaptive double-decoder structure, copying the last layer IDA structure of a decoder, namely a multi-layer characteristic re-fusion structure, on the existing coder-decoder structure DLA34 variant, wherein the decoder part shares two layers of IDA structures, one layer of IDA structure is used for a detection task, the other layer of IDA structure is used for an appearance characteristic extraction task, and a convolution block attention structure is respectively connected after the two layers of IDA structures;

based on the latest detection frames and the corresponding appearance characteristics of all the interested targets of the current frame, carrying out appearance information matching and motion information matching with the appearance characteristics of the existing track and the prediction frames of the current frame; and outputting the ID number and the boundary box information of each target in each frame of the video until all the picture frames are processed.

A second aspect of the present invention provides an adaptive dual decoder-based multi-target tracking system comprising:

a data preprocessing module configured to: acquiring video data, preprocessing the video data, and dividing the video data into continuous frame picture data according to frames;

a feature extraction module configured to: based on continuous frame picture data and a trained target detection and appearance feature extraction model, detecting interesting targets and extracting appearance features of a current frame to obtain the latest detection frames and the corresponding appearance features of all interesting targets of the current frame;

a multi-target tracking module configured to: based on the latest detection frames and the corresponding appearance characteristics of all the interested targets of the current frame, carrying out appearance information matching and motion information matching on the appearance characteristics of the existing track and the prediction frames of the current frame; and outputting the ID number and the boundary box information of each target in each frame of the video until all the picture frames are processed.

A third aspect of the present invention provides a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a multi-object tracking method based on an adaptive dual decoder as described above.

A fourth aspect of the invention provides a computer device.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in an adaptive dual decoder based multi-target tracking method as described above when the program is executed.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention takes the FairMOT framework of the JDE model as a basic structure, takes the deep stacked DLA34 variants of the codec as a backbone network in an Anchor-Free mode, can effectively fuse multi-scale and multi-level features, and extracts enough expressive features for the subsequent target detection task and appearance feature extraction task. On the basis, a self-adaptive double-decoder structure, a noise self-adaptive Kalman filtering based on width prediction and a motion information matching method based on expansion width matching and DIOU measurement are introduced, so that the overall tracking performance can be obviously improved on the basis of the original architecture, and the method has important significance for application and research in the fields of intelligent security and behavior recognition.

2. The invention provides a self-adaptive double-decoder structure based on the original network architecture, and the optimal characteristics of the target detection task and the appearance characteristic task are respectively learned through two identical branches, so that the competition problem between the target detection task and the appearance characteristic extraction task in the same network is relieved, and the accuracy of multi-target tracking is improved on the basis of ensuring the real-time performance. In addition, the self-adaptive double-decoder structure can be also suitable for other similar JDE methods, and has good application potential.

3. According to the invention, the original standard Kalman filtering method is improved through the noise self-adaptive Kalman filtering based on the noise width prediction, the prediction accuracy of the Kalman filtering is improved through a mode of directly predicting the width instead of the aspect ratio, in addition, confidence information is introduced in the updating process of the Kalman filtering, and the noise scale is adaptively adjusted through the confidence, so that the noise scale becomes smaller along with the higher confidence and becomes larger along with the lower confidence, and the motion state of a target is estimated and modeled better. In addition, the noise self-adaptive Kalman filtering based on the noise width prediction can also improve other standard Kalman filtering methods, and has good mobility.

4. According to the invention, the original standard IOU matching method is replaced by using a motion information matching mode based on the width expansion matching and the DIOU measurement, the false negative matching conditions can be reduced by performing the width expansion on the Kalman prediction frame and the target detection frame, meanwhile, the DIOU measurement is used for replacing the original IOU measurement, more accurate modeling is performed on the distance between the two boundary frames, the false positive matching conditions can be filtered, and the ID allocation accuracy in the tracking process is improved. The motion information matching mode based on the width expansion matching and the DIOU measurement can directly replace the original standard IOU matching method and can be used for reference of other MOT methods.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic diagram of a multi-objective tracking algorithm according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a FairMOT target detection and appearance feature extraction network according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a network structure of an adaptive dual-decoder architecture according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a detailed flow of data association matching according to an embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

In order to solve the technical problems mentioned in the background art, the invention adopts the original FairMOT architecture as a basic structure to realize multi-target tracking, and the structure can realize target detection and appearance feature extraction tasks in the same network and realize good tracking performance on the basis of ensuring real-time performance. On the basis, an adaptive double decoder structure, a noise adaptive Kalman filtering method based on width prediction and a motion information matching method based on width matching and DIOU measurement are introduced. The self-adaptive double-decoder structure can effectively relieve the excessive competition problem which can occur when the target detection task and the appearance characteristic extraction task are realized in the same network, and can self-adaptively learn the optimal characteristics of the respective tasks aiming at different tasks, thereby realizing the improvement of the overall tracking performance; the noise self-adaptive Kalman filtering based on the width prediction replaces the original mode of predicting the aspect ratio of the frame by a mode of directly predicting the width of the frame, and simultaneously the noise scale in the Kalman filtering can be adjusted by combining the detection confidence information through the noise self-adaptive method, so that the motion state of a target can be estimated more accurately by the Kalman filtering; the motion information matching mode based on the width expansion and the DIOU measurement is different from the traditional IOU matching method, the size of the frame is adjusted by expanding the width, the false negative matching condition in the motion information matching process can be reduced, meanwhile, more false positive matching can be filtered by using more accurate DIOU measurement, and the ID distribution precision in the tracking process is improved. The multi-target tracking method based on the self-adaptive double-decoder structure, the noise self-adaptive Kalman filtering based on the width prediction and the motion information matching method based on the width matching and the DIOU measurement is constructed, has important theoretical value for optimizing the multi-target tracking method, and has important application value for the fields of intelligent security and the like.

By combining the self-adaptive double-decoder structure with the FairMOT, the excessive competition condition of the target detection task and the appearance feature extraction task can be better relieved, the feature learning which is more friendly to the respective tasks is realized, and the overall tracking performance of the multi-target tracking method is improved. The method has good expandability, the self-adaptive double-decoder structure can be suitable for improvement of different backbone networks under the JDE (joint data analysis) normal form, the noise self-adaptive Kalman filtering based on the width prediction can replace the original standard Kalman filtering method in a plurality of MOT (metal oxide semiconductor) methods, and the motion information matching method based on the width matching and DIOU (dimension of the object) measurement can be used in any MOT solution scheme comprising the standard IOU motion information matching method, and has important theoretical significance and application value.

Example 1

As shown in fig. 1, the present embodiment provides a multi-target tracking method based on an adaptive dual decoder, which includes the following steps:

step 1: acquiring video data, preprocessing the video data, and dividing the video data into continuous frame picture data according to frames;

step 2: sequentially inputting all the segmented frame picture data into a trained target detection and appearance feature extraction model, and detecting an interested target and extracting appearance features of the current frame to obtain all the latest detected target boundary frames of the current frame and corresponding appearance features of the current frame;

step 3: the appearance information matching process is used for forming an appearance similarity matrix by comparing appearance characteristics of a plurality of existing tracks with appearance characteristics of a plurality of latest detection frames of a current frame, and realizing maximum matching by a Hungary algorithm;

step 4: the motion information matching process is to predict a prediction frame of an existing track in a current frame through Kalman filtering, compare overlapping metrics of a plurality of Kalman prediction frames of the existing track and a plurality of latest detection frames of the current frame to form a motion similarity matrix, and realize maximum matching through a Hungary algorithm;

step 5: and (3) repeating the steps 1-4 until all the picture frames are processed by the algorithm, and outputting the ID number and the boundary box information of each target in each frame of the video.

In step 1, the size of the picture data in this embodiment is: 1088*608.

As shown in fig. 2, in step 2, the target detection and appearance feature model uses a modified FairMOT based on an adaptive dual decoder structure.

The improved method comprises the following steps: the backbone network of the network structure of the original FairMOT model is an enhancement of ResNet34, and the DLA34 variant of deep aggregation can better fuse semantic and spatial characteristics through the utilization of more jump layer connection, and the network uses deformable convolution in a characteristic decoder part to improve characteristic extraction capability. After backbone network, different tasks are realized by setting a homogeneous output head, for detection tasks, three branches are respectively used for heat map, center point offset and frame size, and for re-identification tasks, a convolution layer with 128 convolution kernels is used for learning feature embedding.

When the target detection task and the appearance characteristic extraction task are realized in the same network, the situation that the two tasks compete with each other is unavoidable, and although the original model is set to a homogeneous output head to relieve the problem to a certain extent, the target detection task and the appearance characteristic extraction task have essential differences, the target detection task focuses on the inter-class differences between the foreground and the background, and the appearance characteristic extraction task focuses on the intra-class differences.

In order to solve the problem, the invention designs a homogeneous self-adaptive double-decoder structure which has more important characteristics for different task learning, as shown in fig. 3, on the existing codec structure DLA34 variant, the last layer IDA (Iterative Deep Aggregation) structure of the decoder, namely a multi-layer characteristic re-fusion structure, is duplicated, at the moment, the decoder part has two layers of IDA structures, one IDA structure is used for detecting the task, and the other IDA structure is used for extracting the appearance characteristic, so that different tasks can learn the fusion characteristics more prone to the task during network learning. In addition, after the two-layer IDA structure, a convolution block attention (Convolutional Block Attention Module, CBAM) structure is connected, and the design can accelerate network training and simultaneously enable the network to adaptively capture the most relevant characteristics of tasks, so that the performance of the tracker is improved.

After the self-adaptive double-decoder structure, different tasks are realized by setting a homogeneous output head, for a detection task, three branches are respectively used for heat map, center point offset and frame size, and for an appearance feature extraction task, a convolution layer with 128 convolution kernels is used for learning feature embedding.

In step 2, the training process of the target detection and appearance feature extraction model is as follows:

step 1: constructing a data set: jointly constructing a training set by adopting a plurality of public data sets: ETH, cityPerson, calTech, CUHKSYSU, PRW and MOT17-half-train, wherein only the data set ETH, cityPerson with frame annotation has its ID set as-1, and the rest data sets with identity annotation have all ID numbers nID counted in advance to ensure unique ID numbers in the training process;

step 2: preprocessing a data set: segmenting a video in a data set into frame images, adopting HSV (hue, saturation, value) enhancement, rotation, scaling, translation, shearing and other image enhancement methods, and then generating heat map representation of the images and label information such as offset and size of frames;

step 3: training a target detection and appearance feature extraction model: using the pre-trained DLA34 variants on COCO as initial weights, 30 epochs were trained on the constructed training set with uncertainty loss at an initial learning rate of 1e-4 by Adam optimizer.

After training, the linear classification layer behind the appearance characteristic extraction head is removed, so that the output of the appearance characteristic extraction head is 128-dimensional, and a nID-dimensional classification vector obtained through Softmax in the training process is not used, and a trained model is finally obtained.

After the latest detection frame and the corresponding appearance characteristics on the current frame are obtained through the trained target detection and appearance characteristic extraction model, only a plurality of detection targets output by the detector are input into the first frame, and no track exists, at the moment, track information is initialized directly through the detection frame, including initialization of a Kalman filter corresponding to the track, and the appearance characteristics corresponding to the detection frame are used as initial appearance characteristics of the track. When the input is the second frame and later, not only the latest detection frame and appearance characteristics of the current frame, but also the existing tracks obtained in the past frame can enter the appearance information association matching process.

The method specifically comprises the following steps: when the input picture frame is the first frame, a plurality of targets detected on the first frame are needed to initialize a new track, and each detected target corresponds to a brand new track.

Specifically, the initialization of the new track mainly comprises the initialization of the motion state (position and speed) of the new track and the initialization of the appearance characteristic of the new track, wherein the initialization of the position of the new track is the position of the detection frame in the picture frame, the initialization of the speed of the new track is 0, and the initialization of the appearance characteristic of the new track is the appearance characteristic corresponding to the target detection frame. When the current input picture frame is a second frame or later, the memory has the latest detection frame of the current frame and the corresponding appearance characteristics thereof, and also has the existing track obtained from the past frame, and at the moment, the appearance information association matching process can be entered.

In step 3, assuming that the number of existing tracks is N and the latest detection number of the current frame is M, in the process of correlation matching of appearance information, firstly, appearance characteristics of all existing tracks and appearance characteristics corresponding to all latest detected frames of the current frame are required to be compared to form an appearance similarity matrix with a dimension of N x M, and then, maximum matching between every two appearance characteristics of all existing tracks and appearance characteristics of all latest detected frames of the current frame is realized through a hungarian algorithm with a threshold value of 0.4;

and updating the Kalman filtering state and the appearance characteristics of the track through the newly detected bounding box and the appearance characteristics thereof for the latest detection of the successfully matched existing track and the current frame, and continuing to match in the subsequent motion information association process for the latest detection of the remaining unsuccessfully matched existing track or the current frame.

In the above process, the updating strategy of the appearance characteristic information is an exponential moving average (Exponential Moving Average, EMA) method for updating the appearance state of the ith track in the t-th frameTo address the problem that appearance characteristics may be sensitive to detection quality:

wherein the method comprises the steps ofRepresenting the detected appearance feature on the current match, i.e., the Reid (Re-identification) feature of the object. Wherein α takes 0.9 as the momentum term.

In step 4, in the motion information association matching process, firstly predicting frames of all remaining existing tracks on a current frame through a noise self-adaptive Kalman filter based on width prediction to obtain Kalman prediction frames of the existing tracks on the current frame, then performing expansion transformation on the widths of the Kalman prediction frames on the current frame and the remaining latest detection frames, calculating a DIOU distance to obtain a motion similarity matrix, and finally realizing maximum matching between the Kalman prediction frames and the remaining detection frames through a Hungary algorithm;

and for the latest detection of the successfully matched residual existing tracks and residual current frames, updating the Kalman filtering state and the appearance characteristics of the tracks through the latest detection frame and the appearance characteristics thereof.

The Kalman filter is used for estimating the motion state of the track, and the noise self-adaptive Kalman filter based on the width prediction is different from the traditional standard Kalman filter, and the motion state variables estimated by the traditional standard Kalman filter are as follows:

[x,y,a,h,vx,vy,va,vh]

where x, y represents the center coordinates of the target frame, a represents the aspect ratio of the target frame, h represents the height of the target frame, vx represents the speed of x, vy represents the speed of y, va represents the speed of a, vh represents the speed of h, and practice shows that better estimation performance can be obtained by directly estimating the width of the bounding box, i.e., prediction:

[x,y,w,h,vx,vy,vw,vh]

where w represents the width of the bounding box and vw represents the speed of w.

In the state updating process of the Kalman filtering, the noise scale is a constant matrix, and the larger the noise scale is, the larger the uncertainty of the motion is, and conversely, the smaller the uncertainty is.

Therefore, in this embodiment, the detection confidence coefficient is introduced to adaptively adjust the noise scale, and when the kalman filter is updated according to the detection frame, the smaller the confidence coefficient is, the larger the noise scale should be, and the smaller the noise scale should be, so that the adaptive change of the noise covariance along with the detection confidence coefficient is realized by the following formula:

wherein R is _k C is the preset noise covariance _t Representing the confidence threshold of the detector, c in this embodiment _t Take the value of 0.4, c _k Representing the detection confidence in the k state, by introducing the detection confidence in the state update step, this can be achievedThe noise scale varies with the confidence level.

Assuming that the number of the remaining existing tracks is X and the latest detection number of the remaining current frame is Y, in the motion information association matching process, firstly predicting the frames of all the remaining existing tracks on the current frame through a noise self-adaptive Kalman filter based on width prediction, and at the moment, detecting a confidence coefficient threshold value c in a noise self-adaptive adjustment strategy _t Take a value of 0.4.

In the motion information association matching process, the embodiment does not directly calculate the overlapping degree between the Kalman prediction frame and the target detection frame on the current frame, but performs expansion matching on the widths of the two frames.

Specifically, unlike directly computing the overlap of the original frames [ x, y, w, h ] of the Kalman prediction frame and the target detection frame, the overlap of the frames [ x, y, w+Deltaw, h ] after the two are expanded in width is computed, and the strategy of expanding the width can reduce the situation of some false negative matches. Meanwhile, rather than calculating the IOU metric, a more accurate metric DIOU is used:

DIOU introduces a penalty term based on the original IOU, b in the formula ^gt Representing the center points of the detection and prediction frames, respectively, p representing the euclidean distance between the two, and c representing the diagonal distance that can cover the smallest bounding rectangle of the two bounding frames, so DIOU actually models the normalized distance between the detection and prediction frames. The DIOU has more strict requirements on the frame, so that more false positive matches can be filtered, and ID allocation in the tracking process becomes more accurate.

As shown in fig. 4, for the existing track remaining after the motion information association matching process, 30 frames are saved, and if the existing track is not successfully matched with a certain detection frame in the process, the existing track is deleted from the memory; for the latest detection of the rest current frames, initializing a brand new track, including initializing the motion state of a Kalman filter corresponding to the track, taking the appearance characteristic corresponding to the detection as the initial appearance characteristic of the track, and deleting the track from a memory if the previous three frames of the track cannot be successfully matched with a certain detection of each frame continuously; at the output end, after all video frames pass through the multi-target tracking algorithm, the ID and frame information of each target in each frame are output.

The effectiveness of the methods is verified by carrying out an ablation experiment on the disclosed MOT17-half-val data set aiming at the method provided by the invention, and the main evaluation indexes for measuring the tracking performance are as follows:

wherein FP is the total number of false positives detected in the video, FN is the total number of false negatives detected in the video, IDSW represents the number of ID switches, and GT refers to the total number of real tag frames.

Wherein, IDTP represents the total ID correct predicted quantity in the video, IDFP represents the total ID misdistribution quantity in the video, and IDFN represents the total ID miss distribution quantity in the video.

The ablation experiment takes the original FairMOT as Baseline, three improved methods provided by the invention are respectively introduced on the basis of the original FairMOT, namely a noise self-adaptive Kalman filter (Noise Adaptive Kalman Filter based on Width prediction, WNSA-KF) based on width prediction, a motion information matching method (motion information matching method based on Width expansion and DIOU measurement, WDIOU) based on width expansion and DIOU measurement and a self-adaptive double decoder structure (Adaptive Dual Decoder, ADDECORDER), and the experimental results are shown in the following table:

Method	WNSA-KF	WDIOU	ADDecoder	MOTA	IDF1
						Baseline(FairMOT)	-	-	-	69.1	72.8
Baseline+column1	√	-	-	69.1	73.4
						Baseline+column2	-	√	-	69.2	73.4
Baseline+column3	-	-	√	70.3	74.5
						Baseline+column1,2	√	√	-	69.2	73.5
Baseline+column1,2,3	√	√	√	70.4	74.7

experimental results show that the accuracy of Kalman filtering state estimation is truly improved when WNSA-KF is adopted independently, and many false positive correlations are avoided, so that the IDF1 index is increased by 0.6%; the WDIOU matching is adopted independently, so that false negative matching and false positive matching are reduced, and MOTA and IDF1 are increased simultaneously; the ADDECORDER is the method with the largest tracking performance gain, MOTA is increased by 1.2%, IDF1 is increased by 1.7%, and good target detection and appearance feature extraction performance are proved to be the basis for influencing the accuracy of the tracker; finally, when three methods are used simultaneously, the original multi-target tracking method can be improved maximally, so that the MOTA is improved by 1.3% compared with the original Basline, and the IDF1 is improved by 1.9% compared with the original Basline.

Example two

The present embodiment provides a multi-target tracking system based on an adaptive dual decoder, including:

Example III

The present embodiment provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of a multi-target tracking method based on an adaptive dual decoder as described in embodiment one.

Example IV

The present embodiment provides a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in an adaptive dual decoder-based multi-target tracking method as described in embodiment one when executing the program.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored on a computer readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-target tracking method based on an adaptive dual decoder, comprising:

based on continuous frame picture data and a trained target detection and appearance feature extraction model, detecting interesting targets and extracting appearance features of a current frame to obtain the latest detection frames and corresponding appearance features of all interesting targets of the current frame;

based on the latest detection frames and the corresponding appearance characteristics of all the interested targets of the current frame, carrying out appearance information matching and motion information matching on the appearance characteristics of the existing track and the prediction frames of the current frame;

forming an appearance similarity matrix by comparing appearance characteristics of all existing tracks with appearance characteristics corresponding to all latest prediction frames of the current frame; based on the appearance similarity matrix, carrying out maximum matching between appearance characteristics of all existing tracks and appearance characteristics of all latest detection frames of the current frame;

matching the latest detection result of the existing track or the current frame which are not successfully matched with the rest through motion information, wherein the method comprises the following steps: predicting a prediction frame of an existing track in a current frame through Kalman filtering, comparing overlapping metrics of a plurality of Kalman prediction frames of the existing track and a plurality of latest prediction frames of the current frame to form a motion similarity matrix, wherein the motion similarity matrix comprises the following steps: predicting the frames of all the residual existing tracks on the current frame through a noise self-adaptive Kalman filter based on width prediction to obtain a Kalman prediction frame of the existing tracks on the current frame; expanding and transforming the widths of a Kalman prediction frame and the rest of the latest detection frames on the current frame, and calculating a DIOU distance, so as to obtain a motion similarity matrix;

based on the motion similarity matrix, carrying out maximum matching between the Kalman prediction frame and the rest prediction frames;

the noise scale is a constant matrix; introducing detection confidence to adaptively adjust the noise scale; the adaptive change of the noise covariance with the detection confidence is achieved by the following formula:

wherein R is _k C is the preset noise covariance _t Representing a confidence threshold, c, of the detector _k Representing the detection confidence in the k state;

and outputting the ID number and the boundary box information of each target in each frame of the video until all the picture frames are processed.

2. The multi-target tracking method based on the adaptive dual decoder as claimed in claim 1, wherein when the input picture frame is the first frame, a new track is initialized by using a plurality of targets detected on the first frame, each detected target corresponds to a brand new track, the initialization of the new track mainly comprises the initialization of the motion state of the new track and the initialization of the appearance feature of the new track, the position of the new track is initialized to the position of the detection frame in the picture frame, the speed of the new track is initialized to 0, and the appearance feature of the new track is initialized to the appearance feature corresponding to the target detection frame; when the current input picture frame is the second frame or the later frame, the memory has the latest detection frame of the current frame and the corresponding appearance characteristics thereof, and also has the existing track obtained from the past frame, and the appearance information association matching process can be entered at this time.

3. The multi-target tracking method based on the adaptive dual decoder as claimed in claim 1, wherein in the appearance information correlation matching process, for the latest detection of the existing track successfully matched with the current frame, the Kalman filtering state and appearance characteristics of the track are updated through the latest detected bounding box and the appearance characteristics thereof;

in the motion information association matching process, the latest detection of the remaining existing tracks and the remaining current frames which are successfully matched is carried out, and the Kalman filtering state and the appearance characteristics of the tracks are updated through the latest prediction frame and the appearance characteristics thereof.

4. A multi-target tracking method based on adaptive dual decoders as claimed in claim 3, wherein in the state updating process of the kalman filter, detection confidence is introduced to adaptively adjust the noise scale, and when the kalman filter is updated according to the prediction frame, the smaller the confidence is, the larger the noise scale is, the larger the confidence is, the noise is, and the smaller the scale is.

5. The multi-target tracking method based on adaptive dual decoders as claimed in claim 1, wherein for remaining existing tracks after a motion information association matching process, N frames are stored, and if the remaining existing tracks are not successfully matched with a certain prediction frame in the process, the remaining existing tracks are deleted from the memory; for the latest detection of the rest current frames, a brand new track is initialized, including initializing the motion state of the Kalman filter corresponding to the track, and the appearance characteristic corresponding to the detection is used as the initial appearance characteristic of the track.

6. An adaptive dual-decoder-based multi-target tracking system employing the adaptive dual-decoder-based multi-target tracking method as claimed in claim 1, comprising:

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of a multi-target tracking method based on an adaptive dual decoder as claimed in any of claims 1-5.

8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of a multi-target tracking method based on an adaptive dual decoder according to any of claims 1-5 when the program is executed.