CN113409361B - Multi-target tracking method and device, computer and storage medium - Google Patents

Multi-target tracking method and device, computer and storage medium Download PDF

Info

Publication number
CN113409361B
CN113409361B CN202110922602.2A CN202110922602A CN113409361B CN 113409361 B CN113409361 B CN 113409361B CN 202110922602 A CN202110922602 A CN 202110922602A CN 113409361 B CN113409361 B CN 113409361B
Authority
CN
China
Prior art keywords
target
module
information
loss function
target tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110922602.2A
Other languages
Chinese (zh)
Other versions
CN113409361A (en
Inventor
林涛
张炳振
刘宇鸣
邓普阳
张枭勇
陈振武
王宇
周勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Urban Transport Planning Center Co Ltd
Original Assignee
Shenzhen Urban Transport Planning Center Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Urban Transport Planning Center Co Ltd filed Critical Shenzhen Urban Transport Planning Center Co Ltd
Priority to CN202110922602.2A priority Critical patent/CN113409361B/en
Publication of CN113409361A publication Critical patent/CN113409361A/en
Application granted granted Critical
Publication of CN113409361B publication Critical patent/CN113409361B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence. Firstly, inputting a video into a fusion detection association module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features; secondly, obtaining the object type, the object position information and the same trackID of the same object in different video frames by a multi-task learning method in deep learning; and predicting the position of the target of the current frame possibly by using a track prediction module according to the target motion track information in the continuous frames, and providing reference for the fusion detection correlation module. And finally, outputting the multi-target tracking information. The method solves the technical problems that the target tracking efficiency is low, the target is easy to lose and the target ID is easy to change in the prior art, improves the efficiency of multi-target tracking and avoids the loss of target tracking.

Description

Multi-target tracking method and device, computer and storage medium
Technical Field
The application relates to a target tracking method, in particular to a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence.
Background
The multi-target tracking is to simultaneously track a plurality of targets in a video, application scenes such as security protection, automatic driving and the like are adopted, the number of people and vehicles in the scenes is uncertain, the characteristics of each target are uncertain, and the tracking of the targets is the basis of other applications (such as target positioning, target density calculation and the like). Different from single target tracking, multi-target tracking has a unique ID for each target, and the target is ensured not to be lost in the tracking process. Meanwhile, the appearance of a new target and the disappearance of an old target are also problems to be solved by multi-target tracking.
At present, many researches are carried out on multi-target tracking, the main tracking strategy is DBT (tracking based on detection), a detection module and a data association module are independent, a video sequence firstly passes through a detection algorithm to obtain position information of a target, and a final track result is obtained after the data association algorithm is executed.
A representative algorithm in multi-target tracking is a DeepsORT algorithm, belongs to a data association algorithm in MOT (multi-target tracking), and can be combined with any detector to realize multi-target tracking. The algorithm combines a Kalman filtering algorithm and a Hungarian algorithm. And predicting the state of the detection frame in the next frame by using a Kalman filtering algorithm, and matching the state with the detection result of the next frame. In the matching process, a Hungarian algorithm is used, the motion characteristics obtained by Kalman filtering are combined with appearance characteristics extracted by a CNN (convolutional neural network) to be fused together to calculate a cost matrix.
The MOT is mainly applied to scenes such as security protection, automatic driving and the like, and the scenes have high requirements on algorithm real-time performance. In the case of a fixed hardware level, the detection efficiency and the detection accuracy of the MOT should be improved as much as possible. In the prior art, the MOT has the problem of low efficiency in practical application. The existing real-time MOT usually only concerns about data association steps, essentially only completes a part of the MOT, and cannot really solve the efficiency problem.
In addition, different targets are often occluded in a real scene, which can cause problems of target loss, target ID change and the like in the MOT.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or important part of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of this, in order to solve the technical problems of low target tracking efficiency, easy target loss and easy target ID change in the prior art, the present invention provides a multi-target tracking method, apparatus, computer and storage medium.
And the fusion detection correlation module outputs the position and the category information of different targets. The track prediction module takes the information as input to learn different types of target track information, so that the target tracking efficiency is improved, and the target tracking loss is avoided.
A multi-target tracking method comprises the following steps:
s110, inputting the video into a fusion detection correlation module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features;
s120, calculating a loss function;
s130, acquiring data association relation among the target type, the target position information and the target; inputting target position information into a track prediction module, learning target movement by using convolution operation, outputting predicted position information, forming different types of target motion rule information and transmitting the different types of target motion rule information to a database and a fusion detection association module;
s140 outputs multi-target tracking.
Preferably, the specific method for obtaining the feature map in step S110 is:
1) 1/4 downsampling the video through the convolutional layer 1 to obtain a characteristic diagram 1;
2) 1/8 downsampling the characteristic diagram 1 through the convolution layer 2 to obtain a characteristic diagram 2;
3) The characteristic diagram 2 is subjected to 1/16 down-sampling by the convolution layer 3 to obtain a characteristic diagram 3.
Preferably, the calculating the loss function in step S120 specifically includes the following three loss functions:
1) A target classification loss function;
2) A target location regression loss function;
3) Multi-objective cross entropy loss function.
Preferably, the calculation methods of the three loss functions in step S120 are specifically:
1) Objective classification penalty function
Figure 759823DEST_PATH_IMAGE001
Figure 714180DEST_PATH_IMAGE002
Wherein,
Figure 274343DEST_PATH_IMAGE003
a true class label representing a target>
Figure 954373DEST_PATH_IMAGE004
Represents model predicted values, <' > based on>
Figure 94978DEST_PATH_IMAGE005
Representing a total number of target categories;
Figure 969305DEST_PATH_IMAGE006
indicating target class label representation->
Figure 273554DEST_PATH_IMAGE003
A category feature of (a);
Figure 252881DEST_PATH_IMAGE007
And representing a class characteristic balance coefficient for balancing the influence of the class characteristics on the overall loss function, wherein the value is 0.5.
Random initialization at the beginning of training, followed by updating of training each iteration
Figure 705115DEST_PATH_IMAGE008
The update formula is:
Figure 791756DEST_PATH_IMAGE009
Figure 66791DEST_PATH_IMAGE010
represents a difference between the current data and a characteristic of the category>
Figure 106599DEST_PATH_IMAGE005
Represents the total number of target classes, based on the number of target classes>
Figure 410410DEST_PATH_IMAGE006
Indicating target class label representation->
Figure 889189DEST_PATH_IMAGE003
Is selected based on the category characteristic of->
Figure 928819DEST_PATH_IMAGE004
Representing the model predicted value;
Figure 488763DEST_PATH_IMAGE010
Indicates the fifth->
Figure 638292DEST_PATH_IMAGE011
The difference of the current data and the class characteristics at the time of the sub-iteration will then @>
Figure 19463DEST_PATH_IMAGE012
Is updated to indicate->
Figure 797320DEST_PATH_IMAGE013
Simultaneously use->
Figure 674138DEST_PATH_IMAGE014
Guarantee->
Figure 955471DEST_PATH_IMAGE015
Stable and/or bright>
Figure 531379DEST_PATH_IMAGE016
The value was taken to be 0.5;
2) Target position regression loss function:
Figure 528022DEST_PATH_IMAGE017
wherein,
Figure 896947DEST_PATH_IMAGE018
represents the model target prediction value, < > or >>
Figure 563945DEST_PATH_IMAGE019
Indicates the true value of the target, and>
Figure 546640DEST_PATH_IMAGE020
can take the value>
Figure 33509DEST_PATH_IMAGE021
Figure 500131DEST_PATH_IMAGE022
Represents the coordinate value of the center point of the detection frame, and is greater than or equal to>
Figure 269240DEST_PATH_IMAGE023
Indicates that the detection frame is wide and/or is open>
Figure 431625DEST_PATH_IMAGE024
Presentation detection boxHigh, or is greater than or equal to>
Figure 131422DEST_PATH_IMAGE021
The position and the size of the target detection frame can be obtained through regression, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is expressed as follows:
Figure 139086DEST_PATH_IMAGE025
wherein,
Figure 246588DEST_PATH_IMAGE026
indicating the position of the output of the trajectory prediction module, including->
Figure 591592DEST_PATH_IMAGE021
Information;
3) Multi-objective cross entropy loss function:
Figure 790405DEST_PATH_IMAGE027
wherein,
Figure 357085DEST_PATH_IMAGE003
a true class label representing a target>
Figure 322023DEST_PATH_IMAGE004
Representing the model predicted value;
the fusion detection association module aims to generate target types, target position information and trackID information of targets among different video frames, so that loss functions need to be weighted and summed to form a total loss function, and the loss function of the fusion detection association module needs to be calculated;
the loss function of the fusion detection correlation module is as follows:
Figure 292122DEST_PATH_IMAGE028
wherein,
Figure 330678DEST_PATH_IMAGE029
Figure 214232DEST_PATH_IMAGE030
and the multi-task weight parameter is expressed and can be set according to different task requirements.
Preferably, in S230, the target movement law is learned through a three-layer ConvLSTM network, and the predicted position information is output, specifically, the predicted position information includes feature information of a first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.
The system comprises a video input module, a fusion detection association module, a track prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting a target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.
A computer comprising a memory storing a computer program and a processor implementing the steps of a multi-target tracking method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements a multi-target tracking method.
The invention has the following beneficial effects: according to the scheme, the detection algorithm and the data association algorithm are fused into one module, so that repeated calculation is reduced. The track prediction module can be used for well processing the matching problem of difficult targets, the trackID generated by obtaining the data association relation among the target type, the target position information and the targets is more stable, the identification accuracy of the same target of the previous frame and the next frame can be improved, and the trackID is frequently switched. The problem of low computational efficiency and poor real-time performance of the existing multi-target tracking technology is solved, and meanwhile, the robustness for target shielding loss is high.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a fusion detection association module according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a difference computing network according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a trajectory prediction module according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a ConvLSTM model according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a multi-target tracking device according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Embodiment 1, this embodiment is described with reference to fig. 1 to 3, and a multi-target tracking method includes the following steps:
s110, inputting the video into a fusion detection association module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features;
firstly, inputting a video to a fusion detection association module to obtain a target position and data association information at one time, wherein a model of the fusion detection association module specifically refers to fig. 2.
The specific method for obtaining the feature map by performing the down-sampling process is that, assuming that the size of the input video frame is 1280 × 720 (length × width, which means that there are 1280 pixels in length and 720 pixels in width), the image is adjusted to 896 × 896 by resize, which facilitates subsequent processing. The down-sampling process is as follows:
(1) 1/4 down-sampling the input image by convolution layer 1 (convolution kernel size 8*8, step = 8) to obtain a feature map 1 with size 224 × 224;
(2) 1/8 down-sampling is carried out on the characteristic diagram 1 through a convolution layer 2 (the size of a convolution kernel is 2*2, and the step length is = 2), and a characteristic diagram 2 with the size of 112 × 112 is obtained;
(3) Then the feature map 2 is passed through convolution layer 3 (convolution kernel size 2*2, step = 2), and 1/16 down-sampling is calculated, resulting in feature map 3, size 56 x 56.
So far, 3 feature maps with different sizes are obtained by the image through a down-sampling process. Each frame of image in the fusion detection association module is subjected to down-sampling calculation, and 6 feature maps of the front frame and the rear frame are used as input and are transmitted into a difference calculation network. The method aims to calculate and fuse difference characteristics under different scales, and finally, a multi-task learning method is used for simultaneously predicting and obtaining data association relations among target types, target position information and targets.
The difference calculation network mainly comprises two structures of DenseBlock and Transition. The specific DenseBlock is composed of a BN layer + ReLU layer +3*3 convolution layer, and the input and output characteristic diagrams of DenseBlock are consistent. The Transition is composed of a BN layer + ReLU layer +1*1 convolutional layer +2*2 average pooling layer, and thus the size of the feature map becomes 1/2 of the original size after each Transition. In actual calculation, a total of 6 feature maps of two frames are input into the difference calculation network. Similar to the twin Network (Siamese Network), the difference calculation Network also has two paths, which correspond to the 3 feature maps of the previous frame and the 3 feature maps of the current frame, respectively. The two path networks are identical in structure but different in weight.
(1) Firstly inputting a feature map 1 with the size of 224 × 224 into each channel, changing the network size into 112 × 112 through Transition1, and then inputting a DenseBlock1 network learning feature to obtain the features of 112 × 112;
(2) Fusing and adding the features obtained in the previous step with the features 2, and continuously transmitting the features into a Transition2 network and a DenseBlock2 network to obtain 56 × 56 features;
(3) Similarly, the features of the previous step and the feature map 3 are fused and added, and are transmitted into a DenseBlock3 network for further learning features;
(4) The previous frame and the current frame respectively obtain a feature map of 56 × 56, and the difference between the two feature maps obtains a difference feature with the size of 56 × 56.
S120, calculating a loss function; since the network target obtains the association relationship between the target type, the target position information and the target data at a time, that is, the trackID information in the tracking process, the loss function needs to be calculated.
The calculation of the loss function specifically includes the following three loss functions:
1) A target classification loss function;
2) A target location regression loss function;
3) Multi-objective cross entropy loss function.
Wherein the target classification loss function
Figure 54581DEST_PATH_IMAGE001
The calculation method specifically comprises the following steps:
Figure 198511DEST_PATH_IMAGE002
wherein,
Figure 100477DEST_PATH_IMAGE003
a true class label representing a target>
Figure 265353DEST_PATH_IMAGE004
Representing the probability that the model predicts a positive sample>
Figure 942234DEST_PATH_IMAGE005
Representing a total number of target categories;
Figure 776505DEST_PATH_IMAGE006
Indicates that the target class label is->
Figure 900188DEST_PATH_IMAGE003
A category feature of (a);
Figure 122615DEST_PATH_IMAGE007
And representing a class characteristic balance coefficient for balancing the influence of the class characteristics on the overall loss function, wherein the value is 0.5.
Random initialization at the beginning of training, followed by updating of training each iteration
Figure 429835DEST_PATH_IMAGE008
The update formula is: />
Figure 715235DEST_PATH_IMAGE009
Figure 317425DEST_PATH_IMAGE010
Represents a difference between the current data and a characteristic of the category>
Figure 809455DEST_PATH_IMAGE005
Represents the total number of target classes, based on the number of target classes>
Figure 192419DEST_PATH_IMAGE006
Representing target class tag representation>
Figure 394860DEST_PATH_IMAGE003
Is selected based on the category characteristic of->
Figure 761644DEST_PATH_IMAGE004
Representing the model predicted value;
Figure 284461DEST_PATH_IMAGE010
Represents a fifth or fifth party>
Figure 50161DEST_PATH_IMAGE011
The difference of the current data and the class characteristics at the time of the sub-iteration will then @>
Figure 143275DEST_PATH_IMAGE012
Is updated to indicate->
Figure 12004DEST_PATH_IMAGE013
Simultaneously use->
Figure 848766DEST_PATH_IMAGE031
Guarantee->
Figure 734551DEST_PATH_IMAGE015
Stabilized +>
Figure 721269DEST_PATH_IMAGE016
The value was taken to be 0.5;
wherein the target location regression loss function:
Figure 799995DEST_PATH_IMAGE017
wherein,
Figure 953632DEST_PATH_IMAGE018
represents a model target prediction value, <' > based on>
Figure 431275DEST_PATH_IMAGE019
Indicates the true value of the target, and>
Figure 128841DEST_PATH_IMAGE020
can take the value>
Figure 945794DEST_PATH_IMAGE021
Figure 127289DEST_PATH_IMAGE022
Represents the coordinate value of the center point of the detection frame, and is greater than or equal to>
Figure 736736DEST_PATH_IMAGE023
Indicates that the detection frame is wide and/or is open>
Figure 605204DEST_PATH_IMAGE024
Indicates that the detection frame is high and/or is up or down>
Figure 157057DEST_PATH_IMAGE021
The position and the size of the target detection frame can be obtained through regression, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is expressed as follows:
Figure 127594DEST_PATH_IMAGE025
wherein,
Figure 582759DEST_PATH_IMAGE026
indicating the position of the output of the trajectory prediction module, including->
Figure 636776DEST_PATH_IMAGE021
And (4) information.
Wherein, the multi-target cross entropy loss function:
Figure 171531DEST_PATH_IMAGE027
wherein,
Figure 957478DEST_PATH_IMAGE003
a true category label representing a target>
Figure 255430DEST_PATH_IMAGE004
Representing the model predicted value;
the fusion detection association module aims to generate target types, target position information and trackID information of targets among different video frames, so that loss functions are weighted and summed to form a total loss function, and the loss function of the fusion detection association module is required to be calculated;
the loss function of the fusion detection correlation module is as follows:
Figure 970095DEST_PATH_IMAGE028
wherein,
Figure 53716DEST_PATH_IMAGE029
Figure 640424DEST_PATH_IMAGE030
and the multi-task weight parameter is expressed and can be set according to different task requirements.
S130, acquiring data association relation among the target type, the target position information and the target; inputting target position information into a track prediction module, learning target movement by using convolution operation, outputting predicted position information, forming different types of target motion rule information and transmitting the different types of target motion rule information to a database and a fusion detection association module;
the data association target is to obtain target trackID information in front and back video frames, and if a red vehicle appears in the previous frame and the red vehicle also appears in the current frame, the two vehicles can be judged to be the same trackID through data association. In order to find that the same object has the same trackID in different frames, a model should judge that the same object is closer to the space than different objects, and a common method in the prior art, namely a triplet loss function, is used in the MOT.
The specific algorithm implementation process is as follows: the difference features are followed by a full connection layer, and the number N of nodes of the full connection layer indicates that there are at most N different trackids (N is a hyper-parameter, which can be modified according to the needs of the scene, and usually takes the value N = 20000). The classification process is to classify the object when the object is detected. If the target exists before, the corresponding trackID is correctly classified, otherwise, the target is a new target with a classification label of-1, the parameters of the full connection layer are updated, and the object can be identified in the subsequent classification process by adding a trackID. Meanwhile, in the updating process of the model parameters, the trackIDs which are not detected for a long time can be forgotten, and the total number of the trackIDs recorded by the model is ensured not to exceed the value of N.
S150 outputs the multi-target tracking.
Embodiment 2, this embodiment is described with reference to fig. 4, and the multi-target tracking method further includes a track prediction module, and the track prediction module may learn historical track information of targets of different categories. The trajectory prediction module model structure is described with particular reference to fig. 4. The LSTM structure is a classical network structure for processing time series data, while ConvLSTM is a network structure formed by combining an LSTM structure and convolution (convolution), and the model structure is specifically shown with reference to fig. 5, wherein,
Figure 801671DEST_PATH_IMAGE032
represents->
Figure 176983DEST_PATH_IMAGE033
The input of the moment>
Figure 689261DEST_PATH_IMAGE034
Represents->
Figure 814080DEST_PATH_IMAGE035
The input of the moment>
Figure 874176DEST_PATH_IMAGE036
To represent
Figure 429179DEST_PATH_IMAGE035
Output at a moment in time>
Figure 419963DEST_PATH_IMAGE037
Represents->
Figure 85824DEST_PATH_IMAGE033
The output at that moment is greater or less>
Figure 218734DEST_PATH_IMAGE038
Represents->
Figure 221936DEST_PATH_IMAGE039
The output of the moment can not only establish a time sequence relation, but also exert the characteristic of convolution to depict the local spatial features of the image.
S210, inputting the target position information into a track prediction module, and calculating output variables C and H in the LSTM; the model being input in a sequence of successive image frames, e.g. X t And X t+1 Calculating C (cell output) and H (hidden state) for two continuous frame inputs; c (cell output) and H (hidden state) are output variables in the LSTM.
Wherein C represents a cell unit in the LSTM and is used for storing time sequence information and medium-term and long-term memory; h represents a hidden unit for storing the recent memory in the time sequence information.
S220, estimating C and H of a target time through C and H input at the past time by using convolution operation;
s230, learning a target movement rule through a three-layer ConvLSTM network, outputting predicted position information, and forming different types of target movement rule information;
wherein the first layer learns the characteristic information of the target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.
S240, the different types of target motion rule information are respectively transmitted to the database and the fusion detection association module. When the target is shielded, and the fusion detection association module cannot identify the image information of the current frame, the motion track of the next frame of the image can be predicted through the motion rule information of different types of targets obtained through the track prediction model training.
In a traffic monitoring scene, the visual angle of the camera is generally fixed, so that the vehicle tracks in the pictures shot by the camera have certain similarity. The rule can be obtained through automatic learning of a special neural network structure. The track learning prediction module can also store the learned motion rule information in a database for a long time, and can be called at any time when the fusion detection association module needs to use the motion rule information.
After training and learning the target movement law, inputting a frame of image and the position information of the current target, and outputting the position information of the target at the next moment by the track prediction model, wherein the position information comprises x, y, w and h. The predicted target position can be added into a target position loss function of the fusion detection correlation module, and the position identification accuracy is improved.
The track prediction module predicts different positions of different types of targets, and optionally stores output results in a database for the fusion detection association module to utilize the information.
English appearing in the present embodiment or the drawings is explained below
1) ConvLSTM-Encode, namely a convolution length memory coding layer;
2) ConvLSTM-Position, memory Position layer when convolution length;
3) ConvLSTM-Decode, namely a convolution long-time and short-time memory decoding layer;
4) trackID, the same target should have the same trackID in different frames;
5) CNN: a Convolutional Neural Network. The key parameters are the size and the step size of the convolution kernel, the size of the convolution kernel influences the influence range of the convolution kernel in the image, and the step size influences the distance of each movement of the convolution kernel.
Embodiment 3, the embodiment is described with reference to fig. 6, and the multi-target tracking device of the embodiment includes a video input module, a fusion detection association module, a trajectory prediction module, an output module, and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the shooting and fusion detection correlation module is connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.
The video input module inputs the video to the fusion detection association module, the fusion detection association module obtains the data association relation among the target category, the target position information and the target, simultaneously transmits the target position information to the track prediction module, and transmits the data association relation between the target category and the target to the output module; the track prediction module obtains different types of target motion rule information according to the received target position information, and simultaneously transmits the target motion rule information to the storage module and the fusion joint detection association module; when the target tracking of the fusion detection correlation module is lost, the next video frame can be predicted through the target motion rule information.
The key technology of the invention is as follows:
1. the invention fuses the detection algorithm and the data association algorithm into one module, reduces repeated calculation, and can obtain the data association information between the target position information and the continuous frames by only one-time calculation.
2. And the detection association module is used for learning multi-scale information of the video frame, performing differential feature learning at different scales and performing feature fusion at different scales on the basis. And finally, outputting a final result by utilizing a multi-task learning method.
3. The track prediction module can learn historical track information, help predict the target track, avoid losing because of sheltering from and causing the target.
4. The invention fuses the detection module and the data association module into the same neural network, reduces the calculated amount to shorten the operation time by sharing the same bottom layer characteristics,
the traditional DeepsSort algorithm runs 26FPS frames (FPS, which is how many frames can be detected per second, and the higher the FPS, the faster the algorithm is, which is a standard for measuring the execution speed of the algorithm), and the algorithm runs 33FPS frames.
The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method for modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Computer-readable storage medium embodiments
The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.
The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as disclosed herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims (4)

1. A multi-target tracking method is characterized by comprising the following steps:
s110, inputting the video into a fusion detection association module, performing down-sampling processing to obtain a feature map, inputting the feature map into a difference calculation network to obtain difference features, wherein the specific method for obtaining the feature map comprises the following steps:
1) 1/4 downsampling the video through the convolutional layer 1 to obtain a characteristic diagram 1;
2) 1/8 downsampling the characteristic diagram 1 through the convolution layer 2 to obtain a characteristic diagram 2;
3) The characteristic diagram 2 is subjected to 1/16 downsampling through the convolutional layer 3 to obtain a characteristic diagram 3;
s120, calculating a loss function, specifically including the following three loss functions:
1) A target classification loss function;
2) A target location regression loss function;
3) A multi-objective cross entropy loss function;
s120, calculating a loss function, specifically:
1) Target classification loss function L cls
Figure FDA0004016369300000011
Wherein, y i True class label, x, representing an object i Representing a model predicted value, and M represents the total number of target categories;
Figure FDA0004016369300000012
representing object class tag representation y i A category feature of (a); lambda represents a class characteristic balance coefficient, and the value of lambda is 0.5;
random initialization at the beginning of training, followed by updating of training each iteration
Figure FDA0004016369300000013
The update formula is:
Figure FDA0004016369300000014
Figure FDA0004016369300000015
Figure FDA0004016369300000016
representing the difference between current data and a category characteristic, M representing a target category total, based on the number of categories in the category>
Figure FDA0004016369300000017
Representing object class label representation y j Class feature of (a), x i Representing the model predicted value;
Figure FDA0004016369300000018
Represents the difference between the current data and the class characteristics at the tth iteration, and will then ≦>
Figure FDA0004016369300000019
Is updated to indicate->
Figure FDA00040163693000000110
While alpha is used to ensure->
Figure FDA00040163693000000111
The alpha value is taken as 0.5;
2) Target position regression loss function:
Figure FDA0004016369300000021
Figure FDA0004016369300000022
Figure FDA0004016369300000023
wherein t represents the model target predicted value, t * The real value of the target is represented, i can take values of x, y, w, h, x and y to represent the coordinate value of the central point of the detection frame, w represents the width of the detection frame, h represents the height of the detection frame, x, y, w and h can be regressed to obtain the position and the size of the target detection frame, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is changed and represented:
Figure FDA0004016369300000024
wherein, t' i The position of the output of the track prediction module is represented, and the position comprises x, y, w and h information;
3) Multi-objective cross entropy loss function:
Figure FDA0004016369300000025
wherein, y i True class label, x, representing an object i Representing the model predicted value;
s130, acquiring data association relation among the target type, the target position information and the target; inputting target position information into a track prediction module, learning target movement by using convolution operation, outputting predicted position information, forming different types of target motion rule information and transmitting the different types of target motion rule information to a database and a fusion detection association module;
learning target movement by using convolution operation, and outputting predicted position information, wherein the predicted position information specifically comprises characteristic information of a first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs the predicted position information;
s140 outputs the multi-target tracking.
2. A multi-target tracking device, for implementing the multi-target tracking method of claim 1, comprising a video input module, a fusion detection association module, a trajectory prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing different types of target motion rule information.
3. A computer comprising a memory storing a computer program and a processor, the processor implementing the steps of a multi-target tracking method as claimed in claim 1 when executing the computer program.
4. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a multi-target tracking method according to claim 1.
CN202110922602.2A 2021-08-12 2021-08-12 Multi-target tracking method and device, computer and storage medium Active CN113409361B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110922602.2A CN113409361B (en) 2021-08-12 2021-08-12 Multi-target tracking method and device, computer and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110922602.2A CN113409361B (en) 2021-08-12 2021-08-12 Multi-target tracking method and device, computer and storage medium

Publications (2)

Publication Number Publication Date
CN113409361A CN113409361A (en) 2021-09-17
CN113409361B true CN113409361B (en) 2023-04-18

Family

ID=77688703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110922602.2A Active CN113409361B (en) 2021-08-12 2021-08-12 Multi-target tracking method and device, computer and storage medium

Country Status (1)

Country Link
CN (1) CN113409361B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114022509B (en) * 2021-09-24 2024-06-14 北京邮电大学 Target tracking method based on monitoring video of multiple animals and related equipment
CN113993172B (en) * 2021-10-24 2022-10-25 河南大学 Ultra-dense network switching method based on user movement behavior prediction
CN114170271B (en) * 2021-11-18 2024-04-12 安徽清新互联信息科技有限公司 Multi-target tracking method, equipment and storage medium with self-tracking consciousness
CN114419102B (en) * 2022-01-25 2023-06-06 江南大学 Multi-target tracking detection method based on frame difference time sequence motion information
CN116309692B (en) * 2022-09-08 2023-10-20 广东省机场管理集团有限公司工程建设指挥部 Method, device and medium for binding airport security inspection personal packages based on deep learning
CN117541625B (en) * 2024-01-05 2024-03-29 大连理工大学 Video multi-target tracking method based on domain adaptation feature fusion
CN117593340B (en) * 2024-01-18 2024-04-05 东方空间(江苏)航天动力有限公司 Method, device and equipment for determining swing angle of carrier rocket servo mechanism

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827325B (en) * 2019-11-13 2022-08-09 阿波罗智联(北京)科技有限公司 Target tracking method and device, electronic equipment and storage medium
CN111127513B (en) * 2019-12-02 2024-03-15 北京交通大学 Multi-target tracking method
CN112001225B (en) * 2020-07-06 2023-06-23 西安电子科技大学 Online multi-target tracking method, system and application
CN111882580B (en) * 2020-07-17 2023-10-24 元神科技(杭州)有限公司 Video multi-target tracking method and system
CN111898504B (en) * 2020-07-20 2022-07-26 南京邮电大学 Target tracking method and system based on twin circulating neural network

Also Published As

Publication number Publication date
CN113409361A (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN113409361B (en) Multi-target tracking method and device, computer and storage medium
CN110245659B (en) Image salient object segmentation method and device based on foreground and background interrelation
US10671855B2 (en) Video object segmentation by reference-guided mask propagation
CN112651995B (en) Online multi-target tracking method based on multifunctional aggregation and tracking simulation training
CN111062413A (en) Road target detection method and device, electronic equipment and storage medium
CN111382686B (en) Lane line detection method based on semi-supervised generation confrontation network
Patil et al. MsEDNet: Multi-scale deep saliency learning for moving object detection
CN112036381B (en) Visual tracking method, video monitoring method and terminal equipment
CN115546705B (en) Target identification method, terminal device and storage medium
CN114549913A (en) Semantic segmentation method and device, computer equipment and storage medium
CN115880499B (en) Occluded target detection model training method, occluded target detection model training device, medium and device
Liang et al. Cross-scene foreground segmentation with supervised and unsupervised model communication
CN117036397A (en) Multi-target tracking method based on fusion information association and camera motion compensation
CN116129386A (en) Method, system and computer readable medium for detecting a travelable region
CN115661767A (en) Image front vehicle target identification method based on convolutional neural network
CN114360051A (en) Fine-grained behavior identification method based on progressive hierarchical weighted attention network
CN116861262B (en) Perception model training method and device, electronic equipment and storage medium
CN116580063B (en) Target tracking method, target tracking device, electronic equipment and storage medium
CN115100565B (en) Multi-target tracking method based on spatial correlation and optical flow registration
CN114998814B (en) Target video generation method and device, computer equipment and storage medium
CN114972434B (en) Cascade detection and matching end-to-end multi-target tracking system
CN112818743B (en) Image recognition method and device, electronic equipment and computer storage medium
CN114511740A (en) Vehicle image classification method, vehicle track restoration method, device and equipment
CN114444597B (en) Visual tracking method and device based on progressive fusion network
CN116821699B (en) Perception model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant