CN113409361A

CN113409361A - Multi-target tracking method, device, computer and storage medium

Info

Publication number: CN113409361A
Application number: CN202110922602.2A
Authority: CN
Inventors: 林涛; 张炳振; 刘宇鸣; 邓普阳; 张枭勇; 陈振武; 王宇; 周勇
Original assignee: Shenzhen Urban Transport Planning Center Co Ltd
Current assignee: Shenzhen Urban Transport Planning Center Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-09-17
Anticipated expiration: 2041-08-12
Also published as: CN113409361B

Abstract

The invention provides a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence. Firstly, inputting a video into a fusion detection association module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features; secondly, obtaining the object type, the object position information and the same trackID of the same object in different video frames by a multi-task learning method in deep learning; and predicting the position of the target of the current frame possibly by using a track prediction module according to the target motion track information in the continuous frames, and providing reference for the fusion detection correlation module. And finally, outputting the multi-target tracking information. The method solves the technical problems that the target tracking efficiency is low, the target is easy to lose and the target ID is easy to change in the prior art, improves the efficiency of multi-target tracking and avoids the loss of target tracking.

Description

Multi-target tracking method, device, computer and storage medium

Technical Field

The application relates to a target tracking method, in particular to a multi-target tracking method, a multi-target tracking device, a computer and a storage medium, and belongs to the technical field of artificial intelligence.

Background

The multi-target tracking is to simultaneously track a plurality of targets in a video, application scenes such as security protection, automatic driving and the like are adopted, the number of people and vehicles in the scenes is uncertain, the characteristics of each target are uncertain, and the tracking of the targets is the basis of other applications (such as target positioning, target density calculation and the like). Different from single target tracking, multi-target tracking has a unique ID for each target, and the target is ensured not to be lost in the tracking process. Meanwhile, the appearance of a new target and the disappearance of an old target are also problems to be solved by multi-target tracking.

At present, many researches are made for multi-target tracking, a main tracking strategy is DBT (detection-based tracking), a detection module and a data association module are independent, a video sequence firstly passes through a detection algorithm to obtain position information of a target, and a final track result is obtained by executing the data association algorithm.

A representative algorithm in multi-target tracking is a DeepsORT algorithm, belongs to a data association algorithm in MOT (multi-target tracking), and can be combined with any detector to realize multi-target tracking. The algorithm combines a Kalman filtering algorithm and a Hungarian algorithm. And predicting the state of the detection frame in the next frame by using a Kalman filtering algorithm, and matching the state with the detection result of the next frame. In the matching process, a Hungarian algorithm is used, the motion characteristics obtained by Kalman filtering are combined with appearance characteristics extracted by a CNN (convolutional neural network) to be fused together to calculate a cost matrix.

The MOT is mainly applied to scenes such as security protection, automatic driving and the like, and the scenes have high requirements on algorithm real-time performance. In the case of a fixed hardware level, the detection efficiency and the detection accuracy of the MOT should be improved as much as possible. In the prior art, the MOT has the problem of low efficiency in practical application. The existing real-time MOT usually only concerns about data association steps, essentially only completes a part of the MOT, and cannot really solve the efficiency problem.

In addition, different targets are often occluded in a real scene, which can cause problems of target loss, target ID change and the like in the MOT.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

In view of this, in order to solve the technical problems of low target tracking efficiency, easy target loss and easy target ID change in the prior art, the present invention provides a multi-target tracking method, apparatus, computer and storage medium.

And the fusion detection correlation module outputs the position and the category information of different targets. The track prediction module takes the information as input to learn different types of target track information, so that the target tracking efficiency is improved, and the target tracking loss is avoided.

A multi-target tracking method comprises the following steps:

s110, inputting the video into a fusion detection correlation module, performing down-sampling processing to obtain a feature map, and inputting the feature map into a difference calculation network to obtain difference features;

s120, calculating a loss function;

s130, acquiring data association relation among the target type, the target position information and the target; inputting target position information into a track prediction module, learning target movement by using convolution operation, outputting predicted position information, forming different types of target motion rule information and transmitting the different types of target motion rule information to a database and a fusion detection association module;

s140 outputs multi-target tracking.

Preferably, the specific method for obtaining the feature map in step S110 is:

1) 1/4 downsampling the video through the convolutional layer 1 to obtain a characteristic diagram 1;

2) carrying out 1/8 downsampling on the characteristic diagram 1 through the convolutional layer 2 to obtain a characteristic diagram 2;

3) the characteristic diagram 2 is subjected to 1/16 downsampling by the convolutional layer 3 to obtain the characteristic diagram 3.

Preferably, the calculating the loss function in step S120 specifically includes the following three loss functions:

1) a target classification loss function;

2) a target location regression loss function;

3) multi-objective cross entropy loss function.

Preferably, the calculation methods of the three loss functions in step S120 are specifically:

1) objective classification penalty function

：

Wherein the content of the first and second substances,

a true category label representing the object,

the predicted value of the model is represented,

representing a total number of target categories;

representing object class label representations

A category feature of (a);

and expressing a class characteristic balance coefficient for balancing the influence of the class characteristic on the whole loss function, wherein the value is 0.5.

Random initialization at the beginning of training, followed by updating of training each iteration

The update formula is:

indicating the difference between the current data and the class characteristics,

the total number of target categories is represented,

representing object class label representations

Is determined by the characteristics of the category of (1),

representing the model predicted value;

is shown as

The difference between the current data and the class characteristics at the time of the second iteration, which will be later

Value-modified update representation of

Simultaneously use

Guarantee

The stability of the composite material is improved,

the value was taken to be 0.5;

2) target position regression loss function:

wherein the content of the first and second substances,

the target predicted value of the model is represented,

the true value of the target is represented,

can take values

，

The coordinate value of the center point of the detection frame is shown,

it indicates that the width of the detection frame,

it indicates that the detection box is high,

the position and the size of the target detection frame can be obtained through regression, and if the target prediction position output by the track prediction module is increased, the regression loss function of the target position is expressed as follows:

wherein the content of the first and second substances,

indicating the position of the output of the trajectory prediction module, including

Information;

3) multi-objective cross entropy loss function:

wherein the content of the first and second substances,

a true category label representing the object,

representing the model predicted value;

the fusion detection association module aims to generate target types, target position information and trackID information of targets among different video frames, so that loss functions need to be weighted and summed to form a total loss function, and the loss function of the fusion detection association module needs to be calculated;

the loss function of the fusion detection correlation module is as follows:

wherein the content of the first and second substances,

，

and the multi-task weight parameter is expressed and can be set according to different task requirements.

Preferably, in step S230, the target movement law is learned through a three-layer ConvLSTM network, and predicted position information is output, specifically including feature information of a first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.

The system comprises a video input module, a fusion detection association module, a track prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.

A computer comprising a memory storing a computer program and a processor implementing the steps of a multi-target tracking method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, implements a multi-target tracking method.

The invention has the following beneficial effects: the scheme of the invention fuses the detection algorithm and the data association algorithm into a module, thereby reducing repeated calculation. The track prediction module can be used for well processing the matching problem of difficult targets, the trackID generated by obtaining the data association relation among the target type, the target position information and the targets is more stable, the identification accuracy of the same target of the previous frame and the next frame can be improved, and the trackID is frequently switched. The problem of low computational efficiency and poor real-time performance of the existing multi-target tracking technology is solved, and meanwhile, the robustness for target shielding loss is high.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart of a method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a fusion detection association module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a difference computing network according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a trajectory prediction module according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a ConvLSTM model according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a multi-target tracking device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiment 1, this embodiment is described with reference to fig. 1 to 3, and a multi-target tracking method includes the following steps:

firstly, inputting a video to a fusion detection association module to obtain a target position and data association information at one time, wherein a model of the fusion detection association module specifically refers to fig. 2.

The specific method for obtaining the feature map by performing the down-sampling process is that, assuming that the size of the input video frame is 1280 × 720 (length × width, which means that there are 1280 pixels in length and 720 pixels in width), the image is adjusted to 896 × 896 by resize, which is convenient for subsequent processing. The down-sampling process is as follows:

(1) the input image is subjected to 1/4 downsampling through a convolution layer 1 (convolution kernel size 8 × 8, step size = 8) to obtain a feature map 1, size 224 × 224;

(2) the characteristic diagram 1 is subjected to 1/8 downsampling through a convolution layer 2 (the size of a convolution kernel is 2 x 2, and the step length is = 2), and a characteristic diagram 2 with the size of 112 x 112 is obtained;

(3) the feature map 2 is then passed through convolution layer 3 (convolution kernel size 2 x 2, step = 2), and the downsampling is computed 1/16 to get the feature map 3, size 56 x 56.

So far, 3 feature maps with different sizes are obtained by the image through a down-sampling process. Each frame of image in the fusion detection association module is subjected to down-sampling calculation, and 6 feature maps of the front frame and the rear frame are used as input and are transmitted into a difference calculation network. The method aims to calculate and fuse difference characteristics under different scales, and finally, a multi-task learning method is used for simultaneously predicting and obtaining data association relations among target types, target position information and targets.

The difference calculation network mainly comprises two structures of DenseBlock and Transition. The specific DenseBlock is composed of a BN layer + ReLU layer +3 × 3 convolution layer, and the input and output characteristic diagrams of DenseBlock are identical. The Transition is composed of a BN layer + a ReLU layer +1 × 1 convolution layer +2 × 2 average pooling layer, and therefore the size of the feature map becomes 1/2 after each Transition. In actual calculation, a total of 6 feature maps of the two frames before and after are input into the difference calculation network together. Similar to the twin Network (Siamese Network), the difference calculation Network also has two paths, which correspond to the 3 feature maps of the previous frame and the 3 feature maps of the current frame, respectively. The two path networks are identical in structure but different in weight.

(1) Firstly, inputting a feature map 1 with the size of 224 × 224 into each channel, changing the network size into 112 × 112 through Transition1, and then transmitting a DenseBlock1 network learning feature to obtain the feature of 112 × 112;

(2) fusing and adding the features obtained in the last step with the features 2, and continuously transmitting the features into a Transition2 network and a DenseBlock2 network to obtain 56 by 56 features;

(3) similarly, the features of the previous step and the feature map 3 are fused and added, and are transmitted into a DenseBlock3 network for further learning features;

(4) the previous frame and the current frame respectively obtain a feature map of 56 x 56, and the difference between the two feature maps obtains a difference feature with the size of 56 x 56.

S120, calculating a loss function; since the network target obtains the target category, the target position information and the target data association relationship at one time, that is, the trackID information in the tracking process, a loss function needs to be calculated.

The calculation of the loss function specifically includes the following three loss functions:

1) a target classification loss function;

2) a target location regression loss function;

3) multi-objective cross entropy loss function.

Wherein the target classification loss function

The calculation method specifically comprises the following steps:

wherein the content of the first and second substances,

a true category label representing the object,

representing the probability that the model predicts as a positive sample,

representing a total number of target categories;

indicates an object class label of

A category feature of (a);

The update formula is:

the total number of target categories is represented,

representing object class label representations

Is determined by the characteristics of the category of (1),

representing the model predicted value;

is shown as

Value-modified update representation of

Simultaneously use

Guarantee

The stability of the composite material is improved,

the value was taken to be 0.5;

wherein the target location regression loss function:

wherein the content of the first and second substances,

the target predicted value of the model is represented,

the true value of the target is represented,

can take values

，

The coordinate value of the center point of the detection frame is shown,

it indicates that the width of the detection frame,

it indicates that the detection box is high,

wherein the content of the first and second substances,

And (4) information.

Wherein, the multi-target cross entropy loss function:

wherein the content of the first and second substances,

a true category label representing the object,

representing the model predicted value;

the loss function of the fusion detection correlation module is as follows:

wherein the content of the first and second substances,

，

the data association target is to obtain target trackID information in front and rear video frames, and if a red vehicle appears in the previous frame and the red vehicle also appears in the current frame, the two vehicles can be judged to be the same trackID through data association. In order to find that the same object has the same trackID in different frames, a model should judge that the same object is closer to the space than different objects, and a common method in the prior art, namely a triplet loss function, is used in the MOT.

The specific algorithm implementation process is as follows: the difference features are followed by a full connection layer, and the number N of nodes of the full connection layer indicates that there are at most N different trackids (N is a hyper-parameter, which can be modified according to the needs of the scene, and usually takes the value N = 20000). The classification process is to classify the object when the object is detected. If the target exists before, the corresponding trackID is correctly classified, otherwise, the target is a new target with a classification label of-1, the parameters of the full connection layer are updated, and the object can be identified in the subsequent classification process by adding a trackID. Meanwhile, in the updating process of the model parameters, the trackIDs which are not detected for a long time can be forgotten, and the total number of the trackIDs recorded by the model is ensured not to exceed the value of N.

S150 outputs the multi-target tracking.

Embodiment 2, the embodiment is described with reference to fig. 4, and the multi-target tracking method further includes a trajectory prediction module, and the trajectory prediction module may learn historical trajectory information of targets of different categories. The trajectory prediction module model structure is described with particular reference to fig. 4. The LSTM structure is a classical network structure for processing time series data, while ConvLSTM is a network structure formed by combining an LSTM structure and convolution (convolution), and the model structure is specifically shown with reference to fig. 5, wherein,

to represent

The input of the time of day is,

to represent

The input of the time of day is,

to represent

The output of the time of day is,

to represent

The output of the time of day is,

to represent

The output of the moment can not only establish a time sequence relation, but also exert the characteristic of convolution to depict the local spatial features of the image.

S210, inputting the target position information into a track prediction module, and calculating output variables C and H in the LSTM; the model being input in a sequence of successive image frames, e.g. X_tAnd X_t+1Calculating C (cell output) and H (hidden state) for two continuous frame inputs; c (cell output) and H (hidden state) are output variables in the LSTM.

Wherein C represents a cell unit in the LSTM and is used for storing time sequence information and medium-term and long-term memory; h represents a hidden unit for storing the recent memory in the time sequence information.

S220, estimating C and H of a target time through C and H input at the past time by using convolution operation;

s230, learning a target movement rule through a three-layer ConvLSTM network, outputting predicted position information, and forming different types of target movement rule information;

wherein the first layer learns the characteristic information of the target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.

S240, the different types of target motion rule information are respectively transmitted to the database and the fusion detection association module. When the target is shielded, and the fusion detection association module cannot identify the image information of the current frame, the motion track of the next frame of the image can be predicted through the motion rule information of different types of targets obtained through the track prediction model training.

In a traffic monitoring scene, the visual angle of the camera is generally fixed, so that the vehicle tracks in the pictures shot by the camera have certain similarity. The rule can be obtained through automatic learning of a special neural network structure. The track learning prediction module can also store the learned motion rule information in a database for a long time, and can be called at any time when the fusion detection association module needs to use the motion rule information.

After training the law of the movement of the learning target, inputting a frame of image and the position information of the current target, and outputting the position information of the target at the next moment by the track prediction model, wherein the position information comprises x, y, w and h. The predicted target position can be added into a target position loss function of the fusion detection correlation module, and the position identification accuracy is improved.

The track prediction module predicts different positions of different types of targets, and optionally stores output results in a database for the fusion detection association module to utilize the information.

The following explains English language appearing in the present embodiment or the drawings

1) ConvLSTM-Encode, namely a convolution length memory coding layer;

2) ConvLSTM-Position, memory Position layer when convolution length;

3) ConvLSTM-Decode, namely a convolution length memory decoding layer;

4) trackID, the same target should have the same trackID in different frames;

5) CNN: a Convolutional Neural Network. The key parameters are the size and the step size of the convolution kernel, the size of the convolution kernel influences the influence range of the convolution kernel in the image, and the step size influences the distance of each movement of the convolution kernel.

Embodiment 3, the embodiment is described with reference to fig. 6, and the multi-target tracking device of the embodiment includes a video input module, a fusion detection association module, a trajectory prediction module, an output module, and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the shooting and fusion detection correlation module is connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.

The video input module inputs the video to the fusion detection association module, the fusion detection association module obtains the data association relation among the target category, the target position information and the target, simultaneously transmits the target position information to the track prediction module, and transmits the data association relation between the target category and the target to the output module; the track prediction module obtains different types of target motion rule information according to the received target position information, and simultaneously transmits the target motion rule information to the storage module and the fusion joint detection association module; when the target tracking of the fusion detection association module is lost, the next video frame can be predicted through the target motion rule information.

The key technology of the invention is as follows:

1. the invention fuses the detection algorithm and the data association algorithm into one module, reduces repeated calculation, and can obtain the data association information between the target position information and the continuous frames by only one-time calculation.

2. And the detection association module is used for learning multi-scale information of the video frame, performing differential feature learning at different scales and performing feature fusion at different scales on the basis. And finally, outputting a final result by utilizing a multi-task learning method.

3. The track prediction module can learn historical track information, help predict the target track, avoid losing because of sheltering from and causing the target.

4. The invention fuses the detection module and the data association module into the same neural network, reduces the calculated amount to shorten the operation time by sharing the same bottom layer characteristics,

the traditional DeepsSort algorithm runs 26FPS frames (FPS, which is how many frames can be detected per second, and the higher the FPS, the faster the algorithm is, which is a standard for measuring the execution speed of the algorithm), and the algorithm runs 33FPS frames.

The computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Computer-readable storage medium embodiments

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A multi-target tracking method is characterized by comprising the following steps:

s120, calculating a loss function;

s140 outputs multi-target tracking.

2. The method according to claim 1, wherein the specific method for obtaining the feature map in step S110 is:

3. The method according to claim 2, wherein the calculating the loss function at step S120 specifically includes the following three loss functions:

1) a target classification loss function;

2) a target location regression loss function;

3) multi-objective cross entropy loss function.

4. The method according to claim 3, wherein the three loss functions of step S120 are calculated by:

1) objective classification penalty function

：

Wherein the content of the first and second substances,

a true category label representing the object,

the predicted value of the model is represented,

representing a total number of target categories;

representing object class label representations

A category feature of (a);

representing a class characteristic balance coefficient, and taking the value of the class characteristic balance coefficient as 0.5;

The update formula is:

the total number of target categories is represented,

representing object class label representations

Is determined by the characteristics of the category of (1),

representing the model predicted value;

is shown as

Value-modified update representation of

Simultaneously use

Guarantee

The stability of the composite material is improved,

the value was taken to be 0.5;

2) target position regression loss function:

wherein the content of the first and second substances,

the target predicted value of the model is represented,

the true value of the target is represented,

can take values

，

The coordinate value of the center point of the detection frame is shown,

it indicates that the width of the detection frame,

it indicates that the detection box is high,

wherein the content of the first and second substances,

Information;

3) multi-objective cross entropy loss function:

wherein the content of the first and second substances,

a true category label representing the object,

and representing the model predicted value.

5. The method according to claim 4, wherein the learning target movement by convolution operation at S130 outputs predicted position information, specifically including feature information of the first-layer learning target; a second layer learns position change information of the target between consecutive frames; the third layer outputs predicted position information.

6. A multi-target tracking device is characterized by comprising a video input module, a fusion detection association module, a track prediction module, an output module and a storage module; the video input module is sequentially connected with the fusion detection association module and the output module; the video input module and the fusion detection association module are connected with the track prediction module; the track prediction module is connected with the storage module; the video input module is used for inputting video information; the fusion detection association module is used for acquiring the data association relation among the target category, the target position information and the target and outputting the target position information to the track prediction module; the track prediction module is used for acquiring the motion rule information of different types of targets; outputting the motion rule information of the targets of different types to a storage module and a fusion detection association module; the output module is used for outputting the target tracking result output by the fusion detection correlation module; the storage module is used for storing the motion rule information of different types of targets.

7. A computer comprising a memory storing a computer program and a processor, the processor implementing the steps of a multi-target tracking method as claimed in any one of claims 1 to 5 when executing the computer program.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a multi-target tracking method according to any one of claims 1 to 5.