CN112734800A

CN112734800A - Multi-target tracking system and method based on joint detection and characterization extraction

Info

Publication number: CN112734800A
Application number: CN202011510839.1A
Authority: CN
Inventors: 邓国伟; 陈彩莲; 涂静正; 关新平; 杨博
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-12-18
Filing date: 2020-12-18
Publication date: 2021-04-30

Abstract

The invention discloses a multi-target tracking system and method based on joint detection and characterization extraction, and relates to the field of computer visual tracking. According to the technical scheme, the magnitude and the calculation cost of the network parameters to be trained are reduced, and the algorithm efficiency and the multi-target tracking precision are improved.

Description

Multi-target tracking system and method based on joint detection and characterization extraction

Technical Field

The invention relates to the field of computer vision tracking, in particular to a multi-target tracking system and method based on joint detection and characterization extraction.

Background

With the rapid development of internet technology, the continuous improvement of the performance of devices such as smart phones and computers, and the continuous reduction of manufacturing cost, abundant image and video data are continuously generated every moment. As a colloquial language, "a picture prevails over a thousand of languages," huge amounts of valuable information are contained in images and videos. How to utilize the data quickly and accurately becomes a problem which needs to be solved urgently. Computer vision technology, now rapidly developed, can utilize the powerful computing power of computers to process image data instead of human eyes. Computer vision technology has become the core technology in many fields.

Multi-object tracking (MOT) is an important research direction in the field of computer vision, and its task is to continuously track and locate multiple objects, such as pedestrians on the street, vehicles on the road, etc., from a video sequence, while keeping their identity information unchanged, and then derive the motion trajectory of each object. The multi-target tracking not only can accurately detect the space-time information of the target in the video, but also can provide a great deal of valuable information for gesture prediction, action recognition, behavior analysis and the like. The multi-target tracking algorithm has wide application in the fields of intelligent video monitoring, automatic driving, intelligent robots, intelligent human-computer interaction, intelligent transportation, sports video analysis and the like, and has become a popular direction for research in recent years.

The multi-target tracking problem is an extension of the single-target tracking problem. Given a particular target, the task of single target tracking is to continuously track the target from the scene. The task of multi-target tracking is to track a series of objects of interest in a scene, such as pedestrians, vehicles, etc. in the scene. Therefore, compared with single-target tracking, multi-target tracking also needs to complete two additional tasks:

(1) judging the number change of the targets in the scene, and finishing the initialization of the new track and the termination of the old track;

(2) and keeping the identity information of the tracking target.

Currently, detection-based tracking is the mainstream paradigm of multi-target tracking, and can be divided into the following two independent subtasks:

target detection, detecting a target position in a current image;

data correlation, correlating the detection results with existing traces.

Researchers often use pre-trained target detection models directly, so that the video multi-target tracking problem is converted into a data association problem based on detection results. In order to obtain an optimal correlation result, the correlation cost and the optimization algorithm of two key links as data correlation become the research focus of the detection-based tracking algorithm.

A preliminary multi-target tracking method is designed in a domestic patent 'multi-target tracking method, device, electronic equipment and storage medium' (application number 202010573301.9), but the problem of frequent shielding among targets in a scene is not considered, and the track is broken frequently. The name of the domestic patent application number 202010605987.5 is 'an integrated target detection and associated pedestrian multi-target tracking method' which provides a model capable of simultaneously performing target detection and target feature extraction, but the target association step only adopts a simpler threshold discrimination method, which leads the method to be incapable of obtaining the optimal matching result between targets in a scene in which a plurality of similar targets simultaneously appear.

In a domestic patent 'a vehicle multi-target tracking method based on a target center point' (application number is 20201059041.1), a vehicle detection model and a tracking model are integrated in a network, so that the calculated amount and the running time are greatly reduced, and the detection based on tracking is simplified.

Accordingly, those skilled in the art are devoted to developing a multi-target tracking system and method based on joint detection and characterization extraction.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problems to be solved by the present invention are: 1. how to improve the operation speed of the algorithm on the premise of keeping the accuracy of the algorithm; 2. how to organically combine two major links of target detection and data association and improve the tracking precision by comprehensively utilizing information.

In order to achieve the above purpose, the present invention provides a multi-target tracking system based on joint detection and characterization extraction, which is characterized by comprising a joint detection and characterization extraction module, a trajectory prediction module and a candidate frame screening module, wherein the joint detection and characterization extraction module is composed of a backbone network, a region selection network, a target boundary frame regressor and a characterization extractor.

Furthermore, the track prediction module adopts a linear motion model, speculates the possible position of the tracked target in the current video frame according to the motion information of the track, and corrects the existing track to reduce the error.

Furthermore, the candidate frame screening module adopts a non-maximization inhibition algorithm with identity transmission, can screen out an optimal candidate frame with confidence, and completes data association of the detected candidate frame and the track through identity transmission.

Further, the backbone network adopts a backbone network capable of extracting image features, and a feature pyramid network is established on the basis of the backbone network.

Further, the target bounding box regressor and the characterization extractor both adopt a deep neural network structure, and the target bounding box regressor uses a full-connection layer network.

A multi-target tracking method based on joint detection and characterization extraction comprises the following steps:

step 1, making an active track set as an empty set and an inactive track set as an empty set; inputting the video frame sequence into the backbone network frame by frame to obtain a feature table of the current frame image;

step 2, generating a candidate frame by utilizing the functions of the RPN, the boundary frame regressor, the characterization extractor and the like in the track prediction module and the joint detection and characterization extraction module according to the information in the feature table;

step 3, screening out the optimal candidate frame from the candidate frames by adopting a non-maximum inhibition method with identity transfer;

step 4, updating the track according to the screening result, including track generation, extension and deletion;

and 5, if the current frame is not the last frame of the video, returning to the first step, and if not, ending.

Further, the step 2 further comprises:

step 2.1, detecting a target in the image;

2.2, predicting the possible position of the track;

step 2.3, generating a candidate frame;

and 2.4, extracting the characterization vectors.

Further, step 3 adopts a non-polarization inhibition method with identity transfer, and specifically includes:

step 3.1, clustering candidate frames input according to intersection and comparison between the target boundary frames, clustering the candidate frames belonging to the same target into one class by using the spatial relationship between the candidate frames, and distinguishing the candidate frames not belonging to the same target;

3.2, if a certain cluster in the clustering result contains a candidate frame with an identity label, transmitting the identity label of the candidate frame to all candidate frames in the cluster;

and 3.3, deleting the candidate box with the non-maximum confidence coefficient in each cluster, and only keeping the candidate box with the maximum confidence coefficient in the cluster.

Further, the step 4 further includes:

step 4.1, updating the track in the active track set;

4.2, comparing the characteristics between the inactivation track set and the screening result, and carrying out re-identification operation;

and 4.3, updating the tracks successfully re-identified in the inactivation track set, adding the tracks into the active track set, taking the screening result of the re-identification failure as a new target, creating tracks for the tracks and adding the tracks into the active track set.

Further, the adopted re-identification method is a short-term method based on Euclidean distance between the characterization vectors.

Technical effects

1. And providing a joint detection and characterization extraction module. The target position in the image can be detected, and the target appearance representation for subsequent re-recognition can be extracted, so that the magnitude and the calculation cost of the network parameters to be trained are greatly reduced.

2. A candidate box generation module is designed. The module generates a detection candidate frame by searching a target position in a current image, directionally generates a track candidate frame corresponding to the existing track, and extracts target characteristics in the candidate frame, so that the target position in the image can be accurately detected, the subsequent data association steps are greatly facilitated, and the algorithm efficiency is obviously improved.

3. A candidate box screening module is designed. The module adopts a non-maximum inhibition algorithm with identity transfer, can screen the most accurate target boundary box through a unified standard, and obviously improves the tracking precision; and the association of the candidate box with the existing track is efficiently completed through the identity transfer operation.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a system flow diagram of a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a joint detection and characterization extraction model according to a preferred embodiment of the present invention;

FIG. 3 is a schematic diagram of a regression network for the target bounding box according to a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of a token extraction network in accordance with a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be more clearly and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

As shown in fig. 1, a target tracking method based on a joint detection and characterization extraction module includes the following steps:

the first step is as follows: movement track set

Set of inactivation trajectories

Converting the video frame sequence I to { I ═ I₀，i₁，...，i_T-1Inputting the frames into the main network of the module to obtain the characteristic table F of the current frame image_t；

The second step is that: based on the characteristic table F_tThe candidate frame C is generated by the following four steps_t。

2.1 detecting objects in the image. The RPN generates a reference bounding box on each pixel of the image and based on the feature table F_tFrom which to find areas where there is a possibility of targets

2.2 predicting the possible positions of the trajectory. The track prediction module infers the possible position of the tracked target in the current video frame according to the motion information of the track

With RPN output

Different, output of prediction module

With identity information of the corresponding track, which will be between the candidate box and the trackThe association of (a) provides convenience.

2.3 generating candidate boxes. Bounding box with low precision

And

and characteristic table F_tInputting the candidate frame C into a target bounding box grouping module to obtain a candidate frame C_t＝D_t+B_t. Therein called D_tTo detect candidate boxes, B_tIs a track candidate box. In this step, the process is carried out,

will automatically pass on to B_t。

2.4 extracting the characterization vector. The step is prepared for a subsequent pedestrian re-identification link. Algorithm will frame candidates C_tAnd characteristic table F_tThe input is input into a characterization extractor of the module, and a characterization vector of each candidate box is calculated.

The third step: and screening candidate frames.

Candidate frame C generated in the second step_tComprises two parts: 1) detection candidate frame D from RPN_t(ii) a 2) Trajectory candidate box B from the prediction module_t. Neither of these can be directly regarded as a tracking result of the current frame. Firstly, detecting that a candidate frame is not associated with a track, so that the candidate frame does not carry identity information; secondly, because the prediction accuracy of the prediction module is limited, the direct use of the trajectory candidate box will make the trajectory accuracy not high. The method adopts a non-maximum inhibition method with identity transfer from C_tScreening out the optimal candidate frame C'_tThe method comprises the following steps:

3.1 clustering. According to the intersection ratio (IoU) between the target bounding boxes, the candidate box set C is processed_tAnd clustering, namely clustering the candidate frames belonging to the same target into one class by using the spatial relationship among the candidate frames, and distinguishing the candidate frames not belonging to the same target.

3.2 identity transfer. And if a certain cluster in the clustering result contains a candidate frame with the identity label, transmitting the identity label of the candidate frame to all candidate frames in the cluster.

3.3 inhibition. And deleting the candidate box with non-maximum confidence in each cluster, and only keeping the candidate box with the maximum confidence in the cluster.

This step is to mix C_tScreening is optimal candidate frame C'_tOf which is C'_t＝D′_t+B′_t，D′_tA bounding box, B ', representing the screening result without the identity information yet'_tIs a bounding box with identity information.

The fourth step: and (4) track processing, including track generation, extension and deletion.

4.1 update the track. According to B'_tThe corresponding track in T is updated according to the identity information and the position information in the T; deleting ' B ' in activity track set T '_tCorrelating the traces and adding them into the inactivation trace set T';

4.2 re-identification.

Due to frequent occlusion between targets in the scene, the candidate frame D 'without the identity label in the screening result'_tPossibly a newly emerging target, and may also belong to a portion of the trajectory of an occluded target. In order to reduce track fracture and keep the linearity and real-time performance of the algorithm, the short-term pedestrian re-identification method is adopted to judge D'_tWhether it is an occluded target: first, the traces in the inactive trace set T' will be additionally saved T_sFrame time, and still using the trajectory prediction module to predict the location of the trajectory in T' during this time; according to D'_tAnd the distance between the characteristic vector of the track in the T' is used for judging whether the two are the same target or not. In order to reduce the error re-identification rate, the following judgment criteria are set: firstly, the distance between the two characterization vectors must be smaller than a certain threshold; second, the interaction ratio between the two is greater than a certain threshold.

And after the re-identification step is completed, updating the successfully re-identified track in the inactivation track set T', and adding the track into the active track set T. D'_tWith medium heavy recognition failureThe candidate box is a newly emerged target for which a new trajectory is created and added to the active trajectory set T.

The fifth step: if the current frame is not the last frame of the video, returning to the first step; otherwise, ending.

As shown in fig. 2, the multi-target tracking system based on joint detection and feature extraction uses a designed joint detection and feature extraction module as a core skeleton, and adds a trajectory prediction module and a candidate frame extraction module to complete a multi-target tracking task. The joint detection and characterization extraction module is integrally composed of a backbone network, a Region selection network (RPN), a target bounding box regressor and a characterization extractor. The module can not only detect the target position in the image, but also extract the representation vector of the target.

The backbone network adopts a backbone network which can extract image features, such as Alexnet, VGG, Resnet series, increment series, Densenet series, ResNeXt series and the like. In addition, a Feature Pyramid Network (FPN) is established on the basis of the backbone Network, so that the position of the target can be accurately detected based on Feature tables of different scales.

The region selection network adopts the RPN structure of fast RCNN, which can search the region with object from the picture. The RPN first generates a large number of reference bounding boxes (Anchors) on each pixel point in the image. Secondly, finding out the characteristics corresponding to each reference boundary frame in the characteristic table, and judging whether a target exists in the reference boundary frame; meanwhile, the reference boundary frame is made to accord with the actual position of the target as much as possible by a regression method of the target boundary frame. Typically, the RPN generates the reference bounding box with an aspect ratio of { 1: 2, 1: 1, 2: 1 }. In practical applications, an appropriate aspect ratio may be selected according to the characteristics of the target of interest to improve accuracy and efficiency.

As shown in fig. 3 and 4, both the target bounding box regressor and the characterization extractor adopt a deep neural network structure. The deep neural network has excellent fitting capability and characteristic representation capability, and can effectively improve the accuracy of the algorithm. The target bounding box regressor shown in fig. 3 uses a 4-layer full-link network with the number {1, 2, 3, 4 }. The module obtains a boundary frame with more accurate positioning precision and a corresponding confidence coefficient according to the feature table and the boundary frame with poorer positioning precision. The token extractor shown in fig. 4 uses a 3-layer full connectivity layer network numbered 5, 6, 7. The module extracts a characterization vector of the target according to the feature table and the target bounding box. The generated characterization vector satisfies the following properties: given a distance metric method, the distance between the token vectors of the same object before and after the video is sufficiently small, while the distance between the token vectors of different objects is sufficiently large.

The track prediction module is used for presuming the possible position of a tracked target in the current video frame according to the motion information of the track and correcting the existing track to reduce errors. The method can effectively reduce the search space and improve the tracking precision. The trajectory prediction module predicts a most likely position of the trajectory at the current time based on the linear motion model.

The candidate box filtering module employs a non-maximization suppression algorithm with identity delivery. Different from general non-maximization suppression, the non-maximization suppression with identity transmission is performed after clustering, and if a certain cluster in a clustering result contains a candidate frame with an identity label, the identity label of the candidate frame is transmitted to all candidate frames in the cluster. The module can screen out the optimal candidate frame according to the confidence coefficient, and completes the data association of the detection candidate frame and the track through identity transmission, thereby avoiding the complex similarity measurement calculation and binary image distribution process.

The embodiment of the application also provides electronic equipment which comprises a processor and a memory.

The memory is used for storing a computer program;

the processor is used for realizing any one of the multi-target tracking methods when executing the program stored in the memory.

Embodiments of the present application may also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the multi-target tracking methods described above.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. The multi-target tracking system based on joint detection and characterization extraction is characterized by comprising a joint detection and characterization extraction module, a trajectory prediction module and a candidate frame screening module, wherein the joint detection and characterization extraction module is composed of a backbone network, a region selection network, a target boundary frame regressor and a characterization extractor.

2. The multi-target tracking system based on joint detection and feature extraction as claimed in claim 1, wherein the trajectory prediction module uses a linear motion model to infer the possible position of the tracked target in the current video frame according to the motion information of the trajectory, and corrects the existing trajectory to reduce the error.

3. The multi-target tracking system based on joint detection and characterization extraction according to claim 1, wherein the candidate frame screening module employs a non-maximization inhibition algorithm with identity transfer, can screen out an optimal candidate frame with confidence, and completes data association of the detected candidate frame and the track through identity transfer at the same time.

4. The multi-target tracking system based on joint detection and feature extraction as claimed in claim 1, wherein the backbone network adopts a backbone network capable of extracting image features, and a feature pyramid network is established on the basis of the backbone network.

5. The joint detection and characterization extraction based multi-target tracking system of claim 1 wherein the target bounding box regressor and the characterization extractor both employ a deep neural network structure, the target bounding box regressor using a full connectivity layer network.

6. A multi-target tracking method based on joint detection and characterization extraction is characterized by comprising the following steps:

7. The multi-target tracking method based on joint detection and characterization extraction as claimed in claim 6, wherein said step 2 further comprises:

step 2.1, detecting a target in the image;

2.2, predicting the possible position of the track;

step 2.3, generating a candidate frame;

and 2.4, extracting the characterization vectors.

8. The multi-target tracking method based on joint detection and characterization extraction as claimed in claim 6, wherein the step 3 adopts a non-polarization inhibition method with identity transfer, specifically comprising:

9. The multi-target tracking method based on joint detection and characterization extraction as claimed in claim 6, wherein said step 4 further comprises:

step 4.1, updating the track in the active track set;

10. The multi-target tracking method based on joint detection and feature extraction as claimed in claim 9, wherein the re-recognition method is a short-term one based on Euclidean distance between feature vectors.