CN111080673B

CN111080673B - Anti-occlusion target tracking method

Info

Publication number: CN111080673B
Application number: CN201911261618.2A
Authority: CN
Inventors: 张盛; 易梦云; 徐赫
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2019-12-10
Filing date: 2019-12-10
Publication date: 2023-04-18
Anticipated expiration: 2039-12-10
Also published as: CN111080673A

Abstract

The invention provides an anti-occlusion target tracking method, which comprises the steps of firstly, for an input video or an image sequence, detecting each frame of image in the video by adopting a target detector to obtain a candidate item based on detection; and according to the target detection result of the current frame, predicting the position of the target in the next frame by using a Kalman filter to obtain a candidate item based on tracking. Calculating the confidence of the candidate item according to a confidence score formula, and obtaining a final candidate item by adopting a non-maximum suppression algorithm; and inputting the candidate items of the adjacent frames into a feature matching network, and calculating the matching degree between the targets through a cascade matching algorithm. Extracting features of the candidate items based on detection through a deep neural network, and matching similarity between the features; and performing IOU coincidence degree matching based on the tracked candidate items. And determining the position of the target in the current frame according to the target matching result of the adjacent frame so as to output a target motion track. The target is detected and tracked under the condition that the target is shielded, and the tracking precision and performance are improved.

Description

Anti-occlusion target tracking method

Technical Field

The invention relates to the technical field of target tracking, in particular to an anti-occlusion target tracking method.

Background

In recent years, with the continuous development of deep neural networks and the continuous improvement of GPU computing power, methods based on deep learning make breakthrough progress on computer vision tasks. Computer vision technologies such as target detection, target recognition, target tracking, pedestrian re-recognition and the like are rapidly developed and widely applied to various industries and fields such as intelligent monitoring, human-computer interaction, virtual reality and augmented reality, medical image analysis and the like.

Multi-Object Tracking (Multi Object Tracking) is a classic computer vision task, a region of interest obtained by target Tracking is the basis for further high-level vision analysis, and the accuracy of target Tracking directly affects the performance of a computer vision system. Most of the existing multi-target Tracking methods adopt Tracking-by-Detection (Tracking-by-Detection), that is, under the Detection result of a target detector, the motion track association is carried out on the Detection result from the same target between frames. Such detection methods depend to a large extent on the detection result. However, in many practical applications, especially in crowded scenes, the detection result of the detector is usually not accurate enough due to the interaction between objects, the appearance similarity and frequent occlusion of the objects, thereby seriously affecting the accuracy and performance of tracking.

In the existing multi-target tracking algorithm, a target detector is retrained through a large-scale data set to obtain a more accurate detection result, however, motion information in a video image is ignored, and the method is not efficient enough. Some methods carry out feature extraction by designing and training deeper neural networks to obtain more robust target features, however, the appearance similarity problem is difficult to solve by appearance-based features, and the real-time performance of the algorithm is difficult to guarantee. In view of the above, it is desirable to provide a new anti-occlusion target tracking method for solving the target occlusion interaction.

Disclosure of Invention

The invention provides an anti-occlusion target tracking method for solving the existing problems.

In order to solve the above problems, the technical solution adopted by the present invention is as follows:

an anti-occlusion target tracking method comprises the following steps: s1: inputting a video or an image sequence into a target detector according to frames to obtain a target detection result, wherein the target detection result is a candidate item based on detection and comprises a bounding box and detection confidence of all targets in each frame of image; s2: generating a candidate item based on tracking for each frame of image by utilizing a joint detection and tracking frame according to the target detection result, wherein the joint detection and tracking frame carries out tracking motion estimation on the detection result through a Kalman filter and camera motion compensation so as to obtain the candidate item based on tracking; s3: screening the detection-based candidate items and the tracking-based candidate items by using a non-maximum suppression algorithm according to the confidence degrees of the detection-based candidate items and the tracking-based candidate items to obtain the screened detection-based candidate items and the screened tracking-based candidate items; s4: extracting apparent features of all screened candidate items based on detection and screened candidate items based on tracking of the current frame by utilizing a pre-trained deep neural network; s5: calculating the target matching degree of the adjacent frames by using a cascade matching algorithm, wherein the method comprises the following steps: the screened candidate items based on detection perform apparent feature similarity matching on the existing tracks of the adjacent frames; the screened candidate items based on the tracking are subjected to boundary frame intersection with a target boundary frame of the existing track of the adjacent frame and are matched with the matching degree; s6: and determining the position of the target in the current frame according to the target matching degree of the adjacent frames, thereby outputting a target motion track.

Preferably, the object detector is an SDP object detector.

Preferably, the confidence is given by the following confidence score formula:

wherein the content of the first and second substances,

is the detection confidence of the t-1 th frame, is>

For the tracking confidence of the t-th frame, N _det The number N of the candidate items based on detection in the track to be associated _trk For the number of the candidate items based on tracking in the track to be associated last time, (. Cndot.) is a binary function, when the function is true, the value is 1, otherwise, the value is 0, and the parameter α is a constant.

Preferably, the screening the detection-based candidate item and the tracking-based candidate item by using a non-maximum suppression algorithm to obtain the screened detection-based candidate item and the screened tracking-based candidate item includes the following steps: s21: sorting according to the confidence score of all the candidate items based on detection and the candidate items based on tracking to obtain a candidate list; s22: selecting the detection-based candidate item and the tracking-based candidate item with the highest confidence level to be added into a final output list and deleted from the candidate item list; s23: calculating the detection-based candidate item and the tracking-based candidate item with the highest confidence coefficient to be in a border intersection ratio with other candidate items, and deleting the detection-based candidate item and the tracking-based candidate item with the border intersection ratio larger than a preset threshold value; s24: and repeating the process until the candidate item list is empty, wherein the final output list is the screened candidate items based on detection and the candidate items based on tracking.

Preferably, the preset threshold is 0.3-0.5.

Preferably, the deep neural network is a google lenet based network, including from an input layer to an initiation _4e layer, and then connected through a 1 × 1 convolutional layer.

Preferably, the loss function of the training of the neural network is:

l _triplet (I _i ，I _j ，I _k )＝m+d(I _i ，I _j )-d(I _i ，I _k )

wherein, I _i ，I _j For pictures from the same identity, I _i ，I _k For pictures from different identities, d represents the euclidean distance and m is a constant.

Preferably, the step of calculating the target matching degree of the adjacent frames by using the cascade matching algorithm comprises the following steps: s51: obtaining the target detection result of the first frame, and generating a track for each target to obtain an initial track set

A candidate set of the filtered detection-based candidate and the filtered tracking-based candidate->

The apparent characteristic->

And constructs all matched candidate sets ≥>

All candidate sets ≥ that do not match>

S52: selecting the selected detection-based candidate->

And the initial set of trajectories->

Performing feature similarity calculation, and updating the matched candidate item set based on the matching result>

The unmatched candidate set->

The initial set of trajectories +>

S53: selecting the selected tracking-based candidate->

And the updated initial trajectory set->

The target bounding box carries out bounding box intersection and matching of the matching degree of gravity, and the matched candidate item set is updated according to the matching result>

The unmatched candidate set->

Preferably, the matched candidate item set is

Each candidate bounding box in the cluster of candidates is matched to the initial set of tracks->

The track segments in (1) are connected; combining the unmatched set of candidates>

Initializing the trajectory into a new trajectory; set ≥ for the initial locus>

Is set as a temporary track segment, and if it is not matched in the following consecutive N frames, the temporary track segment is considered to have ended and is based on the initial set of tracks £ or £ the>

Is deleted. The value of N is 5-8.

The invention also provides a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of the above.

The beneficial effects of the invention are as follows: the anti-blocking target tracking method is provided, and through the combined action of the combined detection tracking frame and the cascade matching algorithm, when the target is blocked interactively and the detection result of the detector is inaccurate, a better candidate item can be generated through the combined detection tracking frame to carry out target cascade matching. The problem of inaccurate detection during target interaction shielding is solved, and the influence of target shielding on the tracking effect is reduced, so that accurate tracking during target shielding is realized.

Furthermore, the method is very simple to implement, the calculation cost is low, the algorithm can reach the operation speed of 30 frames/second on the GPU, and real-time tracking can be realized. Compared with the traditional target tracking method, the method has the advantages of low required calculation cost, strong anti-blocking capability and high real-time property.

Drawings

FIG. 1 is a schematic diagram of an anti-occlusion target tracking method in an embodiment of the present invention.

Fig. 2 is a schematic diagram of a method for obtaining filtered detection-based candidate items and filtered tracking-based candidate items according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a method for calculating a target matching degree of adjacent frames by using a cascade matching algorithm in the embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the embodiments of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and the embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that when an element is referred to as being "secured to" or "disposed on" another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element. In addition, the connection may be for either a fixing function or a circuit connection function.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in an orientation or positional relationship indicated in the drawings for convenience in describing the embodiments of the present invention and to simplify the description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed in a particular orientation, and be in any way limiting of the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.

As shown in FIG. 1, the present invention provides an anti-occlusion target tracking method, which comprises the following steps:

s1: inputting a video or an image sequence into a target detector according to frames to obtain a target detection result, wherein the target detection result is a candidate item based on detection and comprises a bounding box and detection confidence of all targets in each frame of image;

s2: generating a candidate item based on tracking for each frame of image by utilizing a joint detection and tracking frame according to the target detection result, wherein the joint detection and tracking frame carries out tracking motion estimation on the detection result through a Kalman filter and camera motion compensation so as to obtain the candidate item based on tracking;

taking the nth frame as an example, the position of the target boundary frame output by the SDP target detector of the current frame is taken as a candidate for the nth frame based on detection. Meanwhile, the position of the target boundary box is input into a Kalman filter, and the position of the target boundary box in the next frame is estimated to be used as a candidate item of the N +1 th frame based on tracking.

S3: screening the detection-based candidate items and the tracking-based candidate items by using a non-maximum suppression algorithm according to the confidence degrees of the detection-based candidate items and the tracking-based candidate items to obtain the screened detection-based candidate items and the screened tracking-based candidate items;

s4: extracting apparent features of all screened candidate items based on detection and screened candidate items based on tracking of the current frame by utilizing a pre-trained deep neural network;

the apparent feature obtained in one embodiment of the invention is a 512-dimensional depth feature;

s5: calculating the target matching degree of the adjacent frames by using a cascade matching algorithm, wherein the method comprises the following steps: the screened candidate items based on detection perform apparent feature similarity matching on the existing tracks of the adjacent frames; the screened candidate items based on the tracking are subjected to boundary frame intersection with a target boundary frame of the existing track of the adjacent frame and are matched with the matching degree;

s6: and determining the position of the target in the current frame according to the target matching degree of the adjacent frames, thereby outputting a target motion track.

In one embodiment of the invention, the object detector is an SDP object detector.

The confidence is given by the following confidence score formula:

wherein the content of the first and second substances,

is the detection confidence of the t-1 th frame, is>

In one embodiment of the invention, the α value is 0.05.

As shown in fig. 2, the step of screening the detection-based candidate and the tracking-based candidate by using a non-maximum suppression algorithm to obtain the screened detection-based candidate and the screened tracking-based candidate includes the following steps:

s21 (not shown): sorting according to the confidence score of all the candidate items based on detection and the candidate items based on tracking to obtain a candidate list;

s22 (not shown in the figure): selecting the detection-based candidate item and the tracking-based candidate item with the highest confidence level to be added into a final output list and deleted from the candidate item list;

s23 (not shown in the figure): calculating the detection-based candidate item and the tracking-based candidate item with the highest confidence coefficient to be in a border intersection ratio with other candidate items, and deleting the detection-based candidate item and the tracking-based candidate item with the border intersection ratio larger than a preset threshold value;

s24 (not shown): and repeating the process until the candidate item list is empty, wherein the final output list is the screened candidate items based on detection and the candidate items based on tracking.

In one embodiment of the invention, the predetermined threshold is 0.3-0.5.

The deep neural network is a google lenet based network, which includes layers from the input layer to the initiation _4e layer, and then is connected by a 1 × 1 convolutional layer. The network input picture size is 160 x 80, and the output target feature is 512 dimensions. The network is pre-trained on a large-scale pedestrian re-identification data set, and the loss function is as follows:

l _triplet (I _i ，I _j ，I _k )＝m+d(I _i ，I _j )-d(I _i ，I _k )

As shown in fig. 3, the step of calculating the target matching degree of the adjacent frames by using the cascade matching algorithm includes the following steps:

s51 (not shown in the figure): obtaining the target detection result of the first frame, and generating a track for each target to obtain an initial track set

Said apparent characteristic>

And constructs all matched candidate sets ≥>

All candidates that are not matchedCollection/>

S52 (not shown): the screened candidate items based on detection are processed

And the initial set of trajectories->

The unmatched candidate set->

Said initial set of tracks +>

In an embodiment of the invention, a Hungarian algorithm is used for feature similarity matching.

S53 (not shown in the figure): selecting the selected candidate item based on tracking

And the updated initial trajectory set->

The unmatched candidate set->

In one embodiment of the invention, hungarian algorithm is used for bounding box intersection and fitness matching.

Further, matching the candidate item set

In which each candidate bounding box is matched to the initial set of tracks>

The track segments in (1) are connected; set the unmatched candidate term->

Initializing the track into a new track; set ≥ for the initial locus>

Wherein N is generally 5-8.

On the MOT17 public multi-target pedestrian tracking dataset, the tracking results of the present invention are shown in the following table. It can be seen that in most metrics, particularly in F1 score, tracking rate, number of ID exchanges and accuracy, are superior to other existing techniques and can be run at real-time speed. The improvement of the ID exchange times shows that the apparent features extracted by the method enhance the recognition capability of the tracker and reduce the inaccuracy of tracking under the condition that the target is interacted and occluded. The improvement of false positive and tracking rate indicates the effectiveness of the anti-occlusion target tracking method of the invention.

TABLE 1 test results

Method	Accuracy of measurement	F1 score	Tracking rate	Loss rate	False positive	False negative	Number of ID exchanges	Speed of rotation
									HISP	44.6	38.8	15.1％	38.8％	25,478	276,395	10,617	4.7
SORT	43.1	39.8	12.5％	42.3％	28,398	287,582	4,852	143.3
									FPSN	44.9	48.4	16.5％	35.8％	33,757	269,952	7,136	10.1
MASS	46.9	46	16.9％	36.3％	25,773	269,116	4,478	17.1
									OTCD	44.6	38.8	15.1％	38.8％	25,478	276,359	3,573	46.5
The invention	47.4	50.1	16.8％	37.2％	26,910	267,331	2,760	35.7

All or part of the flow of the method of the embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, to instruct related hardware to implement the steps of the embodiments of the methods. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several equivalent substitutions or obvious modifications can be made without departing from the spirit of the invention, and all the properties or uses are considered to be within the scope of the invention.

Claims

1. An anti-occlusion target tracking method is characterized by comprising the following steps:

s2: generating a candidate item based on tracking for each frame of image by using a joint detection and tracking frame according to the target detection result, wherein the joint detection and tracking frame carries out tracking motion estimation on the target detection result through a Kalman filter and camera motion compensation to obtain the candidate item based on tracking;

2. The anti-occlusion target tracking method of claim 1, wherein the target detector is an SDP target detector.

3. The anti-occlusion target tracking method of claim 1, wherein the confidence is given by the following confidence score formula:

wherein the content of the first and second substances,

is the detection confidence of the t-1 th frame, is>

4. The anti-occlusion target tracking method of claim 1, wherein the step of screening the detection-based candidate and the tracking-based candidate using a non-maximum suppression algorithm to obtain the screened detection-based candidate and the screened tracking-based candidate comprises the steps of:

s21: sorting according to the confidence score of all the candidate items based on detection and the candidate items based on tracking to obtain a candidate item list;

s22: selecting the detection-based candidate item and the tracking-based candidate item with the highest confidence level to be added into a final output list and deleted from the candidate item list;

s23: calculating the boundary frame intersection ratio of the detection-based candidate item and the tracking-based candidate item with the highest confidence coefficient to other candidate items, and deleting the detection-based candidate item and the tracking-based candidate item with the boundary frame intersection ratio larger than a preset threshold value;

s24: repeating steps S21 to S23 until the candidate list is empty, and the final output list is the filtered detection-based candidate and the tracking-based candidate.

5. The anti-occlusion target tracking method of claim 4, wherein the preset threshold is 0.3-0.5.

6. The anti-occlusion target tracking method of claim 1, wherein the deep neural network is a google lenet based network comprising from an input layer to an initiation _4e layer, and then connected by a 1 x 1 convolutional layer.

7. The anti-occlusion target tracking method of claim 6, wherein a loss function of the training of the neural network is:

l _triplet (I _i ，I _j ，I _k )＝m+d(I _i ，I _j )-d(I _i ，I _k )

8. The anti-occlusion target tracking method of claim 1, wherein calculating the target matching degree of adjacent frames using a cascade matching algorithm comprises the steps of:

s51: obtaining the target detection result of the first frame, and generating a track for each target to obtain an initial track set

A set of candidate items consisting of the filtered detection-based candidate item and the filtered tracking-based candidate item +>

The apparent characteristic->

And constructing all matched candidate item sets

All candidate sets ≥ that do not match>

S52: selecting the candidate item based on detection after screening

And said initial set of tracks>

The unmatched candidate set->

The initial set of trajectories +>

S53: selecting the selected tracking-based candidate item

And the updated initial trajectory set->

The set of unmatched candidates +>

9. The anti-occlusion target tracking method of claim 8, wherein the matched candidate item set is

The track segments in (1) are connected; set the unmatched candidate term->

Initializing the track into a new track; set ≥ for the initial locus>

Is set as a temporary trajectory section, and if it is not matched in the following successive N frames, it is considered that the temporary trajectory section has ended and is reserved from the initial trajectory set ÷ or>

Deleting; the value of N is 5-8.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.