CN114332157A

CN114332157A - Long-term tracking method controlled by double thresholds

Info

Publication number: CN114332157A
Application number: CN202111527248.XA
Authority: CN
Inventors: 邓宸伟; 王旭辰; 韩煜祺; 唐林波
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-04-12
Anticipated expiration: 2041-12-14
Also published as: CN114332157B

Abstract

The invention provides a long-time tracking method controlled by double thresholds, which has good effect in different tracking scenes. The method integrates the verification network and the twin network, and judges various conditions in the long-time tracking process by adopting a dual-threshold control method so as to ensure the long-time robustness of the algorithm. The invention fuses the two types of networks, well utilizes the advantages of the two types of networks and makes up the mutual deficiency so as to adapt to long-time tracking scenes. The verification network based on the MDNet well utilizes information in subsequent frames through online training, and makes up for the problem of missing tracking target information in a tracking algorithm based on a twin network; the tracking algorithm based on the twin network replaces the network prediction process in the tracking algorithm based on the MDNet through a template matching method, the calculated amount is small, and the problem that the tracking algorithm based on the MDNet is poor in real-time performance is solved.

Description

Long-term tracking method controlled by double thresholds

Technical Field

The invention relates to the technical field of computer vision, in particular to a long-time tracking method based on double-threshold control.

Background

In the technical field of computer vision, a target tracking technology is taken as one of research hotspots, and the target tracking technology has higher research value in a plurality of fields such as video monitoring, intelligent transportation, man-machine interaction, medical diagnosis and the like. The main workflow of the target tracking technology is to first give a specific position of a target in a first frame of a video sequence, and then locate the same target in each subsequent frame through a certain algorithm. According to the tracking time, the tracking algorithm can be divided into two branches of short-time tracking and long-time tracking. For short-time tracking, the target is always in the visual field of the camera, and the main problem of algorithm research is how to quickly and accurately locate the position of the target in the subsequent frames.

In recent years, in the field of short-time tracking, various algorithms are rapidly developed, and remarkable results are obtained in data sets such as OTBs and UAVs 123. However, in the long-term tracking field, due to uncertainty of motion of the target and the camera, various short-term tracking algorithms are difficult to cope with various situations such as change of a camera view angle, long-term shielding of the target, and returning of the target to a view after removing the view. Meanwhile, compared with short-time tracking, the long-time tracking is closer to practical application and has higher research value.

In the field of short-time tracking, two methods are mainly included: one is a class-based validation method, such as a CNN-based class tracking method. The target appearance is learned on line, and the position of the target is distinguished in the background, so that the tracking effect is achieved; the other type is a regression method based on matching, such as a twin network method, the target data characteristics of a first frame are extracted through an off-line training neural network, template matching is carried out in subsequent frames, and the best candidate area is selected as a target position to achieve the tracking effect. However, for the online-updating classified tracker, when the target is blocked or the target is removed from view in long-term tracking, the online-updating classified tracker is easily interfered by noise, and the tracking algorithm is disabled due to wrong background information updating. For the off-line training matching tracker, only the target appearance information of the first frame is extracted, so that the tracking algorithm is easy to fail when the subsequent frame faces the condition that the camera view angle changes or the target appearance deforms in long-term tracking.

At present, under the condition that a target disappears and the like in a long-time tracking frame, detectors based on a neural network are mostly introduced, and the field of view range is re-detected so as to re-position the target position. The scheme has the disadvantages that firstly, the detector needs a large amount of data to perform off-line training, the generalization capability is poor, and the workload of early preparation is increased; secondly, in order to follow a good detection result, the network depth is designed to be deeper, so that the calculated amount in the tracking process is increased, and the real-time performance is influenced; thirdly, in the tracking process, due to the fact that the depth of the network is deep, online training is not convenient to conduct according to the target characteristics of the current frame, and the change of the appearance of the tracked target cannot be adapted online.

Therefore, a long-term tracking method is needed to integrate two types of algorithms for short-term tracking, so as to achieve the effect of stable operation in long-term tracking.

Disclosure of Invention

In view of this, the invention provides a dual-threshold-controlled long-term tracking method, which can integrate two types of algorithms for short-term tracking to achieve the effect of stable operation in long-term tracking.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a long-term tracking method controlled by double thresholds comprises the following specific steps:

s1, constructing a dual-network structure consisting of a twin network and a verification network, wherein the twin network comprises a feature extraction network and an RPN network; and calibrating the initial frame to obtain a target and a target position frame thereof.

S2, initializing the twin network by using a target position frame corresponding to the initial frame to obtain a search area and a matching template of the initial frame; meanwhile, initializing the verification network and updating the parameters of the verification network.

S3 is executed with the initial frame as the current frame of the processing.

S3, performing multi-scale feature extraction on the current frame in the updated search area of the previous frame by using a feature extraction network to obtain matching template features and search area features; the RPN carries out matching operation according to the matching template characteristics and the searching region characteristics to obtain a matching tracking result; and sending the matching tracking result to a verification network.

And S4, the verification network carries out tracking state evaluation on the matching tracking result of the current frame to obtain a tracking result score.

S5, setting a template updating condition and a search area conversion condition, and judging the tracking result score:

and when the tracking result score is larger than the template updating threshold value in N continuous frames, directly taking the next frame as the current frame to be processed without updating the current searching area and the current matching template, and returning to the step S3.

When the score of the tracking result is less than the template updating threshold and greater than the searching area updating threshold continuously N frames, the twin network and the verification network are reinitialized by utilizing the target position corresponding to the matching tracking result, the parameters of the verification network are updated, and the current matching template is updated; then, the next frame is selected as the current frame to be processed, and the process returns to S3.

When the score of the tracking result is less than the transformation threshold of the search area continuously by N frames, the search area is updated to be global, global matching tracking is carried out in the global by using a twin network, meanwhile, the verification network carries out scoring according to the global matching tracking result, and when the global matching score is higher than the transformation threshold of the search area, the size of the original search area is changed from the global search; then, the next frame is selected as the current frame to be processed, and the process returns to S3.

And finishing the process of all the frames in the video to be processed.

Further, initializing the twin network by using the target position corresponding to the initial frame to obtain the search area and the matching template of the initial frame, initializing the verification network, and updating the verification network parameters, wherein the specific method comprises the following steps:

the twin network initialization method comprises the following steps: extracting the characteristics of the target as a matching template; the four-times range around the target position frame is used as a search area of the initial frame.

The initialization method for the verification network comprises the following steps: randomly generating a position frame with the intersection ratio of more than 0.7 with the target position frame as a positive sample by using a Gaussian distribution method, and generating a position frame with the intersection ratio of less than 0.3 with the target position frame as a negative sample by using a uniform random method; and substituting the positive sample and the negative sample into the verification network for training, and initializing parameters of the verification network.

Further, the number of positive samples is 500, and the number of negative samples is 5000.

Further, a feature extraction network is used for carrying out multi-scale feature extraction in a search area updated in the previous frame to obtain matching template features and search area features, and the specific method comprises the following steps:

the matching template is subjected to a convolutional neural network to obtain 6 multiplied by 256 matching template characteristics; the search area is subjected to a convolutional neural network, and 22 × 22 × 256 search area features are obtained.

Further, the RPN network performs matching operation according to the matching template features and the search area features to obtain a matching tracking result, and the specific method is as follows:

and matching the template features and the search region features and inputting the matched template features and the search region features into the classification branches of the RPN network to obtain a classification response graph.

And matching the template features and the search region features, and inputting the matched template features and the search region features into regression branches of the RPN network to obtain a regression response graph.

And adjusting the position of the target according to the classification response diagram and the regression response diagram to obtain a matching tracking result.

Has the advantages that:

1. the invention provides a target tracking method based on a verification network and a twin network, which has good effect in different tracking scenes. The invention integrates the verification network and the twin network, and designs a set of complete framework to be applied to long-time target tracking. The invention adopts a twin network-based SimRPN technology to extract target characteristics and match templates; updating the template by adopting a verification network, separating background information and target information to the maximum extent and obtaining accurate matching score; and a double-threshold control method is adopted to judge various conditions in the long-time tracking process so as to ensure the long-time robustness of the algorithm. The invention fuses the two types of networks, well utilizes the advantages of the two types of networks and makes up the mutual deficiency so as to adapt to long-time tracking scenes. The verification network based on the MDNet well utilizes information in subsequent frames through on-line training, and makes up for the problem of missing tracking target information in a tracking algorithm based on a twin network; the tracking algorithm based on the twin network replaces the network prediction process in the tracking algorithm based on the MDNet through a template matching method, the calculated amount is small, and the problem that the tracking algorithm based on the MDNet is poor in real-time performance is solved.

2. By introducing the verification network based on the MDNet, the problem that a general neural network cannot be updated on line due to large calculation amount is solved. Experiments prove that the main reason for the tracking failure caused by the fact that the target returns to the visual field after disappearing is that the tracker cannot be well adapted to the appearance change of the target. Instead, the uncertainty in the position of the target reconstruction causes the position of the target reconstruction to exceed the search range of the tracker, thereby causing the tracking algorithm to fail. Therefore, the invention expands the search area by a double-threshold control method when the target disappears, thereby replacing the re-detection operation in the general long-time tracking frame. The introduction of a neural network is avoided, the calculation amount is reduced, and the tracking efficiency is improved.

Drawings

FIG. 1 is a block diagram of an embodiment of the present invention.

Figure 2 is a block diagram of a twin network portion of the present invention.

FIG. 3 is a block diagram of a regression network portion of the present invention.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

As shown in fig. 1, the present invention provides a long-term tracking method with dual-threshold control, which comprises the following specific steps:

s1, constructing a dual-network structure consisting of a twin network and a verification network, calibrating an initial frame of the video to be processed, and giving a target in the initial frame and a corresponding target position frame.

In the embodiment of the invention, the main operation of twin network initialization is to extract the characteristics of a target to obtain a 6 × 6 × 256 image as a matching template; according to the target center position, the range of four times of the size of the target position frame is used as a search area of the next frame, and due to proper setting of the search area, not only is the redundant computation amount caused by global search avoided, but also the tracking algorithm is prevented from being invalid due to the fact that the target exceeds the search area.

The main operation of network initialization is verified, according to the position of a target in an initial frame, a position frame with the intersection ratio of more than 0.7 to the target position frame is randomly generated as a positive sample by using a Gaussian distribution method, and a position frame with the intersection ratio of less than 0.3 to the target position frame is generated as a negative sample by using a uniform random method; and substituting the positive sample and the negative sample into a verification network for training, initializing parameters of the verification network, enabling the network to adapt to the initial appearance of the target, and maximally distinguishing the target from the background. In the embodiment of the invention, the number of the selected positive samples is 500, and the number of the selected negative samples is 5000. In the subsequent tracking, in the search area obtained in the previous frame, matching operation is carried out according to the matching template obtained by initialization to obtain the matching tracking result of the current frame, and the search area of the next frame is calculated according to the matching tracking result.

And S3, performing multi-scale feature extraction on the current frame in the updated search area of the previous frame by using the feature extraction network to obtain the matching template features and the search area features. The matching template is subjected to a convolutional neural network to obtain 6 multiplied by 256 matching template characteristics; the search area is subjected to a convolutional neural network, and 22 × 22 × 256 search area features are obtained.

As shown in fig. 2, the RPN network performs matching operation according to the matching template features and the search area features to obtain a matching tracking result; and sending the matching tracking result to a verification network.

Inputting the matched template features into a classification branch of an RPN network, and performing convolution on the matched template features through a convolution core to obtain a template feature frame; inputting the search area characteristics into a classification branch of an RPN network, and performing convolution on the search area characteristics through a convolution core to obtain search area frame characteristics; and (4) performing convolution processing on the frame characteristics of the search area through a convolution kernel according to the template characteristic frame to obtain a 17 × 17 × 2k classification response image. The size of the template feature frame is 4 × 4 × (2k × 256), and the size of the search area frame feature is 20 × 20 × 256. Wherein the meaning of k in the template feature frame is that k different changes exist in corresponding k different anchors, and k groups of two classification template features of foreground and background are obtained.

Matching the template features and the search region features, inputting the regression branch of the RPN network, wherein the specific mode is the same as the processing method of the classification branch, and obtaining a 17 × 17 × 4k regression response graph, wherein 4 represents the offset of four orientations, namely fine adjustment of the tracking frame obtained by regression.

S4, as shown in fig. 3, the verification network performs tracking state evaluation on the matching tracking result of the current frame to obtain a tracking result score.

And S5, setting a template updating condition and a search area conversion condition, and judging the tracking result score. In order to better judge the state of the current frame of the tracking target in the long-time tracking process and change the tracking operation according to the state, a dual-threshold control method is introduced.

The first threshold is a template updating threshold, and in a long-time tracking process, the appearance of the target changes in angle due to factors such as an observation angle and target motion. In experiments, when the appearance of a tracked target gradually occurs, the twin network can be tracked in a self-adaptive manner in a short time, but if the first frame template is used for matching and tracking for a long time, the tracking frame is drifted. A template update threshold is introduced to determine the point in time at which a template update is required. When the tracking score is smaller than the template updating threshold value, the target appearance is probably changed at the moment, and the tracking twin network template is updated and the verification network parameters are updated at the same time.

The second threshold is a search area transformation threshold, and in the long-time tracking process, conditions such as target shielding, target visual field removal and the like are likely to occur, and at the moment, if the template is wrongly updated, the template is polluted, so that subsequent tracking is invalid. Experiments show that the difference between the appearance change of the target and the target before shielding is not large in most cases, and the matching can be successfully carried out by depending on the generalization of the twin network. The main reason for the algorithm failure is that the position uncertainty of the target reappearance often exceeds the search area of the target, so that the tracking algorithm is failed, and therefore a search area transformation threshold value is introduced.

When the score of the continuous N frames is larger than the template updating threshold value, the target appearance of the frame is not changed much compared with the initial frame, so that the matching tracking operation of the next frame is continued.

When the scored continuous N frames are smaller than the template updating threshold and larger than the search area conversion threshold, the frame matching and tracking result is more accurate, but the target appearance is different from the initial frame appearance greatly, and then matching and tracking may cause tracking errors. And then, the twin network and the verification network are reinitialized, the specific operation is the same as that of the initialization of the first frame, the matching tracking result of the previous frame is used as the initialization input of the frame, the matching template and the verification network parameters are sequentially updated, and the tracking is carried out again after the updating is finished.

In the embodiment of the invention, different from a general tracking algorithm based on an MDNet network, the verification network is only used for verifying the tracking result and is not used for predicting the position information of the target. And in the tracking process, a tracking result with the score larger than the stencil updating threshold value is saved as a positive sample, and when the score is smaller than the stencil updating threshold value, network updating operation is carried out. The specific operation is to generate a positive sample and a negative sample as the same as the initialization operation, wherein the positive sample is selected from 100 frames of the latest successful tracking, and the negative sample is generated according to 20 frames of the latest successful tracking. And sending the samples into a verification network according to the proportion of 1:3 of the positive sample and the negative sample for iterative training, wherein the iterative times are 10 times, and the batch size is 128. And after the iteration is finished, the updating operation of the verification network is finished, and the verification operation of the tracking result of the subsequent frame is continued.

When the score of N continuous frames is smaller than the search area transformation threshold, the situation that the target is blocked by the obstacle or the target removes the visual field at the moment is possibly generated. At this time, the search area is expanded, and the previous search area is expanded to the global search. Matching and tracking are carried out in the full-image range, meanwhile, the verification network carries out scoring according to the matching result, when the score is higher than the search area transformation threshold value, the overall search is changed into the size of the original search area, and subsequent tracking operation is continued. And finishing the process of all the frames in the video to be processed.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A long-term tracking method controlled by double thresholds is characterized by comprising the following specific steps:

s1, constructing a dual-network structure consisting of a twin network and a verification network, wherein the twin network comprises a feature extraction network and an RPN network; calibrating an initial frame to obtain a target and a target position frame thereof;

s2, initializing the twin network by using a target position frame corresponding to the initial frame to obtain a search area and a matching template of the initial frame; meanwhile, initializing the verification network and updating the parameters of the verification network;

with the initial frame as the current frame of the processing, S3 is executed;

s3, performing multi-scale feature extraction on the current frame in the updated search area of the previous frame by using a feature extraction network to obtain matching template features and search area features; the RPN carries out matching operation according to the matching template characteristics and the searching region characteristics to obtain a matching tracking result; sending the matching tracking result to the verification network;

s4, the verification network carries out tracking state evaluation on the matching tracking result of the current frame to obtain a tracking result score;

when the tracking result score is greater than the template updating threshold value in N continuous frames, the current searching area and the current matching template are not updated, the next frame is directly taken as the current frame to be processed, and the step returns to S3;

when the score of the tracking result is less than the template updating threshold and greater than the searching area updating threshold continuously N frames, the twin network and the verification network are reinitialized by utilizing the target position corresponding to the matching tracking result, the parameters of the verification network are updated, and the current matching template is updated; then, selecting the next frame as the current frame to be processed, and returning to S3;

when the score of the tracking result is less than the transformation threshold of the search area continuously by N frames, the search area is updated to be global, global matching tracking is carried out in the global by using a twin network, meanwhile, the verification network carries out scoring according to the global matching tracking result, and when the global matching score is higher than the transformation threshold of the search area, the size of the original search area is changed from the global search; then, selecting the next frame as the current frame to be processed, and returning to S3;

and finishing the process of all the frames in the video to be processed.

2. The method of claim 1, wherein the twin network is initialized by using the target position corresponding to the initial frame to obtain the search area and the matching template of the initial frame, and the verification network is initialized to update the verification network parameters, and the specific method is as follows:

the twin network initialization method comprises the following steps: extracting the characteristics of the target as a matching template; taking the four-time range around the target position frame as a search area of an initial frame;

the initialization method of the verification network comprises the following steps: randomly generating a position frame with the intersection ratio of more than 0.7 with the target position frame as a positive sample by using a Gaussian distribution method, and generating a position frame with the intersection ratio of less than 0.3 with the target position frame as a negative sample by using a uniform random method; and substituting the positive sample and the negative sample into the verification network for training, and initializing parameters of the verification network.

3. The method of claim 2, wherein the number of positive samples is 500 and the number of negative samples is 5000.

4. The method of claim 1, wherein the multi-scale feature extraction is performed in the search area updated in the previous frame by using a feature extraction network to obtain the matching template features and the search area features, and the specific method is as follows:

5. The method of claim 4, wherein the RPN performs matching operation according to the matching template features and the search area features to obtain the matching tracking result, and the specific method is as follows:

matching the template features and the search region features and inputting the matched template features and the search region features into a classification branch of the RPN to obtain a classification response graph;

matching the template features and the search region features and inputting the matched template features and the search region features into a regression branch of the RPN to obtain a regression response graph;