CN114757967A

CN114757967A - Multi-scale anti-occlusion target tracking method based on manual feature fusion

Info

Publication number: CN114757967A
Application number: CN202210288518.4A
Authority: CN
Inventors: 白永强; 李乐; 陈杰; 窦丽华; 邓方; 甘明刚; 蔡涛
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-07-15

Abstract

The invention provides a multi-scale anti-occlusion target tracking method based on manual feature fusion, which has strong robustness and small calculation amount and can be operated on most tracking platforms. The multi-resolution sampling is carried out by designing the scale pool, so that the scale change of the target in the motion process can be rapidly processed; the method for fusing the response results of the two characteristic filters according to the average energy index of the relevant peak realizes the integration of the advantages of the HOG characteristic and the CN characteristic, and improves the discrimination capability of the tracker; by carrying out occlusion judgment and evaluation on the tracking result of each frame, an SVM redetector is additionally designed, and redetection is carried out in the nearby area before the target is lost, so that the occlusion resistance of the tracker is improved. And evaluating the tracking results of the two trackers independently, and adjusting the learning rate according to the APCE index to realize respective self-adaptive updating.

Description

Multi-scale anti-occlusion target tracking method based on manual feature fusion

Technical Field

The invention relates to the technical field of digital image processing, in particular to a multi-scale anti-occlusion target tracking method based on manual feature fusion.

Background

In the daily production and life application of the computer vision technology, it is often necessary to stably track a specific target in a video image and obtain information such as the position, size and the like of the specific target in a picture in real time.

Currently mainstream target tracking algorithms can be divided into two categories: the first type is a target tracking algorithm based on correlation filtering, and the method has high calculation speed and good tracking effect. The basic principle is that a filter template is trained in an initial frame, the similarity between a candidate sample and the filter template is calculated in a subsequent frame, a response graph is output, and the position of the maximum response value is taken as a target position. The second category is a deep learning based target tracking algorithm, which occurs relatively late but evolves very rapidly. The basic idea of the method is to separate the target from the background by using the depth features extracted by the convolutional neural network and having strong representation capability. The early deep learning tracking algorithm adopts a pre-trained convolutional neural network to extract the depth characteristics, and then is combined with a correlation filtering method to train the tracker. The twin network tracking algorithm is a deep learning type tracking method which is popular in recent years, mainly carries out similarity matching between a target and a candidate frame in a subsequent frame, selects the candidate frame position with the highest similarity as a target position, and has good performance in the aspects of robustness and real-time performance. However, the tracking method based on deep learning often needs GPU support, puts high requirements on hardware performance, and is difficult to operate on a platform with limited power consumption and cost, which limits the application range thereof.

In a common tracking task, conditions such as deformation, scale change, occlusion and the like often occur, which requires that a tracking algorithm has strong robustness, and meanwhile, in order to acquire tracking target information in real time, the complexity of the tracking algorithm cannot be too high, and the real-time requirement needs to be met. Although the traditional correlation filtering tracking algorithm has great advantages in speed, the performance of the traditional correlation filtering tracking algorithm in common complex scenes still needs to be improved. In order to improve the robustness of the algorithm, people put forward a plurality of improved methods on the basis of the algorithm in recent years, but the real-time performance of the algorithm is usually sacrificed, so that the most basic advantages of the related filtering tracking method are lost, and the application range of the algorithm is limited.

Disclosure of Invention

In view of the above, the invention provides a multi-scale anti-occlusion target tracking method based on manual feature fusion, which can realize strong robustness and small calculation amount, and can be operated on most tracking platforms.

In order to achieve the aim, the invention provides a multi-scale anti-occlusion target tracking method based on manual feature fusion, based on a related filtering model, two independent filters are trained by utilizing a direction gradient histogram feature and a color naming feature, a target scale is determined through multi-resolution sampling, two response results are evaluated and subjected to self-adaptive fusion, the occlusion judgment is carried out on the final fusion result, if the occlusion is judged to occur, a redetector is started to carry out redetection, if the occlusion is not judged to occur, the position of a target is determined according to the fused response result, respective learning rate adjustment is carried out according to APCE values of the response results of the two filters, model updating is carried out, then whether the frame is the last frame is judged, if the frame is the last frame, the operation is finished, otherwise, multi-resolution sampling is returned again to determine the target scale; and entering the next frame during re-detection, if the target is not detected, entering the next frame again until the target is detected, updating the position of the target, then judging whether the target is the last frame, if the target is the last frame, ending, and otherwise, returning to perform multi-resolution sampling again to determine the target scale.

The method comprises the following steps:

step 1, initializing an HOG characteristic filter and a CN characteristic filter according to initial information and a kernel correlation filtering principle;

step 2, multi-resolution sampling, HOG characteristics are extracted, and a HOG characteristic filter is used for calculating response to determine the current scale;

step 3, extracting CN characteristics according to the size determined in the step 2, and calculating the response result of the CN filter;

step 4, performing self-adaptive fusion on the response results of the step 2 and the step 3;

step 5, evaluating the fused response result by using an APCE index, judging that the target is shielded when the index is lower than a set threshold value, starting a redetector to redetect, entering the next frame, if the target is not detected, entering the next frame again until the target is detected, updating the position of the target, judging whether the frame is the last frame, if the frame is the last frame, ending, otherwise, returning to execute the step 2; when the index is higher than or equal to the threshold value, judging that the target is not shielded, and executing the step 6;

step 6, determining the position of the target according to the fused response results, adjusting respective learning rates according to the APCE values of the response results of the two filters, and updating the model; and (4) judging whether the frame is the last frame or not, if so, ending, otherwise, returning to execute the step (2).

When the learning rate is adjusted according to the APCE indexes of each response, a threshold value a and a threshold value b are set, wherein a is larger than b, when the evaluation index is higher than the threshold value a, the learning rate is fixed to 0.01, when the evaluation index is between the threshold value a and the threshold value b, the learning rate is dynamically adjusted according to the indexes, and when the evaluation index is lower than the threshold value b, the learning rate is not updated when the evaluation index is changed between 0 and 0.01.

The specific determination of the target scale through multi-resolution sampling is as follows:

and (2) using the HOG feature filter obtained by training in the step (1), designing a scale pool to be [1.05,1,0.95] on the basis of the original scale, respectively corresponding to 1.05 times of the original size, the original size and 0.95 time of the original size, sampling and extracting HOG features by three resolutions, calculating response, and taking the scale corresponding to the maximum response value as the current scale.

In the step 4, the response results obtained by calculation in the steps 2 and 3 are evaluated according to the average energy index of the correlation peak, the two response results are adaptively fused according to the evaluation index, when the difference between the two indexes is large, a more reliable response result is selected as a final response result, and when the difference is small, linear weighting fusion is performed.

Wherein the re-detector is designed according to the principle of a linear soft space support vector machine.

Advantageous effects

The multi-resolution sampling is carried out by designing the scale pool, so that the scale change of the target in the motion process can be rapidly processed; the method for fusing the response results of the two characteristic filters according to the average energy index of the relevant peak realizes the integration of the advantages of the HOG characteristic and the CN characteristic, and improves the discrimination capability of the tracker; by carrying out occlusion judgment and evaluation on the tracking result of each frame, an SVM redetector is additionally designed, and redetection is carried out in the nearby area before the target is lost, so that the occlusion resistance of the tracker is improved. And evaluating the tracking results of the two trackers independently, and adjusting the learning rate according to the APCE index to realize respective self-adaptive updating.

Drawings

FIG. 1 is a schematic flow chart of a multi-scale anti-occlusion target tracking algorithm based on manual feature fusion in an embodiment of the present invention.

Fig. 2(a) shows an input image of a certain frame in the embodiment of the present invention.

Fig. 2(b) (c) (d) are schematic diagrams of multi-resolution sampling in the embodiment of the present invention.

FIG. 3(a) is a diagram of a certain frame under normal tracking condition in the embodiment of the present invention

Fig. 3(b) shows the result of the HOG signature filter response in the embodiment of the present invention.

Fig. 3(c) shows the response result of the CN characteristic filter in the embodiment of the present invention.

Fig. 3(d) shows the response result after fusion in the embodiment of the present invention.

FIG. 4(a) is a frame when occlusion occurs in the embodiment of the present invention.

Fig. 4(b) shows the response result of the HOG feature filter when occlusion occurs in the embodiment of the present invention.

FIG. 4(c) is the response result of the CN feature filter when occlusion occurs in the embodiment of the present invention.

FIG. 4(d) is the fused response result when occlusion occurs in the embodiment of the present invention.

Fig. 4(e) is a diagram illustrating the re-detection of the SVM re-detector.

Detailed Description

The invention is described in detail below by way of example with reference to the accompanying drawings.

The method is based on a related filtering method, adopts an anti-occlusion target tracking algorithm based on manual feature fusion, and integrates the advantages of morphological features and color features; a simple and efficient scale processing method is adopted to realize accurate prediction of scale change; the tracking result of each frame is evaluated, the target is searched again by starting the redetection after the target is judged to be lost, independent updating is carried out according to respective response results of the two filters, the learning rate is adjusted in a self-adaptive mode, the anti-interference capability of the tracking algorithm is improved, the method is suitable for solving the target tracking problem in computer vision application, and the method can be widely applied to the fields of intelligent monitoring systems, human-computer interaction, automatic driving and the like.

The multi-scale anti-occlusion target tracking method based on manual feature fusion carries out initialization of a direction gradient Histogram (HOG) feature filter and a Color Naming (CN) feature filter on the basis of a kernel correlation filtering method principle according to tracking target information selected by a manual frame in an initial frame. In each subsequent frame, the method firstly utilizes the histogram of oriented gradients feature filter to carry out multi-resolution sampling to determine the optimal scale, then solves the response of the color naming feature filter, evaluates the responses of the two filters by using a relevant peak average energy (APCE) index, and carries out self-adaptive fusion. And evaluating the final fusion result, and starting the support vector machine to perform redetection if the occlusion is judged to occur. Independently updating the histogram feature filter of the directional gradient and the color naming feature filter, adjusting the learning rate according to the APCE indexes of each response, designing two thresholds, namely a high threshold a and a low threshold b according to experience, when the evaluation index is higher than the threshold a, the learning rate is fixed to 0.01, when the evaluation index is between the threshold a and the threshold b, dynamically adjusting the learning rate according to the index, changing between 0 and 0.01, and when the evaluation index is lower than the threshold b, not updating the model. The implementation flowchart of this embodiment is shown in fig. 1, and the specific steps include:

Step 1, filter initialization:

in the first frame of the video sequence, the initialization of the histogram of oriented gradients feature filter and the color naming feature filter is first performed. In the initial frame, the initial position and size of the target are determined by manually framing the target. According to the initial frame target information and the principle of a KCF tracker, HOG characteristics and CN characteristics are respectively extracted to train two filters independently.

Wherein the optimization problem of the filter is

Wherein x represents a sample, y represents a label, w is a weight vector, and λ is a regularization parameter for controlling system complexity, preventing overfitting, and writing the above formula into a matrix form to obtain

Wherein X is [ X ]₁，x₂，…x_n]^TY is a column vector composed of sample labels, and the expression of the above solution in the complex field is w ═ X (X)^HX+λI)X^Hy. The samples are projected into a high dimensional space using a gaussian kernel function. The expression of the Gaussian kernel function is

The solution to the filter optimization problem is α ═ (K + λ I)^-1y. Diagonalized using circulant matricesThe property-derived filter parameters are expressed in the Fourier domain as

Is a Fourier transform of the sample label, where k^xxIs the kernel matrix K ═ C (K)^xx) The first row of (2).

Step 2, multi-resolution sampling, HOG feature extraction, and response calculation by using an HOG feature filter to determine the current scale:

The invention designs a scale pool for processing the scale problem, wherein the scale pool is designed to be [0.95,1,1.05], and the scale pool respectively corresponds to the three conditions of slightly smaller scale, unchanged scale and slightly larger scale. In order to reduce the calculation amount, only the histogram feature of the directional gradient is extracted during multi-resolution sampling, the response is calculated by using a histogram filter of the directional gradient, and the scale corresponding to the maximum response value is taken as the current scale.

FIG. 2(a) is a frame input image in the tracking process, FIG. 2(b) is a multi-resolution sampling diagram, in the present invention, at the position determined by the previous frame, the search size determined by the previous frame is used as the base size, sampling is performed with three resolutions, the resolutions are respectively 0.95 times the base size, 1 times the base size, and 1.05 times the base size, after the HOG feature is extracted, the response result of the HOG feature filter is calculated, and the corresponding size according to the maximum response value is used as the current size of the target. It is to be noted that the search area of the correlation filter includes the object itself and the background area, so that the size of the search area is larger than that of the tracking object in fig. 2 (b).

And 3, extracting CN characteristics according to the size determined in the step 2, and calculating the response result of the CN filter.

The Histogram of Oriented Gradient (HOG) features mainly represent the outline of the target, and the Color Naming (CN) features map the 3-channel color features of the RGB image to 11 color channels of black, blue, brown, gray, green, orange, pink, purple, red, white, yellow and the like, so that the color of the target is well expressed. In the initial frame of the image sequence, the invention trains two filters independently by using HOG characteristics and CN characteristics according to the principle of kernel correlation filter. During tracking, the responses of the two filters are evaluated.

And 4, performing self-adaptive fusion on the response results of the step 2 and the step 3.

The present invention evaluates the response results of histogram of oriented gradients feature filters and color naming feature filters using a correlation peak average energy (APCE) indicator. The APCE index is defined as follows:

wherein y is_maxRefers to the maximum response value, y_minRefers to the minimum response value, y_w,hAnd (3) referring to a response value at (w, h), wherein the APCE index reflects the reliability degree of a tracking result, the more reliable the tracking result is, the closer the response graph is to an ideal two-dimensional Gaussian distribution, and the larger the APCE index is. And when the APCE indexes of the response results of the two filters are greatly different, selecting the most reliable response result as the final response result, and performing linear weighted fusion when the difference between the two response results is not large. The formula used for fusion is as follows:

Wherein APCE _ HOG refers to the APCE value of the HOG characteristic filter response result, and APCE _ CN refers to the APCE value of the CN characteristic filter response result. Fig. 3(a) shows a certain frame in the normal tracking situation in the embodiment of the present invention. Fig. 3(b) shows the result of the HOG feature filter response of the current frame in the embodiment of the present invention, and the corresponding APCE index is 45.30. Fig. 3(c) shows the response result of the CN characteristic filter of the current frame in the embodiment of the present invention, and the corresponding APCE index is 54.12. And according to a fusion formula, fusing the response results of the two characteristic filters. Fig. 3(d) shows the fused response result in the embodiment of the present invention, and the corresponding APCE index is 64.76.

And 5, carrying out occlusion judgment on the final response result to determine whether to carry out retesting.

In the moving process of the target, the appearance of the target changes, the re-detector needs to have the capability of adapting to the change of the appearance of the target, and the adopted strategy is as follows: evaluating the fused response result by using an APCE index, setting a threshold value to be 20 according to experience, judging that the target is shielded when the index is lower than the threshold value, starting a support vector machine to perform redetection, entering the next frame if the target is not detected, updating the position of the target until the target is detected, judging whether the frame is the last frame, ending if the frame is the last frame, and returning to the step 2 if the frame is the last frame; and when the index is higher than or equal to the threshold value, judging that the target is not shielded, and executing the step 6.

Fig. 4(a) shows a certain frame of picture when occlusion occurs in the embodiment of the present invention. At this time, the tracking target is completely shielded. Fig. 4(b) shows the response result of the HOG feature filter when occlusion occurs in the embodiment of the present invention. The maximum response value was 0.21 and the APCE value was 12.54. FIG. 4(c) is the response result of the CN feature filter when occlusion occurs in the embodiment of the present invention. The maximum value is 0.26 and the APCE value is 9.92. FIG. 4(d) is the fused response result when occlusion occurs in the embodiment of the present invention. The corresponding maximum value is 0.21 and the APCE value is 17.14.

Fig. 4(e) is a diagram illustrating the re-detection of the SVM re-detector. And setting a threshold value to be 20 for the APCE index according to experience, judging that the target is shielded when the APCE value of the final response result is lower than the experience threshold value, starting an SVM detector at the moment, and searching in a nearby area before the target is lost until the target is detected.

And 6, determining the position of the target according to the fused response results, adjusting respective learning rates according to the APCE values of the response results of the two filters, and updating the model.

The updating of the histogram feature filter and the color naming feature filter is carried out independently, the learning rate is adjusted according to the average energy index of the relevant peak of each response, two thresholds are designed according to experience, the high threshold a is 33, the low threshold b is 25, when the evaluation index is higher than the threshold a, the learning rate is fixed to 0.01, when the evaluation index is between the thresholds a and b, the dynamic adjustment of the learning rate is carried out according to the index, when the evaluation index is lower than the threshold b, the dynamic adjustment is carried out between 0 and 0.01, and when the evaluation index is lower than the threshold b, the model updating is not carried out. The learning rate eta adjustment rule is as follows:

The invention also trains a soft interval support vector machine for re-detection of the target, and uses the HOG characteristic to represent the target. The optimization problem of the support vector machine is as follows:

s.t.y_i(w^Tx_i+b)≥1-ξ_i，i＝1,2,…,m

ξ_i≥0,i＝1,2,…,m

where xi is relaxation variable, w is hyperplane normal vector, b is displacement term, C is punishment parameter, the Lagrange function of the above formula is obtained by Lagrange multiplier method, alpha_i≥0，β_iMore than or equal to 0 is Lagrange multiplier.

Let the partial derivatives of the above formula for w, b, xi be 0, to obtain the following formula.

A dual problem is obtained.

0≤α_i≤C,i＝1,2,...,m

And solving the above formula to obtain alpha, and obtaining a soft interval SVM model as follows.

In the moving process of the target, the appearance may change, the SVM redetector needs to adapt to the change, when the evaluation index is higher, the SVM redetector is updated, and the updating formula is as follows.

w_t+1＝w_t+τ_iy_ix_i

τ_t＝l_t(||x||²+1/2C)

Wherein, w_t+1Represents the updated weight vector, (x)_i,y_i) Representing samples and labels,/_tRepresenting a loss function, C is an aggressive parameter.

In summary, the above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A multi-scale anti-occlusion target tracking method based on manual feature fusion is characterized in that two independent filters are trained by utilizing directional gradient histogram features and color naming features based on a relevant filtering model, target dimensions are determined through multi-resolution sampling, two response results are evaluated and subjected to self-adaptive fusion, the occlusion judgment is carried out on the final fusion result, if occlusion is determined to occur, a redetector is started to carry out redetection, if occlusion is not determined to occur, the position of a target is determined according to the fused response result, respective learning rate adjustment is carried out according to APCE values of the response results of the two filters, model updating is carried out, whether the frame is the last frame is judged, if the frame is the last frame, the operation is finished, and if not, multi-resolution sampling is returned to determine the target dimensions; and entering the next frame during re-detection, if the target is not detected, entering the next frame again until the target is detected, updating the position of the target, then judging whether the target is the last frame, if the target is the last frame, ending, and otherwise, returning to perform multi-resolution sampling again to determine the target scale.

2. The method of claim 1, comprising the steps of:

step 2, multi-resolution sampling is carried out, HOG characteristics are extracted, and a HOG characteristic filter is used for calculating response to determine the current scale;

step 3, according to the size determined in the step 2, extracting CN characteristics and calculating the response result of the CN filter;

step 4, carrying out self-adaptive fusion on the response results of the step 2 and the step 3;

3. The method according to claim 1 or 2, wherein a threshold value a and a threshold value b are set when the learning rate is adjusted according to the APCE index of each response, a > b, the learning rate is fixed to 0.01 when the evaluation index is higher than the threshold value a, the learning rate is dynamically adjusted according to the index between the threshold value a and the threshold value b, the learning rate varies between 0 and 0.01, and the learning rate is not updated when the evaluation index is lower than the threshold value b.

4. The method according to claim 1, wherein the determining of the target scale by multi-resolution sampling is in particular:

5. The method according to claim 1, wherein in step 4, the response results obtained by calculation in steps 2 and 3 are evaluated according to the average energy index of the correlation peak, and adaptive fusion of the two response results is performed according to the evaluation index, when the difference between the two indexes is large, a more reliable response result is selected as the final response result, and when the difference is small, linear weighted fusion is performed.

6. The method of claim 1, wherein the re-detector is designed according to the principles of a linear soft-space support vector machine.