CN113838093B

CN113838093B - Self-adaptive multi-feature fusion tracking method based on spatial regularization correlation filter

Info

Publication number: CN113838093B
Application number: CN202111121555.8A
Authority: CN
Inventors: 刘冰
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2021-09-24
Filing date: 2021-09-24
Publication date: 2024-03-19
Anticipated expiration: 2041-09-24
Also published as: CN113838093A

Abstract

The invention discloses a self-adaptive multi-feature fusion tracking method based on a spatial regularization correlation filter, which comprises the steps of adopting a residual convolution network to extract the depth convolution feature of a target, adopting a gradient histogram and a gray level diagram to extract the manual feature of the target, carrying out feature fusion on the depth convolution feature and the manual feature, respectively sending the fusion feature of a first frame image and the fusion feature of a current frame image tracking result into two spatial regularization correlation filters, and carrying out accumulation summation on confidence values of the two spatial regularization correlation filters to obtain a t+1 frame tracking result. The invention can effectively, accurately and robustly express the appearance of the target, and improves the tracking precision and the robustness while reducing the calculation cost, thereby improving the tracking performance of the tracking algorithm.

Description

Self-adaptive multi-feature fusion tracking method based on spatial regularization correlation filter

Technical Field

The invention relates to the field of computer vision, in particular to a self-adaptive multi-feature fusion tracking method based on a spatial regularization correlation filter.

Background

Visual object tracking is a key problem in the field of computer vision, particularly for continuous tracking of objects in video sequences. Visual target tracking involves almost every aspect in a real life scenario, from human-machine interaction to autopilot. In addition, attention has been paid to various practical applications such as medical diagnosis, activity recognition and video monitoring. Most importantly, continuous monitoring of moving objects by visual object tracking technology has a wider range of applications than searching for objects by static images, which makes visual object tracking a very active research area in computer vision, with many new tracking algorithms being proposed each year.

After we specify an initial bounding box for a target object in the first frame of the video sequence, the visual target tracking algorithm is responsible for estimating the bounding box for this target object in subsequent consecutive video frames. A good tracking algorithm needs to achieve a balance between performance such as accuracy, robustness and efficiency in all practical application fields. Although tracking performance has improved significantly over the last decade, visual object tracking remains a challenging task due to problems with occlusion, scale changes, illumination changes, low resolution, fast motion, motion blur, and distortion in video scenes.

Currently, although various tracking algorithms have been proposed, most share similar components. In all tracking algorithms, one of the key factors that improves tracking performance is the apparent representation of the target object. Existing visual target tracking algorithms can be broadly divided into two categories according to different appearance representation strategies: tracking algorithms based on traditional manual features and tracking algorithms based on deep convolution features. Tracking algorithms based on traditional manual features use underlying features such as SIFT, gradient histogram, color histogram, etc. to represent local features of the target object. The deep convolution feature may learn global and local features that best fit to represent the appearance of an object from a given image, as opposed to being based on traditional manual features. Thus, a tracking algorithm based on a depth convolution feature appearance model may achieve more accurate, robust tracking performance because the depth convolution features may better represent the appearance of the target. However, the tracking algorithm based on the convolution structure has low calculation efficiency and large calculation amount, and does not meet the real-time performance requirement between continuous video frames.

While multi-feature fusion-based representations have proven to be an effective method of improving the tracking performance of tracking algorithms, these existing multi-feature fusion-based tracking algorithms simply fuse together various different manual features or extract multi-layer deep convolutional features in a convolutional network. In various challenging video sequences, previously proposed tracking algorithms based on a variety of manual features or multi-layer deep convolution features are prone to tracking failures.

Disclosure of Invention

The technical problem solved by the invention includes that the existing tracking algorithm based on manual characteristics has high calculation speed, but the tracking algorithm based on manual characteristics adopts low-level characteristics to represent local information of a target object, so that the target object cannot be accurately described; the existing tracking algorithm based on the depth convolution features has low calculation efficiency and large calculation amount, does not meet the real-time performance requirement between continuous video frames, and is easy to fail tracking. The self-adaptive multi-feature fusion visual target tracking method based on the spatial regularization correlation filter fully utilizes the complementary advantages of the traditional manual features and the depth convolution features, can effectively, accurately and robustly represent the appearance of the target, reduces the calculation cost and improves the tracking precision and the robustness at the same time, thereby improving the tracking performance of a tracking algorithm.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the adaptive multi-feature fusion tracking method based on the spatial regularization correlation filter comprises the following steps:

(1) And reading the video frame, determining a tracking target, and marking the tracking target by a target frame in the first frame image.

(2) And extracting the depth convolution characteristics of the target by adopting a residual convolution network, and extracting the manual characteristics of the target by adopting a gradient histogram and a gray level map.

(3) And carrying out feature fusion on the depth convolution features and the manual features.

(4) And respectively sending the fusion characteristic of the first frame image and the fusion characteristic of the tracking result of the current frame image into two spatial regularization correlation filters.

(5) And carrying out accumulated summation on confidence values of the two spatial regularization correlation filters to obtain a tracking result of the t+1 frame.

By adopting the technical scheme, the invention has the following beneficial technical effects.

The tracking algorithm of the invention achieves satisfactory tracking performance on five reference data sets of OTB-2015, sample-Color, PTB-TIR, UAV20L and VOT-2016. Experimental results show that the tracking performance of the tracking algorithm in the long-term video sequence, the short-term video sequence and the infrared video sequence is remarkably improved. Experiments are carried out on the highly crowded video scene on the sample-Color reference data set, so that the effectiveness, accuracy and robustness of the tracking algorithm are further verified.

Specifically, the invention fully utilizes the complementary advantages of the multi-layer deep convolution feature and the manual feature, and uses the feature fusion technology based on the manual feature and the deep convolution feature representation. This feature fusion-based representation is extended to multiple color channels and the results are concatenated together to form a new representation. Through the feature fusion technology, the tracking algorithm can accurately and robustly learn the appearance model of the target, and has obvious advantages for improving tracking performance.

Second, an alternate direction multiplier-algorithm (alternating direction algorithm of multipliers, ADMM) is introduced into the spatially regularized correlation filter, solving the problem of how to accurately search for targets in complex contexts and how to remove the influence of redundant boundaries.

Drawings

FIG. 1 is a diagram of an overall tracking framework of the present invention.

Detailed Description

Referring to fig. 1, the method of the present invention comprises the steps of:

(1) Reading a video frame, determining a tracking target, and marking the tracking target by a target frame in a first frame image;

(2) Extracting depth convolution characteristics of a target by adopting a residual convolution network based on ResNet-101, and extracting manual characteristics of the target by adopting a gradient histogram (gradient of histogram, HOG) and a gray level map (Grayscale);

(3) Feature fusion is carried out on the depth convolution feature and the manual feature,

(4) Respectively sending the fusion characteristics of the first frame image and the fusion characteristics of the tracking result of the current frame image into two spatial regularization correlation filters;

(5) And carrying out accumulation summation on the confidence values of the two spatial regularization correlation filters, and optimizing the final confidence values through a Newton algorithm to obtain a more accurate tracking result of the t+1 frame.

More specifically, the residual convolution network employs a residual convolution network based on ResNet-101 to extract depth convolution features of the target, and employs a manual feature based on gradient histograms (gradient of histogram, HOG) and gray scale maps (Grayscale) to extract the target. In FIG. 1 (e), the parameter set-up, training and updating of the depth residual convolutional network ResNet-101 of the present invention is consistent with the description in the literature "K.He, X.Zhang, S.Ren, J.Sun, deep residual learning for image recognition, in: proceedings of the IEEE conference on computer vision and pattern recognition,2016, pp.770-778.

The fusion process of the depth convolution feature and the traditional manual feature comprises the steps of respectively extracting the convolution feature and the traditional manual feature by using a tracking result of a first frame image with a target frame and a tracking result of a current frame, and fusing the manual feature serving as a layer of the convolution feature with the depth convolution feature. The target frame is indicated by red (small frame of the first frame image in fig. 1), and the region containing the target sampled based on the first frame image of the video sequence is indicated by yellow dotted line (larger dotted frame of the first frame image in fig. 1), as shown in fig. 1 (a). In general, the first frame image is important for the tracking algorithm, as this corresponds to the actual target position and target size. Thus, the present invention uses the first frame image information to correct for target drift in each frame. In order to better handle the change of the target size, 5 image blocks with different sizes are sampled for the t frame image, wherein the scale factor is 1.01 and the scale step length is 5. Then, at the t-th frame, depth residual convolution features and manual fusion features are extracted for these 5 image blocks of different sizes, respectively, as shown in fig. 1 (d) and 1 (e). Finally, in order to effectively utilize the feature fusion technology of the depth convolution feature and the manual feature for tracking, the invention provides a feature fusion learning framework of two-path spatial canonical correlation filtering. The fusion characteristics of the first frame image and the fusion characteristics of the tracking result of the current frame image are respectively sent into two spatial regularization correlation filters, as shown in fig. 1 (f). The spatial correlation filter values of the first frame image and the current frame image are added and then the newton value is obtained to obtain the tracking result of the next video frame t+1, and the tracking result of the first frame image and the current frame image is utilized to continuously update the subsequent frame of the video sequence, as shown in fig. 1 (g).

When the tracking algorithm fails to track some video frames due to occlusion, deformation, rapid motion and the like, the tracking algorithm of the invention can re-track the rest of the video sequence by the real target frame based on the first frame image. If a tracking failure is detected, the first frame image and the last run containing the next video frame image are used for repositioning and the process is repeated until the last frame image of the video sequence.

For visual target tracking, it is of paramount importance to learn a robust and accurate appearance model in various complex environments. At the same time, accurate and robust representation of the target object is also very important for improved tracking performance, although this is often at the expense of efficiency at runtime. In the tracking algorithm of the invention, the basic idea of feature learning is to take the fusion feature of the deep convolution feature and the traditional manual feature as the appearance feature of the target object. By fusing traditional manual features with deep convolution features, feature representation of the target appearance is enhanced. In addition, the fusion of the deep convolution features and conventional manual features provides a more comprehensive appearance representation for more accurate and robust tracking results.

The deep convolution feature is based on the residual network ResNet-101.ResNet-101 is widely used in the field of image classification. ResNet-101 is incorporated into the visual target tracking algorithm to accomplish the tracking task. Manual features are based on HOG and Grayscale, which are two methods commonly used in conventional tracking algorithms to represent the appearance of a target object. In general, the depth convolution feature is insensitive to illumination changes, so that in an indoor environment, a tracking algorithm based on the depth convolution feature has difficulty in achieving ideal tracking performance. However, tracking algorithms based on manual features of HOG and granslide are very useful when the target object is subject to severe illumination changes, full or partial occlusion, and when the target object is moving outside the camera's field of view. This is because HOG and Grayscale have strong invariance to high light, illumination intensity and shadow induced illumination changes. Therefore, in the characteristic learning process, the invention fully utilizes the complementary advantages of the deep convolution characteristic and the traditional manual characteristic to express the appearance characteristic of the moving object.

For manual features, the present invention uses a 31-dimensional HOG feature and 5-scale gray feature vectors obtained from a 4x 4 grid. The depth convolution feature is then combined with the center view feature as the other channel feature to describe the appearance of the target object.

After feature learning, two types of features can be obtained to describe the target object. The two types of fusion features are transformed into successive spatial domains by two spatially regularized correlation filters. For each characteristic channel k, x is used _k The representation of the sample is made,is a sample set used to train the multichannel correlation filter. t represents the t-th video frame, every training sample with k channels +.>Is composed of D-dimensional feature maps of size mxn extracted from a target area of each video frame image. Correlation filter y _k ∈R ^D Is a gaussian tag function. For any spatial position (m, n) ∈Ω in each video frame image: = {0, & gt, M-1 × {0, & gt, N-1}, there is one D-dimensional eigenvector +.>Thus, the spatial regularization correlation filter may be defined as:

wherein ∈is a spatial correlation operation,., is a hadamard product, α _k 0 represents each training sample x _k The weight of each sample determines the impact of each sample. w and f ^d Representing the regularization weights and the actual output of the correlation filter, respectively, w may determine f from the spatial position of the correlation filter ^d Is of importance.Representing d-dimensional training samples with k channels, epsilon (f) is the actual output value of the correlation filter that satisfies the minimum error. Thus (S)>Is a spatially canonical term of the correlation filter. To solve this linear least squares problem, equation (1) can be converted to a standard equation in the fourier domain using the Parseval equation. The spatial regularization correlation filter can be effectively learned by obtaining a plurality of samples according to the historical tracking result. If z is of size MxN, which is a feature map of each video frame image, then the classification score s for each location can be determined using a discrete Fourier transform (Discrete Fourier Transformed, DFT) _f (z) is defined as:

wherein s is _f (z) is equivalent to a linear regression based on correlation filters, Λ is a discrete fourier function,is an inverse fourier transform. By adding a spatially regularized term weight in the correlation filter f>Unnecessary boundary effects can be eliminated. The value of the weight w changes along with the change of the space position, and the farther the weight w is away from the target area, the larger the weight w penalizes the background area around the target and vice versa.

The over-fitting problem that may occur in an occlusion environment is eliminated by passively updating the correlation filter. Because the correlation filter with spatially regularized terms in equation (1) is a convex optimization problem, the present invention can effectively utilize the global optimal solution that solves the correlation filter f with the alternating direction multiplier Algorithm (ADMM). However, unlike the STRCF tracking algorithm, the tracking algorithm of the present invention does not add the temporal regularization term f-f _t-1 || ² The linear regression problem is solved using only spatially-correlated canonical correlation terms. Firstly, introducing an auxiliary variable g, requiring f=g, setting a step size parameter as beta to obtain the augmented lagrangianDay form:

wherein s is Lagrangian multiplier whenWhen equation (3) is expressed as:

the invention uses ADMM algorithm to solve the following sub-problems alternately:

f ⁽ⁱ⁺¹⁾ ，p ⁽ⁱ⁺¹⁾ and q ⁽ⁱ⁺¹⁾ Each representing a sub-problem of alternate solution, i representing the number of times of alternate solution.

For filter f, applying the Parseval theorem to the first line of equation (5), in the Fourier domain, can equivalently yield f as:

in formula (6)Is the discrete fourier transform of the correlation filter f, decomposing equation (6) into MN sub-problems:

in the labelIs dependent on the correlation filter only +.>Is the j-th element of (c). /> Each represents a vectorization of the corresponding scalar value function, consisting of the j-th element of f in all D channels. The +.A can be obtained by solving the minimum of equation (7) (i.e., the derivative of equation 7 is zero)>Closed solution of (2):

wherein I represents a unit vector, h is in the form ofIs a vector of (a).Is a matrix with rank of 1, and can be obtained by Sherman-Morrison formula:

finally, it can passIs inverse fourier transformed to f.

In the HCDC-SRCF tracking algorithm provided by the invention, the derivation of the sub-problem p and the updating process of the step size parameter beta are as follows from the document F.Li, C.Tian, W.Zuo, L.Zhang, M. -H.Yang, learning spatial-temporal regularized correlation filters forvisual tracking, in: proceedings of the IEEE conference on computer vision and pattern recognition,2018, pp.4904-4913, "same.

In visual object tracking, it is important how quickly and accurately to estimate candidate objects for the next frame. There are two traditional prediction methods: probability estimation and exhaustive search. The present invention uses the document "M.Danelljan, G.Hager, F.Shahbaz Khan, m.felsberg, learning spatially regularized correlation filters for visual tracking, in: proceedings of the IEEE international conference oncomputer vision,2015, pp.4310-4318", generates target candidates for the next frame. The object of the invention is to estimate the position of a target object in a next frame from the positions of the current frame and the first frame. The model can be robustly updated by correcting past errors using the information of the first frame image. Particularly, when the tracking algorithm of the present invention fails to track the target, the target can be easily re-tracked.

The invention mainly adopts two strategies to improve the positioning and tracking performance of the tracking algorithm on the target object in each video frame image. First, the present invention relies on a simple correlation filter that predicts a target frame for each video frame image in each sequence during tracking, as compared to other target detection and search schemes. The correlation filter f can be dynamically, incrementally maintained and updated to determine the position p of the object in each video frame t _t And the size alpha of the target _t . Most importantly, the invention adds a spatial regularization term in the correlation filterTo improve the positioning accuracy of the tracking algorithm in each video frame t. Secondly, the advantages of a space-time regularization related filter tracking algorithm STRCF, a space regularization related filter tracking algorithm SRDCF and a BACF tracking algorithm are comprehensively considered, and ADMM is added into the space regularization related filter, so that accurate and robust positioning and tracking are realized.

In addition, the ADMM is utilized to eliminate the interference of the background to the target object, so that the tracking performance of the algorithm is further improved. ADMM can be used to model the appearance of tracking target objects. The proposed tracking algorithm does not use negative samples when training the correlation filter model. In the HCDC-SRCF tracking algorithm provided by the invention, each frame fuses ADMM and a spatial regularization term into a tracking frame, and then fusion features are sent into a correlation filter so as to better locate a target object and further enhance the tracking robustness. During tracking, the present invention utilizes online updates of the target representation to describe the change in appearance of the target object. In each frame of video image, according to the tracking results of the first frame and the current frame, using the multi-channel fusion characteristics of the depth convolution characteristics and the manual characteristics to train the correlation filter in an iterative manner, and continuously updating the spatial correlation filter to determine the position and the size of the target object in the next frame of image.

Algorithm 1 gives the overall flow of the HCDC-SRCF tracking algorithm proposed by the present invention.

Those skilled in the art will appreciate that all or part of the steps in implementing the above-described implementation method may be implemented by a program to instruct related software, where the program may be stored in a computer readable storage medium, and the program when executed includes the steps of: (steps of the method), the storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.

Claims

1. The self-adaptive multi-feature fusion tracking method based on the spatial regularization correlation filter is characterized by comprising the following steps of:

(2) Extracting the depth convolution characteristics of the target by adopting a residual convolution network, and extracting the manual characteristics of the target by adopting a gradient histogram and a gray level map;

(3) Carrying out feature fusion on the depth convolution features and the manual features;

(4) Respectively sending the fusion characteristics of the first frame image and the fusion characteristics of the tracking result of the current frame image into two spatial regularization correlation filters; the spatial regularization correlation filter is expressed as:

where, x is a spatial correlation operation, x is a hadamard product, α _k 0 represents each training sample x _k Weights, w and f of (2) ^d The regularized weights and the actual outputs of the correlation filters, respectively, epsilon (f) being the minimum error value, w determining f from the spatial position of the correlation filter ^d Is used for determining the importance of the product,is a spatial regularization term of the correlation filter;

(5) Accumulating and summing confidence values of the two spatial regularization correlation filters to obtain a tracking result of the t+1 frame, and continuously updating a subsequent frame of the video sequence by using the tracking results of the first frame image and the current frame image;

introducing an alternate direction multiplier algorithm into the spatial regularization correlation filter to solve a global optimal solution of the correlation filter f, wherein the process is as follows:

introducing an auxiliary variable g, requiring f=g, and obtaining an augmented lagrangian form with a step size parameter of beta:

wherein s is Lagrangian multiplier whenWhen the above formula is expressed as:

the following sub-problems are solved alternately by using an alternate direction multiplier algorithm:

wherein f ⁽ⁱ⁺¹⁾ ，p ⁽ⁱ⁺¹⁾ And q ⁽ⁱ⁺¹⁾ Respectively representing the sub-problems of the alternate solution, i representing the number of times of the alternate solution;

for the filter f, the first line in the above equation, in the fourier domain, equivalently gets f as:

in the middle ofIs the discrete fourier transform of the correlation filter f, samples +.>All D channels are included, further broken down into MN sub-problems:

in the labelIs dependent on the correlation filter only +.>Is the j-th element of-> Each representing a vectorization of the corresponding scalar value function consisting of f the j-th element in all D-channels,/for>The closed solution of (2) is:

wherein I is a unit vector and h is in the form ofIs used for the vector of (a),is a matrix with rank of 1, and can be obtained by Sherman-Morrison formula:

finally, byIs inverse fourier transformed to f.

2. The adaptive multi-feature fusion tracking method based on spatial regularization correlation filter of claim 1, wherein: and (3) fusing the characteristic, namely fusing a layer of manual characteristic serving as a convolution characteristic with the depth convolution characteristic.

3. The adaptive multi-feature fusion tracking method based on spatial regularization correlation filter of claim 1 or 2, characterized in that: the first frame image information is used to correct for target drift in each frame.

4. The adaptive multi-feature fusion tracking method based on spatial regularization correlation filter of claim 3, wherein: when a tracking failure is detected, the last running result of the next video frame image is repositioned by using the real target frame of the first frame image, and the process is repeated until the last frame image of the video sequence.

5. The adaptive multi-feature fusion tracking method based on spatial regularization correlation filter of claim 1, 2 or 4, characterized in that: step (5) further comprises optimizing the final confidence value using a newton algorithm.