CN112330716A

CN112330716A - Space-time channel constraint correlation filtering tracking method based on abnormal suppression

Info

Publication number: CN112330716A
Application number: CN202011251969.8A
Authority: CN
Inventors: 范保杰; 王雪艳
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2021-02-05
Anticipated expiration: 2040-11-11
Also published as: CN112330716B

Abstract

The invention discloses a space-time channel constraint correlation filtering tracking method based on abnormal suppression, which comprises the following steps: step S1, extracting the HOG feature, the first depth feature and the second depth feature of the t frame; step S2, carrying out fusion processing on the HOG feature, the first depth feature and the second depth feature to obtain a first fusion feature X, and then determining the position and the scale of a target in the t frame image based on the first fusion feature X and a filter; step S3, updating the filter based on the spatial-temporal channel constraint related filtering model which can restrain abnormity according to the characteristic diagram of the t-th frame; and step S4, repeating the steps S2-S4 until all the frame tracking is finished, and finally obtaining the tracking result. The invention uses manual characteristics to combine depth characteristics to remarkably improve the characteristic representation capability of the target template and passes the characteristic_2,1The norm realizes self-adaptive channel characteristic selection, and effectively solves the problems of boundary effect and background clutter.

Description

Space-time channel constraint correlation filtering tracking method based on abnormal suppression

Technical Field

The invention relates to the technical field of computer vision image processing, in particular to a spatio-temporal channel constraint correlation filtering tracking method based on abnormal suppression.

Background

Target tracking is a very popular research topic in the field of computer vision and has been widely applied to video surveillance, unmanned driving, human-computer interaction and the like. Target tracking is given the position and scale of the target in the first frame, thereby predicting the position of the target in subsequent video frames.

Discriminant correlation filter based trackers have achieved excellent results in many common video reference datasets and competitions. Starting from the dominant MOSSE filter, the discriminant correlation filter based tracker achieves very good performance in visual tracking. The KCF uses the multiple-channel HOG features to build an appearance model of the target, thereby significantly improving the performance of the algorithm. The HOG characteristic has good tracking performance on color change, and the CN characteristic has good tracking effect on target deformation and motion blur. Thus, stack proposes a fusion algorithm that uses the combined features of HOG + CN to track the target. The BACF multiplies the target center region by a fixed binary matrix to obtain real samples, and proposes an effective ADMM method to learn the filter. CACF proposes a new framework that can add more context information and context information to the learned filter. The AutoTrack trains the tracker by using self-adaptive space-time regularization, and provides a hyper-parameter correlation filtering tracker method based on self-adaptively adjusting space-time constraint terms.

Recently, convolutional neural networks have been widely used for correlation filters in order to achieve a more accurate and comprehensive representation of the appearance of the target. Feature representations combining manual features and depth features have been widely used in some trackers. C-COT performs sub-grid tracking by learning discriminant continuous convolution operators. ECO, a lightweight version of C-COT, uses decomposition convolution operators and generation of sample space to reduce model complexity and increase computation speed. LADCF selects only 5% of the handcrafted and 20% of the deep features for filter learning using a method with adaptive spatial feature selection. ASRCF provides an adaptive spatial regularization method, which can effectively acquire spatial weight to adapt to the change of target appearance and realize stronger tracking performance. GFS-DCF introduced a tracking framework for Res-Net50 containing rich semantic information and, in addition, introduced a group sparse feature selection method to learn adaptive feature selection and achieve superior performance.

However, when the tracking target is interfered by occlusion, fast motion and background noise, the target sample used for training the filter may be interfered, resulting in a gradual decrease in the discrimination of the filter template.

Therefore, the invention provides a spatio-temporal channel constraint correlation filtering tracking method based on abnormal suppression, aiming at the problems.

Disclosure of Invention

In view of this, in order to solve the problem that the filter learning capability is degraded when the target template is interfered by the existing target tracking method, the invention provides a spatio-temporal channel constraint correlation filtering tracking method based on abnormal suppression.

In order to achieve the above object, the present invention provides a spatio-temporal channel constraint correlation filtering tracking method based on abnormal suppression, which comprises the following steps:

step S1, acquiring a region of a target in a t frame according to the position and the scale of the target in the t-1 frame image, taking the region as a target region, and extracting the HOG feature, the first depth feature and the second depth feature of the target region;

step S2, carrying out fusion processing on the HOG feature, the first depth feature and the second depth feature of the t-th frame to obtain a fused first fusion feature X, and then determining the position and the scale of a target in the image of the t-th frame based on the first fusion feature X and a filter, wherein the filter is trained on the target image in the image of the t-1 th frame in advance;

step S3, updating the filter to obtain a new filter based on a spatial-temporal channel constraint related filtering model capable of inhibiting abnormity according to the characteristic diagram of the t-th frame;

and step S4, repeating the steps S2-S4 until all the frame tracking is finished, and finally obtaining the tracking result.

Further, in the step S1, a preset ResNet50 depth model is adopted to extract the Conv4-3 as the first depth feature and the Conv4-6 as the second depth feature.

Further, in step S2, the determining the position and the scale of the target in the t-th frame image specifically includes:

and acquiring 7 second fusion features X with sequentially increasing scales according to the first fusion feature X, performing Fourier transform on the 7 second feature X by using the filter to obtain a Fourier domain, performing convolution operation to obtain a response map, taking the position of the maximum response value in the response map as the position of the target in the t-th frame image, and taking the scale corresponding to the maximum response value in the response map as the target scale in the t-th frame image.

Further, step S3 specifically includes:

301, the spatial domain expression of the spatial-temporal channel constraint related filtering model capable of suppressing the abnormality is as follows:

in formula (1), y is an ideal Gaussian function, and is a spatial correlation operator, p and q represent the position difference of two peaks in two response maps in two-dimensional space, [ psi [_p,q]Indicating a shifting operation performed in order to make two peaks coincide with each other,

representing the filter on the kth channel at the t-1 frame,

representing the filter on the kth channel at the tth frame,

and

respectively representing the characteristic samples and filters, λ, from K channels₁And λ₂Is a regular parameter, gamma is an abnormal punishment parameter;

step 302, converting the space-time channel constraint related filtering model capable of suppressing the abnormity into a Fourier domain by using Parseval's theorem, and obtaining an expression as follows:

in the formula (2), the upper criterion Λ is the discrete fourier transform of a given signal,

for the discrete fourier transform of the introduced filter auxiliary variables, T is the size of the input data, F is an orthogonal matrix of size T x T, which maps the arbitrary T-dimensional vectorized signal into the fourier domain,

is defined as M_t-1[ψ_p,q]Discrete fourier transform of (d);

step 303, constructing formula (2) in step S302 as an augmented lagrangian function, where the expression is:

in the formula (3), the first and second groups,

in order to be a lagrange multiplier,

is a fourier domain transform of the lagrange multiplier,

expressed as the transpose of the lagrange multiplier fourier domain transform, mu is a penalty factor,

for the discrete fourier transform of the introduced filter auxiliary variables,

is the discrete fourier transform of the filter auxiliary variables introduced in the t-1 th frame.

Step 304, given

And

converting the augmented Lagrange function of step S303 into a subproblem solving filter

In the formula (4), the first and second groups,

expressed as the transpose of the lagrange multiplier fourier domain transform,

representing the solution of the filter at the t +1 th frame.

Solving for each channel yields:

in the formula (5), the first and second groups,

representing the filter auxiliary variables on the kth channel at the tth frame.

Step 305, given

And

converting the augmented Lagrange function of step S303 toSolving auxiliary variables for sub-problems

Equation (6) is then decomposed into N subproblems, each expressed as:

solving the formula (7) and optimizing by using a Sherman Morrison formula to obtain:

in the formula (8), the first and second groups,

and

it should be noted that the above variables are not of practical significance, but are merely combined for computational convenience.

Step 306, give

And

updating Lagrange multiplier terms

In the formula (9), the reaction mixture,

is the lagrangian multiplier term at the ith iteration,

and

is solved by the formulas in steps S305 and S304 in the (i + 1) th iteration;

step 307, updating the regular penalty factor μ:

μ(i+1)＝min(βμ(i),μ_max) (10)

in the formula (10), β is 1.5, μ_max＝1；

Step 308, update the filter template

The compound of the formula (11),

representing the updated filter template at the t-th frame,

representing the filter template at frame t-1,

representing the result of solving the formula of step S304,

is the online update rate.

Further, before performing step S1, it is determined whether the t-th frame is the first frame in the video sequence, if not, step S1 is performed, if so, the position of the target and the relevant filter are initialized, after the initialization operation is completed, it is determined whether tracking of all frames is currently completed, if so, the tracking result is output, and if not, step S1 is returned to.

The invention has the beneficial effects that:

the invention significantly improves the feature representation capability of the target template in the aspect of feature representation by combining manual features with depth features. In the spatial-temporal channel constraint correlation filtering model based on suppressible abnormity_2,1The norm realizes self-adaptive channel characteristic selection, and effectively solves the problems of boundary effect and background clutter ground; similarity constraint is carried out on the target template from the time sequence through a time constraint item, and historical frame information is reserved for learning of the filter, so that the degradation problem of the filter is relieved; more robust and more accurate target tracking is realized by suppressing the abnormal constraint item; by optimizing the model using the ADMM algorithm, the time complexity is significantly reduced and the computation speed is increased.

Drawings

Fig. 1 is a flow chart of a first embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For better illustration of the present invention, two terms are explained first, the overall name of HOG being Historgram ordered Gradients, representing Histogram of Oriented Gradients; ADMM is generally called Alternating Direction Method of Multipliers, and is expressed as the Alternating Direction multiplier Method.

Example 1

Referring to fig. 1, the present embodiment provides a spatio-temporal channel constraint correlation filtering tracking method based on suppressible anomalies, the method uses a lemming sequence in a target tracking reference data set (OTB100) as a verification set, the video size is 640 × 480, and total 1336 frames include illumination changes, scale scaling, occlusion, fast motion, background clutter and other appearance significant changes; before the method is used, namely in a first frame, initializing the position of a target to be tracked and a relevant filter, and completing target tracking of a subsequent frame based on the method, wherein the general flow chart is shown in fig. 1, and the method comprises the following steps:

step S1, obtaining the area of the target in the t frame according to the target position and scale in the t-1 frame image, taking the area as the target area, and extracting the HOG feature, the first depth feature and the second depth feature of the target area.

Specifically, in the embodiment, a preset ResNet50 depth model is adopted to extract Conv4-3 as a first depth feature and Conv4-6 as a second depth feature.

Step S2, fusion processing is carried out on the HOG feature, the first depth feature and the second depth feature of the t-th frame to obtain a first fusion feature X after fusion, then the position and the scale of the target in the image of the t-th frame are determined based on the first fusion feature X and a filter, and the filter is trained by the target image in the image of the t-1 th frame in advance.

Specifically, in this embodiment, the determining the position and the scale of the target in the t-th frame image specifically includes: and acquiring 7 second fusion features X with sequentially increasing scales according to the first fusion feature X, performing Fourier transform on the 7 second feature X by using the filter to obtain a Fourier domain, performing convolution operation to obtain a response map, taking the position of the maximum response value in the response map as the position of the target in the t-th frame image, and taking the scale corresponding to the maximum response value in the response map as the target scale in the t-th frame image.

specifically, step S3 includes the following sub-steps:

representing the filter on the kth channel at the t-1 frame,

representing the filter on the kth channel at the tth frame,

and

in the above formula, the first term is a ridge regression term, so that the response map and the label y are fitted as much as possible; the second term being the filter h_tChannel regularization penalty term of using l_2,1Norm to realize self-adaptive feature selection; the third term is a temporal regularization term, wherein

Represents the filter on the kth channel at the t-1 frame; the fourth term is an abnormal term inhibition term, wherein the response graph of the t-1 frame

Can simplify M_t-1[ψ_p,q]And (4) showing.

is defined as M_t-1[ψ_p,q]Discrete fourier transform of (d);

in the formula (3), the first and second groups,

in order to be a lagrange multiplier,

is a fourier domain transform of the lagrange multiplier,

Step 304, given

And

In the formula (4), the first and second groups,

expressed as the transpose of the lagrange multiplier fourier domain transform,

representing the solution of the filter at the t +1 th frame.

Solving for each channel yields:

in the formula (5), the first and second groups,

Step 305, given

And

converting the augmented Lagrange function of step S303 into a subproblem solving auxiliary variable

Equation (6) is then decomposed into N subproblems, each expressed as:

in the formula (8), the first and second groups,

and

it is to be noted that; the above variables are not of practical significance, but are merely combined for computational convenience.

Step 306, give

And

updating Lagrange multiplier terms

In the formula (9), the reaction mixture,

is the lagrangian multiplier term at the ith iteration,

and

is solved by the formulas in steps S305 and S304 in the (i + 1) th iteration;

step 307, updating the regular penalty factor μ:

μ(i+1)＝min(βμ(i),μ_max) (10)

in the formula (10), β is 1.5, μ_max＝1；

Step 308, update the filter template

The compound of the formula (11),

filter template updated at the time of representing the t-th frame

Representing the filter template at frame t-1,

representing the result of solving the formula of step S304,

is the online update rate.

According to the technical scheme, the target tracking is carried out on the video sequence, no deviation of the tracked target is generated all the time in the scenes with shielding, rapid motion or background clutter, and the tracking precision is high.

The invention significantly improves the feature representation capability of the target template in the aspect of feature representation by combining manual features with depth features. In the spatial-temporal channel constraint correlation filtering model based on suppressible abnormity_2,1The norm realizes self-adaptive channel characteristic selection, and effectively solves the problems of boundary effect and background clutter ground; the filter degradation problem is alleviated through a time constraint term; more robust and more accurate target tracking is realized by suppressing the abnormal constraint item; by optimizing the model using the ADMM algorithm, the time complexity is significantly reduced and the computation speed is increased.

It should be noted that the above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes and substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A spatial-temporal channel constraint correlation filtering tracking method based on suppressive abnormality is characterized by comprising the following steps:

2. The method for spatio-temporal channel constraint correlation filtering tracking based on suppressible abnormality of claim 1, characterized in that in said step S1, a preset ResNet50 depth model is used to extract Conv4-3 as the first depth feature and Conv4-6 as the second depth feature.

3. The method for spatio-temporal channel constraint correlation filtering tracking based on suppressible abnormality according to claim 2, characterized in that in said step S2, said determining the position and scale of the target in the t-th frame image is specifically:

4. The method for spatio-temporal channel constraint correlation filtering tracking based on suppressible abnormalities as claimed in claim 3, wherein said step S3 specifically comprises:

representing the filter on the kth channel at the t-1 frame,

representing the filter on the kth channel at the tth frame,

and

for the discrete fourier transform of the filter auxiliary variables introduced in the t-th frame,

filter assistance introduced for t-1 frameDiscrete Fourier transform of variables, T being the size of the input data, F being an orthogonal matrix of size T x T, F mapping any T-dimensional vectorized signal into the Fourier domain,

is defined as M_t-1[ψ_p,q]Discrete fourier transform of (d);

in the formula (3), the first and second groups,

in order to be a lagrange multiplier,

is a fourier domain transform of the lagrange multiplier,

discrete Fourier transform of filter auxiliary variables introduced in the t-1 frame;

step 304, given

And

In the formula (4), the first and second groups,

expressed as the transpose of the lagrange multiplier fourier domain transform,

represents the solution of the filter at the t +1 th frame;

solving for each channel yields:

in the formula (5), the first and second groups,

represents the filter auxiliary variable on the kth channel at the tth frame;

step 305, given

And

Equation (6) is then decomposed into N subproblems, each expressed as:

in the formula (8), the first and second groups,

and

step 306, give

And

updating Lagrange multiplier terms

In the formula (9), the reaction mixture,

is the lagrangian multiplier term at the ith iteration,

and

is solved by the formulas in steps S305 and S304 in the (i + 1) th iteration;

step 307, updating the regular penalty factor μ:

μ(i+1)＝min(βμ(i),μ_max) (10)

in the formula (10), β is 1.5, μ_max＝1；

Step 308, update the filter template

In the formula (11), the reaction mixture,

representing the updated filter template at the t-th frame,

representing the filter template at frame t-1,

representing the result of solving the formula of step S304,

is the online update rate.

5. The method as claimed in claim 4, wherein before step S1, it is determined whether the t-th frame is the first frame in the video sequence, if not, step S1 is performed, if it is, the position of the target and the correlation filter are initialized, and after the initialization operation is completed, it is determined whether the tracking of all frames is currently completed, if yes, the tracking result is output, and if not, step S1 is returned to.