CN112819865B

CN112819865B - Correlation filtering tracking method based on self-adaptive regular feature joint time correlation

Info

Publication number: CN112819865B
Application number: CN202110214541.4A
Authority: CN
Inventors: 刘龙; 惠志轩; 杨尚其
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-02-09
Anticipated expiration: 2041-02-26
Also published as: CN112819865A

Abstract

The invention discloses a correlation filtering tracking method based on self-adaptive regular feature joint time association, which specifically comprises the following steps: step 1, selecting a video sequence to be tracked, and initializing a first frame of the video sequence; step 2, determining the central position of a target in a second frame of the tracking video sequence, and estimating the target scale in the second frame; and 3, determining the position of the target in the t frame of the tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2. The method solves the problems that when the scale of the target changes, the original windowing can cause incomplete target or excessive background information learned by the filter.

Description

Correlation filtering tracking method based on self-adaptive regular feature joint time correlation

Technical Field

The invention belongs to the technical field of video image tracking in machine vision, and relates to a related filtering tracking method based on self-adaptive regular feature joint time association.

Background

With the rapid development of computer technology, object tracking has become one of the hot subjects of computer vision research. The visual target tracking is to continuously mark the position of the tracked target in each frame of image of the video sequence through some algorithms so as to obtain the motion parameters of the target, such as position, speed, acceleration and the like, so that further processing analysis is performed, and the behavior understanding of the target is realized to complete higher-level tasks. As an important branch in the field of computer vision, it has a variety of applications in various fields of science and technology, national defense construction, aerospace, medicine and health, and national economy, such as intelligent video monitoring, man-machine interaction, robotics, and autopilot. The related filtering target tracking method has the advantages of high processing speed, high tracking precision and the like.

In correlation filtering, it is important to extract sample features of a target region, and there are two types of extracted features: (1) The conventional features are manually set features such as image color histogram features, direction gradient histogram features (HOG), local binary pattern features (LBP); (2) The convolution characteristics are characteristics extracted by using a deep convolution network (CNN), a characteristic image (called deep characteristic) output by a CNN back convolution layer has higher semantic information and translational invariance, and has good robustness for target change, but for target tracking needing accurate target position positioning, only the deep characteristic is insufficient because the spatial resolution is too low to accurately position a tracked target, and a characteristic image (called shallow characteristic) output by a CNN back convolution layer has the characteristic of high spatial detail, which is very beneficial to target positioning, but the characteristic has no robustness for target appearance change and is unfavorable to accurate target positioning.

The existing relevant filtering tracking method extracts the characteristics of rectangular areas around the first frame target, a relevant filter is trained through ridge regression based on the characteristics, the updated filter and the characteristics are used for carrying out relevant operation on the characteristics of the area searched by the frame to be tracked in the subsequent frame after the filter and the characteristics are updated, a response diagram is obtained after the relevant operation, and the maximum response value point position in the response diagram is the target position. And the traditional characteristic is replaced by a multilayer convolution characteristic based on the correlation filtering of the CNN characteristic, and the target position is deduced by using a weighted fusion method after the multilayer correlation filtering response is obtained.

(1) In the prior art, when the change of the target scale is not considered, the original windowing can lead to incomplete target learned by a filter or excessive background information, the robustness of the filter is poor, the tracking precision is low, and the tracking failure can be finally caused.

(2) When the existing method is used for tracking a relevant filtering target by using convolution characteristics, a method of using weighted fusion for multi-layer relevant filtering response is used for deducing the target position, but each frame of weight is fixed and is not an adaptive weight, so that the robustness and the precision of the tracker can be reduced if the requirements of the tracker cannot be met.

Disclosure of Invention

The invention aims to provide a related filtering tracking method based on self-adaptive regular feature joint time association, which solves the problems that when the scale of a target changes, the original windowing can cause incomplete target or excessive background information learned by a filter.

The technical scheme adopted by the invention is that the self-adaptive regularization feature joint time correlation-based related filtering tracking method specifically comprises the following steps:

step 1, selecting a video sequence to be tracked, and initializing a first frame of the video sequence;

step 2, determining the central position of a target in a second frame of the tracking video sequence, and estimating the target scale in the second frame;

and 3, determining the position of the target in the t frame of the tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2.

The invention is also characterized in that:

the specific process of the step 1 is as follows:

step 1.1, artificially framing a target area in a first frame of a video sequence to be tracked to obtain a target center position coordinate p ₁ And a target scale s ₁ ，p ₁ ＝[x ₁ ,y ₁ ] ^T Wherein x is ₁ ,y ₁ Coordinates of the center position of the first frame object of the video sequence on the x-axis and the y-axis with the upper left corner of the image as the origin, s ₁ ＝[h ₁ ,w ₁ ] ^T Wherein h is ₁ ,w ₁ Is the length and width of the target area of the first frame;

step 1.2, according to the target center position p in the first frame of the video sequence ₁ And a target size s ₁ Determining a first frame training search area I of a video sequence ₁ ；

Step 1.3, extracting a first frame training search area I of a video sequence by using a convolutional neural network ₁ Is obtained by the hierarchical convolution characteristic of the search area I ₁ Is of convolution characteristics of (a)Is one (k x h) ₁ )×(k×w ₁ ) X c matrix, k x h ₁ ,k×w ₁ Length and width of the convolution feature map, c is the number of channels of the convolution feature map, ++>Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (low, mid, high), l epsilon (high)>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;

step 1.4, training the search area I according to the size of the target area and the first frame ₁ Size versus convolution characteristicsWindowing is performed, and the convolutions after windowing are characterized by +.>

Step 1.5, the convolutions after windowing according to step 1.4Training a correlation filter->Is one (k x h) ₁ )×(k×w ₁ ) Corresponding to the feature layer, transforming the trained correlation filter into the frequency domain to obtain +.>

The specific process of windowing in the step 1.4 is as follows: the total window function after windowing is superposition of two window functions of cosine window and Gaussian window, wherein:

cosine window training the size of the convolved feature in the search area from the first frameDecision (S)>The cosine window function w is unchanged in size _cos Unchanged;

the first frame training search area gaussian window is determined by the following equation (1):

wherein,(M, n) is the coordinates of each point in the gaussian window, m= (k×h) ₁ )/2，N＝(k×w ₁ ) 2, delta is a regulatory factor;

the total window function after windowing is shown in the following formula (2):

the specific process of step 1.5 is as follows:

order theTo->Cyclic shift samples generated by performing cyclic shift of m and n elements in length and width respectively, wherein the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian function +>Generating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha ₁ ^l The following formula (3) shows:

in formula (3), x represents a convolution operation, y is the size h ₁ ×w ₁ The matrix of m-th row and n-th column of elements y _m,n =y (m, n) =y, λ is a regularization coefficient, and in order to accelerate the operation, the following equation (4) is obtained by transforming equation (3) into the frequency domain:

wherein,an ith feature channel representing a windowed feature of the training search area of frame 1, ∈a is a discrete fourier transform, +.>The horizontal lines on the letters represent complex conjugates, the Hadamard product of the matrix, and c the number of characteristic channels.

The specific process of the step 2 is as follows:

step 2.1, determining a detection search zone Z of the second frame ₂ Wherein the detection of the second frame searches for the region Z ₂ To target the center p with the first frame on the second frame ₁ Centered at a first frame target scale k×s ₁ A rectangular area cut for length and width, k is more than 1;

step 2.2, extracting the second frame detection search region Z ₂ Is of convolution characteristics of (a)The characteristic is a (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a volume extracted from the first layer of a convolutional networkProduct characteristics, l E (low, mid, high), A>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;

step 2.3, searching for region features for detection of the second frameWindowing, characterization after windowing +.>The following formula (5) shows:

step 2.4, calculating the correlation filter α by the following equation (6) ₁ ^l At the position ofRelated filter response r ₂ ^l ；

Wherein, lambda is the discrete Fourier transform,an ith feature channel representing a 2 nd frame detection search area windowing feature.

Step 2.5, the target center position p of the second frame is calculated by the following formula (7) ₂ The specific method for estimating is as follows:

wherein,in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;

step 2.6, tracking the center position p of the second frame object of the video sequence according to the tracking obtained in step 2.5 ₂ Estimating a target scale s of the second frame using DSST method ₂ ＝[h ₂ ,w ₂ ] ^T ；

Step 2.7, the center position p of the target in the second frame image ₂ And the first frame scale s ₁ Determining training search area I of second frame ₂ Searching for region I based on training ₂ For correlation filter alpha ₁ ^l And updating.

The specific process of the step 3 is as follows:

step 3.1, from the target center position p in the previous frame _t-1 And a first frame target scale s ₁ Determining t-1 frame training search area I _t-1 ；

Step 3.2, extracting t-1 frame training search area I _t-1 Is of convolution characteristics of (a)The convolution characteristic is a (k x h) ₁ )×(k×w ₁ ) Matrix of Xc>The convolution characteristics extracted for the first layer of the convolution network;

step (a)3.3 training search area characteristics for t-1 frameWindowing, obtaining the convolution characteristic after windowing>The following formula (8) shows:

wherein,

step 3.4, according to the convolutions after windowingTraining a correlation filter->Is one (k x h) ₁ )×(k×w ₁ ) Is used to train the correlation filter +.>The following formula (9) shows:

to accelerate the operation, transform equation (9) to the frequency domain yields the following equation (10):

wherein,representing t-1 frame trainingIth feature channel of windowing features of search area,/->The horizontal lines on the letters represent complex conjugates.

Step 3.5, determining the detection search zone Z of the t frame _t Wherein the detection search area Z of the t-th frame _t To target the center p at the t-1 th frame on the t-th frame _t-1 Centered at a first frame target scale k×s ₁ K is greater than 1, and a rectangular area is cut out for length and width to be used as a detection search area of a t-th frame;

step 3.6, extracting the t-th frame detection search region Z _t Is of convolution characteristics of (a)The convolution feature->Is (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;

step 3.7, searching for region convolution feature for detection of the t-th frameWindowing to obtain a windowed convolution productSyndrome of->The following formula (11) shows:

step 3.8, calculating the correlation filter by the following formula (12)At->Related filter response r _t ^l ：

Wherein, lambda is the discrete Fourier transform,an ith feature channel representing a windowed feature of the t-th frame detection search area;

step 3.9, the following formula (13) is adopted for the target center position p in the t frame _t The specific method for estimating is as follows:

step 3.10, the center position p of the t frame target of the tracking video sequence obtained according to step 3.9 _t Estimating target scale s of t frame by DSST method _t ＝[h _t ,w _t ] ^T ；

Step 3.11, the center position p of the target in the t frame image _t And the first frame scale s ₁ Determining training search area I of the t-th frame _t Searching for region I based on training _t For correlation filterAnd updating.

And step 3.12, repeating the steps 3.1-3.11 until tracking is finished.

The method has the advantages that the tracking precision and the robustness of the situation that the target features are learned incompleteness or the background is learned too much when the scale change occurs in the tracking process are improved, the self-adaptive distribution of weights among response graphs is realized, and the precision and the robustness of the tracker are improved.

Drawings

FIG. 1 is a graph of a correlation filter tracking method based on adaptive regularization feature joint time correlation in comparison with an existing tracking algorithm with respect to accuracy;

FIG. 2 is a graph of a comparison of the present invention of a correlation filter tracking method based on adaptive regularization feature joint time correlation with an existing tracking algorithm with respect to success rate.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention relates to a self-adaptive regularization feature joint time correlation-based related filtering tracking method, which specifically comprises the following steps:

step 1, selecting a video sequence to be tracked, and initializing a first frame of the sequence;

step 1.1, artificially framing a first frame target area to obtain a target center position coordinate p ₁ And a target scale s ₁ ，p ₁ ＝[x ₁ ,y ₁ ] ^T Wherein x is ₁ ,y ₁ Coordinates of the center position of the first frame object on the x-axis and the y-axis with the upper left corner of the image as the origin, s ₁ ＝[h ₁ ,w ₁ ] ^T Wherein h is ₁ ,w ₁ Is the length and width of the target area of the first frame, t represents the current frame of the video;

step 1.2, from the target center position p in the first frame ₁ And a target size s ₁ Determining a first frame training search area I ₁ The method comprises the following steps:

in the image acquired in the first frame, p ₁ Centered at kxs ₁ K > 1 cuts out a rectangular area for length and width as a first frame training search area I ₁ K is a specified parameter;

step 1.3, extracting the first frame training search area I using VGG-19 convolutional neural network ₁ And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature f of the region ₁ ^l ，f ₁ ^l Is one (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) For the length and width of the feature map, c is the number of channels of the convolution feature map, and f is recorded ₁ ^l Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (mid, high), f ₁ ^low For the network shallow layer characteristics, f ₁ ^mid For network intermediate layer features, f ₁ ^high Is a high-level feature of the network.

Step 1.4, windowing features according to the target size and the first frame training search area size, and finally superposing a total window function by two window functions, namely a cosine window and a Gaussian window, which are called as self-adaptive regular feature windows, wherein the total window function is specifically:

the cosine window is trained by the first frame to search for the size of the region feature, i.e., by f ₁ ^l Size determination of f ₁ ^l The window function w is unchanged in size _cos Unchanged;

the gaussian window of the first frame training search area is determined by the size of the first frame training search area and the first frame target scale,where (M, n) is the coordinates of each point of the gaussian window, m= (k×h) ₁ )/2，N＝(k×w ₁ ) And 2, delta is a regulating factor. When the target size changes, the gaussian variance σ of the gaussian window ₁ The Gaussian function is a value which follows the change of the target and is proportional to the area of the target, when the target becomes larger, the Gaussian function is influenced by the variance to be gentle, the inhibition of the Gaussian function to the characteristics of the target area is weakened, and the target is more exposed; conversely, the Gaussian function becomes sharp, and the suppression of the background is increased, so that the filter does not learn too much background.

Will ultimately window the featureAs the final feature f ₁ ^′l

Step 1.5 training the correlation filter alpha based on the extracted features ^l ，α ^l Is one (k x h) ₁ )×(k×w ₁ ) I corresponds to the feature layer, the specific method is as follows:

let f ₁ ^′l (m, n) is F ₁ ^′l Cyclic shift samples generated after cyclic shift of m and n elements in length and width are respectively carried out, the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian functionGenerating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter +.>

In the formula (1)Denote the convolution operation, y is a size h ₁ ×w ₁ Of the matrix of m-th row and n-th column of elements y _m,n =y (m, n) =y, λ being the regularization coefficient. To accelerate the operation, transform (1) into the frequency domain to obtain

Wherein,the i-th feature channel representing the windowing feature of the training search area of t=1 frame, Λ is the discrete fourier transform, the horizontal line on the letter represents the complex conjugate, the hadamard product of the matrix, and c is the number of feature channels. In order to make it possible to explain the filter update more concisely in the following we have the (2) type molecule +.>Denominator->

Step 2, determining the center position p of the target in the second frame of the video image ₂ ＝[x ₂ ,y ₂ ] ^T And estimating the target scale s ₂ ＝[h ₂ ,w ₂ ] ^T ；

Step 2.1, determining a detection search zone Z of the second frame ₂ Wherein the detection of the second frame searches for the region Z ₂ To target the center p with the first frame on the second frame ₁ Centered at a first frame target scale k×s ₁ K > 1 is length and width, a rectangular area is cut out as the detection search area of the second frame, and the search area is Z ₂ ；

Step 2.2, extracting the convolution characteristic of the second frame detection search area, and using a bilinear interpolation method to make the size and input of the characteristic diagram consistent, wherein the interpolated characteristic is thatIt is still one (k.times.h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>The convolution characteristic of l is extracted from the layer of the convolution network, i epsilon (low, mid, high), is->For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.

Step 2.3, searching for region features for detection of the second frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:

the cosine window detects the size of the search area feature from the second frame, i.e. fromSize determination of->Size and f ₁ ^l Consistent, all unchanged, window function w _cos Unchanged;

the second frame detection search area gaussian window is determined by the size of the second frame detection search area and the first frame target scale,final windowing feature->As a final feature->

Step 2.4, calculating a correlation filter by (3)At->Correlated filter response ∈>

Wherein Λ is the discrete fourier transform.

Step 2.5, for the target center position p of the second frame ₂ The specific method for estimating is as follows:

wherein,in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>For the control parameters associated with the time of the front frame and the time of the rear frame, two parameters are determined as follows:

is provided with->For response patterns respectively->Maximum value of>Respectively is a response chartWhen the ratio of the maximum value to the maximum value of the response map is larger than a given threshold value gamma, the response map is not ideal, otherwise, the response map is ideal. The invention adaptively weights the response map by equation (5):

if the frame to be tracked is the second frame, +.>

After obtainingAnd->After that, pass->Obtain->Then the destination of the second frame is obtained according to the formula (4)Target center position p ₂ ；

Step 2.6, scale estimation

Center position p of object in second frame image of obtained video sequence ₂ Then, estimating the target scale s of the second frame by using the existing DSST method ₂ ＝[h ₂ ,w ₂ ] ^T

Step 2.7, correlation Filter α ₁ ^l Model updating:

from the new central position p of the target ₂ And the first frame scale s ₁ Determining training search area I of second frame ₂ The specific method comprises the following steps:

in the second frame acquired image with p ₂ Centered at kxs ₁ K > 1 is a rectangular area cut out for length and width as training search area I ₂ K is a designated parameter, and then a VGG-19 convolutional neural network is used for extracting a rectangular image block I of a target area ₂ And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature f of the region ₂ ^l ，f ₂ ^l Is one (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ，k×w ₁ ) For the length and width of the feature map, c is the number of channels of the convolution feature map, and is recordedIs the convolution feature of the second frame extracted by the first layer of the convolution network, l epsilon (low, mid, high), l epsilon (high)>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.

For the above featuresWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:

the cosine window is trained by the second frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w _cos Unchanged;

the second frame training search area gaussian window is determined by the size of the second frame training search area and the second frame target scale,final windowing feature->As a final feature->At the time of getting->Thereafter, each filter is updated by

Wherein eta is the learning rate and c is the number of characteristic channels.Representation t=2 frame training search area windowingThe ith feature channel of the feature.

Step 3, determining the target position p of the t (t > 2) th frame _t ＝[x _t ,y _t ] ^T Target scale s _t ＝[h _t ,w _t ] ^T ,

Step 3.1, from the last frame target center position p _t-1 And a first frame target scale s ₁ Determining t-1 frame training search area I _t-1 The specific method comprises the following steps:

in p in the image obtained in the t-1 frame _t-1 Centered at (k×h) ₁ ,k×w ₁ ) (k > 1) cutting out a rectangular region for length and width as t-1 frame training search region I _t-1 ；

Step 3.2, extracting the convolution characteristics of the t-1 frame training search area and using a bilinear interpolation method to make the size of the characteristic diagram consistent with the input to obtainIt is one (k x h) ₁ )×(k×w ₁ ) Matrix of Xc>The convolution characteristics extracted for the first layer of the convolution network;

step 3.3, training the search area feature for t-1 frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:

the cosine window is trained by the t-1 frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w _cos Unchanged;

the t-1 frame training search area gaussian window is determined by the size of the t-1 frame training search area and the t-1 frame target scale,and will be windowed feature +>As a final feature->

Step 3.4, according to the extracted featuresTraining a correlation filter->Is one (k x h) ₁ )×(k×w ₁ ) I corresponds to the feature layer, the specific method is as follows:

order theTo->Cyclic shift samples generated by performing cyclic shift of m and n elements in length and width respectively, wherein the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian function +>Generating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha ^l

In formula (7), x represents convolution operation, y is a matrix of h×w, and the element y of the m-th row and n-th column of the matrix _m,n =y (m, n) =y, λ being the regularization coefficient. To accelerate the operation, transform (7) into the frequency domain to obtain

Wherein,the i-th characteristic channel which represents the windowing characteristic of the t-1 frame training search area, wherein lambda is discrete Fourier transform, a transverse line on letters represents complex conjugate, lambda product is matrix, and c is the number of characteristic channels.

Step 3.5, determining the detection search zone Z of the t frame _t Wherein the detection search area Z of the t-th frame _t To target the center p at the t-1 th frame on the t-th frame _t-1 Centered at a first frame target scale k×s ₁ K > 1 is a rectangular area cut out for length and width as a detection search area of the t-th frame.

Step 3.6, extracting the t-th frame detection search region Z _t Is characterized by that it uses bilinear interpolation method to make the size of characteristic graph identical to input, and the interpolated characteristic isIt is still one (k.times.h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ，k×w ₁ ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>The convolution characteristic of l is extracted from the layer of the convolution network, i epsilon (low, mid, high), is->For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.

Step 3.7, search for region features for detection of the t-th frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:

the cosine window is obtained by detecting the size of the features of the search area from the t-th frame, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w _cos Unchanged;

the t-th frame detection search area gaussian window is determined by the size of the t-th frame detection search area and the t-1 st frame target scale,final windowing feature->As a final feature->

Step 3.8, calculating a correlation filter by (9)At->Related filter response r _t ^l ；

Wherein Λ is the discrete fourier transform.

Step 3.9, for the target center position p _t The specific method for estimating is as follows:

wherein,in response to the adaptive weights between the layers,for the subpeak suppression parameter, < >>For the control parameters associated with the time of the front frame and the time of the rear frame, two parameters are determined as follows:

is provided with->For response patterns respectively->Maximum value of>Response graphs->When the ratio of the maximum value to the maximum value of the response map is larger than a given threshold value gamma, the response map is not ideal, otherwise, the response map is ideal. The response map is adaptively weighted by equation (11):

recording the generation of the upper frame target position p _t-1 Maximum position in three response plots:

where t-1 represents the last frame.The abscissa representing the maximum position of three response maps in t-1 frame, and the target position p of the frame is recorded and generated _t Maximum position in three response plots:

where t represents the frame to be tracked.

Since the video sequence is not a single image, and there is a certain time correlation, the change of the target position of the adjacent frames is not obvious, and by utilizing the characteristic, mu can be determined according to the formula (14) _i :

/>

Step 3.10, estimating the scale;

in obtaining the second frame image of the video sequenceCenter position p of the middle object _t Then, estimating the target scale s of the second frame by using the existing DSST method _t ＝[h _t ,w _t ] ^T

Step 3.11, correlation FilterModel updating:

new center position p in the t-th frame by the target _t And the first frame scale s ₁ Determining training search area I of the t-th frame _t The specific method comprises the following steps:

in p in the image acquired in the t-th frame _t Centered at kxs ₁ K > 1 is a rectangular area cut out for length and width as training search area I _t K is a designated parameter, and then a VGG-19 convolutional neural network is used for extracting a rectangular image block I of a target area _t And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature of the regionIs one (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ，k×w ₁ ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>Is the convolution feature of the t frame extracted from the first layer of the convolution network, l epsilon (low, mid, high), the convolution feature of the t frame extracted from the first layer of the convolution network>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.

the cosine window is trained by the t-th frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w _cos Unchanged;

the t-th frame training search area gaussian window is determined by the size of the t-th frame training search area and the t-th frame target scale,final windowing feature->As a final feature->At the time of getting->Thereafter, each filter is updated by

Wherein eta is the learning rate and c is the number of characteristic channels.Representing t frame trainingThe ith feature channel of the search area windowed feature is refined.

And 3.12, repeating the steps 3.1-3.10 when a new frame of the video sequence arrives, until tracking is finished.

Examples

The algorithm adopts an OTB-100 data set for evaluation, the algorithm development environment is Matlab R2018b and a deep learning library MatConvNet-Gpu, and the Processor selects GTX-1060 for AMD Ryzen 7 1700Eight-Core Processor and GPU. The algorithm in the experiment adopts the same parameters for the test video, and is specifically set as follows: regularization parameter λ=10 ^-4 The adjustment factor δ=0.43, the learning rate η=0.01, the gaussian variance ε=0.3, the search area adjustment parameter k=2, and the 3,4, 5-layer characteristics in the VGG19 network are selected as output characteristics. The proposed algorithm is evaluated experimentally by comparison with advanced tracking methods.

The algorithm in the present invention is HZXT, and the proposed algorithm is evaluated by comparison with 3 representative trackers, namely SRDCF based on correlation filtering, BACF, and HCF based on deep learning. First, a comparison graph of the tracking algorithm in terms of success rate and accuracy rate is drawn, as shown in fig. 1 and 2. Compared with other algorithms, the algorithm of the invention has excellent results in terms of accuracy and success rate. In FIG. 1, the accuracy of the algorithm reaches 0.834, which is higher than other algorithms; in fig. 2, the algorithm of the present invention is also superior to other algorithms. The experimental results in table 1 show that when the challenges of 11 different situations are faced, the success rate of the algorithm of the invention is the optimal value or the suboptimal value, especially in the situations of Fast Motion (FM), motion Blur (MB), target Deformation (DEF), occlusion (OCC), and the like, the success rate of the algorithm is superior to other related filtering algorithms which are most popular at present, and the proposed algorithm is proved to be capable of further enhancing the robustness of the tracking algorithm.

TABLE 1

/>

Claims

1. The related filtering tracking method based on the self-adaptive regular characteristic joint time association is characterized by comprising the following steps of: the method specifically comprises the following steps:

the specific process of the step 1 is as follows:

Step 1.3, extracting a first frame training search area I of a video sequence by using a convolutional neural network ₁ Is obtained by the hierarchical convolution characteristic of the search area I ₁ Is a convolution characteristic f of (2) ₁ ^l ，f ₁ ^l Is one (k x h) ₁ )×(k×w ₁ ) X c matrix, k x h ₁ ,k×w ₁ The length and the width of the convolution characteristic diagram are respectively, c is the channel number of the convolution characteristic diagram, f ₁ ^l Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (mid, high), f ₁ ^low For the network shallow layer characteristics, f ₁ ^mid For network intermediate layer features, f ₁ ^high Is a high-level feature of the network;

step 1.4, training the search area I according to the size of the target area and the first frame ₁ Size versus convolution characteristic f ₁ ^l Windowing is carried out, and the convolution characteristic after windowing is f ₁ ' ^l ；

Step 1.5, the convolutions after windowing according to step 1.4f ₁ ' ^l Training a correlation filter alpha ^l ，α ^l Is one (k x h) ₁ )×(k×w ₁ ) Corresponding to the feature layer, and transforming the trained correlation filter into the frequency domain to obtain

the specific process of the step 2 is as follows:

step 2.2, extracting the second frame detection search region Z ₂ Is of convolution characteristics of (a)The characteristic is a (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>As a feature of the intermediate layer of the network,is a high-level feature of the network;

Wherein, lambda is the discrete Fourier transform,an ith feature channel representing t=2 frame detection search area windowing features;

wherein, in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;

Step 2.7, the center position p of the target in the second frame image ₂ And the first frame scale s ₁ Determining training search area I of second frame ₂ Searching for region I based on training ₂ For correlation filter alpha ₁ ^l Updating;

step 3, determining the position of a target in a t frame of a tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2;

the specific process of the step 3 is as follows:

step 3.3, training the search area feature for t-1 frameWindowing, obtaining the convolution characteristic after windowing>The following formula is%8) The following is shown:

wherein,

step 3.4, according to the convolutions after windowingTraining a correlation filter-> Is one (k x h) ₁ )×(k×w ₁ ) Is used to train the correlation filter +.>The following formula (9) shows:

wherein,an ith characteristic channel which represents the windowing characteristic of the t-1 frame training search area, wherein Λ is discrete Fourier transform, and a horizontal line on letters represents complex conjugation;

step 3.5, determining the detection search zone Z of the t frame _t Wherein the detection of the t-th frameSearch area Z _t To target the center p at the t-1 th frame on the t-th frame _t-1 Centered at a first frame target scale k×s ₁ K is greater than 1, and a rectangular area is cut out for length and width to be used as a detection search area of a t-th frame;

step 3.6, extracting the t-th frame detection search region Z _t Is of convolution characteristics of (a)The convolution feature->Is (k x h) ₁ )×(k×w ₁ ) X c matrix, (k x h) ₁ ,k×w ₁ ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>As a feature of the intermediate layer of the network,is a high-level feature of the network;

step 3.7, searching for region convolution feature for detection of the t-th frameWindowing, obtaining the convolution characteristic after windowing>The following formula (11) shows:

Wherein, a is a discrete Fourier transform;

Step 3.11, the center position p of the target in the t frame image _t And the first frame scale s ₁ Determining training search area I of the t-th frame _t Searching for region I based on training _t For correlation filterUpdating;

and step 3.12, repeating the steps 3.1-3.11 until tracking is finished.

2. The adaptive regularization feature joint time correlation-based correlation filtering tracking method of claim 1, wherein: the specific process of windowing in the step 1.4 is as follows:

the total window function after windowing is superposition of two window functions of cosine window and Gaussian window, wherein:

the cosine window is trained by the size f of the convolved feature in the search area by the first frame ₁ ^l Determining f ₁ ^l The cosine window function w is unchanged in size _cos Unchanged;

3. the adaptive regularization feature joint time correlation-based correlation filtering tracking method of claim 2, wherein: the specific process of the step 1.5 is as follows:

let f ₁ ' ^l (m, n) is F ₁ ' ^l Cyclic shift samples generated after cyclic shift of m and n elements in length and width are respectively carried out, the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian functionGenerating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha ₁ ^l The following formula (3) shows:

wherein,the i-th feature channel representing the windowed feature of the training search area of the first frame is the discrete Fourier transform, the horizontal line on the letter represents the complex conjugate, the Hadamard product of the matrix, and c is the number of feature channels.