CN112819865B - Correlation filtering tracking method based on self-adaptive regular feature joint time correlation - Google Patents

Correlation filtering tracking method based on self-adaptive regular feature joint time correlation Download PDF

Info

Publication number
CN112819865B
CN112819865B CN202110214541.4A CN202110214541A CN112819865B CN 112819865 B CN112819865 B CN 112819865B CN 202110214541 A CN202110214541 A CN 202110214541A CN 112819865 B CN112819865 B CN 112819865B
Authority
CN
China
Prior art keywords
frame
target
feature
convolution
windowing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110214541.4A
Other languages
Chinese (zh)
Other versions
CN112819865A (en
Inventor
刘龙
惠志轩
杨尚其
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN202110214541.4A priority Critical patent/CN112819865B/en
Publication of CN112819865A publication Critical patent/CN112819865A/en
Application granted granted Critical
Publication of CN112819865B publication Critical patent/CN112819865B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/262Analysis of motion using transform domain methods, e.g. Fourier domain methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20048Transform domain processing
    • G06T2207/20056Discrete and fast Fourier transform, [DFT, FFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a correlation filtering tracking method based on self-adaptive regular feature joint time association, which specifically comprises the following steps: step 1, selecting a video sequence to be tracked, and initializing a first frame of the video sequence; step 2, determining the central position of a target in a second frame of the tracking video sequence, and estimating the target scale in the second frame; and 3, determining the position of the target in the t frame of the tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2. The method solves the problems that when the scale of the target changes, the original windowing can cause incomplete target or excessive background information learned by the filter.

Description

Correlation filtering tracking method based on self-adaptive regular feature joint time correlation
Technical Field
The invention belongs to the technical field of video image tracking in machine vision, and relates to a related filtering tracking method based on self-adaptive regular feature joint time association.
Background
With the rapid development of computer technology, object tracking has become one of the hot subjects of computer vision research. The visual target tracking is to continuously mark the position of the tracked target in each frame of image of the video sequence through some algorithms so as to obtain the motion parameters of the target, such as position, speed, acceleration and the like, so that further processing analysis is performed, and the behavior understanding of the target is realized to complete higher-level tasks. As an important branch in the field of computer vision, it has a variety of applications in various fields of science and technology, national defense construction, aerospace, medicine and health, and national economy, such as intelligent video monitoring, man-machine interaction, robotics, and autopilot. The related filtering target tracking method has the advantages of high processing speed, high tracking precision and the like.
In correlation filtering, it is important to extract sample features of a target region, and there are two types of extracted features: (1) The conventional features are manually set features such as image color histogram features, direction gradient histogram features (HOG), local binary pattern features (LBP); (2) The convolution characteristics are characteristics extracted by using a deep convolution network (CNN), a characteristic image (called deep characteristic) output by a CNN back convolution layer has higher semantic information and translational invariance, and has good robustness for target change, but for target tracking needing accurate target position positioning, only the deep characteristic is insufficient because the spatial resolution is too low to accurately position a tracked target, and a characteristic image (called shallow characteristic) output by a CNN back convolution layer has the characteristic of high spatial detail, which is very beneficial to target positioning, but the characteristic has no robustness for target appearance change and is unfavorable to accurate target positioning.
The existing relevant filtering tracking method extracts the characteristics of rectangular areas around the first frame target, a relevant filter is trained through ridge regression based on the characteristics, the updated filter and the characteristics are used for carrying out relevant operation on the characteristics of the area searched by the frame to be tracked in the subsequent frame after the filter and the characteristics are updated, a response diagram is obtained after the relevant operation, and the maximum response value point position in the response diagram is the target position. And the traditional characteristic is replaced by a multilayer convolution characteristic based on the correlation filtering of the CNN characteristic, and the target position is deduced by using a weighted fusion method after the multilayer correlation filtering response is obtained.
(1) In the prior art, when the change of the target scale is not considered, the original windowing can lead to incomplete target learned by a filter or excessive background information, the robustness of the filter is poor, the tracking precision is low, and the tracking failure can be finally caused.
(2) When the existing method is used for tracking a relevant filtering target by using convolution characteristics, a method of using weighted fusion for multi-layer relevant filtering response is used for deducing the target position, but each frame of weight is fixed and is not an adaptive weight, so that the robustness and the precision of the tracker can be reduced if the requirements of the tracker cannot be met.
Disclosure of Invention
The invention aims to provide a related filtering tracking method based on self-adaptive regular feature joint time association, which solves the problems that when the scale of a target changes, the original windowing can cause incomplete target or excessive background information learned by a filter.
The technical scheme adopted by the invention is that the self-adaptive regularization feature joint time correlation-based related filtering tracking method specifically comprises the following steps:
step 1, selecting a video sequence to be tracked, and initializing a first frame of the video sequence;
step 2, determining the central position of a target in a second frame of the tracking video sequence, and estimating the target scale in the second frame;
and 3, determining the position of the target in the t frame of the tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2.
The invention is also characterized in that:
the specific process of the step 1 is as follows:
step 1.1, artificially framing a target area in a first frame of a video sequence to be tracked to obtain a target center position coordinate p 1 And a target scale s 1 ,p 1 =[x 1 ,y 1 ] T Wherein x is 1 ,y 1 Coordinates of the center position of the first frame object of the video sequence on the x-axis and the y-axis with the upper left corner of the image as the origin, s 1 =[h 1 ,w 1 ] T Wherein h is 1 ,w 1 Is the length and width of the target area of the first frame;
step 1.2, according to the target center position p in the first frame of the video sequence 1 And a target size s 1 Determining a first frame training search area I of a video sequence 1
Step 1.3, extracting a first frame training search area I of a video sequence by using a convolutional neural network 1 Is obtained by the hierarchical convolution characteristic of the search area I 1 Is of convolution characteristics of (a)Is one (k x h) 1 )×(k×w 1 ) X c matrix, k x h 1 ,k×w 1 Length and width of the convolution feature map, c is the number of channels of the convolution feature map, ++>Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (low, mid, high), l epsilon (high)>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;
step 1.4, training the search area I according to the size of the target area and the first frame 1 Size versus convolution characteristicsWindowing is performed, and the convolutions after windowing are characterized by +.>
Step 1.5, the convolutions after windowing according to step 1.4Training a correlation filter->Is one (k x h) 1 )×(k×w 1 ) Corresponding to the feature layer, transforming the trained correlation filter into the frequency domain to obtain +.>
The specific process of windowing in the step 1.4 is as follows: the total window function after windowing is superposition of two window functions of cosine window and Gaussian window, wherein:
cosine window training the size of the convolved feature in the search area from the first frameDecision (S)>The cosine window function w is unchanged in size cos Unchanged;
the first frame training search area gaussian window is determined by the following equation (1):
wherein,(M, n) is the coordinates of each point in the gaussian window, m= (k×h) 1 )/2,N=(k×w 1 ) 2, delta is a regulatory factor;
the total window function after windowing is shown in the following formula (2):
the specific process of step 1.5 is as follows:
order theTo->Cyclic shift samples generated by performing cyclic shift of m and n elements in length and width respectively, wherein the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian function +>Generating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha 1 l The following formula (3) shows:
in formula (3), x represents a convolution operation, y is the size h 1 ×w 1 The matrix of m-th row and n-th column of elements y m,n =y (m, n) =y, λ is a regularization coefficient, and in order to accelerate the operation, the following equation (4) is obtained by transforming equation (3) into the frequency domain:
wherein,an ith feature channel representing a windowed feature of the training search area of frame 1, ∈a is a discrete fourier transform, +.>The horizontal lines on the letters represent complex conjugates, the Hadamard product of the matrix, and c the number of characteristic channels.
The specific process of the step 2 is as follows:
step 2.1, determining a detection search zone Z of the second frame 2 Wherein the detection of the second frame searches for the region Z 2 To target the center p with the first frame on the second frame 1 Centered at a first frame target scale k×s 1 A rectangular area cut for length and width, k is more than 1;
step 2.2, extracting the second frame detection search region Z 2 Is of convolution characteristics of (a)The characteristic is a (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a volume extracted from the first layer of a convolutional networkProduct characteristics, l E (low, mid, high), A>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;
step 2.3, searching for region features for detection of the second frameWindowing, characterization after windowing +.>The following formula (5) shows:
step 2.4, calculating the correlation filter α by the following equation (6) 1 l At the position ofRelated filter response r 2 l
Wherein, lambda is the discrete Fourier transform,an ith feature channel representing a 2 nd frame detection search area windowing feature.
Step 2.5, the target center position p of the second frame is calculated by the following formula (7) 2 The specific method for estimating is as follows:
wherein,in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;
step 2.6, tracking the center position p of the second frame object of the video sequence according to the tracking obtained in step 2.5 2 Estimating a target scale s of the second frame using DSST method 2 =[h 2 ,w 2 ] T
Step 2.7, the center position p of the target in the second frame image 2 And the first frame scale s 1 Determining training search area I of second frame 2 Searching for region I based on training 2 For correlation filter alpha 1 l And updating.
The specific process of the step 3 is as follows:
step 3.1, from the target center position p in the previous frame t-1 And a first frame target scale s 1 Determining t-1 frame training search area I t-1
Step 3.2, extracting t-1 frame training search area I t-1 Is of convolution characteristics of (a)The convolution characteristic is a (k x h) 1 )×(k×w 1 ) Matrix of Xc>The convolution characteristics extracted for the first layer of the convolution network;
step (a)3.3 training search area characteristics for t-1 frameWindowing, obtaining the convolution characteristic after windowing>The following formula (8) shows:
wherein,
step 3.4, according to the convolutions after windowingTraining a correlation filter->Is one (k x h) 1 )×(k×w 1 ) Is used to train the correlation filter +.>The following formula (9) shows:
to accelerate the operation, transform equation (9) to the frequency domain yields the following equation (10):
wherein,representing t-1 frame trainingIth feature channel of windowing features of search area,/->The horizontal lines on the letters represent complex conjugates.
Step 3.5, determining the detection search zone Z of the t frame t Wherein the detection search area Z of the t-th frame t To target the center p at the t-1 th frame on the t-th frame t-1 Centered at a first frame target scale k×s 1 K is greater than 1, and a rectangular area is cut out for length and width to be used as a detection search area of a t-th frame;
step 3.6, extracting the t-th frame detection search region Z t Is of convolution characteristics of (a)The convolution feature->Is (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network;
step 3.7, searching for region convolution feature for detection of the t-th frameWindowing to obtain a windowed convolution productSyndrome of->The following formula (11) shows:
step 3.8, calculating the correlation filter by the following formula (12)At->Related filter response r t l
Wherein, lambda is the discrete Fourier transform,an ith feature channel representing a windowed feature of the t-th frame detection search area;
step 3.9, the following formula (13) is adopted for the target center position p in the t frame t The specific method for estimating is as follows:
wherein,in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;
step 3.10, the center position p of the t frame target of the tracking video sequence obtained according to step 3.9 t Estimating target scale s of t frame by DSST method t =[h t ,w t ] T
Step 3.11, the center position p of the target in the t frame image t And the first frame scale s 1 Determining training search area I of the t-th frame t Searching for region I based on training t For correlation filterAnd updating.
And step 3.12, repeating the steps 3.1-3.11 until tracking is finished.
The method has the advantages that the tracking precision and the robustness of the situation that the target features are learned incompleteness or the background is learned too much when the scale change occurs in the tracking process are improved, the self-adaptive distribution of weights among response graphs is realized, and the precision and the robustness of the tracker are improved.
Drawings
FIG. 1 is a graph of a correlation filter tracking method based on adaptive regularization feature joint time correlation in comparison with an existing tracking algorithm with respect to accuracy;
FIG. 2 is a graph of a comparison of the present invention of a correlation filter tracking method based on adaptive regularization feature joint time correlation with an existing tracking algorithm with respect to success rate.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention relates to a self-adaptive regularization feature joint time correlation-based related filtering tracking method, which specifically comprises the following steps:
step 1, selecting a video sequence to be tracked, and initializing a first frame of the sequence;
step 1.1, artificially framing a first frame target area to obtain a target center position coordinate p 1 And a target scale s 1 ,p 1 =[x 1 ,y 1 ] T Wherein x is 1 ,y 1 Coordinates of the center position of the first frame object on the x-axis and the y-axis with the upper left corner of the image as the origin, s 1 =[h 1 ,w 1 ] T Wherein h is 1 ,w 1 Is the length and width of the target area of the first frame, t represents the current frame of the video;
step 1.2, from the target center position p in the first frame 1 And a target size s 1 Determining a first frame training search area I 1 The method comprises the following steps:
in the image acquired in the first frame, p 1 Centered at kxs 1 K > 1 cuts out a rectangular area for length and width as a first frame training search area I 1 K is a specified parameter;
step 1.3, extracting the first frame training search area I using VGG-19 convolutional neural network 1 And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature f of the region 1 l ,f 1 l Is one (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) For the length and width of the feature map, c is the number of channels of the convolution feature map, and f is recorded 1 l Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (mid, high), f 1 low For the network shallow layer characteristics, f 1 mid For network intermediate layer features, f 1 high Is a high-level feature of the network.
Step 1.4, windowing features according to the target size and the first frame training search area size, and finally superposing a total window function by two window functions, namely a cosine window and a Gaussian window, which are called as self-adaptive regular feature windows, wherein the total window function is specifically:
the cosine window is trained by the first frame to search for the size of the region feature, i.e., by f 1 l Size determination of f 1 l The window function w is unchanged in size cos Unchanged;
the gaussian window of the first frame training search area is determined by the size of the first frame training search area and the first frame target scale,where (M, n) is the coordinates of each point of the gaussian window, m= (k×h) 1 )/2,N=(k×w 1 ) And 2, delta is a regulating factor. When the target size changes, the gaussian variance σ of the gaussian window 1 The Gaussian function is a value which follows the change of the target and is proportional to the area of the target, when the target becomes larger, the Gaussian function is influenced by the variance to be gentle, the inhibition of the Gaussian function to the characteristics of the target area is weakened, and the target is more exposed; conversely, the Gaussian function becomes sharp, and the suppression of the background is increased, so that the filter does not learn too much background.
Will ultimately window the featureAs the final feature f 1 ′l
Step 1.5 training the correlation filter alpha based on the extracted features l ,α l Is one (k x h) 1 )×(k×w 1 ) I corresponds to the feature layer, the specific method is as follows:
let f 1 ′l (m, n) is F 1 ′l Cyclic shift samples generated after cyclic shift of m and n elements in length and width are respectively carried out, the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian functionGenerating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter +.>
In the formula (1)Denote the convolution operation, y is a size h 1 ×w 1 Of the matrix of m-th row and n-th column of elements y m,n =y (m, n) =y, λ being the regularization coefficient. To accelerate the operation, transform (1) into the frequency domain to obtain
Wherein,the i-th feature channel representing the windowing feature of the training search area of t=1 frame, Λ is the discrete fourier transform, the horizontal line on the letter represents the complex conjugate, the hadamard product of the matrix, and c is the number of feature channels. In order to make it possible to explain the filter update more concisely in the following we have the (2) type molecule +.>Denominator->
Step 2, determining the center position p of the target in the second frame of the video image 2 =[x 2 ,y 2 ] T And estimating the target scale s 2 =[h 2 ,w 2 ] T
Step 2.1, determining a detection search zone Z of the second frame 2 Wherein the detection of the second frame searches for the region Z 2 To target the center p with the first frame on the second frame 1 Centered at a first frame target scale k×s 1 K > 1 is length and width, a rectangular area is cut out as the detection search area of the second frame, and the search area is Z 2
Step 2.2, extracting the convolution characteristic of the second frame detection search area, and using a bilinear interpolation method to make the size and input of the characteristic diagram consistent, wherein the interpolated characteristic is thatIt is still one (k.times.h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>The convolution characteristic of l is extracted from the layer of the convolution network, i epsilon (low, mid, high), is->For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.
Step 2.3, searching for region features for detection of the second frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:
the cosine window detects the size of the search area feature from the second frame, i.e. fromSize determination of->Size and f 1 l Consistent, all unchanged, window function w cos Unchanged;
the second frame detection search area gaussian window is determined by the size of the second frame detection search area and the first frame target scale,final windowing feature->As a final feature->
Step 2.4, calculating a correlation filter by (3)At->Correlated filter response ∈>
Wherein Λ is the discrete fourier transform.
Step 2.5, for the target center position p of the second frame 2 The specific method for estimating is as follows:
wherein,in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>For the control parameters associated with the time of the front frame and the time of the rear frame, two parameters are determined as follows:
is provided with->For response patterns respectively->Maximum value of>Respectively is a response chartWhen the ratio of the maximum value to the maximum value of the response map is larger than a given threshold value gamma, the response map is not ideal, otherwise, the response map is ideal. The invention adaptively weights the response map by equation (5):
if the frame to be tracked is the second frame, +.>
After obtainingAnd->After that, pass->Obtain->Then the destination of the second frame is obtained according to the formula (4)Target center position p 2
Step 2.6, scale estimation
Center position p of object in second frame image of obtained video sequence 2 Then, estimating the target scale s of the second frame by using the existing DSST method 2 =[h 2 ,w 2 ] T
Step 2.7, correlation Filter α 1 l Model updating:
from the new central position p of the target 2 And the first frame scale s 1 Determining training search area I of second frame 2 The specific method comprises the following steps:
in the second frame acquired image with p 2 Centered at kxs 1 K > 1 is a rectangular area cut out for length and width as training search area I 2 K is a designated parameter, and then a VGG-19 convolutional neural network is used for extracting a rectangular image block I of a target area 2 And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature f of the region 2 l ,f 2 l Is one (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) For the length and width of the feature map, c is the number of channels of the convolution feature map, and is recordedIs the convolution feature of the second frame extracted by the first layer of the convolution network, l epsilon (low, mid, high), l epsilon (high)>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.
For the above featuresWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:
the cosine window is trained by the second frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w cos Unchanged;
the second frame training search area gaussian window is determined by the size of the second frame training search area and the second frame target scale,final windowing feature->As a final feature->At the time of getting->Thereafter, each filter is updated by
Wherein eta is the learning rate and c is the number of characteristic channels.Representation t=2 frame training search area windowingThe ith feature channel of the feature.
Step 3, determining the target position p of the t (t > 2) th frame t =[x t ,y t ] T Target scale s t =[h t ,w t ] T ,
Step 3.1, from the last frame target center position p t-1 And a first frame target scale s 1 Determining t-1 frame training search area I t-1 The specific method comprises the following steps:
in p in the image obtained in the t-1 frame t-1 Centered at (k×h) 1 ,k×w 1 ) (k > 1) cutting out a rectangular region for length and width as t-1 frame training search region I t-1
Step 3.2, extracting the convolution characteristics of the t-1 frame training search area and using a bilinear interpolation method to make the size of the characteristic diagram consistent with the input to obtainIt is one (k x h) 1 )×(k×w 1 ) Matrix of Xc>The convolution characteristics extracted for the first layer of the convolution network;
step 3.3, training the search area feature for t-1 frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:
the cosine window is trained by the t-1 frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w cos Unchanged;
the t-1 frame training search area gaussian window is determined by the size of the t-1 frame training search area and the t-1 frame target scale,and will be windowed feature +>As a final feature->
Step 3.4, according to the extracted featuresTraining a correlation filter->Is one (k x h) 1 )×(k×w 1 ) I corresponds to the feature layer, the specific method is as follows:
order theTo->Cyclic shift samples generated by performing cyclic shift of m and n elements in length and width respectively, wherein the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian function +>Generating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha l
In formula (7), x represents convolution operation, y is a matrix of h×w, and the element y of the m-th row and n-th column of the matrix m,n =y (m, n) =y, λ being the regularization coefficient. To accelerate the operation, transform (7) into the frequency domain to obtain
Wherein,the i-th characteristic channel which represents the windowing characteristic of the t-1 frame training search area, wherein lambda is discrete Fourier transform, a transverse line on letters represents complex conjugate, lambda product is matrix, and c is the number of characteristic channels.
Step 3.5, determining the detection search zone Z of the t frame t Wherein the detection search area Z of the t-th frame t To target the center p at the t-1 th frame on the t-th frame t-1 Centered at a first frame target scale k×s 1 K > 1 is a rectangular area cut out for length and width as a detection search area of the t-th frame.
Step 3.6, extracting the t-th frame detection search region Z t Is characterized by that it uses bilinear interpolation method to make the size of characteristic graph identical to input, and the interpolated characteristic isIt is still one (k.times.h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>The convolution characteristic of l is extracted from the layer of the convolution network, i epsilon (low, mid, high), is->For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.
Step 3.7, search for region features for detection of the t-th frameWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:
the cosine window is obtained by detecting the size of the features of the search area from the t-th frame, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w cos Unchanged;
the t-th frame detection search area gaussian window is determined by the size of the t-th frame detection search area and the t-1 st frame target scale,final windowing feature->As a final feature->
Step 3.8, calculating a correlation filter by (9)At->Related filter response r t l
Wherein Λ is the discrete fourier transform.
Step 3.9, for the target center position p t The specific method for estimating is as follows:
wherein,in response to the adaptive weights between the layers,for the subpeak suppression parameter, < >>For the control parameters associated with the time of the front frame and the time of the rear frame, two parameters are determined as follows:
is provided with->For response patterns respectively->Maximum value of>Response graphs->When the ratio of the maximum value to the maximum value of the response map is larger than a given threshold value gamma, the response map is not ideal, otherwise, the response map is ideal. The response map is adaptively weighted by equation (11):
recording the generation of the upper frame target position p t-1 Maximum position in three response plots:
where t-1 represents the last frame.The abscissa representing the maximum position of three response maps in t-1 frame, and the target position p of the frame is recorded and generated t Maximum position in three response plots:
where t represents the frame to be tracked.
Since the video sequence is not a single image, and there is a certain time correlation, the change of the target position of the adjacent frames is not obvious, and by utilizing the characteristic, mu can be determined according to the formula (14) i :
/>
Step 3.10, estimating the scale;
in obtaining the second frame image of the video sequenceCenter position p of the middle object t Then, estimating the target scale s of the second frame by using the existing DSST method t =[h t ,w t ] T
Step 3.11, correlation FilterModel updating:
new center position p in the t-th frame by the target t And the first frame scale s 1 Determining training search area I of the t-th frame t The specific method comprises the following steps:
in p in the image acquired in the t-th frame t Centered at kxs 1 K > 1 is a rectangular area cut out for length and width as training search area I t K is a designated parameter, and then a VGG-19 convolutional neural network is used for extracting a rectangular image block I of a target area t And using bilinear difference method to make the size of the convolution feature map consistent with the size of the input image to obtain the convolution feature of the regionIs one (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) C is the number of channels of the convolution feature map, which is the length and width of the feature map, and +.>Is the convolution feature of the t frame extracted from the first layer of the convolution network, l epsilon (low, mid, high), the convolution feature of the t frame extracted from the first layer of the convolution network>For the network shallow feature +_>For network middle layer feature, +.>Is a high-level feature of the network.
For the above featuresWindowing, wherein the final total window function is formed by superposing two window functions, namely a cosine window and a Gaussian window, and is called a self-adaptive regular characteristic window, and the method is characterized in that:
the cosine window is trained by the t-th frame to search for the size of the region features, i.e. bySize determination of->Size and +.>Consistent, all unchanged, window function w cos Unchanged;
the t-th frame training search area gaussian window is determined by the size of the t-th frame training search area and the t-th frame target scale,final windowing feature->As a final feature->At the time of getting->Thereafter, each filter is updated by
Wherein eta is the learning rate and c is the number of characteristic channels.Representing t frame trainingThe ith feature channel of the search area windowed feature is refined.
And 3.12, repeating the steps 3.1-3.10 when a new frame of the video sequence arrives, until tracking is finished.
Examples
The algorithm adopts an OTB-100 data set for evaluation, the algorithm development environment is Matlab R2018b and a deep learning library MatConvNet-Gpu, and the Processor selects GTX-1060 for AMD Ryzen 7 1700Eight-Core Processor and GPU. The algorithm in the experiment adopts the same parameters for the test video, and is specifically set as follows: regularization parameter λ=10 -4 The adjustment factor δ=0.43, the learning rate η=0.01, the gaussian variance ε=0.3, the search area adjustment parameter k=2, and the 3,4, 5-layer characteristics in the VGG19 network are selected as output characteristics. The proposed algorithm is evaluated experimentally by comparison with advanced tracking methods.
The algorithm in the present invention is HZXT, and the proposed algorithm is evaluated by comparison with 3 representative trackers, namely SRDCF based on correlation filtering, BACF, and HCF based on deep learning. First, a comparison graph of the tracking algorithm in terms of success rate and accuracy rate is drawn, as shown in fig. 1 and 2. Compared with other algorithms, the algorithm of the invention has excellent results in terms of accuracy and success rate. In FIG. 1, the accuracy of the algorithm reaches 0.834, which is higher than other algorithms; in fig. 2, the algorithm of the present invention is also superior to other algorithms. The experimental results in table 1 show that when the challenges of 11 different situations are faced, the success rate of the algorithm of the invention is the optimal value or the suboptimal value, especially in the situations of Fast Motion (FM), motion Blur (MB), target Deformation (DEF), occlusion (OCC), and the like, the success rate of the algorithm is superior to other related filtering algorithms which are most popular at present, and the proposed algorithm is proved to be capable of further enhancing the robustness of the tracking algorithm.
TABLE 1
/>

Claims (3)

1. The related filtering tracking method based on the self-adaptive regular characteristic joint time association is characterized by comprising the following steps of: the method specifically comprises the following steps:
step 1, selecting a video sequence to be tracked, and initializing a first frame of the video sequence;
the specific process of the step 1 is as follows:
step 1.1, artificially framing a target area in a first frame of a video sequence to be tracked to obtain a target center position coordinate p 1 And a target scale s 1 ,p 1 =[x 1 ,y 1 ] T Wherein x is 1 ,y 1 Coordinates of the center position of the first frame object of the video sequence on the x-axis and the y-axis with the upper left corner of the image as the origin, s 1 =[h 1 ,w 1 ] T Wherein h is 1 ,w 1 Is the length and width of the target area of the first frame;
step 1.2, according to the target center position p in the first frame of the video sequence 1 And a target size s 1 Determining a first frame training search area I of a video sequence 1
Step 1.3, extracting a first frame training search area I of a video sequence by using a convolutional neural network 1 Is obtained by the hierarchical convolution characteristic of the search area I 1 Is a convolution characteristic f of (2) 1 l ,f 1 l Is one (k x h) 1 )×(k×w 1 ) X c matrix, k x h 1 ,k×w 1 The length and the width of the convolution characteristic diagram are respectively, c is the channel number of the convolution characteristic diagram, f 1 l Is the convolution characteristic of the first frame extracted by the first layer of the convolution network, l epsilon (mid, high), f 1 low For the network shallow layer characteristics, f 1 mid For network intermediate layer features, f 1 high Is a high-level feature of the network;
step 1.4, training the search area I according to the size of the target area and the first frame 1 Size versus convolution characteristic f 1 l Windowing is carried out, and the convolution characteristic after windowing is f 1 ' l
Step 1.5, the convolutions after windowing according to step 1.4f 1 ' l Training a correlation filter alpha l ,α l Is one (k x h) 1 )×(k×w 1 ) Corresponding to the feature layer, and transforming the trained correlation filter into the frequency domain to obtain
Step 2, determining the central position of a target in a second frame of the tracking video sequence, and estimating the target scale in the second frame;
the specific process of the step 2 is as follows:
step 2.1, determining a detection search zone Z of the second frame 2 Wherein the detection of the second frame searches for the region Z 2 To target the center p with the first frame on the second frame 1 Centered at a first frame target scale k×s 1 A rectangular area cut for length and width, k is more than 1;
step 2.2, extracting the second frame detection search region Z 2 Is of convolution characteristics of (a)The characteristic is a (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>As a feature of the intermediate layer of the network,is a high-level feature of the network;
step 2.3, searching for region features for detection of the second frameWindowing, characterization after windowing +.>The following formula (5) shows:
step 2.4, calculating the correlation filter α by the following equation (6) 1 l At the position ofRelated filter response r 2 l
Wherein, lambda is the discrete Fourier transform,an ith feature channel representing t=2 frame detection search area windowing features;
step 2.5, the target center position p of the second frame is calculated by the following formula (7) 2 The specific method for estimating is as follows:
wherein, in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;
step 2.6, tracking the center position p of the second frame object of the video sequence according to the tracking obtained in step 2.5 2 Estimating a target scale s of the second frame using DSST method 2 =[h 2 ,w 2 ] T
Step 2.7, the center position p of the target in the second frame image 2 And the first frame scale s 1 Determining training search area I of second frame 2 Searching for region I based on training 2 For correlation filter alpha 1 l Updating;
step 3, determining the position of a target in a t frame of a tracking video sequence, and estimating the target scale in the t frame, wherein t is more than 2;
the specific process of the step 3 is as follows:
step 3.1, from the target center position p in the previous frame t-1 And a first frame target scale s 1 Determining t-1 frame training search area I t-1
Step 3.2, extracting t-1 frame training search area I t-1 Is of convolution characteristics of (a)The convolution characteristic is a (k x h) 1 )×(k×w 1 ) Matrix of Xc>The convolution characteristics extracted for the first layer of the convolution network;
step 3.3, training the search area feature for t-1 frameWindowing, obtaining the convolution characteristic after windowing>The following formula is%8) The following is shown:
wherein,
step 3.4, according to the convolutions after windowingTraining a correlation filter-> Is one (k x h) 1 )×(k×w 1 ) Is used to train the correlation filter +.>The following formula (9) shows:
to accelerate the operation, transform equation (9) to the frequency domain yields the following equation (10):
wherein,an ith characteristic channel which represents the windowing characteristic of the t-1 frame training search area, wherein Λ is discrete Fourier transform, and a horizontal line on letters represents complex conjugation;
step 3.5, determining the detection search zone Z of the t frame t Wherein the detection of the t-th frameSearch area Z t To target the center p at the t-1 th frame on the t-th frame t-1 Centered at a first frame target scale k×s 1 K is greater than 1, and a rectangular area is cut out for length and width to be used as a detection search area of a t-th frame;
step 3.6, extracting the t-th frame detection search region Z t Is of convolution characteristics of (a)The convolution feature->Is (k x h) 1 )×(k×w 1 ) X c matrix, (k x h) 1 ,k×w 1 ) Is the length and width of the feature map, c is the number of channels of the convolution feature map, +.>Is a convolution feature extracted from the first layer of the convolution network, l epsilon (low, mid, high), and ++>For the network shallow feature +_>As a feature of the intermediate layer of the network,is a high-level feature of the network;
step 3.7, searching for region convolution feature for detection of the t-th frameWindowing, obtaining the convolution characteristic after windowing>The following formula (11) shows:
step 3.8, calculating the correlation filter by the following formula (12)At->Related filter response r t l
Wherein, a is a discrete Fourier transform;
step 3.9, the following formula (13) is adopted for the target center position p in the t frame t The specific method for estimating is as follows:
wherein, in response to the adaptive weights between the layers +.>For the subpeak suppression parameter, < >>Associating control parameters for the time of the front frame and the time of the rear frame;
step 3.10, the center position p of the t frame target of the tracking video sequence obtained according to step 3.9 t Estimating target scale s of t frame by DSST method t =[h t ,w t ] T
Step 3.11, the center position p of the target in the t frame image t And the first frame scale s 1 Determining training search area I of the t-th frame t Searching for region I based on training t For correlation filterUpdating;
and step 3.12, repeating the steps 3.1-3.11 until tracking is finished.
2. The adaptive regularization feature joint time correlation-based correlation filtering tracking method of claim 1, wherein: the specific process of windowing in the step 1.4 is as follows:
the total window function after windowing is superposition of two window functions of cosine window and Gaussian window, wherein:
the cosine window is trained by the size f of the convolved feature in the search area by the first frame 1 l Determining f 1 l The cosine window function w is unchanged in size cos Unchanged;
the first frame training search area gaussian window is determined by the following equation (1):
wherein,(M, n) is the coordinates of each point in the gaussian window, m= (k×h) 1 )/2,N=(k×w 1 ) 2, delta is a regulatory factor;
the total window function after windowing is shown in the following formula (2):
3. the adaptive regularization feature joint time correlation-based correlation filtering tracking method of claim 2, wherein: the specific process of the step 1.5 is as follows:
let f 1 ' l (m, n) is F 1 ' l Cyclic shift samples generated after cyclic shift of m and n elements in length and width are respectively carried out, the label corresponding to each cyclic shift sample is a soft label and is formed by Gaussian functionGenerating epsilon as Gaussian variance, at which time ridge regression is used to train the correlation filter alpha 1 l The following formula (3) shows:
in formula (3), x represents a convolution operation, y is the size h 1 ×w 1 The matrix of m-th row and n-th column of elements y m,n =y (m, n) =y, λ is a regularization coefficient, and in order to accelerate the operation, the following equation (4) is obtained by transforming equation (3) into the frequency domain:
wherein,the i-th feature channel representing the windowed feature of the training search area of the first frame is the discrete Fourier transform, the horizontal line on the letter represents the complex conjugate, the Hadamard product of the matrix, and c is the number of feature channels.
CN202110214541.4A 2021-02-26 2021-02-26 Correlation filtering tracking method based on self-adaptive regular feature joint time correlation Active CN112819865B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110214541.4A CN112819865B (en) 2021-02-26 2021-02-26 Correlation filtering tracking method based on self-adaptive regular feature joint time correlation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110214541.4A CN112819865B (en) 2021-02-26 2021-02-26 Correlation filtering tracking method based on self-adaptive regular feature joint time correlation

Publications (2)

Publication Number Publication Date
CN112819865A CN112819865A (en) 2021-05-18
CN112819865B true CN112819865B (en) 2024-02-09

Family

ID=75863930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110214541.4A Active CN112819865B (en) 2021-02-26 2021-02-26 Correlation filtering tracking method based on self-adaptive regular feature joint time correlation

Country Status (1)

Country Link
CN (1) CN112819865B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327273B (en) * 2021-06-15 2023-12-19 中国人民解放军火箭军工程大学 Infrared target tracking method based on variable window function correlation filtering

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015163830A1 (en) * 2014-04-22 2015-10-29 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Target localization and size estimation via multiple model learning in visual tracking
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110223323A (en) * 2019-06-02 2019-09-10 西安电子科技大学 Method for tracking target based on the adaptive correlation filtering of depth characteristic
CN111383249A (en) * 2020-03-02 2020-07-07 西安理工大学 Target tracking method based on multi-region layer convolution characteristics

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272530B (en) * 2018-08-08 2020-07-21 北京航空航天大学 Target tracking method and device for space-based monitoring scene

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015163830A1 (en) * 2014-04-22 2015-10-29 Aselsan Elektronik Sanayi Ve Ticaret Anonim Sirketi Target localization and size estimation via multiple model learning in visual tracking
CN109816689A (en) * 2018-12-18 2019-05-28 昆明理工大学 A kind of motion target tracking method that multilayer convolution feature adaptively merges
CN110223323A (en) * 2019-06-02 2019-09-10 西安电子科技大学 Method for tracking target based on the adaptive correlation filtering of depth characteristic
CN111383249A (en) * 2020-03-02 2020-07-07 西安理工大学 Target tracking method based on multi-region layer convolution characteristics

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张伟 ; 温显斌 ; .基于多特征和尺度估计的核相关滤波跟踪算法.天津理工大学学报.2020,(第03期),全文. *
王守义 ; 周海英 ; 杨阳 ; .基于卷积特征的核相关自适应目标跟踪.中国图象图形学报.2017,(第09期),全文. *

Also Published As

Publication number Publication date
CN112819865A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN108053419B (en) Multi-scale target tracking method based on background suppression and foreground anti-interference
CN110084836B (en) Target tracking method based on deep convolution characteristic hierarchical response fusion
CN108734723B (en) Relevant filtering target tracking method based on adaptive weight joint learning
CN108062531B (en) Video target detection method based on cascade regression convolutional neural network
CN112184752A (en) Video target tracking method based on pyramid convolution
CN109741366B (en) Related filtering target tracking method fusing multilayer convolution characteristics
CN107689052B (en) Visual target tracking method based on multi-model fusion and structured depth features
CN109859241B (en) Adaptive feature selection and time consistency robust correlation filtering visual tracking method
CN110276785B (en) Anti-shielding infrared target tracking method
CN110175649B (en) Rapid multi-scale estimation target tracking method for re-detection
CN111311647B (en) Global-local and Kalman filtering-based target tracking method and device
CN111724411B (en) Multi-feature fusion tracking method based on opposite-impact algorithm
CN112651998B (en) Human body tracking algorithm based on attention mechanism and double-flow multi-domain convolutional neural network
CN109035300B (en) Target tracking method based on depth feature and average peak correlation energy
CN111612817A (en) Target tracking method based on depth feature adaptive fusion and context information
CN110111370B (en) Visual object tracking method based on TLD and depth multi-scale space-time features
CN110276784B (en) Correlation filtering moving target tracking method based on memory mechanism and convolution characteristics
CN106887012A (en) A kind of quick self-adapted multiscale target tracking based on circular matrix
CN116343267B (en) Human body advanced semantic clothing changing pedestrian re-identification method and device of clothing shielding network
CN108830170A (en) A kind of end-to-end method for tracking target indicated based on layered characteristic
CN112819865B (en) Correlation filtering tracking method based on self-adaptive regular feature joint time correlation
CN111027586A (en) Target tracking method based on novel response map fusion
CN113963026A (en) Target tracking method and system based on non-local feature fusion and online updating
CN112131991B (en) Event camera-based data association method
CN112883928A (en) Multi-target tracking algorithm based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant