Nuclear correlation filtering tracking method based on feature fusion and self-adaptive blocking
Technical Field
The invention relates to the technical field of computer vision, in particular to a kernel correlation filtering tracking method based on feature fusion and self-adaptive blocking.
Background
Target tracking is widely applied to the fields of robots, video monitoring, automation technology and the like, and is one of important research hotspots in the field of computer vision. The target tracking problem is precisely that the research problem of single target short-time tracking is to give the position of the target in the first frame of an image sequence, including the position of the target and the size of the target, and then track the target in the next image to obtain the position and the size of the target. The main difficulty faced by the object tracking problem is that the object is deformed in the tracking process, the object is blocked or moves severely, and the like, and the transformation in the process is often difficult to predict in advance. Specifically, there are several difficulties: the target is occluded, rotated in-plane, rotated out-of-plane, fast motion, illumination changes, background clutter, motion blur, and large scale changes.
The quality of the target model is the key of whether the tracking can be successful, so the target tracking algorithm is divided into two types of a generating model and a discriminant model according to a target modeling mode. The generated model directly describes the observation of the target through an online learning mode, then target searching is carried out, the joint probability of a sample and the target is found, and an image area which is most matched with the target model is used as the estimation of the current real target to finish target positioning. The generation class method is generally based on template matching or subspace model, and can better describe the target by utilizing global features. Unlike the generative model, the discriminant model makes full use of the target and background information, regards target tracking as a target foreground and background classification problem, and judges whether the target is or the background by calculating the conditional probability thereof, thereby reducing the complexity of the algorithm.
The tracking method based on the kernel correlation filtering belongs to a discrimination method, and the kernel correlation filtering method becomes the biggest research hot spot except the deep learning method in the field of target tracking in recent years by the good tracking performance and extremely high calculation efficiency. The kernel correlation filtering-based method generally utilizes a training data set acquired from a previous frame to train a classifier in a tracking process, utilizes the classifier to detect the position of a target in a current frame, and then updates the training data set to be the data set acquired by the current frame so as to update the classifier. The tracking method based on the kernel correlation filtering has good performance on speed and scale change and can realize real-time tracking, but because the tracking method uses single characteristics to carry out target representation and has no mechanism for processing target shielding, the tracking accuracy of an algorithm is affected, and tracking failure is easy to cause under the conditions of background clutter, target shielding and the like.
Disclosure of Invention
The invention aims at overcoming the defects in the prior art and provides a kernel-related filtering tracking method based on feature fusion and self-adaptive blocking. The method can improve the processing capacity of the problems of scale change, shielding, target rotation, background clutter and the like under the condition of meeting the requirement of real-time tracking speed.
The technical scheme for realizing the aim of the invention is as follows:
a kernel correlation filtering tracking method based on feature fusion and self-adaptive blocking comprises the following steps:
1) Acquiring the initial size and position of a target sub-block: inputting a first frame image of a tracking video to determine the size and initial position of a tracking target, setting the total frame number of the tracking video as N, partitioning the target according to the target pixel proportion, setting the total number of sub-blocks as m, and obtaining the initial size and position of the target sub-blocks by utilizing the geometric relationship between the initial size and position of the tracking target and each sub-block;
2) Modeling and training the target subblocks: inputting images of other frames, obtaining all samples of a target sub-block after cyclic shift, calculating base samples of HOG and CN characteristics of the target sub-block, then fusing the HOG characteristics and the CN characteristics by utilizing a multi-channel technology, modeling the target sub-block by utilizing the fused characteristics, wherein the built target sub-block model is a classifier, and training the target sub-block model by utilizing a ridge regression function;
3) Obtaining a target position of a current sub-block: calculating a kernel correlation matrix and a classifier weight coefficient of all positive samples, namely a target and a negative sample, namely a background, of a target sub-block by using a Gaussian kernel function to calculate a maximum response value of all target positions in the sub-block, and detecting the maximum response position, thereby obtaining the target position of the current sub-block in the current frame;
4) Updating the target sub-block samples and classifier weight coefficients: updating the target sub-block sample and the classifier weight coefficient according to the target position of the current sub-block in the current frame, and preparing for target position detection of the corresponding sub-block of the next frame;
5) Determining the final position of the current frame target: after tracking all the sub-blocks, giving a larger weight coefficient to the target sub-block with a larger response value, and determining the final position of the current frame target by using weighted average;
6) Repeating the steps 2) to 5) until the last frame of the tracking video is completed.
The step 1) of partitioning the target according to the target pixel proportion includes: and after the targets are segmented according to the target pixel comparison, reading a group trunk text file of the tracked video by utilizing a toolkit corresponding to Matlab, wherein the text contains real data calibrated by the tracked video, the size of the target and the center position of the target of each frame, and then obtaining the size and the initial position of the target of each sub-block according to the geometric relationship between the initial position of the target and the sub-block, and taking the center coordinate of each sub-block as the initial position of the target of the sub-block in the first frame.
The specific process of obtaining all samples of the target sub-block after cyclic shift in the step 2) is as follows: according to a vectorPass vector->Sequentially shifting each element of the plurality of elements to the right by one position to obtain a row vector, obtaining n cyclic shift vectors after n times, sequentially arranging the n cyclic shift vectors into a matrix, wherein the matrix is +.>The generated circulant matrix->Then according to the sonAnd calculating a base sample of the target area of the sub-block according to the central position of the block, the central coordinates of (a, b) and the sizes of the sub-block, namely the length and the width of the sub-block, and performing cyclic shift sampling on the base sample to obtain a cyclic sample of the target area of the sub-block.
The specific process of fusing HOG features and CN features by utilizing the multichannel technology in the step 2) is as follows:
(1) Acquiring HOG features and reduced-dimension CN features of a target area through a toolkit corresponding to Matlab, wherein the CN features finally acquire HOG features of 31 dimensions and CN features of 11 dimensions, and reducing dimensions of the feature CN features by utilizing a principal component analysis method, wherein the reduced-dimension CN features are of 2 dimensions;
(2) And carrying out weighted fusion on the 31-dimensional HOG characteristic, the 2-dimensional CN characteristic and the 1-dimensional gray scale characteristic by utilizing a multichannel technology to obtain a 34-dimensional fusion characteristic, wherein a multichannel fusion formula is as follows:
wherein ,is a kernel function, x c Is the fusion feature of HOG feature and CN feature of the c-th channel, sigma is the bandwidth of Gaussian kernel function, F -1 Representing the inverse fourier transform.
Training the sub-block object model using the ridge regression function described in step 2), wherein the training is to find a regression function f (z) =w T z, the value of the sample regression minimizes the error squared, the regression function f (z) solves for w as follows:
(1) The loss function used in training is:
wherein x is a target sample, y is a regression target, w is a classifier weight coefficient, lambda is a regularization parameter, f is a classification function, and I;
(2) Under the condition that the sample is linearly separable, the ridgeRegression solution: w= (X T X+λI) -1 X T y,
Wherein X is a cyclic matrix formed by cyclic shift of sample data, I is a unit matrix, and T represents transposition operation;
(3) Written in the fourier domain as:
the specific process for obtaining the target position of the current sub-block in the current frame in the step 3) is as follows:
(1) The solution w to the regression function f (z) is expressed as a linear combination of the target sample x and the dual space α:
thus, the regression problem translates into:
wherein k (z, x i ) As the kernel function, a gaussian kernel function is selected as the kernel function:
wherein z represents a detection sample, sigma is the bandwidth of a Gaussian kernel function, and x represents the complex conjugate of the matrix;
(2) The specific solving steps of the subblock response values are as follows:
f(z)=∑ i α i k(z i ,x i ),
solving the above formula by using kernel theory and ridge regression:
α=(K+θI) -1 y,
k is a cyclic matrix, fourier transformation is carried out on the cyclic matrix, and the solution of alpha in the dual space is obtained by the following steps:
the final response position of the sub-block is obtained as follows:
according to the formulaRegression detection is carried out on all training samples in the sub-block, the corresponding response value of all training samples in the frequency domain is calculated, and the inverse Fourier transform is utilized to carry out +.>When converting back to the time domain, the position with the maximum response value is the target position of the sub-block.
Updating the sub-block target samples and the classifier weight coefficients in the step 4), wherein in the updating process, each frame of image generates a classifier, and updating the classifier with the classifier trained in the step 2) to obtain a classifier trained in real time:
η represents the updated weight of the training classifier, where η=0.125, in preparation for target position detection of the corresponding sub-block of the next frame (i.e. t+1 frame);
the target subblocks with larger response values are given larger weight coefficients w in the step 5) i Determining final position of current frame target by weighted average, and giving greater weight coefficient w to target sub-block with greater response value i :
f(z) i Is the i-th sub-block maximum response position,
determining the final position F of the current frame target using a weighted average:
wherein
According to the method, HOG features and CN features with complementary properties are fused, features of a target sample are enriched, a target is segmented by utilizing a target splitting idea, a final target position is obtained by means of decision fusion of effective sub-block target positions, and the processing capacity of problems such as scale change, shielding, target rotation and background clutter is improved under the condition that the requirement of real-time tracking speed is met.
Drawings
FIG. 1 is a schematic diagram of a frame of an embodiment;
FIG. 2a is a schematic diagram of a target block when the pixel ratio of the target is greater than 1 in the embodiment;
FIG. 2b is a schematic diagram of a target block when the pixel ratio of the target is less than 1 in the embodiment;
FIG. 2c is a schematic diagram of a target block when the pixel ratio of the target is equal to 1 in the embodiment;
FIG. 3 is a schematic diagram of feature fusion in an embodiment.
Detailed Description
The present invention will now be further illustrated, but not limited, by the following figures and examples.
Examples:
referring to fig. 1, a kernel-related filtering tracking method based on feature fusion and adaptive blocking includes the following steps:
1) Inputting a first frame image of a tracking video to determine the size and the initial position of a tracking target, setting the total frame number of the tracking video as N, carrying out block division processing on different targets according to the pixel ratio of the targets, assuming that the pixel ratio of the targets is divided into blocks with the value of theta after upward rounding, dividing the vertical direction of the targets into blocks with the value of 3 x theta when the pixel ratio of the targets is larger than 1, dividing the vertical direction of the targets into blocks with the value of 2 x theta when the pixel ratio of the targets is smaller than 1, dividing the vertical direction of the targets into blocks with the value of 3 x theta when the pixel ratio of the targets is smaller than 1, dividing the horizontal direction into blocks with the value of 3 x theta when the pixel ratio of the targets is equal to 1, setting the total number of sub-blocks as m after the targets are divided into blocks, reading a group trudate file of the tracked video by using a tool packet corresponding to Matlab, and obtaining the real data marked by the tracked text in the text, and the size of the target of each frame and the initial position of each frame as the initial position of each sub-frame according to the initial position of the target in the method, and the initial position of each sub-frame is obtained as the initial position of the sub-frame;
2) Inputting the images of the rest frames, and obtaining all samples of the target subblocks after cyclic shift, namely according to a vectorPass vector->Each element of (a) is sequentially shifted to the right by one position to obtain a row vector, n cyclic shift vectors are obtained after n times, which is equivalent to +.>Multiplying by an arrangement matrix P, and then sequentially arranging the n cyclic shift vectors into a matrix, which becomes +.>The generated circulant matrix->Wherein vector->Arrangement matrix P and circulant matrix->The expressions of (2) are respectively as follows:
the number of target sub-block samples is increased by circularly shifting and sampling the target sub-block, so that the accuracy of target detection is improved. Then, calculating a base sample of a target area of the sub-block according to the central position of the sub-block, namely, the central coordinates (a, b) and the size, namely, the length and width of the sub-block, performing cyclic shift sampling on the base sample to obtain a cyclic sample of the target area of the sub-block, and then fusing HOG features and CN features by utilizing a multi-channel technology, wherein the feature fusion is as shown in figure 3, and the specific process of fusing the HOG features and the CN features is as follows:
(1) Acquiring HOG characteristics of a target area through a tool pack corresponding to Matlab, finally acquiring 31-dimensional HOG characteristics, acquiring CN characteristics of the target area through the tool pack corresponding to Matlab, wherein the CN characteristics comprise 11-dimensional CN attributes of red, yellow, white, purple, powder, orange, green, gray, brown, blue and black, and reducing the dimensions of the characteristics CN characteristics by utilizing a principal component analysis method, wherein the reduced dimensions of the CN characteristics are 2-dimensional;
(2) And carrying out weighted fusion on the 31-dimensional HOG characteristic, the 2-dimensional CN characteristic and the 1-dimensional gray scale characteristic by utilizing a multichannel technology to obtain a 34-dimensional fusion characteristic, wherein a multichannel fusion formula is as follows:
wherein ,is a kernel function, x c Is the fusion feature of HOG feature and CN feature of the c-th channel, sigma is the bandwidth of Gaussian kernel function, F -1 Representing the inverse fourier transform,
modeling the sub-block target by using the fused characteristics, wherein the model of the built target sub-block is a classifier, training the sub-block target model by using a ridge regression function, and supposing that n training samples x= [ x ] are in total in the sub-block 1 ,x 2 ,...,x n ]Regression target y= [ y ] 1 ,y 2 ,...,y n ]The purpose of training is to find a regression function f (z) =w T z, so that the value error of the sample regression is least squared, wherein the regression function f (z) solves for w as follows:
(1) The loss function used in training is:
wherein x is a target sample, y is a regression target, w is a classifier weight coefficient, lambda is a regularization parameter, overfitting is prevented, lambda takes a value of 0.001 in the example, f is a classification function, and I I.I is a norm operation;
(2) Under the sample linear separable condition, deriving a regression function to enable the derivative to be equal to 0, and obtaining a closed solution of the weight coefficient w as follows:
w=(X T X+λI) -1 X T y,
wherein X is a cyclic matrix formed by cyclic shift of sample data, I is a unit matrix, and T represents transposition operation;
(3) Written in the fourier domain as:
3) Computing all positive samples of the target sub-block using a Gaussian kernel functionThe method comprises the steps of calculating a maximum response value of all target positions in a sub-block by a kernel correlation matrix of a target and a negative sample, namely a background, and a classifier weight coefficient, detecting the maximum response position, thereby obtaining the target position of the current sub-block in the current frame, and defining a kernel function k to map an input sample to a high-dimensional space in the nonlinear conditionThe solution w to the regression function f (z) is expressed as a linear combination of the target sample x and the dual space α:
for predicting the next sub-block targetThe response function f (z) is expressed in high-dimensional space as:
wherein k (z, x i ) As a kernel function, a conventional kernel function is expressed in the form of:the kernel function selected in this example is a gaussian kernel function:
where z is the detection sample, σ is the bandwidth of the gaussian kernel function, x represents the matrix complex conjugate,
the specific solving steps of the subblock response values are as follows:
f(z)=∑ i α i k(z i ,x i ),
the solution for obtaining the response value of the sub-block by using the kernel function theory and the ridge regression solution is as follows:
α=(K+λI) -1 y,
k is a cyclic matrix, fourier transformation is carried out on the cyclic matrix, and the solution of alpha in the dual space is obtained by the following steps:
wherein ,discrete fourier transform for regression target y, +.>For the cyclic kernel matrix k=c (K xz ) The first row of (i) the base sample k of the cyclic core matrix xz To this end, the final response position of the resulting sub-block is:
according to the formulaRegression detection is carried out on all training samples in the sub-block, the corresponding response value of all training samples in the frequency domain is calculated, and the inverse Fourier transform is utilized to carry out +.>When converting back to the time domain, the position with the maximum response value is the target position of the sub-block;
4) Updating the target sample of the effective sub-block and the classifier weight coefficient according to the target position of the current sub-block in the current frame, repositioning the ineffective sub-block,
based on Peak Sidelobe Ratio (PSR) of response map of sub-block target detection area and Euclidean distance L between maximum response value position of current sub-block and center coordinate position of target i To determine the validity of the sub-block: the PSR calculation formula is as follows:
where max (f (z)) represents the response value of each candidate object of the sub-block, i.e. the final tracking result of the ith sub-block in the t-th frame, and μ and δ represent the mean and variance of the object sub-blocks.
European distance L i The formula is as follows:
wherein ,(a1 ,b 1 ) Is the center coordinates of the object, (a) 2 ,b 2 ) The center coordinates of the sub-block targets; when the PSR value of the sub-block target area is less than 70 or the Euclidean distance L between the maximum response value coordinate position of the current sub-block and the center coordinate position of the target i If the sub-block diagonal length is greater than half of the sub-block diagonal length, the sub-block is determined to be invalid,
in the updating process, each sub-block target generates a classifier, and for the block determined to be a valid sub-block, the classifier is updated with the classifier already trained in step 2) to obtain a classifier trained in real time, and the updating formula is as follows:
for a sub-block that is determined to be invalid, we relocate it, the relocation formula is:
wherein η represents an update weight of the training classifier, in this example η=0.125, and the obtained real-time trained classifier prepares for target position detection of the corresponding sub-block of the next frame;
5) After all the sub-blocks are tracked according to the steps 2) -4), the larger the response value of the target sub-block is, the more reliable the target sub-block is, and the more reliable target sub-block is given a larger weight coefficient w i :
Wherein f (z) i Is the i-th sub-block maximum response position, and then the final position F of the current frame target is determined by using weighted average:
wherein ,
6) Repeating the steps 2) to 5) until the last frame of the tracking video is completed.