CN111260691A

CN111260691A - Spatio-temporal canonical correlation filtering tracking method based on context-aware regression

Info

Publication number: CN111260691A
Application number: CN202010059049.XA
Authority: CN
Inventors: 胡众义; 邹绵璐; 陈昌足; 吴奇
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2020-01-18
Filing date: 2020-01-18
Publication date: 2020-06-09
Anticipated expiration: 2040-01-18
Also published as: CN111260691B

Abstract

The invention discloses a spatio-temporal canonical correlation filtering tracking method based on context-aware regression, which comprises the following steps of: 1. giving an initialization sample, determining a target template, initializing a target context, and sampling a frame; 2. extracting the characteristics of the target context; 3. constructing a regression label shape and fitting a target structure; 4. introducing a time and space regular term, constructing an object equation, restraining larger difference of a video sequence in time and space, preventing overfitting, solving the object equation and obtaining a space regular matrix (filter); 5. performing related filtering operation on the filter and the target image block, calculating to obtain a related response value, selecting the maximum response value as a target position, and updating a target template (result parameter) according to the determined target position; 6. and circularly operating the steps 2,3,4 and 5 until the video sequence is ended. The frame of the invention has better robustness, and the tracking accuracy and success rate obtained by the OTB100(Object tracking benchmark v1.0) test on the public data set and the robustness are generally superior to some existing advanced methods.

Description

Spatio-temporal canonical correlation filtering tracking method based on context-aware regression

Technical Field

The invention belongs to the field of computer vision target tracking, in particular relates to a spatio-temporal regularization correlation filtering tracking method based on context-aware regression, and better solves the problems of low tracking accuracy, success rate and robustness of some existing target tracking methods.

Background

A hot research topic in the field of computer vision for online target tracking has great significance for high-level vision research such as abnormal event detection, motion analysis, scene understanding and the like, and has wide application prospects in the fields of video monitoring, unmanned driving, human-computer interaction and the like.

In a complex scene, due to the influence of factors such as the mutual shielding of targets or the shielding of background static objects on the targets, the rotation of the targets, the change of target dimensions and the like, especially when the targets are shielded for a long time or the motion state of a target template is not updated for a long time, the tracked track gradually deviates from the targets, and the accuracy and the success rate of target tracking are reduced.

With the rise of artificial intelligence, target tracking becomes one of the key concerns in the field.

Disclosure of Invention

The invention provides a space-time canonical correlation filtering tracking method based on context-aware regression, aiming at the problems of low tracking accuracy and success rate caused by factors such as shielding interference, target rotation, scale change and the like in target tracking.

The technical scheme of the invention is as follows: a spatio-temporal canonical correlation filtering tracking method based on context-aware regression comprises the following steps:

in step S1, an initialization sample is given, an initial target template and an initial target context are determined, an initial frame is sampled, and a sample training set is

Each sample

Consists of D feature maps of size M × N. In the formula x_tDenotes sample, y'_tDenotes a label, T denotes the number of samplesNumber, D, indicates the number of layers of the feature.

Initializing a target position: given the target position of the initial frame. Namely x and y coordinates, length and width of the target, and extracting feature information according to the initial position by an algorithm in a specific processing mode as follows:

I_position＝[x,y,w,h]

I_positiondenotes the target position, [ x, y, w, h]The coordinate position, length and width of the target are shown, and finally a rectangular frame for framing the target is formed.

Target background filling: in order to ensure the robustness of the method, prevent the situation that the appearance of the target changes greatly in the tracking process and ensure that more discrimination information is obtained, the specific processing mode is as follows:

I_{sam_sz}＝I_{base_sz}+I_{p_con}

I_{sam_sz}represents the target sample after filling, I_{base_sz}Representing the original target size, I_{p_con}Indicating the populated context information.

In step S2, after the target position is initialized and the background of the target position is partially filled, feature extraction is performed to fuse the Hog and CN features. Because the Hog feature counts the direction gradient of local appearance of the image, and the CN color feature counts the color of the target, the shielding and deformation of the target can be better processed.

The method comprises the following steps of extracting features of candidate context targets, adopting Hog and CN features, counting directional gradients of local images by the Hog features, counting target colors by the CN color features, fusing the target colors from the gradients and the colors, and having better processing capability on the shielding and deformation of the targets, wherein the specific processing mode is as follows:

gradient feature extraction mode (Hog):

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

in the formula G_x(x,y)，G_y(x, y), H (x, y) respectively represent the horizontal direction gradient, the vertical direction gradient and the image at the pixel point in the input imageThe prime value.

Color feature (CN): the color feature is a global feature, describes surface properties of a scene corresponding to an image or an image area, can be free from the influence of image rotation and translation change, and is free from the influence of scale change after normalization.

In step S3, after regression fitting and introduction of context information, reconstructing a regression label according to the target spatial structure change, and performing a smoothing operation on the sample label, the fitting is approximately gaussian (1), where σ represents gaussian bandwidth, m, and n represents sample x_tRepresents the regression smoothing parameter.

In step S4, a constraint term is added, a temporal and spatial regularization term is introduced, an objective equation (2) is established, and a temporal regularization parameter μ and a spatial regularization matrix w (filter) are obtained by minimizing the objective solution, where c is^dFilter representing channel d, c current frame filter, c_i-1Representing the previous frame filter. After introduction of context information, the sample

The structure changes.

In step S5, a temporal regularization parameter μ and a spatial regularization matrix w (filter) are solved, a correlation filtering operation, i.e., a convolution operation, is performed on the samples in the fourier domain, after the final convolution operation, the response values of the target samples and the filter are scored, the maximum score is determined as the target region, and then the target template is updated. The function of the time regular parameter is to control the error between the obtained filtering of the front frame and the back frame as much as possible and minimize the error. The function of the spatial regular matrix w (filter) is to suppress the background region and highlight the target region by weight distribution.

Purpose(s) toIs a constrained target solution, prevents overfitting, constrains the target equation from time and space, time regularization parameter mu and space regularization matrix w (filter), time regularization term

Spatial regularization term

The time regular parameter and the space regular matrix are solved by minimizing an object equation, the form of the object equation is complex, merging, Lagrange constraint and ADMM subproblem decomposition are introduced in the solving process, and the following formulas (3), (4) and (5) are obtained after decomposition, wherein the specific processing modes are as follows:

h⁽ⁱ⁺¹⁾＝h⁽ⁱ⁾+c⁽ⁱ⁺¹⁾-g⁽ⁱ⁺¹⁾(5)

in step S6: the operations of steps S2, S3, S4, S5 are performed for the subsequent frame until the sequence ends.

The steps S2, S3, S4 and S5 are repeated for the loop operation of the picture frames by performing the same operation for each frame until the end of the video sequence. The flow in the example of fig. 1 shows that feature extraction is performed on subsequent frames of an initial frame, an objective equation is constructed, a spatial regular matrix is solved, a correlation filtering operation is performed, an objective position is determined, and then a template is updated. Here, it is a difference that the target template of the initial frame is given; and the target template of the subsequent frame is updated by the target position of the preceding frame.

The invention provides a spatio-temporal canonical correlation filtering tracking method based on context-aware regression, which has the following beneficial effects compared with the prior art:

the method is mainly characterized in that a relevant filtering mechanism is based, context-aware regression is provided, a target structure can be well fitted, in addition, a spatial regular constraint term can be well constrained and solved, a spatial matrix w (filter) is constructed, relevant filtering is carried out, a target area is well highlighted, a background area is restrained, and errors between a front filter and a rear filter can be well restrained by a time regular constraint term. By fusing the features of Hog (histogram) and CN (color name), the problem that the appearance of the target changes greatly, such as rotation, scale change and the like, can be solved. The method of the invention achieves good effects in the aspects of accuracy, success rate and robustness, and has good value and prospect in practical application.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a schematic diagram of the target location annotation of the present invention;

FIG. 3 is a schematic representation of Gaussian smoothing according to the present invention;

FIG. 4 is a diagram of the correlation filter architecture of the present invention;

FIG. 5 is a diagram of the correlation filtering detailed operation used in the present invention;

Detailed Description

For completeness and clarity of description of technical solutions in the embodiments of the present invention, the following detailed description will be further developed with reference to the accompanying drawings in the embodiments of the present invention. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1, the present invention provides a technical solution: a spatio-temporal canonical correlation filtering tracking method based on context-aware regression comprises the following steps:

step S1: giving an initialization sample, determining an initial target template, an initial target context, sampling an initial frame, and obtaining a sample training set of

Each sample

Consists of D feature maps of size M × N. In the formula x_tDenotes sample, y'_tDenotes a label, T denotes the number of samples, and D denotes the number of layers of the feature.

I_position＝[x,y,w,h]

I_positiondenotes the target position, [ x, y, w, h]The coordinate position, length and width of the target are shown, and finally, a rectangular frame for framing the target is obtained, and as shown in fig. 2, the red rectangular frame is the target position marked by a rectangle.

Target background filling: in order to ensure the robustness of the method, prevent the situation that the appearance of the target changes greatly in the tracking process and ensure that more discrimination information is obtained, as shown in fig. 2, the part between the red rectangular frame and the blue rectangular frame is filled target context information. The specific treatment method is as follows:

I_{sam_sz}＝I_{base_sz}+I_{p_con}

Step S2: after initialization of the target position and partial background filling of the target position, feature extraction is carried out, and Hog and CN features are fused. Because the Hog feature counts the direction gradient of local appearance of the image, and the CN color feature counts the color of the target, the shielding and deformation of the target can be better processed.

gradient feature extraction mode (Hog):

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

the gradient processing converts the processing of the image into mathematical formula calculation, namely, derivation is carried out on the positions of different pixel points, and because the x direction and the y direction are divided, the partial derivatives are respectively obtained on x and y after conversion, G in the formula_x(x,y)，G_y(x, y), and H (x, y) respectively represent a horizontal direction gradient, a vertical direction gradient, and a pixel value at a pixel point in the input image.

Color feature (CN): the color feature is a global feature, describes surface properties of a scene corresponding to an image or an image area, can be free from the influence of image rotation and translation change, and is free from the influence of scale change after normalization. Color features are visual features that are commonly used in image retrieval applications because colors are often very related to objects and scenes contained in an image, and color features are based on pixel point features, and all pixels belonging to an image or an image region have their own contribution.

Step S3: regression fitting, reconstructing a regression label according to the change of a target space structure after context information is introduced, and performing smoothing operation on the Gaussian form to fit approximate the Gaussian form (1), wherein sigma represents a Gaussian bandwidth, m and n represent a sample x_tRepresents the regression smoothing parameter. As shown in fig. 3, the graph has two parts, the upper part represents the original regression label, the lower part represents the regression label after the context is introduced, "→" represents the direction of the regression label corresponding to the sample, and "↓" represents the change of the sample after the context information is introduced.

Step S4: adding a constraint term, introducing a time and space regular term, establishing an object equation (2), and solving by minimizing an object to obtain a time regular parameter mu and a space regular matrix w (filter), wherein c^dDisplay unitFilter of trace d, c denotes current frame filter, c_i-1Representing the previous frame filter. After introduction of context information, the sample

The structure changes.

Step S5: solving a time regular parameter mu and a space regular matrix w (filter), carrying out related filtering operation, namely convolution operation on the samples in a Fourier domain, scoring the response values of the target samples and the filter after the final convolution operation, determining the maximum score as a target area, and then updating the target template. The function of the time regular parameter is to control the error between the obtained filtering of the front frame and the back frame as much as possible and minimize the error. The function of the spatial regular matrix (filter) is to suppress the background region and highlight the target region by weight distribution. As shown in fig. 4, the legend mainly shows convolution operations between an image and a filter, the computer calculates that each pixel of the image is converted into data, different numbers and sizes represent different colors, and the color value ranges from 0 to 255, where the filter used here is the spatial regular matrix obtained in the step S4, and the related filtering operation is that the filter performs sliding convolution on an input image in the form of a sliding window, and a response value is obtained without performing convolution once, and finally a maximum response value is determined for a target position. In the legend, clicking operation is performed, firstly, an image block with the same size as that of a filter is taken out from an input image, then, the image block is multiplied by the corresponding position of the filter to obtain a new matrix image block, then, the obtained matrix image block is subjected to summation operation, sigma represents the summation operation, and finally, a response value is obtained, and only one value can be obtained in each convolution.

The purpose of adding constraint items is to optimize the solution of the target and preventStopping overfitting, constraining the target equation from time and space, time regular parameter mu and space regular matrix w (filter), time regular term

Spatial regularization term

The solution of the time regular parameter and the space regular matrix w (filter) is realized by minimizing an objective equation, the form of the objective equation is complex, merging, Lagrange constraint and ADMM subproblem decomposition are introduced in the solution process, and the following formulas (3), (4) and (5) are obtained after the decomposition, wherein the specific processing modes are as follows:

h⁽ⁱ⁺¹⁾＝h⁽ⁱ⁾+c⁽ⁱ⁺¹⁾-g⁽ⁱ⁺¹⁾(5)

as shown in fig. 5, the sample image is shown in the figure, starting from the initial frame, the operation of the correlation filtering is performed, and the operation of separating the initial frame and the subsequent frame is performed in two steps, because the subsequent frame needs the updated target template of the previous frame, and the target template of the first frame is given. So in fig. 5, the operations for the initial frame are: and performing correlation operation on the filter and the sample image to obtain a confidence score, and then determining the target position according to the maximum score. The basic steps are the same for the operation of the subsequent frames, and the feature extraction and template updating operation is performed after the target positioning. "→" in the figure indicates execution of the flow operation. The "FFT" represents Fourier transform, and the "IFFT" represents inverse Fourier transform, because of the dot-product relationship, the calculation speed is fast when the Fourier transform is in the frequency domain.

In step S6: the operations of steps S2, S3, S4, S5 are performed for the subsequent frame until the sequence ends. As shown in fig. 1, an initial frame, and subsequent frames are shown, with "…" in the legend indicating an omitted picture frame.

In summary, the invention adopts a spatio-temporal regularization correlation filtering method based on context-aware regression, adjusts the target structure change on the basis of the correlation filtering, introduces context information, and further fits a regression equation according to the structure change. And the Hog and CN features are fused to provide more discrimination information. Then, constraint terms are added, and the filter structure is adjusted from the space, the target area is highlighted, and the background area is suppressed. Temporally, the error of the pre-and post-frame filter construction is constrained. Theoretical analysis, through feature fusion, target structure fitting and time space constraint, the problem of large target appearance change in target tracking can be further processed, and experiments prove that the obtained tracking result is superior to some existing advanced algorithms in accuracy, success rate and robustness when the target is tested on an open data set OTB100(Object tracking benchmark v 1.0).

It will be appreciated by persons skilled in the art that the invention is not limited to details of the foregoing embodiments, and that the invention can be embodied in other specific forms without departing from the spirit or scope of the invention. In addition, various modifications and alterations of this invention may be made by those skilled in the art without departing from the spirit and scope of this invention, and such modifications and alterations should also be viewed as being within the scope of this invention.

Claims

1. A spatio-temporal canonical correlation filtering tracking method based on context-aware regression is characterized by comprising the following steps:

Each sample

Is composed of D characteristic graphs with the size of M multiplied by N, wherein x is_tDenotes sample, y'_tDenotes a label, T denotes the number of samples, D denotes the number of layers of the feature,

initializing a target position: given the target position of the initial frame, i.e. the x, y coordinates, length and width of the target, the algorithm extracts feature information according to the initial position, and the specific processing mode is as follows:

I_position＝[x,y,w,h]

I_positiondenotes the target position, [ x, y, w, h]The coordinate position, the length and the width of the target are shown, and finally a rectangular frame for framing the target is formed,

I_{sam_sz}＝I_{base_sz}+I_{p_con}

I_{sam_sz}represents the target sample after filling, I_{base_sz}Representing the original target size, I_{p_con}The context information indicating the filling-in is,

step S2: filling partial background around the target after the target position is initialized, then extracting features, fusing the features of Hog (histogram of oriented gradients) and CN (color Names),

gradient feature extraction mode (Hog):

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

in the formula G_x(x,y)，G_y(x, y), H (x, y) respectively represent a horizontal direction gradient, a vertical direction gradient and a pixel value at a pixel point in the input image,

color feature (CN): the color feature is a global feature, describes the surface property of the scene corresponding to the image or the image area, can not be influenced by the image rotation and translation change, can not be influenced by the scale change after normalization,

step S3: regression fitting, reconstructing a regression label according to the change of a target spatial structure after the introduction of the context information, and performing a smoothing operation on the Gaussian model to fit approximate Gaussian model (1), wherein y'_tRepresenting a fitted Gaussian form, sigma representing the Gaussian bandwidth, m, n representing the sample x_tRepresents the regression smoothing parameter,

step S4: adding a constraint term, introducing a time and space regular term, establishing an object equation (2), and solving by minimizing an object to obtain a time regular parameter mu and a space regular matrix w (filter), wherein c^dFilter representing channel d, c current frame filter, c_i-1Representing the previous frame filter, where samples after introduction of context information

The structure of the device is changed, and the device is changed,

the purpose of adding constraints is to optimize the objective solution, prevent overfitting, constrain the objective equation temporally and spatially, the temporal regularization parameter μ and the spatial regularization matrix w (filter), the temporal regularization term

Spatial regularization term

Step S5: solving a time regular parameter mu and a space regular matrix w (filter), carrying out related filtering operation on the samples in a Fourier domain, namely convolution operation, after the final convolution operation, scoring the response values of the target samples and the filter, determining the maximum score as a target area, then updating a target template, wherein the time regular parameter has the function of controlling the error between the two frames of filters before and after the target samples and the filter so that the error is as small as possible, and the space regular matrix w (filter) has the function of restraining a background area and highlighting the target area through weight distribution,

the purpose of adding constraint terms is to optimize the solution of the object, prevent overfitting, constrain the object equation temporally and spatially, the temporal regularization parameter mu and the spatial regularization matrix w (filter), the temporal regularization term

Spatial regularization term

h⁽ⁱ⁺¹⁾＝h⁽ⁱ⁾+c⁽ⁱ⁺¹⁾-g⁽ⁱ⁺¹⁾(5)

step S6: the following frames are processed through steps S2, S3, S4 and S5 until the sequence is finished, and the loop operation on the video sequence frames is that the same operation is performed on each frame until the video sequence is finished, feature extraction is performed on the initial frame and the following frames, an objective equation is constructed, a spatial regular matrix is solved, relevant filtering operation is performed, an objective position is determined, and then templates are updated, wherein a difference is that the objective template of the initial frame is given, and the objective template of the following frames is updated through the objective position of the preamble frame.

2. The context-aware regression spatiotemporal regularization correlation filtering tracking method according to claim 1, wherein in step S1, a sample target is initialized, a context of a selected target is amplified, and the feature extraction processing specifically comprises:

I_position＝[x,y,w,h]

I_positiondenotes the target position, [ x, y, w, h]The coordinate position, length and width of the target are shown, and finally, the target is surrounded by a rectangular frame,

I_{sam_sz}＝I_{base_sz}+I_{p_con}

3. The context-aware regression-based spatio-temporal canonical correlation filtering tracking method according to claim 1, wherein in step S2, feature extraction is performed on candidate context targets, and the Hog features and CN features are adopted, the Hog features can count directional gradients that locally appear in an image, the CN color features perform statistics on target colors, and from the fusion of gradients and colors, better processing capability can be provided for occlusion and deformation of a target, and the specific processing manner is as follows:

gradient feature extraction mode (Hog):

G_x(x,y)＝H(x+1,y)-H(x-1,y)

G_y(x,y)＝H(x,y+1)-H(x,y-1)

4. The context-aware regression-based spatio-temporal canonical correlation filtering tracking method according to claim 1, wherein the fitting of the regression target in step S3, after introduction of context information, the target structure changes, the regression structure of the target is adjusted, and the target label shape is smoothly fitted, and the specific processing manner is as follows (1):

of formula (II) to (III)'_tRepresenting a fitted Gaussian form, sigma representing the Gaussian bandwidth, m, n representing the sample x_tRepresents the regression smoothing parameter.

5. The context-aware regression-based spatio-temporal regularized correlation filtering tracking method according to claim 1, wherein said step S4 introduces temporal and spatial regularization terms,the objective is to constrain the object solution, prevent overfitting, constrain the object equation temporally and spatially, the temporal regularization parameter mu and the spatial regularization matrix w (filter), the temporal regularization term

Spatial regularization term

The temporal regularization term controls the difference between the filters between the previous and the next frames in time to make the difference as small as possible, and the spatial regularization term processes in space, and performs weighting distribution on the filters to highlight the target area part and suppress the background area part.

6. The context-aware regression-based spatio-temporal regularized correlation filtering tracking method according to claim 1, wherein the solution of the temporal regularization parameter and the spatial regularization matrix in step S5 is implemented by minimizing an objective equation (2), and the specific processing manner is as follows:

solving a time regular parameter mu and a space regular matrix w (filter), carrying out related filtering operation on the samples in a Fourier domain, namely convolution operation, after the final convolution operation, scoring the response values of the target samples and the filter, determining the maximum score as a target area, then updating a target template, wherein the time regular parameter has the function of controlling the error between filtering obtained by two frames before and after as much as possible and minimizing the error, the space regular matrix (filter) has the function of restraining a background area and highlighting the target area through weight distribution,

h⁽ⁱ⁺¹⁾＝h⁽ⁱ⁾+c⁽ⁱ⁺¹⁾-g⁽ⁱ⁺¹⁾(5)。

7. the method according to claim 1, wherein the main implementation manner of S6 in the step is a loop operation, and the steps S2, S3, S4, and S5 are repeated until the video sequence is ended, and the following steps are specifically performed:

repeating the steps S2, S3, S4 and S5 on the cyclic operation of the picture frames, namely performing the same operation on each frame until the video sequence is finished, performing feature extraction on the subsequent frames of the initial frame, constructing an object equation, solving a spatial regular matrix, performing related filtering operation, determining the object position, and then updating the template, wherein the difference is that the object template of the initial frame is given; and the target template of the subsequent frame is updated by the target position of the preceding frame.