CN111612817A

CN111612817A - Target tracking method based on depth feature adaptive fusion and context information

Info

Publication number: CN111612817A
Application number: CN202010375319.8A
Authority: CN
Inventors: 纪元法; 何传骥; 孙希延; 付文涛; 严素清; 符强; 王守华; 黄建华
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2020-09-01

Abstract

The invention discloses a target tracking method based on depth feature adaptive fusion and context information, which comprises the steps of firstly, obtaining a first frame image of a video image sequence, and establishing a deep layer feature model and a shallow layer feature model based on a context sensing framework; then, a plurality of second frame images of the video image sequence are obtained, and deep layer feature responses and shallow layer feature responses of corresponding tracking targets are calculated by utilizing the deep layer feature models and the shallow layer feature models; obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the deep layer feature response and the shallow layer feature response are adaptively fused; and judging the average peak value correlation energy based on a threshold value, and updating the deep layer feature model and the shallow layer feature model until the video image sequence is finished, so that the target can be effectively tracked, and the accuracy is high.

Description

Target tracking method based on depth feature adaptive fusion and context information

Technical Field

The invention relates to the technical field of image processing, in particular to a target tracking method based on depth feature adaptive fusion and context information.

Background

Visual tracking has always been a major concern in the field of computer vision. In recent years, a visual tracking algorithm based on correlation filtering develops rapidly, and has certain advantages in tracking speed and precision. However, the research of target tracking still has certain difficulties, and external interference factors such as target shielding, rapid movement, illumination change and the like directly influence the performance of the tracking algorithm.

The method based on the correlation filtering and the method based on the deep learning are the current mainstream target tracking algorithm, wherein the correlation filtering is one of the research hotspots in the current target tracking field by virtue of the speed advantage brought by the rapid calculation characteristic of the correlation filtering in the frequency domain. Bolme et al first introduced a correlation filtering algorithm into the field of target tracking, and proposed a MOSSE algorithm, but the gray scale features cannot accurately describe the appearance of a target. Henriques and the like propose a CSK algorithm by improving a kernel function, complete the intensive sampling of samples in a cyclic shift mode, and solve the problem of insufficient training samples, but the CSK algorithm adopts single-channel gray scale characteristics and has limited characteristic description capability. Henriques et al then proposed a kernel correlation filtering KCF algorithm, introducing a kernel function into the ridge regression, extending the gray features of a single channel to the HOG features of multiple channels. Possegger et al use color histogram features to describe the appearance of the target, with good results. Martin Danelljan and the like extend a CSK algorithm by utilizing color attribute Characteristics (CN), reduce the operation amount by Principal Component Analysis (PCA) dimension reduction, and improve the tracking precision.

The method only adopts a single characteristic to describe the appearance of the target, and the identification degree of the target is poor and the target is easy to be interfered in a complex scene. The traditional related filtering tracking algorithm has a serious boundary effect, the appearance of a tracked target cannot be well described by adopting artificial features, and a proper feature fusion strategy is lacked, so that the accuracy of the algorithm is seriously influenced by the factors, and the target cannot be effectively tracked.

Disclosure of Invention

The invention aims to provide a target tracking method based on depth feature adaptive fusion and context information, which improves the accuracy of an algorithm and can effectively track a target.

In order to achieve the above object, the present invention provides a target tracking method based on depth feature adaptive fusion and context information, comprising:

acquiring a first frame of image of a video image sequence, and establishing a deep layer feature model and a shallow layer feature model based on a context-aware framework;

acquiring a plurality of second frame images of the video image sequence, and calculating a deep layer feature response and a shallow layer feature response of a corresponding tracking target by using the deep layer feature model and the shallow layer feature model;

obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the deep layer feature response and the shallow layer feature response are adaptively fused;

and judging the average peak value correlation energy based on a threshold value, and updating the deep layer feature model and the shallow layer feature model until the video image sequence is ended.

The acquiring a first frame of image of a video image sequence and establishing a deep layer feature model and a shallow layer feature model based on a context-aware framework includes:

and introducing a background sample of the target in the first frame of image of the acquired video image sequence as context information into template learning, simultaneously acquiring three-layer convolution characteristics of a target area and three-layer convolution characteristics of four image blocks of the target, namely the upper image block, the lower image block, the left image block, the right image block and the left image block, and calculating a deep layer characteristic model by using a detection formula.

The acquiring a first frame of image of a video image sequence, and establishing a deep layer feature model and a shallow layer feature model based on a context-aware framework, further includes:

and establishing a shallow feature model according to the color histogram feature and the HOG feature in the background sample.

Acquiring a plurality of second frame images of the video image sequence, and calculating a deep layer feature response and a shallow layer feature response of a corresponding tracking target by using the deep layer feature model and the shallow layer feature model, wherein the method comprises the following steps:

and sequentially acquiring a plurality of second frame images, extracting deep features and shallow features of the tracking target corresponding to the second frame images based on the target position, and calculating corresponding deep feature responses and shallow feature responses by using the deep feature model and the shallow feature model.

Wherein, obtaining a plurality of second frame images of the video image sequence, and calculating a deep layer feature response and a shallow layer feature response of a corresponding tracking target by using the deep layer feature model and the shallow layer feature model, further comprises:

and performing depth feature extraction by using a VGG-NET-19 depth network, and adjusting the size of the corresponding deep feature image by using a bilinear interpolation method.

Obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the deep layer feature response and the shallow layer feature response are adaptively fused, wherein the obtaining of the position of the tracking target in the corresponding second frame image comprises:

and sorting the local maximum values of the deep layer feature response and the shallow layer feature response in an ascending order to serve as candidate states, and obtaining the candidate states with set overall loss and corresponding weight coefficients based on a minimized loss function.

Wherein, obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the self-adaptive fusion of the deep layer feature response and the shallow layer feature response, further comprises:

and combining the response graphs of the deep layer features and the shallow layer features by using a self-adaptive feature fusion strategy, fusing the loss weight coefficients, introducing a relaxation variable and a distance function to perform prediction quality evaluation, and determining a main peak and an interference peak to obtain the position of the tracking target in the corresponding second frame image.

Wherein the determining an average peak correlation energy based on a threshold and updating the deep layer feature model and the shallow layer feature model until the video image sequence ends comprises:

and calculating corresponding average peak correlation energy based on the corresponding second frame image, and updating the deep layer feature model and the shallow layer feature model when the average peak correlation energy is larger than a historical average peak correlation energy average value until the video image sequence is ended.

The invention relates to a target tracking method based on depth feature adaptive fusion and context information, which comprises the steps of firstly, obtaining a first frame image of a video image sequence, and establishing a deep layer feature model and a shallow layer feature model based on a context sensing framework; then, a plurality of second frame images of the video image sequence are obtained, and deep layer feature responses and shallow layer feature responses of corresponding tracking targets are calculated by utilizing the deep layer feature models and the shallow layer feature models; obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the deep layer feature response and the shallow layer feature response are adaptively fused; and judging the average peak value correlation energy based on a threshold value, and updating the deep layer feature model and the shallow layer feature model until the video image sequence is finished, so that the target can be effectively tracked, and the accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic step diagram of a target tracking method based on depth feature adaptive fusion and context information according to the present invention.

Fig. 2 is a graph comparing accuracy curves of 8 algorithms provided by the present invention.

Fig. 3 is a graph comparing success rate curves of 8 algorithms provided by the present invention.

FIG. 4 is a flowchart illustrating a target tracking method based on depth feature adaptive fusion and context information according to the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

In the description of the present invention, it is to be understood that the terms "upper", "lower", "left", "right", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Further, in the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Referring to fig. 1, the present invention provides a target tracking method based on depth feature adaptive fusion and context information, including:

s101, obtaining a first frame of image of a video image sequence, and establishing a deep layer feature model and a shallow layer feature model based on a context-aware framework.

In particular, in the correlation filtering tracking, the background information around the target has a very significant influence on the performance of the tracker. The related filtering tracker easily causes boundary effect due to the characteristics of the cyclic samples, the boundary effect can be effectively limited but background information is reduced through the cosine window, and when a target is deformed or the background is disorderedTracking failures may result. Aiming at the problem, a background sample of a target in a first frame image of an acquired video image sequence is taken as context information and introduced into template learning, and meanwhile, three-layer convolution characteristics of the target area and three-layer convolution characteristics of four image blocks around the target, namely the upper layer, the lower layer, the left layer and the right layer, are acquired and taken as a training template of a depth feature model. And the four image blocks of the target, namely the upper, lower, left and right image blocks are context image blocks. Establishing a shallow feature model according to the color histogram feature and the HOG feature in the background sample, and tracking the object n₀∈RⁿExtracting k context image blocks n around_i∈RⁿThe corresponding circulant matrix is N₀∈R^n×nAnd N_i∈R^n×n. And calculating a deep layer characteristic model by using a detection formula:

wherein λ is₂For the regularization parameters, in Fourier form, complex conjugate transform, z is a circulant matrix, r_dIndicating the location of the target, ⊙ is a dot product operation,

as a result of the fourier transform of the depth features of the target region,

and the result is obtained after the Fourier transform of the depth features of the area around the target.

Is the dot product of the depth features of the target image,

the dot product of the depth features of the image blocks around the target.

The method has the advantages that multilayer convolution characteristics are collected to serve as deep characteristics, the color histogram characteristics and the HOG characteristics serve as shallow characteristics, the appearance of the target can be accurately and effectively described, a context-aware frame (context-aware) is introduced, four image blocks, namely an upper image block, a lower image block, a left image block, a right image block and a left image block are collected around the target, the boundary effect is reduced, the.

S102, obtaining a plurality of second frame images of the video image sequence, and calculating a deep layer feature response and a shallow layer feature response of the corresponding tracking target by using the deep layer feature model and the shallow layer feature model.

Specifically, a plurality of second frame images except the first frame image are sequentially acquired, deep features and shallow features of a tracking target corresponding to the second frame images are extracted based on the target position, and deep feature responses and shallow feature responses corresponding to the tracking target are calculated by using the deep feature models and the shallow feature models, wherein the deep feature responses are calculated by using a VGG-NET-19 depth network to extract the depth features, and by taking a MotorRolling sequence in an OTB100 data set as an example, conv5-4 with more semantic information and conv3-4 and conv4-4 with more detail information in a feature map are selected to describe the appearance of the target_ikDepending on the feature map where positions i and k are adjacent, the feature vector for position i is then expressed as:

x_i＝∑β_ikm_k

after the feature maps of the conv3-4, conv4-4 and conv5-4 convolutional layers are subjected to bilinear interpolation and visualization processing, shallow feature maps such as conv3-4 and conv4-4 have higher resolution, and the contour of the target can be described more accurately. As the depth increases, the conv5-4 deep features describe the range of regions where the target is located, and the brightness is higher. Through sequence comparison, when the shape and the background of the tracked target change simultaneously, the extracted depth features can still distinguish the target.

Calculation method of shallow feature responseThe method comprises the following steps: the shallow feature is mainly a manual feature, comprises RGB pixels, HOG, CN and the like, contains detailed information such as texture, color and the like, has high spatial resolution and is suitable for high-precision positioning. The method extracts the color histogram feature and the HOG feature as the shallow feature, and the color histogram and the HOG response value are represented by a feature image psi of an M-channel_x:H→R^MCalculated and defined on a finite grid g.

response_hist(x)＝g(ψ_x)

S103, obtaining the position of the tracking target in the corresponding second frame image according to the response sum after the deep layer feature response and the shallow layer feature response are adaptively fused.

Specifically, the depth features encode high-level semantic information, are insensitive to external deformation and can be used for coarse positioning, and the shallow features have higher detail resolution and are suitable for accurate positioning. The two features are treated separately, the deep feature is responsible for robustness, the shallow feature emphasizes accuracy, the two features are fused in a self-adaptive mode, and feature complementation is achieved. Collecting three-layer convolution characteristics as depth characteristics, color histogram characteristics and HOG characteristics as shallow characteristics, respectively training a relevant filter by the two characteristics, constructing two independent appearance models, and combining response graphs of the two characteristics by adopting a self-adaptive characteristic fusion strategy:

y_β(t)＝β_dy_d(t)+β_sy_s(t)

wherein, y_dRepresenting the deep feature score, y_sDenotes the shallow feature fraction, y_βRepresents the total score obtained by weighting two scores, β ═ β_d,β_s) Weights representing the dark and light scores.

The response graph can reflect the accuracy and robustness of target positioning, the accuracy is related to the response sharpness degree around the predicted target, and the sharper the main peak is, the stronger the accuracy is; robustness is related to the interval from the main peak to the interference peak, and the larger the distance from the main peak to the secondary peak is, the stronger the robustness is. In order to evaluate the reliability of the prediction target, a prediction quality evaluation method is adopted:

where y represents the detection score function of the image search region and y (t) ∈ R is the position t ∈ R²Target prediction score of, t^*Representing candidate prediction targets. The delta distance function is defined as:

lead-in slack variable ξ ═ ξ_t*{y_βCo-estimating the fractional weight β and the target state t based on a formula combining response maps of the two features^*Maximizing the quality assessment by minimizing the loss function yields:

in actual operation, local maximum values are respectively searched from deep layer scores and shallow layer scores, the local maximum values are sorted and screened according to response values in ascending order and then serve as limited candidate states omega, and each state t is optimized through a minimization loss function^*∈ Ω, and then selecting the candidate state t with the set, i.e., minimum, overall loss^*As the final prediction result, the corresponding weight coefficient β is obtained (β)_d,β_s). The method adopts a deep and shallow feature adaptive fusion strategy to combine the response graphs of the two features, adaptively changes the feature fusion weight according to different tracking backgrounds, effectively adapts to various tracking scenes, and improves the tracking performance.

And S104, judging the average peak value correlation energy based on a threshold value, and updating the deep layer feature model and the shallow layer feature model until the video image sequence is finished.

In particular, the selection of the model update strategy has a significant impact on the performance of the correlation filter. During the tracking process, the tracking process is inevitably interfered by various factors, such as target loss, background occlusion, blurring and the like, and the updating of the model by using wrong information may cause the tracking to drift or even fail. And calculating corresponding average peak correlation energy based on the corresponding second frame image, evaluating the confidence degree of the response by adopting the Average Peak Correlation Energy (APCE), and judging the reliability of the tracking result. The index can reflect the reliability of the target tracking result and the fluctuation degree of the response diagram, and can be calculated by the following formula:

wherein, F_maxIndicates the highest response, F_minRepresents the lowest response, F_w,hRepresenting the response at location (w, h), mean () represents the average of the values in parentheses. When the target generates deformation and background interference, the response diagram generates severe fluctuation and multi-peak interference occurs, and the APCE is in a lower state. When the target is not disturbed, the response diagram has a sharp and well-defined (unambiguated) peak, while the APCE is in a higher state. Thus, the model is not updated when the APCE value decreases significantly, only when the APCE value and F_maxThe model is updated when the ratio of the updated model is larger than the historical mean value, so that the model updating times are reduced, and the model drift condition is also reduced.

And when the average peak correlation energy is larger than the historical average peak correlation energy mean value, namely a threshold value, updating the deep layer feature model and the shallow layer feature model until the video image sequence is ended. The updated model is as follows:

wherein, α_{deep_t-1}Depth feature model for previous frame, α_{deep_t}A depth feature model for the current frame, η_deepLearning rate for depth features α_{shallow_t-1}Model of shallow features for the previous frame, α_{shallow_t}Is a shallow feature model of the current frame, η_shallowThe learning rate of the shallow features.

For example, using OTB100 data set evaluates the tracking performance of the method. The evaluation results were compared in terms of both Precision map (Precision plot) and Success rate (Success plot) maps. The precision graph adopts CLE (Central location error), and CLE is defined as the Euclidean distance between the target coordinate value detected by the tracking method and the actually marked target coordinate value. The success rate refers to the overlapping rate of the boundary frames, and the target boundary frame r detected by the given detection method_tAnd the actual labeled target bounding box r_aThe overlap ratio is defined as:

where, U and @, respectively, represent the union of two regions, and | · | represents the number of pixels.

The experimental hardware environment is a Win10 operating system, an Intel Core i7-8750H (2.20GHz) processor, an 8GB memory and Matlab R2018a, the result of comparing the algorithm with the SRDCF, STAPLE, LCT, RPT, SAMF, KCF and CSK tracking algorithm on an OTB100 data set is shown in a comparison graph of the precision curve and the success rate curve of the 8 algorithms provided in figures 2 and 3, OUR is the method provided by the invention, and the method has the advantages that the precision and the success rate are both in the first place, the precision is improved by 0.2% compared with the second-place LCT algorithm, the precision is improved by 1.5% compared with the SRDCF, and the precision is improved by 10.8% compared with the KCF. On the success rate, the algorithm is improved by 0.2% compared with SRDCF and 20.9% compared with KCF. It can be seen that the invention has better tracking performance.

As shown in fig. 4, the process of the target tracking method based on the depth feature adaptive fusion and the context information specifically includes: reading a first frame image of a video sequence, determining a rectangular region of a tracked target, extracting three-layer convolution features of the target region and three-layer convolution features of four image blocks of the target, namely an upper image block, a lower image block, a left image block, a right image block and a left image block as deep-layer features, extracting color histogram features and HOG features as shallow-layer features, and respectively calculating a deep-layer feature model and a shallow-layer feature model; secondly, reading a next frame of image, extracting deep layer features and shallow layer features of the target at the target position predicted by the previous frame, and calculating deep layer feature response and shallow layer feature response of the target of the current frame according to the feature model calculated by the previous frame; then, adopting a deep-layer and shallow-layer feature self-adaptive fusion strategy, calculating a weight value by finding a candidate state with minimum overall loss in the deep-layer and shallow-layer response scores to obtain the optimal fusion proportion of the two features, and calculating the position with the maximum response value according to the fused response sum, wherein the position is the position (x, y) of the tracking target in the image; and finally, calculating the APCE value of the current frame according to the fused response value, if the APCE value of the current frame is larger than the historical APCE average value, namely APCE is larger than APCE _ mean, judging that the reliability of the image tracking result of the frame is higher, updating the deep layer feature model and the shallow layer feature model, and if not, not updating until the video sequence is finished, effectively tracking the target and having higher accuracy.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A target tracking method based on depth feature adaptive fusion and context information is characterized by comprising the following steps:

2. The method for tracking a target based on adaptive fusion of depth features and context information as claimed in claim 1, wherein the obtaining a first frame image of a video image sequence and establishing a deep feature model and a shallow feature model based on a context-aware framework comprises:

3. The method for tracking a target based on adaptive fusion of depth features and context information as claimed in claim 2, wherein the obtaining a first frame image of a video image sequence and establishing a deep feature model and a shallow feature model based on a context-aware framework further comprises:

4. The method for tracking a target based on adaptive fusion of deep and shallow features and contextual information according to claim 3, wherein obtaining a plurality of second frame images of the video image sequence and calculating a deep feature response and a shallow feature response of the corresponding tracked target by using the deep feature model and the shallow feature model comprises:

5. The method as claimed in claim 4, wherein the method for tracking the target based on the adaptive fusion of the deep and shallow features and the context information comprises obtaining a plurality of second frame images of the video image sequence and calculating a deep feature response and a shallow feature response of the corresponding tracked target by using the deep feature model and the shallow feature model, and further comprising:

6. The method for tracking the target based on the adaptive fusion of the deep and shallow features and the context information as claimed in claim 5, wherein obtaining the position of the tracked target in the corresponding second frame image according to the sum of the responses after the adaptive fusion of the deep feature response and the shallow feature response comprises:

7. The method for tracking the target based on the adaptive fusion of the deep and shallow features and the context information as claimed in claim 6, wherein the position of the tracked target in the corresponding second frame image is obtained according to the sum of the responses after the adaptive fusion of the deep feature response and the shallow feature response, further comprising:

8. The method for tracking an object based on adaptive fusion of depth features and context information as claimed in claim 7, wherein the determining the average peak correlation energy based on the threshold and updating the deep feature model and the shallow feature model until the end of the video image sequence comprises: