CN113538509A

CN113538509A - Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Info

Publication number: CN113538509A
Application number: CN202110615306.8A
Authority: CN
Inventors: 朱鹏飞; 于洪涛
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-10-22
Anticipated expiration: 2041-06-02
Also published as: CN113538509B

Abstract

The invention discloses a visual tracking method and a visual tracking device based on self-adaptive correlation filtering feature fusion learning, wherein the method comprises the following steps: the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the related filter coefficient with the help of self-adaptive time sequence smoothing punishment, and then the target feature graph and the updated related filter perform related operation to obtain the response graph of the shallow and deep features; the feature fusion weight output at the last moment of the self-adaptive feature response fusion learning module and the response graph of the shallow layer and the depth feature are fused together in a weighted mode to form a final target feature response graph, and the feature fusion weight output at the current moment is used for fusing the feature response graph of the next frame; and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking. The device comprises: a processor and a memory. The invention effectively solves the challenging problems such as shielding, illumination change, complex background and the like in the tracking process.

Description

Visual tracking method and device based on adaptive correlation filtering feature fusion learning

Technical Field

The invention relates to the field of visual tracking, in particular to a robust visual tracking method and device based on adaptive correlation filtering feature fusion learning.

Background

Visual target tracking is a challenging task in the field of computer vision. It has many practical applications at present, for example: traffic monitoring, human-computer interaction, autonomous robots, autopilot, and the like. Although the existing tracking method is remarkably improved in the aspects of accuracy and robustness, the problems of illumination change, target shielding, target deformation, background confusion and the like are still not well solved.

In recent years, tracking models based on Discriminant Correlation Filters (DCFs) have received increasing attention due to their high accuracy and high efficiency. The model updating method based on multi-feature fusion and time sequence constraint obviously improves the tracking precision and robustness of the tracking model based on the DCFs. Although great progress has been made in the existing DCFs-based trackers, there are two limitations, namely the non-robustness of feature representation and poor adaptability to various scenes.

The characteristic response diagram is the correlation operation result between the target characteristic diagram and the discriminant filter, and the quality of the response diagram has a great influence on the target positioning precision. The rich feature response provides an accurate and robust representation of the object for visual tracking. Currently, a common method for feature response map fusion is to integrate multiple response maps together by equally sized fusion weights. This coarse multi-feature response combination, while improving the quality of the target representation to some extent, also undermines the original advantages of each response map. For example, the response of a manual feature of oriented gradient Histogram (HOG) contains a spike of target response that can be used for accurate localization, but the target localization usually has disturbances like object responses. In contrast, response maps expressed by Convolutional Neural Networks (CNNs) have high-confidence target semantics, but they are only suitable for rough target spatial localization. The naive addition of the two reduces the discrimination and accurate positioning of the target expressed in the original response diagram. The STAple (tracking method based on the fusion of the response graphs of the color histogram and the histogram of oriented gradients) model selects a feature fusion weight by a contrast test setting method, and then the weight combines the HOG feature and the color histogram to fully utilize the discrimination of the HOG feature response on the target and the robustness of the color histogram on the deformation of the target. But the set weight is selected through a large number of comparison experiments on the verification set, and is difficult to popularize in other scenes with different tracking challenge attributes. And MCCT (fusion tracking method combining response image self-evaluation and expert strategy) and UPDT (fusion tracking method of shallow and deep features based on response image confidence evaluation) models solve the fusion weight of the shallow and deep feature responses by using the self-evaluation value of the current target feature response. The feature fusion method based on self-evaluation avoids additional cost consumption of model fine tuning, and tracking accuracy is superior to fixed fusion weight setting. However, the evaluation of the response map fusion is greatly influenced by the current tracking quality, and as the tracking error of the model is accumulated, the tracking drift is easily generated.

In addition, timing constraints can maintain reliable model updates, thereby ensuring that robust target feature responses are generated. The model corruption caused by unreliable target representation can be overcome by the time-series smoothing, and particularly, the time-series smoothing punishment influences the direction and the speed of model updating so as to resist the challenging problems of target shielding or appearance change and the like in a short time. STRCF^[1]The tracking model (the space-time regularization correlation filtering tracking method) reduces model corruption caused by target occlusion by introducing time sequence regularization into a tracker based on DCFs. In consideration of the interference of background noise, the ARCF (correlation filter tracking method based on time series distortion suppression) model imposes a consistency constraint on the feature response map between adjacent frames in the filter learning process, and therefore, the filter coefficients can be updated and learned according to the target information rather than the background information. In addition, the LADCF (correlation filtering tracking method based on time sequence consistency constraint) model realizes the time sequence consistency of the model by limiting the updating of the filter near the historical value, so that the obtained target representation has higher robustness to the time sequence appearance change of a scene or a target.

Although several of the above methods achieve some tracking performance improvement, the fixed temporal smoothing penalty ignores the variability of the tracking situation and can only be applied to a few tracking scenarios. Furthermore, video sequences used for tracking typically contain a wide variety of scene changes, such as: illumination changes, occlusion, etc.

To this end, two challenges facing the correlation filtering based tracking model in the appearance stage are summarized: how to form a robust target response, and how to maintain a good model update.

Disclosure of Invention

The invention provides a robust visual tracking method and a robust visual tracking device based on adaptive correlation filtering feature fusion learning, which effectively solve the challenging problems such as shielding, illumination change, complex background and the like in the tracking process and are described in detail in the following:

in a first aspect, a robust visual tracking method based on adaptive correlation filtering feature fusion learning includes the following steps:

the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the related filter coefficient with the help of self-adaptive time sequence smoothing punishment, and the target feature graph and the updated related filter perform related operation to obtain the response graph of the shallow and deep features;

the feature fusion weight output at the last moment of the self-adaptive feature response fusion learning module and the response graph of the shallow layer and the depth feature are fused together in a weighted mode to form a final target feature response graph, and the feature fusion weight output at the current moment is used for fusing the feature response graph of the next frame;

and based on the final target characteristic response diagram, obtaining the scale of the current target by utilizing multi-scale search, and realizing visual tracking.

In one embodiment, the adaptive time-series smoothing correlation filtering module obtains the degree of variation of the current target and scene by using the overall quality difference and confidence difference of the target feature response image between adjacent video frames as a basis;

and designing a time sequence smooth punishment for self-adaptive adjustment of the intensity by utilizing the obtained variation degree of the target and the scene, and using the time sequence smooth punishment for reliably updating the self-adaptive correlation filtering model.

In an embodiment, the time-sequence smoothing penalty for adaptively adjusting the magnitude of the intensity is specifically:

wherein the function e^(.)The first term | R of_max,t-1-R_maxL represents the maximum confidence interval between adjacent frames; r | |_t-1-R||₂Representing the difference in overall response quality between adjacent frames; gamma is a hyperparameter; r ═ beta_t⊙R_D+(I-β_t)⊙R_SRepresenting the fusion of weights beta by adaptation_tIntegrated characteristic response, and beta_tThe self-adaptive feature response fusion learning module learns the feature; r_t-1Representing the response fusion of the previous frame; r_max，t-1And R_maxRepresenting the maximum confidence of the previous and current frames.

The adaptive feature response fusion learning module introduces directional gradient histograms and convolutional neural network features to integrate feature responses; the desired correlation filter is solved by two independent correlation filtering training processes.

In one embodiment, the adaptive feature response fusion learning module learns optimal weights for histogram of oriented gradients and convolutional neural network feature response map fusion using gaussian response truth values as a supervised ridge regression method.

Wherein the content of the first and second substances,

the optimal weight is expressed as:

wherein, beta_tThe optimal characteristic response fusion matrix is represented;

is beta_tReference value of, providePriori information and model degradation can be avoided; μ is a hyperparameter.

In a second aspect, a robust visual tracking apparatus based on adaptive correlation filter feature fusion learning, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of any of the first aspects.

In a third aspect, a computer-readable storage medium storing a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any one of the first aspect.

The technical scheme provided by the invention has the beneficial effects that:

1. the method adopts a supervised learning mode to fuse two characteristics of depth and manual work, effectively utilizes complementary characteristics among various characteristics, is more intelligent and effective compared with the existing method of fusing various characteristics by utilizing a fixed weight or a response graph self-evaluation strategy, and can integrate a higher-quality characteristic response graph for target positioning;

2. aiming at the inherent time sequence characteristic of visual target tracking, the invention provides the time sequence smooth constraint of self-adaptive adjustment according to the real-time tracking quality change condition on the basis of the existing time sequence smooth of fixed intensity, and the constraint method has better self-adaptability and generalization;

3. numerous experiments on five challenging visual tracking benchmarks have demonstrated the superiority of the present invention over other advanced methods, such as: with the latest tracking method ECO (efficient deconvolution-based tracking method) and STRCF^[1]Compared with a space-time regularization correlation filtering tracking method, the AFLCF respectively obtains 1.9% and 4.4% AUC (tracking target overlap ratio precision) score promotion on the LaSOT data basis, and in addition, a visualization experiment and a tracking scene attribute experiment also verify that the method can effectively cope with challenging problems of target deformation, shielding, illumination variation, camera motion, target rapid movement, background clutter and the likeTitle to be obtained;

4. the invention is updated on line, does not need massive labeled training data, has small model scale and low requirement on hardware environment, and can be applied to mobile terminal equipment, such as: unmanned aerial vehicles, unmanned vehicles, autonomous robots, and the like.

Drawings

FIG. 1 is a flow chart of a robust visual tracking method based on adaptive correlation filter feature fusion learning;

FIG. 2 is a schematic diagram of a video sequence with scene and target changes used for evaluating IOU (target overlap ratio) scores in each frame of video under different strength of time-sequence smoothness penalties;

when the scene or the target (the head of the rider in the image) does not change obviously, a higher tracking IOU score can be obtained by a higher-intensity time sequence smoothing penalty, for example, two pictures (adjacent 40 th and 41 th frames in the video sequence) in the first column in the image are provided, the target moves smoothly, and the scene does not change obviously; when the target or scene changes drastically, the smaller intensity of the temporal smoothing penalty achieves a higher tracking IOU score, for example, in the second column of pictures (69 th and 70 th frames of the video sequence) in the present illustration, the target has obvious fast motion and severe motion blur. The figure illustrates that the larger time sequence smoothing punishment is only suitable for the condition that the target and the scene have no obvious variation, and the smaller time sequence smoothing punishment can obtain higher tracking performance under the condition that the target or the background has severe variation. Therefore, a time sequence smooth punishment capable of self-adaptive adjustment according to the change conditions of the tracking scene and the target is designed, and the tracking performance of the model can be effectively improved.

FIG. 3 is a comparison of the performance test of the present method with 18 advanced tracking methods on the basis of OTB-2015 data;

the left graph shows the distance precision comparison of each tracking method, and the numerical value after the name of the tracking method represents the distance precision value obtained by the tracking method; the right graph shows the comparison of the overlapping success rate of each tracking method, and the numerical value after the name of the tracking method represents the obtained overlapping success rate value. Experiments show that the tracking method AFLCF corresponding to the invention achieves the leading tracking performance.

FIG. 4 is a diagram illustrating performance comparison between the present method and nine advanced tracking methods in a variety of scenarios with different tracking challenges;

the values in the figure after tracing the method name indicate the overlap success rate values they achieved. Wherein, mainly include: illumination variation, target deformation, shielding, camera motion, target fast moving, background clutter and other challenging problems. From the experimental test results, the method is obviously superior to the existing tracking method and has better robustness for various tracking challenge problems.

Fig. 5 is a schematic structural diagram of a robust visual tracking device based on adaptive correlation filtering feature fusion learning.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Aiming at the problems, the embodiment of the invention provides a tracking model based on DCFs adaptive feature response fusion learning. The model adopts a method of self-adaptive supervised learning to replace manual setting or self-evaluation to solve the weight for fusing various features, and then integrates the response graphs of the shallow feature HOG and the deep feature CNN through the learned weight. The solved optimal fusion weight can fully excavate the complementary advantages of multi-feature response, thereby obtaining robust target representation. Meanwhile, the design of self-adaptive time smoothing punishment can ensure that the model can still be reliably updated in a scene with motion change.

Fig. 1 shows an overall structural diagram of a tracking method provided by an embodiment of the present invention. The embodiment of the invention uses the direction gradient histogram and the convolutional neural network as extractors of shallow and deep features; updating the model and solving the filter coefficients using a correlation filtering method with an adaptive timing smoothing penalty; learning the feature response graph and the feature of the convolutional neural network of the direction gradient histogram in the current frame by using an adaptive feature response fusion learning moduleOptimal fusion weights for the response map. The solved fusion weights are then used for the integration of the multi-feature response map in the next frame. Time sequence smoothing correlation filtering tracking method STRCF with fixed intensity^[1]Is a baseline model of an embodiment of the invention.

Wherein, STRCF^[1]The target function of (2) is shown in equation (1), and for the sake of simplicity, the index t corresponding to the current time is uniformly omitted (e.g., x)_tDenoted as x),

wherein x is [ x ]¹,...,x^k,...x^K]Representing an input feature map, x, consisting of K channels^kRepresenting the feature of the kth channel in the input feature map. w denotes a correlation filter to be learned. Denotes the convolution operator. y is a gaussian target characteristic response true value (which is a default true value in machine learning, and is not described in detail in the embodiments of the present invention). The second term is spatial regularization referenced from SRDCF (correlation filter tracking model based on spatial regularization), M being a spatial weight of the gaussian, k being the index of the kth channel of the correlation filter. An element dot-product operation. I w-w_t-1||²Is a time smoothing constraint to prevent model damage, λ is a penalty factor, w_t-1The correlation filter at the previous time instant.

Adaptive feature response fusion learning

Robust object representations with explicit semantic and spatial details enable trackers to overcome various object changes. The tracker introduces Histogram of Oriented Gradient (HOG) and Convolutional Neural Network (CNNs) features to integrate feature responses. CNNs have more characteristic channels than HOG, and therefore, two independent correlation filter training processes are required to solve for the desired correlation filter. The training sample is

And

the filter corresponding to the solution required is denoted w_DAnd w_S。

Specifically, x_DTraining samples for the expression of CNNs characteristics, x_STraining samples for HOG feature expression, w_DTo correspond to x_DFilter of w_STo correspond to x_SThe filter of (2). The characteristic response map is written as the convolution of sample x and filter w, as shown in equation (2), R_DAnd R_SCharacteristic response maps corresponding to CNNs and HOG are shown, respectively.

To facilitate subsequent fusion to the feature response graph, R_DAnd R_SWill be resized to the same size.

Wherein, K₁The number of channels of the CNNs characteristics is set to 512 in the embodiment of the invention; k₂The number of channels with the HOG feature is set to 31 in the embodiment of the present invention, and in the specific implementation, the embodiment of the present invention does not limit this, and is set according to the needs in the practical application.

Innovatively, the AFLCF (tracking method based on adaptive correlation filter feature response fusion learning) model in the embodiment of the present invention learns an optimal weight for HOG and CNNs feature response map fusion by using gaussian response truth value y as a supervised ridge regression method,

the corresponding objective function can be expressed as:

wherein, beta_tThe optimal eigenresponse fusion matrix is represented. While

Is beta_tIs a reference value of beta_tProvides a priori information and can avoid model degradation.

Is provided with

Where I is the identity matrix. μ is a hyperparameter. The second term in the above objective function is a regularization term that is used to prevent overfitting of the model.

To improve computational efficiency, Parseval's theorem can be utilized^[2]Transform equation (2) into the fourier domain. Equation (2) can then be written as:

wherein the content of the first and second substances,

response map R characteristic of CNNs_DThe representation in the fourier domain is such that,

as a feature x of CNNs_DThe representation in the fourier domain is such that,

filters w corresponding to the characteristics of CNNs_DThe representation in the fourier domain is such that,

response graph R for HOG feature_SThe representation in the fourier domain is such that,

as a HOG feature x_SThe representation in the fourier domain is such that,

filter w corresponding to HOG features_SRepresentation in the fourier domain.

Accordingly, equation (3) translates to:

wherein the content of the first and second substances,

is a representation of gaussian response truth y in the fourier domain.

Finally, the objective function corresponding to equation (5) can be solved and β can be obtained_tAnalytic solution of (2):

wherein the content of the first and second substances,

t denotes transposition.

Division in equation (6)

Representing element-by-element division.

Second, adaptive timing smoothing

The correlation filtering based tracker uses a timing smoothing penalty to enforce the similarity of the current filter to the previous filter in order to prevent model damage. Intuitively, when a target encounters a sharp appearance change and occlusion, the required filter coefficients should not be too similar to the filter at the previous time. At this time, the intensity of the corresponding temporal smoothing penalty should be appropriately reduced, otherwise, the situation is the opposite.

To verify this assumption, the embodiments of the present invention compare numerically tested different strengths of the temporal smoothing penalty. As shown in fig. 2, for example, adjacent 40 th and 41 th frames in a video sequence do not experience significant target or scene variation, and then a larger penalty factor (λ ═ 20,30) achieves a better IOU (cross-over ratio, i.e., target overlap ratio) score than a weaker timing penalty (λ ═ 0, 1). In contrast, the 69 th and 70 th framesSignificant target fast motion and motion blur variations are experienced, so the timing penalty factor λ is 0 and 1 achieves a better IOU score than larger parameter settings (λ 20, 30). In addition, a large penalty also risks causing tracking drift in subsequent frames. The reason for this is that the larger penalty coefficient forces the adjacent filters to be too similar, and the model cannot adapt to a complex and variable scene. Therefore, the optimal setting scheme of λ should be able to adapt to the changing tracking scenario. On the basis of the analysis and experimental verification, the peak to side lobe ratio (PSR) is simultaneously obtained^[3]Inspiring, the embodiment of the invention provides an adaptive time sequence smoothing punishment. The PSR is a measurement mode of the characteristic response quality, and can evaluate whether the current target characteristic response is good or not.

Wherein R is beta_t⊙R_D+(I-β_t)⊙R_SRepresenting the fusion of weights beta by adaptation_tIntegrated characteristic response, and beta_tThe self-adaptive feature response fusion learning module learns the feature; r_t-1Representing the response fusion of the previous frame; gamma is a hyperparameter; r_max，t-1And R_maxRepresenting the maximum confidence of the previous and current frames.

Function e^(.)The first term | R of_max,t-1-R_maxAnd | represents the maximum confidence interval between adjacent frames. R | |_t-1-R||₂Representing the difference in overall response quality between adjacent frames. In practice, the maximum confidence and overall response quality may measure the accuracy and robustness of the tracker, respectively. The maximum confidence determines the localization of the target, while the overall response map with the target semantics affects the distinction between foreground and background. Thus, the change in performance of the tracker in adjacent frames may be accounted for by

To measure. It also indirectly reflects the current onesThe extent of scene changes in the frame. Therefore, the embodiments of the present invention summarize

Will become larger as the degree of tracking scene change increases and lambda will become smaller. Otherwise, the situation is reversed. The adaptive timing smoothing penalty enables adaptive changing tracking scenarios and maintains reliable model updates.

Third, tracing process description and implementing conditions

The working flow of the method is shown in fig. 1, and the adaptive correlation filtering tracking model is composed of two feature extractors (a direction gradient histogram and a convolution neural network), a correlation filtering model provided with adaptive time sequence smoothing, and an adaptive feature response fusion learning module. Wherein the feature extractor is for extracting shallow and deep feature representations of the input trace image. And the correlation filtering model is to use the STRCF with space-time regular constraint^[1]Model as a base model in STRCF^[1]On the basis of the fixed time sequence smoothness punishment, the embodiment of the invention improves the self-adaptive adjustment of the time sequence smoothness punishment, and indirectly evaluates the degree of sudden change of the tracking scene according to the current tracking quality of the model, thereby pertinently adjusting the strength of the time sequence smoothness punishment. Specifically, when the change of the scene or the target along the time sequence is obvious, the corresponding time sequence smoothing strength is reduced; conversely, when the scene change is moderate, the time sequence smoothing penalty intensity should be increased moderately.

In the tracking process, the embodiment of the invention can adaptively adjust the intensity of time sequence smoothing according to the tracking quality difference of the models in the adjacent video frames, thereby ensuring reliable model updating and relevant filter updating. The tracking image input into the model is extracted into shallow and deep feature representations, and then the response graph R of the shallow feature and the deep feature is obtained after the target feature graph is subjected to correlation operation with the updated correlation filter_SAnd R_D. In addition, the feature fusion weight β learned by the adaptive feature response fusion learning module_tFeature response map to be used in the next frameAnd fusing to obtain more robust target characterization. And the characteristic response map R in the current video frame_SAnd R_DThe fusion weight β to be learned_tAnd the weights are fused together to form the final target characteristic response. In the feature response graph after fusion, the position of the maximum response value is the position of the target, and the scale of the current target is obtained by utilizing a multi-scale searching method.

According to the embodiment of the invention, HOG is taken as a shallow layer characteristic, and a Conv4 layer in a VGG-M model is taken as a target depth characteristic to express. The target scale estimation is performed using a scale search range of 7 × 1.01. The number of ADMM iterations is set to 2. The hyperparameter μ in equation (3) is set to 0.01 and γ in equation (7) is set to 17. The model AFLCF was implemented on MATLAB2017a equipped with MatConvNet toolbox. All simulation experiments were performed on a PC equipped with Intel i7-8700CPU, 16G RAM and an NVIDIA GTX1050Ti GPU.

Simulation experiment results: to verify the effectiveness of the method AFLCF, comparisons were made with other most advanced trackers on five different types of visual target tracking benchmarks, these tracking benchmark datasets including: OTB-2015, VisDrone2019, Nfs, DTB70, and LaSOT. The evaluation of the experimental result mainly has two criteria of distance precision (Accuracy) and overlapping success rate (AUC). Fig. 3 shows a comparison of the performance of the AFLCF model and the 18 published tracking methods on OTB-2015 data basis, and table 1 lists the results of the above experimental comparisons compared to a tracking model with performance in the top five. Tables 2-5 show the performance of the AFLCF model compared to other advanced tracking methods on 4 data benchmarks total of VisDrone2019, Nfs, DTB70, and LaSOT. Fig. 4 shows performance comparison of the method and nine advanced tracking methods in six scenes with different tracking challenge problems, and experimental test results show that the method is obviously superior to the existing tracking method and has higher robustness. On the five data bases, the AFLCF tracking model provided by the invention obtains the top tracking precision, thereby effectively verifying the effectiveness of the method. In addition, the five data references used in the method belong to short-distance tracking, unmanned aerial vehicle tracking, high frame rate video tracking, high diversity tracking and large-scale scene tracking respectively, and the high generalization and adaptability of the method to different tracking scenes are verified by the extremely competitive tracking performance obtained by the method.

The comparative tracking method related to the invention is explained as follows: SRDCF (correlation filter tracking method based on spatial regularization); KCF (memory coring correlation filtering tracking method); fdst (multi-scale search based correlation filter tracking method); siamesfc (twin full convolution neural network based tracking method); repeat (tracking method based on fusion of response maps of color histogram and histogram of oriented gradient features); BACF (background-aware based tracking method); repeat-CA (tracking method of context-aware application to response map fusion of color histogram and histogram of oriented gradients features); DSiam (dynamic twin network based tracking method); PTAV (tracking method based on parallel tracking and backtracking verification); ECO (tracking method based on efficient decomposition convolution); ECO-HC (tracking method of high-efficiency decomposition convolution using only shallow features); STRCF^[1](spatio-temporal regularization correlation filtering tracking method); DeepsTRCF^[1](a space-time regularization correlation filtering tracking method based on depth features); MCCT (fusion tracking method combining response image self-evaluation and expert strategy); TRACA (context-aware tracking method in conjunction with depth features); CSRDCF (correlation filter tracking method based on channel and spatial confidence); UPDT-RCG (Gaussian mask is applied to a shallow and deep feature fusion tracking method based on response image confidence evaluation instead of cosine window); HCFT-star (hierarchical feature fusion based tracking method); ASRCF (adaptive spatial regularization based correlation filter tracking method); ARCF (correlation filter tracking method based on timing distortion suppression); AutoTrack (correlation filtering tracking method based on automatic spatio-temporal regularization).

TABLE 119 comparison of tracking Performance on OTB-2015 data basis, top five selected

Table 2 comparison of performance of various tracking methods on the data basis of VisDrone2019

TABLE 3 comparison of Performance of various tracking methods on NFs data base

TABLE 4 comparison of Performance of various tracking methods on DTB70 data base

TABLE 5 comparison of the Performance of the first five in all correlation-based filter tracking methods on a LaSOT data basis

The performance comparison of the AFLCF tracking method and nine advanced tracking methods in various tracking scenes with different challenge problems mainly comprises the following steps: illumination variation, target deformation, shielding, camera motion, target fast moving, background clutter and other challenging problems. From the experimental test results, the method is obviously superior to the existing tracking method and has better robustness for various tracking challenge problems.

Based on the same inventive concept, the embodiment of the invention also provides a robust visual tracking device based on self-adaptive correlation filtering feature fusion learning, which comprises: a processor and a memory, the memory having stored therein program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method steps of:

the two feature extractors extract the input tracking image into shallow and deep feature representations, the related filtering module completes the learning and updating of the coefficients of the related filter and the like under the assistance of the self-adaptive time sequence smoothing penalty, and then the target feature graph and the updated related filter perform the related operation to obtain the response graph of the shallow and deep features;

In one embodiment, the adaptive time-sequence smoothing correlation filtering module obtains the degree of variation of the current target and scene by using the overall quality difference and confidence difference of the target characteristic response image between adjacent video frames as the basis;

In another embodiment, the time-series smoothing penalty for adaptively adjusting the magnitude of the intensity is specifically:

During specific implementation, the self-adaptive feature response fusion learning module introduces the features of the directional gradient histogram and the convolutional neural network to integrate feature response; the desired correlation filter is solved by two independent correlation filtering training processes.

Wherein the optimal weight is represented as:

is beta_tProvides a priori information and can avoid model degradation; μ is a hyperparameter.

It should be noted that the device description in the above embodiments corresponds to the method description in the embodiments, and the embodiments of the present invention are not described herein again.

The execution main bodies of the processor 1 and the memory 2 may be devices having a calculation function, such as a computer, a single chip, a microcontroller, and the like, and in the specific implementation, the execution main bodies are not limited in the embodiment of the present invention, and are selected according to requirements in practical applications.

The memory 2 and the processor 1 transmit data signals through the bus 3, which is not described in detail in the embodiment of the present invention.

Based on the same inventive concept, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method steps in the foregoing embodiments.

The computer readable storage medium includes, but is not limited to, flash memory, hard disk, solid state disk, and the like.

It should be noted that the descriptions of the readable storage medium in the above embodiments correspond to the descriptions of the method in the embodiments, and the descriptions of the embodiments of the present invention are not repeated here.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer.

The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium or a semiconductor medium, etc.

Reference documents:

[1]Li,F.,Tian,C.,Zuo,W.,Zhang,L.,Yang,M.:Learning Spatial-Temporal Regularized Correlation Filters for Visual Tracking[C].In:CVPR,2018,pp.4904–4913.

[2]Gonzalez,R.C.and Woods,R.E.(2002)Digital Image processing[M].2nd Edition,Prentice Hall,New York.

[3]Bolme,D.S.,Beveridge,J.R.,Draper,B.A.,Lui,Y.M.:Visual object tracking using adaptive correlation filters[C].In:IEEE Conference on Computer Vision and Pattern Recognition,2010, pp.2544–2550.

in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A visual tracking method based on adaptive correlation filtering feature fusion learning is characterized by comprising the following steps:

extracting input tracking images into shallow and deep feature representations by adopting two feature extractors, finishing learning and updating of related filter coefficients by a related filtering module under the assistance of self-adaptive time sequence smoothing punishment, and performing related operation on a target feature map and the updated related filter to obtain a response map of shallow and deep features;

weighting and fusing the feature fusion weight output by the self-adaptive feature response fusion learning module at the previous moment and the response images of the shallow layer and the depth features together to form a final target feature response image, wherein the feature fusion weight output at the current moment is used for fusing the feature response images of the next frame;

2. The visual tracking method based on the adaptive correlation filtering feature fusion learning according to claim 1,

the self-adaptive time sequence smoothing related filtering module adopts the whole quality difference and the confidence coefficient difference of a target characteristic response image between adjacent video frames as a basis to obtain the variation degree of a current target and a scene;

3. The visual tracking method based on adaptive correlation filtering feature fusion learning according to claim 1, wherein the time-series smooth penalty for adaptively adjusting the intensity is specifically:

4. The visual tracking method based on the adaptive correlation filtering feature fusion learning of claim 1, wherein the adaptive feature response fusion learning module introduces histogram of oriented gradients and convolutional neural network features to integrate feature responses; the desired correlation filter is solved by two independent correlation filtering training processes.

5. The visual tracking method based on adaptive correlation filtering feature fusion learning as claimed in claim 1, wherein the adaptive feature response fusion learning module learns the optimal weights for histogram of oriented gradients and convolutional neural network feature response graph fusion by using gaussian response truth value as supervised ridge regression method.

6. The visual tracking method based on adaptive correlation filtering feature fusion learning according to claim 5,

the optimal weight is expressed as:

7. An apparatus for visual tracking based on adaptive correlation filter feature fusion learning, the apparatus comprising: a processor and a memory, the memory having stored therein program instructions, the processor calling upon the program instructions stored in the memory to cause the apparatus to perform the method steps of any of claims 1-6.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps of any of claims 1-6.