CN113963026A

CN113963026A - Target tracking method and system based on non-local feature fusion and online updating

Info

Publication number: CN113963026A
Application number: CN202111255280.7A
Authority: CN
Inventors: 李爱民; 刘腾; 刘笑含; 周福珍
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2022-01-21

Abstract

The invention discloses a target tracking method and a system based on non-local feature fusion and online updating, wherein the method comprises the following steps: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched; inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched; extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module; matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result; and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.

Description

Target tracking method and system based on non-local feature fusion and online updating

Technical Field

The invention belongs to the technical field of computer vision and image processing, and particularly relates to a target tracking method and system based on non-local feature fusion and online updating.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The basic process of target tracking is to acquire position information of a target in a first frame of a video, and continuously position the target in subsequent frames, and during the tracking of the target, a tracker may encounter various problems, such as deformation, occlusion, disappearance and the like of the target, and some external changes also have great influences on the tracked target, such as illumination change, background clutter and the like, which makes great challenges on the stability of the tracker.

Traditional target tracking algorithms usually model the target by using discriminant and generative models, and classical algorithms include KCF, SCM and the like. Since 2012 AlexNet has gained significant performance in the field of image classification, deep learning has also gained wide attention. After that, deep learning is widely applied to the field of target detection and target tracking, and then a target tracking method based on a classification network and a twin network is continuously proposed.

Due to the simple and efficient characteristics of the twin network, the twin network has received wide attention in the field of target tracking. The SimFC proposed by Bertonitto et al converts the target tracking problem into the similarity measurement problem of the target template and the search area, and subsequently derives a series of target tracking algorithms, such as SimRPN, SimRPN + +, CFNet, DaSimRPN, C-RPN, SimMask, structSim, and SA-Sim. The SiamRPN combines a regional nomination network (RPN) with a twin network, and after features are extracted by the twin network, the feature map is fed into classification and regression branches. And the target tracking task is converted into a single sample detection task, so that online updating is avoided, and the running speed is greatly improved. After that, the SimRPN + + applies ResNet to the target tracker of the twin network, and a multi-layer fusion and deep cross-correlation mechanism is integrated, so that parameters are reduced, and the accuracy is improved. In order to improve the generalization capability of the model, DaSiamRPN generates a positive sample by applying data augmentation, and constructs a negative sample, so that the generalization capability of the model is greatly enhanced. CFNet differs from SiamFC in that it introduces correlation filtering and integrates it into the Siamese framework, greatly improving the tracking rate of the model. C-RPN on the basis of SiamRPN, the problem of similarity sample and positive and negative sample imbalance is solved by cascading multiple RPNs. A Mask branch is added to the SiamMask on the basis of the SiamRPN, so that the detection result is more accurate, and the structure Sim reduces the sensitivity of the model by designing a plurality of local structures, so that the similarity comparison problem is converted into a comparison problem of a local feature block. The SA-Sim encodes the target by designing two branches of appearance and semantics, so as to improve the tracking performance of the model.

Attention mechanisms are often considered as a means of improving tracking performance. Hu et al introduced a squeeze-and-fire network (SENet) block in the depth tracking architecture to learn the channelized attention. The SEnet improves the characterization capability of the CNN network by explicitly modeling the interdependence relation between the channels of the convolution characteristics of the CNN network. Further, Woo et al propose an efficient spatial and channel attention learning module to improve the representation capability of CNNs. Zhu et al developed an end-to-end flow correlation tracking framework (FlowTrack) with spatio-temporal attention that leverages the rich flow information in successive frames to improve characterization and tracking accuracy. The CSR-DCF constructs a spatial attention map to constrain CF learning by preprocessing the foreground segmentation. But the introduction of segmentation increases the amount of computation and the accuracy of segmentation can seriously affect the tracking performance. RASNet reformulates CF in a siamese tracking framework and introduces a space and channel attention mechanism to achieve high siamese tracking performance. And D, the spatial attention module guided by the pixel direction correlation and the channel attention module guided by the channel direction correlation are adopted to improve the identification capability of the characteristics to the Siamese tracking. SiamAttn proposes a new Siamese attention mechanism that combines self-attention and cross-branch attention with morphing operations to enhance the discriminative characterization of targets.

Although the tracker based on the Siamese network achieves excellent performance in the field of target tracking, a plurality of problems still exist, and the problems mainly appear in two aspects of template matching and feature extraction. In the aspect of feature extraction, in general, when a backbone network extracts a target feature, a feature of a last layer is output as an input of a subsequent task. The high-level features contain more semantic information and are more beneficial to the target classification task. But are less than satisfactory in the object-locating task. Since the high-level features do not have as much spatial information as the shallow features. The information can well locate the target. There are three major problems with template matching: 1) in an ONE-SHOT type target tracker, matching is performed throughout the tracking task by extracting the target appearance information of the first frame as a template. This mechanism is highly susceptible to loss of the target, since the appearance of the target will typically change constantly in the video sequence. 2) Conventional template update strategies, dominated by linear updates, allow the tracker to update the template at a constant rate. Since the change amplitude of the target appearance in each frame is non-linear, the drift phenomenon occurs in the target tracking. 3) Due to the real-time requirement of the tracker by the tracking task, the template updating network cannot be too complex.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a target tracking method and a target tracking system based on non-local feature fusion and online updating, and the target tracking system based on non-local feature fusion and online updating achieves an optimal balance between accuracy and real-time performance through comprehensive consideration of an improved tracker of a Simese network.

In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:

the target tracking method based on non-local feature fusion and online updating comprises the following steps:

acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;

inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;

extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;

matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;

and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.

Further, the fusing the low-level features and the high-level features through the non-feature pyramid module comprises:

the last two layers of the Siamese network output high-level features and low-level features; and the features output by the non-characteristic pyramid module and the high features of the last layer are subjected to matrix splicing.

Further, the loss function adopted during offline training of the siense network is as follows:

L＝log(1+e^-λf)

wherein, λ ∈ { -1, +1} represents a ground-truth label, and f represents the actual score of a pair of template matching.

Further, before the target template features are obtained, the bounding box is adopted to map the target on the template image to the corresponding position of the feature map in an equal scaling-down manner.

Further, after the initial target tracking result is output, calculating a similarity score of the feature maps of the template image and the image to be searched by using a similarity function, specifically:

wherein F (-) is the confidence score of prediction, Ratio () is the bounding box which is mapped into the feature map in an equal scaling-down way,

an actual error is represented by a value representing,

the convolution characteristic embedding of each network is represented, z represents the characteristic of the template image, z' represents the characteristic of the region where the bounding box is equally and proportionally mapped to the target, and x represents the characteristic of the image to be searched.

Furthermore, during online updating, an attention mechanism is introduced, and features of the template image and space-time information enhancement target representation of the updated features are extracted.

Further, the process of online updating includes:

performing convolution operation on template features of the initial frame by adopting two convolution layers, and flattening the convolved features to obtain Z'₀And Z ″)₀Then Z'₀And Z ″)₀Performing operation of the correlation matrix to obtain a characteristic matrix;

and performing sigmoid function and average pooling operation on the feature matrix, performing convolution operation by using a convolution layer, and performing corresponding multiplication operation on the convolved features by using the same-position elements to obtain the final target tracking features.

One or more embodiments provide a target tracking system based on non-local feature fusion and online updating, comprising:

an image acquisition module configured to: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;

a target tracking module configured to: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;

One or more embodiments provide a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the steps of any non-local feature fusion and online update based object tracking method as described above.

One or more embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the non-local feature fusion and online update based object tracking methods described above.

The above one or more technical solutions have the following beneficial effects:

(1) the invention fuses low-level features and high-level features through a non-local module, the high-level features have better semantic features, the features are not interfered, and the classification of the target is more effective. However, the last layer of semantic features is not very robust and accurate for positioning the target, the shallow layer of convolution features has more spatial information, and the features can capture the detailed position of the target, so that the target can be accurately positioned, and the accuracy and robustness of target tracking are improved.

(2) According to the method, an attention mechanism is introduced into an online updating model, the template features and the space-time information of the updated features are extracted to enhance target representation, and meanwhile context information of the target can be well obtained. The on-line target change adaptation capability of the tracker is improved, so that the tracker has strong robustness when the target is subjected to emission deformation and shielding.

(3) The invention does not directly update the template, but puts the extracted first frame template characteristic into an Attention module, thus the template always keeps the most original information of the target, and even if the error information is updated in the new update module, the tracker is not seriously influenced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a real-time target tracking method for non-local feature fusion and online updating according to an embodiment of the present invention;

FIG. 2 is a non-local feature pyramid module in an embodiment of the invention;

FIG. 3 is an online update module in an embodiment of the invention;

FIG. 4 is a diagram illustrating tracking effects in an embodiment of the present invention;

fig. 5(a) -5 (b) are graphs comparing the accuracy and precision of OTB-100 with other advanced models in the embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

As shown in fig. 1, the present embodiment discloses a target tracking method based on non-local feature fusion and online update for target tracking, which includes the following steps:

s1: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;

s2: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;

s3: extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;

s4: matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;

s5: average Peak Correlation Energy (APCE) and maximum response F according to initial target tracking result_maxAnd (4) judging whether the threshold condition is met or not through change, if not, updating the target template feature template, extracting the features of the template image and enhancing the target representation by the spatio-temporal information of the updated features, and obtaining a target tracking result.

As one or more embodiments, in order to fully extract the characteristic relationship between the target and the surrounding environment,

and after the initial target tracking result is output, calculating the similarity score of the characteristic graphs of the template image and the image to be searched by using a similarity function.

an actual error is represented by a value representing,

For one or more embodiments, the loss function employed during offline training of the two backbone networks is the logistic loss:

L＝log(1+e^-λf)

The specific tracking process is shown in algorithm table 1.

TABLE 1

As one or more embodiments, the extracting target template features and target to-be-searched region features by fusing the low-level features and the high-level features through the non-feature pyramid module specifically includes:

respectively combining the template image and the low-level features of the image to be searched

And high level features

And inputting the fused target template features and the target region to be searched into a non-local pyramid model to perform pyramid pooling operation to obtain the fused target template features and the target region to be searched.

Where C, H, and W are the number of channels, width, and height of the feature, respectively, and subscripts I, H represent the bottom level and top level, respectively. As shown in fig. 3, the specific step of fusing the low-level features and the high-level features of the template image and the image to be searched includes:

(1) lower layer feature X_lowTwo dimensionality reduced lower layer features phi 'are obtained under two 1 x 1 convolutions'_lowAnd phi ″)_lowPhi 'will'_lowAnd phi ″)_lowThe two dimension-reduced low-level features are respectively input into the pyramid pooling layer for pooling operation, and the output pooling results are connected and flattened to obtain eta_l1And η_l2；

Φ′_low＝Conv₁(X_low)，Φ″_low＝Conv₂(X_low))；

(2) High level feature X_highObtained under convolution of 1 × 1 to'_highPhi 'will'_highFlattening to obtain

Φ′_high＝Conv₁(X_high)；

η_h＝Rshape(Conv(X_high))；

(3) According to η_l1And η_hThe similarity matrix γ is obtained:

wherein

(4) Carrying out normalization operation on the similar matrix gamma to obtain

Normalizing the similarity matrixes gamma and eta_l2Performing matrix multiplication to obtain a high-level feature X_highCarrying out matrix splicing operation to obtain output fusion characteristic O_out：

The scheme has the advantages that the convolutional neural network extracts the features of the target, only the features of the last layer are usually used as a classifier to classify the target, the method is effective for some high-level visual problems, the high-level features have good semantic features, the features are not interfered, and the method is effective for classifying the target. However, the semantic features of the last layer are not very robust and accurate for positioning the target, and the convolution features of the shallow layer have more spatial information, and can capture the detailed position of the target, so that the target can be accurately positioned. Therefore, in order to improve the accuracy and robustness of target tracking, the low-level features and the high-level features are fused through a non-local module.

By designing the non-local pyramid module, the low-level features and the high-level features are fused to obtain more accurate target information, so that a solid foundation is laid for subsequent target positioning, and the tracker has stronger robustness.

In one or more embodiments, the maximum response F is determined based on the average peak correlation energy APCE and the maximum response_maxAnd comparing the change with a threshold condition, and judging whether the model needs to be updated or not.

When APCE suddenly decreases, no model update is performed, only when APCE and F_maxWhen the average value is larger than the historical average value APCE and the maximum response value in a certain proportion, the model is updated on line.

Calculating the maximum response value F_max：F_max＝max(response)

Calculate APCE value:

in the formula, F_max、F_min、F_w，hRespectively representing the maximum response value, the minimum response value and any w-th line in the final responseThe response value of the h column element.

In the formula, mean (F)_max) And mean (apce) represents the history F of the previous frame_maxAnd average value of apce value, ξ_maxAnd xi_apceRepresenting two thresholds.

When a threshold condition is met, the template is updated, preferably where ξ_maxAnd xi_apceSet to 0.8 and 0.2, respectively.

Since the appearance characteristics of the target in a video sequence usually change, if the tracking template cannot be updated in real time, the target is easily lost. We use the average peak correlation energy APCE and the maximum response value F_maxAnd judging the change, wherein when the two values are suddenly reduced, the probability of the target being blocked or lost is higher, and the updated template can cause pollution.

When the APCE is suddenly reduced, the target is generally lost by the shielded target, and the model is not updated to avoid model drift. The updating mode effectively distinguishes different influences of target appearance change, target shielding and target loss on tracking, and improves the robustness of the algorithm. Only when APCE and F_maxAre all proportionally greater than the respective historical mean value xi_maxAnd xi_apceIn time, the model is updated, so that the condition of model drift is greatly reduced on one hand, the times of model updating are reduced on the other hand, and the acceleration effect is achieved.

As one or more embodiments, when online updating, an attention mechanism is introduced, and the spatio-temporal information of template features and updated features is extracted;

set the template feature of the initial frame as Z₀With subsequent update being characterized by X_n；

Preferably, the update factor is set to 5, with updates being made every 5 frames.

First, the template feature Z of the initial frame is obtained by using two convolution layers₀Performing convolution operation, and flattening the convolved features to obtain Z₀' and Z₀；

Z′₀＝Conv₁(Rshape(Z₀))，Z″₀＝Conv₂(Rshape(Z₀)))

Then Z is₀' and Z₀Performing operation on the correlation matrix to obtain a feature matrix gamma epsilon R^HW×HWPerforming reshaping operation on the characteristic matrix gamma to obtain gamma epsilon R^HW×H×WThe relationship between each sub-region is captured.

Adopting sigmoid function and average pooling operation to the characteristic matrix gamma to obtain gamma' epsilon R^1×H×WSimultaneously using a convolution layer X_nAnd (3) performing convolution operation, and performing element-wise multiplication operation on the convolved features to obtain a final feature V:

V＝Γ⊙Conv(X_n)。

the method has the advantages that if only the template of the tracker is updated, the requirement on the quality of the template is extremely high, once the template has errors, the method can only accumulate the errors continuously, and finally the target tracking fails. Since visual tracking is a dynamic process with changing scenes, the target object has a strong spatiotemporal relationship between successive frames, and the target appearance can be modeled by using features from different frames and positions. An attention mechanism is introduced into an online updating model, the space-time information of template features and updating features is extracted to enhance target representation, meanwhile, context information of a target can be well obtained, the capability of the tracker to adapt to target change on line is improved, and the tracker has strong robustness when the target is transmitted, deformed and shielded.

Experimental results and experimental pairs are shown in fig. 4 and 5, fig. 4 is a graph of the tracking effect of the present invention, and fig. 5(a) -5 (b) are graphs of the accuracy and precision of OTB-100 and other advanced models.

In the embodiment, when the template is updated, the template is not directly updated, but the template and the extracted first frame template features are put into an Attention module, so that the most original information of the target is always kept in the template, and even if the error information is updated in the newly updated module, the tracker cannot be seriously influenced.

Example two

The embodiment provides a target tracking system based on non-local feature fusion and online update, which comprises:

EXAMPLE III

The embodiment of the present specification provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the program to implement the steps of the non-local feature fusion and online update based target tracking method in the first embodiment.

Example four

The embodiment of the present specification provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used to implement, when executed by a processor, the steps of the target tracking method based on non-local feature fusion and online update in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The target tracking method based on non-local feature fusion and online updating is characterized by comprising the following steps of:

2. The non-local feature fusion and online update based target tracking method of claim 1, wherein the fusing the low-level features and the high-level features by the non-feature pyramid module comprises:

3. The target tracking method based on non-local feature fusion and online update as claimed in claim 1, wherein the loss function adopted during the siense network offline training is:

L＝log(1+e^-λf)

4. The non-local feature fusion and online update based target tracking method according to claim 1, wherein before the target template features are obtained, a bounding box is used to map the target on the template image to the corresponding position of the feature map in an equal scaling-down manner.

5. The target tracking method based on non-local feature fusion and online update as claimed in claim 2, wherein after the initial target tracking result is output, a similarity function is used to calculate a similarity score of the feature maps of the template image and the image to be searched, specifically:

an actual error is represented by a value representing,

6. The non-local feature fusion and online update based target tracking method according to claim 1, wherein an attention mechanism is introduced during online update to extract features of the template image and enhance target representation by updating spatiotemporal information of the features.

7. The non-local feature fusion and online update based target tracking method according to claim 1, wherein the online update process comprises:

performing convolution operation on template features of the initial frame by adopting two convolution layers, and flattening the convolved features to obtain Z₀' and Z₀Then Z is₀' and Z₀Performing operation of the correlation matrix to obtain a characteristic matrix;

8. The target tracking method based on non-local feature fusion and online updating is characterized by comprising the following steps of:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the non-local feature fusion and online update based object tracking method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the non-local feature fusion and online update based object tracking method of any one of claims 1 to 7.