CN113963026A - Target tracking method and system based on non-local feature fusion and online updating - Google Patents

Target tracking method and system based on non-local feature fusion and online updating Download PDF

Info

Publication number
CN113963026A
CN113963026A CN202111255280.7A CN202111255280A CN113963026A CN 113963026 A CN113963026 A CN 113963026A CN 202111255280 A CN202111255280 A CN 202111255280A CN 113963026 A CN113963026 A CN 113963026A
Authority
CN
China
Prior art keywords
target
template
features
image
searched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111255280.7A
Other languages
Chinese (zh)
Inventor
李爱民
刘腾
刘笑含
周福珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qilu University of Technology
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202111255280.7A priority Critical patent/CN113963026A/en
Publication of CN113963026A publication Critical patent/CN113963026A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Abstract

The invention discloses a target tracking method and a system based on non-local feature fusion and online updating, wherein the method comprises the following steps: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched; inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched; extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module; matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result; and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.

Description

Target tracking method and system based on non-local feature fusion and online updating
Technical Field
The invention belongs to the technical field of computer vision and image processing, and particularly relates to a target tracking method and system based on non-local feature fusion and online updating.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The basic process of target tracking is to acquire position information of a target in a first frame of a video, and continuously position the target in subsequent frames, and during the tracking of the target, a tracker may encounter various problems, such as deformation, occlusion, disappearance and the like of the target, and some external changes also have great influences on the tracked target, such as illumination change, background clutter and the like, which makes great challenges on the stability of the tracker.
Traditional target tracking algorithms usually model the target by using discriminant and generative models, and classical algorithms include KCF, SCM and the like. Since 2012 AlexNet has gained significant performance in the field of image classification, deep learning has also gained wide attention. After that, deep learning is widely applied to the field of target detection and target tracking, and then a target tracking method based on a classification network and a twin network is continuously proposed.
Due to the simple and efficient characteristics of the twin network, the twin network has received wide attention in the field of target tracking. The SimFC proposed by Bertonitto et al converts the target tracking problem into the similarity measurement problem of the target template and the search area, and subsequently derives a series of target tracking algorithms, such as SimRPN, SimRPN + +, CFNet, DaSimRPN, C-RPN, SimMask, structSim, and SA-Sim. The SiamRPN combines a regional nomination network (RPN) with a twin network, and after features are extracted by the twin network, the feature map is fed into classification and regression branches. And the target tracking task is converted into a single sample detection task, so that online updating is avoided, and the running speed is greatly improved. After that, the SimRPN + + applies ResNet to the target tracker of the twin network, and a multi-layer fusion and deep cross-correlation mechanism is integrated, so that parameters are reduced, and the accuracy is improved. In order to improve the generalization capability of the model, DaSiamRPN generates a positive sample by applying data augmentation, and constructs a negative sample, so that the generalization capability of the model is greatly enhanced. CFNet differs from SiamFC in that it introduces correlation filtering and integrates it into the Siamese framework, greatly improving the tracking rate of the model. C-RPN on the basis of SiamRPN, the problem of similarity sample and positive and negative sample imbalance is solved by cascading multiple RPNs. A Mask branch is added to the SiamMask on the basis of the SiamRPN, so that the detection result is more accurate, and the structure Sim reduces the sensitivity of the model by designing a plurality of local structures, so that the similarity comparison problem is converted into a comparison problem of a local feature block. The SA-Sim encodes the target by designing two branches of appearance and semantics, so as to improve the tracking performance of the model.
Attention mechanisms are often considered as a means of improving tracking performance. Hu et al introduced a squeeze-and-fire network (SENet) block in the depth tracking architecture to learn the channelized attention. The SEnet improves the characterization capability of the CNN network by explicitly modeling the interdependence relation between the channels of the convolution characteristics of the CNN network. Further, Woo et al propose an efficient spatial and channel attention learning module to improve the representation capability of CNNs. Zhu et al developed an end-to-end flow correlation tracking framework (FlowTrack) with spatio-temporal attention that leverages the rich flow information in successive frames to improve characterization and tracking accuracy. The CSR-DCF constructs a spatial attention map to constrain CF learning by preprocessing the foreground segmentation. But the introduction of segmentation increases the amount of computation and the accuracy of segmentation can seriously affect the tracking performance. RASNet reformulates CF in a siamese tracking framework and introduces a space and channel attention mechanism to achieve high siamese tracking performance. And D, the spatial attention module guided by the pixel direction correlation and the channel attention module guided by the channel direction correlation are adopted to improve the identification capability of the characteristics to the Siamese tracking. SiamAttn proposes a new Siamese attention mechanism that combines self-attention and cross-branch attention with morphing operations to enhance the discriminative characterization of targets.
Although the tracker based on the Siamese network achieves excellent performance in the field of target tracking, a plurality of problems still exist, and the problems mainly appear in two aspects of template matching and feature extraction. In the aspect of feature extraction, in general, when a backbone network extracts a target feature, a feature of a last layer is output as an input of a subsequent task. The high-level features contain more semantic information and are more beneficial to the target classification task. But are less than satisfactory in the object-locating task. Since the high-level features do not have as much spatial information as the shallow features. The information can well locate the target. There are three major problems with template matching: 1) in an ONE-SHOT type target tracker, matching is performed throughout the tracking task by extracting the target appearance information of the first frame as a template. This mechanism is highly susceptible to loss of the target, since the appearance of the target will typically change constantly in the video sequence. 2) Conventional template update strategies, dominated by linear updates, allow the tracker to update the template at a constant rate. Since the change amplitude of the target appearance in each frame is non-linear, the drift phenomenon occurs in the target tracking. 3) Due to the real-time requirement of the tracker by the tracking task, the template updating network cannot be too complex.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a target tracking method and a target tracking system based on non-local feature fusion and online updating, and the target tracking system based on non-local feature fusion and online updating achieves an optimal balance between accuracy and real-time performance through comprehensive consideration of an improved tracker of a Simese network.
In order to achieve the above object, one or more embodiments of the present invention provide the following technical solutions:
the target tracking method based on non-local feature fusion and online updating comprises the following steps:
acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.
Further, the fusing the low-level features and the high-level features through the non-feature pyramid module comprises:
the last two layers of the Siamese network output high-level features and low-level features; and the features output by the non-characteristic pyramid module and the high features of the last layer are subjected to matrix splicing.
Further, the loss function adopted during offline training of the siense network is as follows:
L=log(1+e-λf)
wherein, λ ∈ { -1, +1} represents a ground-truth label, and f represents the actual score of a pair of template matching.
Further, before the target template features are obtained, the bounding box is adopted to map the target on the template image to the corresponding position of the feature map in an equal scaling-down manner.
Further, after the initial target tracking result is output, calculating a similarity score of the feature maps of the template image and the image to be searched by using a similarity function, specifically:
Figure BDA0003323650500000041
wherein F (-) is the confidence score of prediction, Ratio () is the bounding box which is mapped into the feature map in an equal scaling-down way,
Figure BDA0003323650500000042
an actual error is represented by a value representing,
Figure BDA0003323650500000043
the convolution characteristic embedding of each network is represented, z represents the characteristic of the template image, z' represents the characteristic of the region where the bounding box is equally and proportionally mapped to the target, and x represents the characteristic of the image to be searched.
Furthermore, during online updating, an attention mechanism is introduced, and features of the template image and space-time information enhancement target representation of the updated features are extracted.
Further, the process of online updating includes:
performing convolution operation on template features of the initial frame by adopting two convolution layers, and flattening the convolved features to obtain Z'0And Z ″)0Then Z'0And Z ″)0Performing operation of the correlation matrix to obtain a characteristic matrix;
and performing sigmoid function and average pooling operation on the feature matrix, performing convolution operation by using a convolution layer, and performing corresponding multiplication operation on the convolved features by using the same-position elements to obtain the final target tracking features.
One or more embodiments provide a target tracking system based on non-local feature fusion and online updating, comprising:
an image acquisition module configured to: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
a target tracking module configured to: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.
One or more embodiments provide a computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the steps of any non-local feature fusion and online update based object tracking method as described above.
One or more embodiments provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the non-local feature fusion and online update based object tracking methods described above.
The above one or more technical solutions have the following beneficial effects:
(1) the invention fuses low-level features and high-level features through a non-local module, the high-level features have better semantic features, the features are not interfered, and the classification of the target is more effective. However, the last layer of semantic features is not very robust and accurate for positioning the target, the shallow layer of convolution features has more spatial information, and the features can capture the detailed position of the target, so that the target can be accurately positioned, and the accuracy and robustness of target tracking are improved.
(2) According to the method, an attention mechanism is introduced into an online updating model, the template features and the space-time information of the updated features are extracted to enhance target representation, and meanwhile context information of the target can be well obtained. The on-line target change adaptation capability of the tracker is improved, so that the tracker has strong robustness when the target is subjected to emission deformation and shielding.
(3) The invention does not directly update the template, but puts the extracted first frame template characteristic into an Attention module, thus the template always keeps the most original information of the target, and even if the error information is updated in the new update module, the tracker is not seriously influenced.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is a flowchart of a real-time target tracking method for non-local feature fusion and online updating according to an embodiment of the present invention;
FIG. 2 is a non-local feature pyramid module in an embodiment of the invention;
FIG. 3 is an online update module in an embodiment of the invention;
FIG. 4 is a diagram illustrating tracking effects in an embodiment of the present invention;
fig. 5(a) -5 (b) are graphs comparing the accuracy and precision of OTB-100 with other advanced models in the embodiment of the present invention.
Detailed Description
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
Example one
As shown in fig. 1, the present embodiment discloses a target tracking method based on non-local feature fusion and online update for target tracking, which includes the following steps:
s1: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
s2: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
s3: extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
s4: matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
s5: average Peak Correlation Energy (APCE) and maximum response F according to initial target tracking resultmaxAnd (4) judging whether the threshold condition is met or not through change, if not, updating the target template feature template, extracting the features of the template image and enhancing the target representation by the spatio-temporal information of the updated features, and obtaining a target tracking result.
As one or more embodiments, in order to fully extract the characteristic relationship between the target and the surrounding environment,
and after the initial target tracking result is output, calculating the similarity score of the characteristic graphs of the template image and the image to be searched by using a similarity function.
Figure BDA0003323650500000081
Wherein F (-) is the confidence score of prediction, Ratio () is the bounding box which is mapped into the feature map in an equal scaling-down way,
Figure BDA0003323650500000084
an actual error is represented by a value representing,
Figure BDA0003323650500000082
the convolution characteristic embedding of each network is represented, z represents the characteristic of the template image, z' represents the characteristic of the region where the bounding box is equally and proportionally mapped to the target, and x represents the characteristic of the image to be searched.
For one or more embodiments, the loss function employed during offline training of the two backbone networks is the logistic loss:
L=log(1+e-λf)
wherein, λ ∈ { -1, +1} represents a ground-truth label, and f represents the actual score of a pair of template matching.
The specific tracking process is shown in algorithm table 1.
TABLE 1
Figure BDA0003323650500000083
Figure BDA0003323650500000091
As one or more embodiments, the extracting target template features and target to-be-searched region features by fusing the low-level features and the high-level features through the non-feature pyramid module specifically includes:
respectively combining the template image and the low-level features of the image to be searched
Figure BDA0003323650500000092
And high level features
Figure BDA0003323650500000093
And inputting the fused target template features and the target region to be searched into a non-local pyramid model to perform pyramid pooling operation to obtain the fused target template features and the target region to be searched.
Where C, H, and W are the number of channels, width, and height of the feature, respectively, and subscripts I, H represent the bottom level and top level, respectively. As shown in fig. 3, the specific step of fusing the low-level features and the high-level features of the template image and the image to be searched includes:
(1) lower layer feature XlowTwo dimensionality reduced lower layer features phi 'are obtained under two 1 x 1 convolutions'lowAnd phi ″)lowPhi 'will'lowAnd phi ″)lowThe two dimension-reduced low-level features are respectively input into the pyramid pooling layer for pooling operation, and the output pooling results are connected and flattened to obtain etal1And ηl2
Φ′low=Conv1(Xlow),Φ″low=Conv2(Xlow));
Figure BDA0003323650500000094
(2) High level feature XhighObtained under convolution of 1 × 1 to'highPhi 'will'highFlattening to obtain
Figure BDA0003323650500000095
Φ′high=Conv1(Xhigh);
ηh=Rshape(Conv(Xhigh));
(3) According to ηl1And ηhThe similarity matrix γ is obtained:
Figure BDA0003323650500000101
wherein
Figure BDA0003323650500000102
(4) Carrying out normalization operation on the similar matrix gamma to obtain
Figure BDA0003323650500000103
Normalizing the similarity matrixes gamma and etal2Performing matrix multiplication to obtain a high-level feature XhighCarrying out matrix splicing operation to obtain output fusion characteristic Oout
Figure BDA0003323650500000104
The scheme has the advantages that the convolutional neural network extracts the features of the target, only the features of the last layer are usually used as a classifier to classify the target, the method is effective for some high-level visual problems, the high-level features have good semantic features, the features are not interfered, and the method is effective for classifying the target. However, the semantic features of the last layer are not very robust and accurate for positioning the target, and the convolution features of the shallow layer have more spatial information, and can capture the detailed position of the target, so that the target can be accurately positioned. Therefore, in order to improve the accuracy and robustness of target tracking, the low-level features and the high-level features are fused through a non-local module.
By designing the non-local pyramid module, the low-level features and the high-level features are fused to obtain more accurate target information, so that a solid foundation is laid for subsequent target positioning, and the tracker has stronger robustness.
In one or more embodiments, the maximum response F is determined based on the average peak correlation energy APCE and the maximum responsemaxAnd comparing the change with a threshold condition, and judging whether the model needs to be updated or not.
When APCE suddenly decreases, no model update is performed, only when APCE and FmaxWhen the average value is larger than the historical average value APCE and the maximum response value in a certain proportion, the model is updated on line.
Calculating the maximum response value Fmax:Fmax=max(response)
Calculate APCE value:
Figure BDA0003323650500000111
in the formula, Fmax、Fmin、Fw,hRespectively representing the maximum response value, the minimum response value and any w-th line in the final responseThe response value of the h column element.
Figure BDA0003323650500000112
Figure BDA0003323650500000113
In the formula, mean (F)max) And mean (apce) represents the history F of the previous framemaxAnd average value of apce value, ξmaxAnd xiapceRepresenting two thresholds.
When a threshold condition is met, the template is updated, preferably where ξmaxAnd xiapceSet to 0.8 and 0.2, respectively.
Since the appearance characteristics of the target in a video sequence usually change, if the tracking template cannot be updated in real time, the target is easily lost. We use the average peak correlation energy APCE and the maximum response value FmaxAnd judging the change, wherein when the two values are suddenly reduced, the probability of the target being blocked or lost is higher, and the updated template can cause pollution.
When the APCE is suddenly reduced, the target is generally lost by the shielded target, and the model is not updated to avoid model drift. The updating mode effectively distinguishes different influences of target appearance change, target shielding and target loss on tracking, and improves the robustness of the algorithm. Only when APCE and FmaxAre all proportionally greater than the respective historical mean value ximaxAnd xiapceIn time, the model is updated, so that the condition of model drift is greatly reduced on one hand, the times of model updating are reduced on the other hand, and the acceleration effect is achieved.
As one or more embodiments, when online updating, an attention mechanism is introduced, and the spatio-temporal information of template features and updated features is extracted;
set the template feature of the initial frame as Z0With subsequent update being characterized by Xn
Figure BDA0003323650500000121
Preferably, the update factor is set to 5, with updates being made every 5 frames.
First, the template feature Z of the initial frame is obtained by using two convolution layers0Performing convolution operation, and flattening the convolved features to obtain Z0' and Z0
Z′0=Conv1(Rshape(Z0)),Z″0=Conv2(Rshape(Z0)))
Then Z is0' and Z0Performing operation on the correlation matrix to obtain a feature matrix gamma epsilon RHW×HWPerforming reshaping operation on the characteristic matrix gamma to obtain gamma epsilon RHW×H×WThe relationship between each sub-region is captured.
Figure BDA0003323650500000122
Adopting sigmoid function and average pooling operation to the characteristic matrix gamma to obtain gamma' epsilon R1×H×WSimultaneously using a convolution layer XnAnd (3) performing convolution operation, and performing element-wise multiplication operation on the convolved features to obtain a final feature V:
V=Γ⊙Conv(Xn)。
the method has the advantages that if only the template of the tracker is updated, the requirement on the quality of the template is extremely high, once the template has errors, the method can only accumulate the errors continuously, and finally the target tracking fails. Since visual tracking is a dynamic process with changing scenes, the target object has a strong spatiotemporal relationship between successive frames, and the target appearance can be modeled by using features from different frames and positions. An attention mechanism is introduced into an online updating model, the space-time information of template features and updating features is extracted to enhance target representation, meanwhile, context information of a target can be well obtained, the capability of the tracker to adapt to target change on line is improved, and the tracker has strong robustness when the target is transmitted, deformed and shielded.
Experimental results and experimental pairs are shown in fig. 4 and 5, fig. 4 is a graph of the tracking effect of the present invention, and fig. 5(a) -5 (b) are graphs of the accuracy and precision of OTB-100 and other advanced models.
In the embodiment, when the template is updated, the template is not directly updated, but the template and the extracted first frame template features are put into an Attention module, so that the most original information of the target is always kept in the template, and even if the error information is updated in the newly updated module, the tracker cannot be seriously influenced.
Example two
The embodiment provides a target tracking system based on non-local feature fusion and online update, which comprises:
an image acquisition module configured to: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
a target tracking module configured to: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.
EXAMPLE III
The embodiment of the present specification provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the program to implement the steps of the non-local feature fusion and online update based target tracking method in the first embodiment.
Example four
The embodiment of the present specification provides a computer-readable storage medium, on which a computer program is stored, where the computer program is used to implement, when executed by a processor, the steps of the target tracking method based on non-local feature fusion and online update in the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. The target tracking method based on non-local feature fusion and online updating is characterized by comprising the following steps of:
acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.
2. The non-local feature fusion and online update based target tracking method of claim 1, wherein the fusing the low-level features and the high-level features by the non-feature pyramid module comprises:
the last two layers of the Siamese network output high-level features and low-level features; and the features output by the non-characteristic pyramid module and the high features of the last layer are subjected to matrix splicing.
3. The target tracking method based on non-local feature fusion and online update as claimed in claim 1, wherein the loss function adopted during the siense network offline training is:
L=log(1+e-λf)
wherein, λ ∈ { -1, +1} represents a ground-truth label, and f represents the actual score of a pair of template matching.
4. The non-local feature fusion and online update based target tracking method according to claim 1, wherein before the target template features are obtained, a bounding box is used to map the target on the template image to the corresponding position of the feature map in an equal scaling-down manner.
5. The target tracking method based on non-local feature fusion and online update as claimed in claim 2, wherein after the initial target tracking result is output, a similarity function is used to calculate a similarity score of the feature maps of the template image and the image to be searched, specifically:
Figure FDA0003323650490000021
wherein F (-) is the confidence score of prediction, Ratio () is the bounding box which is mapped into the feature map in an equal scaling-down way,
Figure FDA0003323650490000022
an actual error is represented by a value representing,
Figure FDA0003323650490000023
the convolution characteristic embedding of each network is represented, z represents the characteristic of the template image, z' represents the characteristic of the region where the bounding box is equally and proportionally mapped to the target, and x represents the characteristic of the image to be searched.
6. The non-local feature fusion and online update based target tracking method according to claim 1, wherein an attention mechanism is introduced during online update to extract features of the template image and enhance target representation by updating spatiotemporal information of the features.
7. The non-local feature fusion and online update based target tracking method according to claim 1, wherein the online update process comprises:
performing convolution operation on template features of the initial frame by adopting two convolution layers, and flattening the convolved features to obtain Z0' and Z0Then Z is0' and Z0Performing operation of the correlation matrix to obtain a characteristic matrix;
and performing sigmoid function and average pooling operation on the feature matrix, performing convolution operation by using a convolution layer, and performing corresponding multiplication operation on the convolved features by using the same-position elements to obtain the final target tracking features.
8. The target tracking method based on non-local feature fusion and online updating is characterized by comprising the following steps of:
an image acquisition module configured to: acquiring a video sequence, wherein a first frame of a video is used as a template image, and a current frame is used as an image to be searched;
a target tracking module configured to: inputting the template image and the image to be searched into a Siemese network trained offline, and respectively extracting high-level features and low-level features of the template image and the image to be searched;
extracting target template features and target to-be-searched region features by fusing low-level features and high-level features through a non-feature pyramid module;
matching the target to-be-searched area characteristics with the target template characteristics to obtain an initial target tracking result;
and judging whether a threshold condition is met or not according to the average peak correlation energy and the maximum response change of the initial target tracking result, and if not, updating the target template characteristic template to obtain a target tracking result.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the non-local feature fusion and online update based object tracking method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the non-local feature fusion and online update based object tracking method of any one of claims 1 to 7.
CN202111255280.7A 2021-10-27 2021-10-27 Target tracking method and system based on non-local feature fusion and online updating Pending CN113963026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111255280.7A CN113963026A (en) 2021-10-27 2021-10-27 Target tracking method and system based on non-local feature fusion and online updating

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111255280.7A CN113963026A (en) 2021-10-27 2021-10-27 Target tracking method and system based on non-local feature fusion and online updating

Publications (1)

Publication Number Publication Date
CN113963026A true CN113963026A (en) 2022-01-21

Family

ID=79467497

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111255280.7A Pending CN113963026A (en) 2021-10-27 2021-10-27 Target tracking method and system based on non-local feature fusion and online updating

Country Status (1)

Country Link
CN (1) CN113963026A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757972A (en) * 2022-04-15 2022-07-15 中国电信股份有限公司 Target tracking method and device, electronic equipment and computer readable storage medium
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114757972A (en) * 2022-04-15 2022-07-15 中国电信股份有限公司 Target tracking method and device, electronic equipment and computer readable storage medium
CN114757972B (en) * 2022-04-15 2023-10-10 中国电信股份有限公司 Target tracking method, device, electronic equipment and computer readable storage medium
CN115661207A (en) * 2022-11-14 2023-01-31 南昌工程学院 Target tracking method and system based on space consistency matching and weight learning
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment
CN117333514B (en) * 2023-12-01 2024-04-16 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment

Similar Documents

Publication Publication Date Title
Von Stumberg et al. Gn-net: The gauss-newton loss for multi-weather relocalization
Trnovszky et al. Animal recognition system based on convolutional neural network
CN112069896B (en) Video target tracking method based on twin network fusion multi-template features
CN113963026A (en) Target tracking method and system based on non-local feature fusion and online updating
CN112800903B (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
CN111191583A (en) Space target identification system and method based on convolutional neural network
Zhang et al. Overview of currency recognition using deep learning
Yang et al. Facial expression recognition based on dual-feature fusion and improved random forest classifier
Aroulanandam et al. Object Detection in Convolution Neural Networks Using Iterative Refinements.
CN111160407A (en) Deep learning target detection method and system
CN112446379B (en) Self-adaptive intelligent processing method for dynamic large scene
CN113963032A (en) Twin network structure target tracking method fusing target re-identification
CN109840518B (en) Visual tracking method combining classification and domain adaptation
CN113592894B (en) Image segmentation method based on boundary box and co-occurrence feature prediction
CN117252904B (en) Target tracking method and system based on long-range space perception and channel enhancement
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
Menon et al. Custom Face Recognition Using YOLO. V3
Jin et al. Face recognition based on MTCNN and Facenet
Bunel et al. Detection of pedestrians at far distance
Wang et al. Visual tracking via robust multi-task multi-feature joint sparse representation
Silvoster et al. Enhanced CNN based electron microscopy image segmentation
Lin et al. CapsNet meets SIFT: A robust framework for distorted target categorization
Lin et al. Ml-capsnet meets vb-di-d: A novel distortion-tolerant baseline for perturbed object recognition
CN114241202A (en) Method and device for training dressing classification model and method and device for dressing classification
CN113591607A (en) Station intelligent epidemic prevention and control system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination