CN114926652A

CN114926652A - Twin tracking method and system based on interactive and convergent feature optimization

Info

Publication number: CN114926652A
Application number: CN202210600748.XA
Authority: CN
Inventors: 陈思; 许瑞; 王大寒; 朱顺痣; 吴芸
Original assignee: Xiamen University of Technology
Current assignee: Xiamen University of Technology
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-19

Abstract

The invention relates to a twin tracking method and a twin tracking system based on interactive and convergent feature optimization, wherein the method comprises the following steps: initializing a template image and a search area image; constructing a feature extraction network to obtain template multilayer features and search area multilayer features; constructing a gated double-view aggregation module to optimize the characteristics of the multilayer template; constructing a semantic-guided attention module to realize coarse-grained feature optimization of a search area; constructing a correlation graph aggregation module to realize fine-grained feature optimization of a search area; and constructing a head prediction network, and predicting the position of the target of the current frame. According to the method and the system, the target significant features are enhanced through self-attention aggregation and interaction of the template features and the search area features, and background noise is suppressed, so that a more stable, robust and accurate tracking result is obtained in a challenging scene.

Description

Twin tracking method and system based on interaction and convergent feature optimization

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a twin tracking method and system based on interaction and convergent feature optimization.

Background

In the field of computer vision, target tracking is one of the most important and active research subjects, and has wide application in the aspects of unmanned driving, intelligent security, human-computer interaction, unmanned aerial vehicles and the like. For a single target tracker or tracking system, it is intended to continuously predict the spatial position of an object in a subsequent video sequence based on the given arbitrary object coordinate information in the first frame.

In recent years, the application of twin networks in the field of target tracking has been greatly developed, which not only benefits from deep learning to obtain better feature expression, but also can realize real-time tracking speed through means such as parameter sharing and offline training, and thus the twin networks are becoming the mainstream of the research in the tracking field. The basic idea of the twin network-based tracking algorithm is as follows: and taking a target area corresponding to the real frame in the first frame of the video as a template, and taking the subsequent frame as a search area. And in the tracking process, the area most similar to the template is matched in the search area and is used as the predicted position of the target of the current frame. SiamFC (berthinetto L, Valmadre J, henqueries J F, et al. full-conditional parameters networks for object tracking. proceedings of the European Conference reference Computer Vision works, 2016, pp.850-865.) and SiamRPN (Li B, Yan J, Wu W, et al. high performance video tracking with parameter registration network protocol) get an appearance model by training offline on a large data set and do not perform parameter updates while tracking online, thus not only is tracker accuracy high, but also has advantages in particular in speed. However, due to the fixed template, the tracker is not particularly sensitive to changes in the appearance of the target and is also susceptible to interference from similar and complex backgrounds. In order to accommodate changes in the appearance of the target, CFNet (Valmadre J, Bertonitto L, Henriques JF, et al. End-to-end representation learning for correlation filter based tracking. Procedings of the IEEE Conference on Computer Vision and Pattern registration.2017, pp.5000-5008.) and RASNet (Wang Q, Teng Z, Xing J, et al. learning entries: identification Z, calculation aspect position for high performance on line Vision tracking. tracking IEEE Conference Computer Vision and Pattern registration.2018, pp.4854-4863) embed the relevant template and local branch attention module, respectively, by updating the local module with the filter parameters. GradNet (Li P, ChenB, Ouyang W, et al.GradNet: Gradient-defined network for visual object tracking. proceedings of the IEEE International Conference on computer Vision.2019, pp.6161-6170.) and UpdateNet (Zhang L, Abel G, Joost V, et al.learning the model data for parameter tracking. proceedings of IEEE International Conference on computer Vision.2019, pp.4009-4018.) template parameter updates are achieved using a process of web iterative learning. Relative to GradNet and UpdateNet, which directly update parameters on the first frame template, MemDTC (Yang T, Antoni B. visual tracking via dynamic network, IEEE Transactions on Pattern Analysis and Machine analysis.2021, pp.360-374) stores reliable target templates in the tracking process by adding a memory unit, so that effective information of the first frame template can be completely saved, and quick recovery of the tracker in case of drift can be facilitated. In addition, in order to improve the discrimination ability of the twin tracker for similar and complex backgrounds, an interference sensing module capable of incremental learning is designed on line by DaSiamRPN (Zhu Z, Wang Q, Li B, et al. Disfractor-aware networks for visual object tracking. proceedings of the European Conference on Computer Vision.2018, pp.103-119.). Nocal-Sim (Tan H, Zhang X, Zhang Z, et al. Nocal-Sim: learning visual characteristics and response with advanced non-local blocks for real-time parameter tracking. IEEE Transactions on Image processing.2021, pp.30:2656 and 2668.) utilizes the remote dependence of non-local attention to enhance the learning of feature weights associated with the target. siamDW (Zhang Z, Peng H. De and wire dimension networks for real-time visual tracking. proceedings of the IEEE Conference on computer Vision and Pattern registration.2019, pp.4591-4600.) designs a deeper and wider network architecture for the twin tracker, and further excavates the feature extraction and discrimination capability of the depth network.

Although the twin tracking algorithm has made some progress in how to design deeper and wider backbone networks, better matching methods, more accurate output representation, and more efficient online updating mechanisms, a more effective solution for scenes such as similar interference, complex background, and occlusion is still lacking.

Disclosure of Invention

The invention aims to provide a twin tracking method and system based on interactive and convergent feature optimization, which are beneficial to obtaining more stable, robust and accurate tracking results in a complex environment.

In order to realize the purpose, the invention adopts the technical scheme that: a twin tracking method based on interactive and convergent feature optimization comprises the following steps:

s1, initializing a template image and a search area image;

s2, constructing a feature extraction network, inputting a template image and a search area image, and acquiring corresponding template multilayer features F _z And search region Multi-layer feature F _x ；

S3, constructing a gating double-view aggregation module GDA to optimize template multilayer characteristics, and constructing template multilayer characteristics F _z Inputting the GDA module to obtain optimized template multilayer characteristics

S4 attention module for constructing semantic guidanceSGA to achieve coarse-grained feature optimization of search regions

And search region Multi-layer feature F _x Inputting into SGA module to obtain coarse-grained optimized search region characteristics

S5, constructing a correlation graph aggregation module (CGA) to realize fine-grained feature optimization of the search area

And

inputting the CGA module to obtain fine-grained optimized search region characteristics

S6, constructing a head prediction network

And

and inputting and predicting the position of the target of the current frame.

Further, the specific implementation method of step S1 is:

cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a target real boundary frame given by the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point.

Further, the specific implementation method of step S2 is as follows:

adopting ResNet-50 as a feature extraction network, taking a template image and a search area image as input, and acquiring template multilayer features

And search area multi-layer features

Where l represents the total number of layers of extracted template or search area features,

respectively representing the template features and the search region features of the ith layer, i belongs to [1, l ∈ [)]。

Further, the specific implementation method of step S3 is:

the GDA module comprises three sub-modules, namely a local view attention LA, a global view attention GA and a convergence gate; the LA module is used for highlighting high-frequency information of a local visual angle; for single layer template features of size C H W

Local visual angle attention feature

Expressed as:

wherein, W ₂ Is a learnable convolution parameter of size

And

where r denotes the channel compression parameter;

representing batch normalization; sigma represents a sigmoid function;

represents a bit-wise multiplication; high frequency characteristics

Obtained by subtracting the local mean, expressed as:

in the formula, W ₁ Are learnable convolution parameters;

is a warp W ₁ Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s denote window size and step size, respectively; δ denotes the nonlinear activation function, here with ReLU;

the LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; the GA module is used for aggregating global information of different receptive fields through interaction of multilayer features; characteristic F ═ x for a set of l layers ₁ ,x ₂ ,...,x _l For any two-layer feature }

And

first, three convolutional layers, θ (-), φ (-), and g (-), are used for x _i Linear mapping is carried out to obtain feature maps of ' query ', ' key ' and ' value

And

namely, it is

Q＝θ(x _i )

K ₁ ＝φ(x _i )

V ₁ ＝g(x _i )

At the same time, feature x _j Sharing convolutional layers phi (-) and g (-) to obtain corresponding feature maps

And

namely, it is

K ₂ ＝φ(x _j )

V ₂ ＝g(x _j )

Then, the 'key' and 'value' of each layer are spliced together respectively to obtain a global representation of the multilayer characteristic

And

where l × H × W, where l denotes the total number of feature layers queried; the global features K and V are then represented as:

K＝[φ(x _i )||φ(x _j )]

V＝[g(x _i )||g(x _j )]

wherein, [ | · ] represents that the characteristic is spliced according to the space dimension;

thus, the attention feature y is derived from a standard non-local attention formula _i Expressed as:

in the above formula, the first and second carbon atoms are,

it represents the softmax operation along the j dimension;

finally, y is transformed by a convolution layer ξ (-) _i The number of channels is unified with the original characteristic diagram and is added to the original characteristic diagram in a residual error mode, so that:

for template multi-layer features

According to the above formula, wherein the i-th layer is characterized

The updates through the GA module are represented as:

wherein the content of the first and second substances,

the LA module and the GA module reduce the redundancy of channels to a certain extent through channel compression parameters r and m and respectively obtain attention characteristics under local and global perspectives

And

on the basis, the gating mechanism utilizing the aggregation gating module is fused adaptively

And

thereby enhancing the effective representation of salient features; for input features

And

firstly, splicing two features together, learning the correlation between the two features by using a convolution layer of 1 multiplied by 1, and obtaining a normalized correlation matrix W by using a sigmoid function _gate Let it represent

The weight matrix of (1), then

Is expressed as 1-W _gate (ii) a Then, feature weighting is realized in a mode of multiplying the weight matrix and the features element by element, and finally the obtained optimized features

Expressed as:

further, the specific implementation method of step S4 is:

optimized features for GDA module output

The SGA module extracts global semantic information of spatial dimension layer by layer, generates a target semantic attention matrix, and then compares the target semantic attention matrix with the characteristics of a search area

Interacting layer by layer to obtain coarse-grained optimized search region characteristics

In particular, for the ith layer template features

Generated target semantic attention moment array

Expressed as:

wherein GAP (·) represents global average pooling in spatial dimensions, and σ is sigmoid function;

then, the SGA module adopts global view attention aggregation multi-layer characteristics, and the global view attention aggregation multi-layer characteristics share parameters with the global view attention in the GDA module to reduce actual calculation amount; through the interaction of target semantic information and the 'inquiry', 'key' and 'value' characteristics of the search area, the ith layer of search area characteristics

The resulting Q, K and V are expressed as:

the optimized characteristics obtained by the layer

Expressed as:

wherein the content of the first and second substances,

in the above equation, i represents the current layer and j represents other multi-layer features.

Further, the specific implementation method of step S5 is:

CGA module and

and

as input, on one hand, calculating the correlation of spatial pixels of a search area and the template overall; on the other hand, local relevance is calculated based on the salient features of the template only; by fusing global and local correlation, the relation of space positions is strengthened by adopting graph convolution, so that the construction of an attention map associated with the target feature is realized; in particular, features are optimized for the template of the ith layer

And search area optimization features

Firstly, respectively cutting the template along the space and the channel to obtain the space characteristics

And channel characteristics

Wherein N is ₁ ＝H ₁ ×W ₁ (ii) a For a certain pixel in the search area, firstly, the correlation between the certain pixel and the template spatial feature is calculated to obtain a spatial correlation diagram S ₁ Expressed as:

wherein, Corr (·) is a correlation calculation function, and an inner product mode is adopted;

then, at S ₁ On the basis, the retrieval of the template global information is realized by calculating the correlation with the channel characteristics, and a correlation graph S obtained by calculation at the moment ₂ Expressed as:

the global relevance of a certain pixel in the search area to the template is expressed as:

wherein MaxPool (·) is the maximum pooling operation, ks and s represent the window size and step size, respectively;

the local correlation is then expressed as:

finally, fusing the two correlation graphs by adopting a method of adding corresponding elements, and constructing graph relations of the two correlation graphs, thereby enhancing the association of positions; the obtained correlation diagram and the characteristics of the search area are obtained

Adding to obtain finer grained optimization features

Expressed as:

wherein GCN (-) is a two-layer graph convolution network;

representing the addition of corresponding elements; further obtaining fine-grained optimized search region characteristics

The invention also provides a twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, the method steps being able to be carried out when the computer program instructions are executed by the processor.

Compared with the prior art, the invention has the following beneficial effects: the method and the system enhance the target salient features and inhibit background noise through self-attention aggregation and interaction of the template features and the search area features. In particular, a novel interaction and aggregation network is employed that includes a gated dual-perspective attention module, a semantically guided attention module, and a dependency graph aggregation module. The gated double-view attention module is used for aggregating the outputs of the local view attention submodule and the global view attention submodule based on a gating mechanism and is used for enhancing the characteristics of target significance and discriminability. The attention module of semantic guidance is to extract semantic information of a target, and the semantic information is used as feature optimization of a priori guidance search area. Further, for the optimized template and the search area characteristics, local similarity and global similarity are respectively constructed in a correlation graph aggregation module, and spatial position relation is strengthened through a graph convolution network.

Drawings

FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.

FIG. 2 is a result of comparing the accuracy of the method of the present invention with that of other target tracking methods under different attributes in the embodiment of the present invention.

Detailed Description

The invention is further explained by the following embodiments in conjunction with the drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a twin tracking method based on interactive and convergent feature optimization, which includes the following steps:

s1, initializing the template image and the search area image. The specific implementation method comprises the following steps:

cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a given target real bounding box (ground-computing bounding box) of the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point. The target prediction bounding box refers to the target position predicted by each frame and is given in the form of (x, y, w, h), where (x, y) denotes the center position of the prediction bounding box, and w and h denote the width and height of the prediction bounding box, respectively.

S2, constructing a feature extraction network, inputting the template image and the search area image, and obtaining the corresponding template multilayer features

And a search areaDomain multi-layer features

The specific implementation method comprises the following steps:

adopting ResNet-50 and its improved network as characteristic extraction network, using template image and search area image as input to obtain template multilayer characteristics

And search area multi-layer features

respectively represent the ith layer (i e [1, l ]]) Template features and search area features.

S3, constructing a gating double-view aggregation module GDA to optimize template multilayer characteristics

Inputting the GDA module to obtain optimized template multilayer characteristics

The specific implementation method comprises the following steps:

1) the GDA module includes three sub-modules, local-view attention (LA), global-view attention (GA), and aggregation gate (aggregation gate). The LA module is used for highlighting high-frequency information of a local visual angle. In order to save calculation parameters, a bottleneck structure scheme is adopted for design, and detail characteristics are highlighted in a residual error connection mode, so that the purposes of protecting original characteristics and preventing effective information from being erased by errors are achieved. In addition, for the attribute of convolution local connection, essentially each element of the feature map represents the embedded feature and signal strength of a specific area in the feature map of the previous layer, so that high-frequency discrimination can be obtained by subtracting the average signal strengthAnd (4) sexual information. In the case where the difference between the target and the background is not significant, it is important to express high-frequency information for enhancing the discrimination. Thus, for single layer template features of size C H W

Local visual angle attention feature

Expressed as:

wherein, W ₂ Is a learnable convolution parameter of size

And

where r denotes the channel compression parameter;

representing batch normalization; sigma represents a sigmoid function;

represents a bit-wise multiplication; high frequency characteristics

Obtained by subtracting the local mean, expressed as:

in the formula, W ₁ For learnable volumesA product parameter;

is a warp W ₁ Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s represent window size and step size, respectively; δ denotes the nonlinear activation function, here ReLU.

2) The LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; and the GA module is used for aggregating the global information of different receptive fields through the interaction of multilayer features. The standard non-local attention captures the dependency relationship of different positions through the interaction at the pixel level, so that the performance of the visual task is greatly improved, but the information interaction of different network layers is not considered, and the importance of different receptive fields on semantic information mining is ignored. In view of the above, the standard non-local attention is extended to cross-layer non-local attention, which aggregates multiple layers of semantic information to the current layer through interaction with different receptive fields, thereby obtaining richer feature representations. For convenience of presentation and general generalization, the characteristic F ═ x for a set of l layers ₁ ,x ₂ ,...,x _l In any two layers of features

And

first, three convolutional layers, θ (-), φ (-), and g (-), are used for x _i Performing linear mapping to obtain feature maps of ' query ', ' key ' and ' value

And

namely that

Q＝θ(x _i )

K ₁ ＝φ(x _i )

V ₁ ＝g(x _i )

And

namely, it is

K ₂ ＝φ(x _j )

V ₂ ＝g(x _j )

Then, splicing the key and the value of each layer together respectively to obtain a global representation of the multilayer feature

And

where S ═ l × H × W, where l denotes the total number of feature layers queried. The global features K and V are then represented as:

K＝[φ(x _i )||φ(x _j )]

V＝[g(x _i )||g(x _j )]

wherein, [ | · ] represents that the features are spliced according to spatial dimensions.

in the above formula, the first and second carbon atoms are,

which represents the softmax operation along the j dimension.

for template multi-layer features

According to the above formula, wherein the ith layer is characterized

The updates through the GA module are represented as:

wherein the content of the first and second substances,

3) the LA module and the GA module reduce the redundancy of channels to a certain extent through channel compression parameters r and m and respectively obtain attention characteristics under local and global perspectives

And

And

thereby enhancing the effective representation of the salient features. For input features

And

firstly, splicing two features together, learning the correlation between the two features by using a 1 multiplied by 1 convolutional layer, and obtaining a normalized correlation matrix W by using a sigmoid function _gate Let it represent

The weight matrix of (1), then

Is expressed as 1-W _gate . Then, realizing feature weighting by multiplying the weight matrix and the features element by element (Hardamard product), and finally obtaining the optimized features

Expressed as:

s4, constructing a semantic-guided attention module SGA to realize coarse-grained feature optimization of the search area

And search area multi-layer features

Inputting the obtained data into SGA module to obtain coarse-grained optimized search region characteristics

The specific implementation method comprises the following steps:

1) optimized features for GDA module output

The SGA module extracts global semantic information of space dimensions layer by layer, generates a target semantic attention matrix and then compares the target semantic attention matrix with the characteristics of a search area

In particular, for the ith layer template features

Generated target semantic attention moment array

Expressed as:

wherein GAP (-) represents the global average pooling in spatial dimension, and σ is sigmoid function.

2) The SGA module adopts global view attention aggregation multi-layer characteristics, and the global view attention aggregation multi-layer characteristics share parameters with global view attention in the GDA module to reduce actual calculation amount; in particular, the ith layer search area features are characterized by interaction of target semantic information with the 'query', 'key' and 'value' features of the search area

The resulting Q, K and V are expressed as:

thus the layer is obtainedTo optimize the characteristics

Expressed as:

wherein the content of the first and second substances,

S5, constructing a correlation graph aggregation module CGA to realize fine-grained feature optimization of the search area, and optimizing the correlation graph aggregation module CGA

And with

The specific implementation method comprises the following steps:

CGA module and

and

as input, on one hand, calculating the correlation of spatial pixels of a search area and the global template; on the other hand, local correlation is calculated based on the salient features of the template only; and (3) by fusing global and local correlation, strengthening the relation of spatial positions by adopting graph convolution, thereby realizing the construction of the attention map associated with the target feature.

1) Optimizing features for a template of an ith layer

And search area optimization features

Firstly, the template is respectively cut along the space and the channel to obtain the space characteristics

And channel characteristics

Wherein N is ₁ ＝H ₁ ×W ₁ (ii) a For a certain pixel in the search area, firstly calculating the correlation between the certain pixel and the spatial feature of the template to obtain a spatial correlation diagram S ₁ Expressed as:

where Corr (-) is a correlation calculation function, here in the form of an inner product.

Then, at S ₁ On the basis, the retrieval of the template global information is realized by calculating the correlation with the channel characteristics, and the correlation graph S obtained by calculation at the moment ₂ Expressed as:

for simplicity of description, the global correlation of a certain pixel in the search area with the template is expressed as:

where MaxPool (·) is the maximum pooling operation, and ks and s represent the window size and step size, respectively.

The local correlation is then expressed as:

2) fusing the two correlation graphs by adopting a method of adding corresponding elements, and constructing graph relations of the two correlation graphs, thereby enhancing the association of positions; the obtained correlation diagram and the characteristics of the search area

Adding to obtain finer grained optimization features

Expressed as:

wherein GCN (-) is a two-layer graph convolution network;

S6, constructing a head prediction network

And

and inputting and predicting the position of the target of the current frame.

The present embodiment also provides a twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, implement the above-mentioned method steps.

In this embodiment, an OTB100 data set is used for comparison and verification, and fig. 2 shows the result of the accuracy comparison between the FRIA-Track and other target tracking methods under different attributes. Table 1 shows the success rate comparison results of the method proposed by the present invention with other target tracking methods on OTB100 data set.

TABLE 1 comparison of the present invention with other target tracking methods

As can be seen from FIG. 2, the FRIA-Track method of the present invention shows the best level under 8 attributes, and the performance under 10 attributes exceeds the standard algorithm, SiamCAR. As can be seen from table 1, the method of the present invention has the best success rate compared to other target tracking methods.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A twin tracking method based on interactive and convergent feature optimization is characterized by comprising the following steps:

s1, initializing a template image and a search area image;

s2, constructing a feature extraction network, inputting the template image and the search area image, and acquiring corresponding template multilayer features F _z And search region Multi-layer feature F _x ；

S3, constructing a gating dual-view aggregation module GDA to optimizeTemplate multilayer characterization, template multilayer characterization F _z Inputting the GDA module to obtain optimized template multi-layer features

And search region Multi-layer feature F _x Inputting the obtained data into SGA module to obtain coarse-grained optimized search region characteristics

And

S6, constructing a head prediction network

And

and inputting and predicting the position of the target of the current frame.

2. The twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S1 is implemented by:

cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a target real bounding box given by the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point.

3. The twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S2 is implemented by:

And search region multi-layer features

Where l represents the total number of levels of extracted template or search area features,

respectively representing the template characteristics and the search region characteristics of the ith layer, i belongs to [1, l ]]。

4. The twin tracking method based on interactive and convergent feature optimization according to claim 1, wherein the specific implementation method of step S3 is:

the GDA module comprises three sub-modules, namely a local view attention LA, a global view attention GA and an aggregation gate; the LA module is used for highlighting high-frequency information of a local visual angle; for single layer template features of size C × H × W

Local visual angle attention feature

Expressed as:

wherein, W ₂ Is a learnable convolution parameter of size

And

where r denotes the channel compression parameter;

representing batch normalization; sigma represents a sigmoid function;

represents a bit-wise multiplication; high frequency characteristics

Obtained by subtracting the local mean, expressed as:

in the formula, W ₁ Are learnable convolution parameters;

is a warp W ₁ Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s denote window size and step size, respectively; delta denotes the non-linear activation function, used hereReLU；

The LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; the GA module is used for aggregating global information of different receptive fields through interaction of multilayer features; characteristic F ═ x for a set of l layers ₁ ,x ₂ ,...,x _l For arbitrary two-layer features }

And

first, three convolutional layers, θ (-), φ (-), and g (-), are used to pair x _i Linear mapping is carried out to obtain feature maps of ' query ', ' key ' and ' value

And

namely, it is

Q＝θ(x _i )

K ₁ ＝φ(x _i )

V ₁ ＝g(x _i )

And

namely, it is

K ₂ ＝φ(x _j )

V ₂ ＝g(x _j )

Then, the "key" and the "value" of each layer are spliced together respectively to obtainGlobal representation of multi-layer features

And

where S ═ l × H × W, where l denotes the total number of feature layers queried; the global features K and V are then expressed as:

K＝[φ(x _i )||φ(x _j )]

V＝[g(x _i )||g(x _j )]

in the above formula, the first and second carbon atoms are,

it represents the softmax operation along the j dimension;

finally, y is transformed by a convolutional layer xi (·) _i The number of channels is unified with the original characteristic diagram and is added to the original characteristic diagram in a residual error mode, so that the following results are obtained:

for template multi-layer features

According to the above formula, wherein the ith layer is characterized

Updates through the GA Module are represented as：

Wherein, the first and the second end of the pipe are connected with each other,

And

And

And

The weight matrix of (1), then

Expressed as:

5. the twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S4 is implemented by:

optimized features for GDA module output

In particular, for the ith layer template features

Generated target semantic attention moment array

Expressed as:

wherein GAP (-) represents global average pooling in spatial dimension, and sigma is sigmoid function;

The resulting Q, K and V are expressed as:

the optimized characteristics obtained by the layer

Expressed as:

6. The twin tracking method based on interactive and convergent feature optimization according to claim 1, wherein the specific implementation method of step S5 is:

CGA module and

and

as input, on one hand, calculating the correlation of spatial pixels of a search area and the global template; on the other hand, local relevance is calculated based on the salient features of the template only; by fusing global and local correlation, the relation of space positions is strengthened by adopting graph convolution, so that the construction of an attention map associated with the target feature is realized; in particular, features are optimized for the template of the ith layer

And search area optimization features

And channel characteristics

wherein, MaxPool (-) is the maximum pooling operation, and ks and s respectively represent the window size and the step size;

the local correlation is then expressed as:

Adding to obtain finer grained optimization features

Expressed as:

wherein GCN (-) is a two-layer graph convolution network;

7. Twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps of any of claims 1 to 6.