CN114926652A - Twin tracking method and system based on interactive and convergent feature optimization - Google Patents
Twin tracking method and system based on interactive and convergent feature optimization Download PDFInfo
- Publication number
- CN114926652A CN114926652A CN202210600748.XA CN202210600748A CN114926652A CN 114926652 A CN114926652 A CN 114926652A CN 202210600748 A CN202210600748 A CN 202210600748A CN 114926652 A CN114926652 A CN 114926652A
- Authority
- CN
- China
- Prior art keywords
- features
- template
- layer
- module
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Probability & Statistics with Applications (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a twin tracking method and a twin tracking system based on interactive and convergent feature optimization, wherein the method comprises the following steps: initializing a template image and a search area image; constructing a feature extraction network to obtain template multilayer features and search area multilayer features; constructing a gated double-view aggregation module to optimize the characteristics of the multilayer template; constructing a semantic-guided attention module to realize coarse-grained feature optimization of a search area; constructing a correlation graph aggregation module to realize fine-grained feature optimization of a search area; and constructing a head prediction network, and predicting the position of the target of the current frame. According to the method and the system, the target significant features are enhanced through self-attention aggregation and interaction of the template features and the search area features, and background noise is suppressed, so that a more stable, robust and accurate tracking result is obtained in a challenging scene.
Description
Technical Field
The invention belongs to the technical field of computer vision, and particularly relates to a twin tracking method and system based on interaction and convergent feature optimization.
Background
In the field of computer vision, target tracking is one of the most important and active research subjects, and has wide application in the aspects of unmanned driving, intelligent security, human-computer interaction, unmanned aerial vehicles and the like. For a single target tracker or tracking system, it is intended to continuously predict the spatial position of an object in a subsequent video sequence based on the given arbitrary object coordinate information in the first frame.
In recent years, the application of twin networks in the field of target tracking has been greatly developed, which not only benefits from deep learning to obtain better feature expression, but also can realize real-time tracking speed through means such as parameter sharing and offline training, and thus the twin networks are becoming the mainstream of the research in the tracking field. The basic idea of the twin network-based tracking algorithm is as follows: and taking a target area corresponding to the real frame in the first frame of the video as a template, and taking the subsequent frame as a search area. And in the tracking process, the area most similar to the template is matched in the search area and is used as the predicted position of the target of the current frame. SiamFC (berthinetto L, Valmadre J, henqueries J F, et al. full-conditional parameters networks for object tracking. proceedings of the European Conference reference Computer Vision works, 2016, pp.850-865.) and SiamRPN (Li B, Yan J, Wu W, et al. high performance video tracking with parameter registration network protocol) get an appearance model by training offline on a large data set and do not perform parameter updates while tracking online, thus not only is tracker accuracy high, but also has advantages in particular in speed. However, due to the fixed template, the tracker is not particularly sensitive to changes in the appearance of the target and is also susceptible to interference from similar and complex backgrounds. In order to accommodate changes in the appearance of the target, CFNet (Valmadre J, Bertonitto L, Henriques JF, et al. End-to-end representation learning for correlation filter based tracking. Procedings of the IEEE Conference on Computer Vision and Pattern registration.2017, pp.5000-5008.) and RASNet (Wang Q, Teng Z, Xing J, et al. learning entries: identification Z, calculation aspect position for high performance on line Vision tracking. tracking IEEE Conference Computer Vision and Pattern registration.2018, pp.4854-4863) embed the relevant template and local branch attention module, respectively, by updating the local module with the filter parameters. GradNet (Li P, ChenB, Ouyang W, et al.GradNet: Gradient-defined network for visual object tracking. proceedings of the IEEE International Conference on computer Vision.2019, pp.6161-6170.) and UpdateNet (Zhang L, Abel G, Joost V, et al.learning the model data for parameter tracking. proceedings of IEEE International Conference on computer Vision.2019, pp.4009-4018.) template parameter updates are achieved using a process of web iterative learning. Relative to GradNet and UpdateNet, which directly update parameters on the first frame template, MemDTC (Yang T, Antoni B. visual tracking via dynamic network, IEEE Transactions on Pattern Analysis and Machine analysis.2021, pp.360-374) stores reliable target templates in the tracking process by adding a memory unit, so that effective information of the first frame template can be completely saved, and quick recovery of the tracker in case of drift can be facilitated. In addition, in order to improve the discrimination ability of the twin tracker for similar and complex backgrounds, an interference sensing module capable of incremental learning is designed on line by DaSiamRPN (Zhu Z, Wang Q, Li B, et al. Disfractor-aware networks for visual object tracking. proceedings of the European Conference on Computer Vision.2018, pp.103-119.). Nocal-Sim (Tan H, Zhang X, Zhang Z, et al. Nocal-Sim: learning visual characteristics and response with advanced non-local blocks for real-time parameter tracking. IEEE Transactions on Image processing.2021, pp.30:2656 and 2668.) utilizes the remote dependence of non-local attention to enhance the learning of feature weights associated with the target. siamDW (Zhang Z, Peng H. De and wire dimension networks for real-time visual tracking. proceedings of the IEEE Conference on computer Vision and Pattern registration.2019, pp.4591-4600.) designs a deeper and wider network architecture for the twin tracker, and further excavates the feature extraction and discrimination capability of the depth network.
Although the twin tracking algorithm has made some progress in how to design deeper and wider backbone networks, better matching methods, more accurate output representation, and more efficient online updating mechanisms, a more effective solution for scenes such as similar interference, complex background, and occlusion is still lacking.
Disclosure of Invention
The invention aims to provide a twin tracking method and system based on interactive and convergent feature optimization, which are beneficial to obtaining more stable, robust and accurate tracking results in a complex environment.
In order to realize the purpose, the invention adopts the technical scheme that: a twin tracking method based on interactive and convergent feature optimization comprises the following steps:
s1, initializing a template image and a search area image;
s2, constructing a feature extraction network, inputting a template image and a search area image, and acquiring corresponding template multilayer features F z And search region Multi-layer feature F x ;
S3, constructing a gating double-view aggregation module GDA to optimize template multilayer characteristics, and constructing template multilayer characteristics F z Inputting the GDA module to obtain optimized template multilayer characteristics
S4 attention module for constructing semantic guidanceSGA to achieve coarse-grained feature optimization of search regionsAnd search region Multi-layer feature F x Inputting into SGA module to obtain coarse-grained optimized search region characteristics
S5, constructing a correlation graph aggregation module (CGA) to realize fine-grained feature optimization of the search areaAndinputting the CGA module to obtain fine-grained optimized search region characteristics
S6, constructing a head prediction networkAndand inputting and predicting the position of the target of the current frame.
Further, the specific implementation method of step S1 is:
cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a target real boundary frame given by the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point.
Further, the specific implementation method of step S2 is as follows:
adopting ResNet-50 as a feature extraction network, taking a template image and a search area image as input, and acquiring template multilayer featuresAnd search area multi-layer featuresWhere l represents the total number of layers of extracted template or search area features,respectively representing the template features and the search region features of the ith layer, i belongs to [1, l ∈ [)]。
Further, the specific implementation method of step S3 is:
the GDA module comprises three sub-modules, namely a local view attention LA, a global view attention GA and a convergence gate; the LA module is used for highlighting high-frequency information of a local visual angle; for single layer template features of size C H WLocal visual angle attention featureExpressed as:
wherein, W 2 Is a learnable convolution parameter of sizeAndwhere r denotes the channel compression parameter;representing batch normalization; sigma represents a sigmoid function;represents a bit-wise multiplication; high frequency characteristicsObtained by subtracting the local mean, expressed as:
in the formula, W 1 Are learnable convolution parameters;is a warp W 1 Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s denote window size and step size, respectively; δ denotes the nonlinear activation function, here with ReLU;
the LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; the GA module is used for aggregating global information of different receptive fields through interaction of multilayer features; characteristic F ═ x for a set of l layers 1 ,x 2 ,...,x l For any two-layer feature }Andfirst, three convolutional layers, θ (-), φ (-), and g (-), are used for x i Linear mapping is carried out to obtain feature maps of ' query ', ' key ' and ' value Andnamely, it is
Q=θ(x i )
K 1 =φ(x i )
V 1 =g(x i )
At the same time, feature x j Sharing convolutional layers phi (-) and g (-) to obtain corresponding feature mapsAndnamely, it is
K 2 =φ(x j )
V 2 =g(x j )
Then, the 'key' and 'value' of each layer are spliced together respectively to obtain a global representation of the multilayer characteristicAndwhere l × H × W, where l denotes the total number of feature layers queried; the global features K and V are then represented as:
K=[φ(x i )||φ(x j )]
V=[g(x i )||g(x j )]
wherein, [ | · ] represents that the characteristic is spliced according to the space dimension;
thus, the attention feature y is derived from a standard non-local attention formula i Expressed as:
in the above formula, the first and second carbon atoms are,it represents the softmax operation along the j dimension;
finally, y is transformed by a convolution layer ξ (-) i The number of channels is unified with the original characteristic diagram and is added to the original characteristic diagram in a residual error mode, so that:
for template multi-layer featuresAccording to the above formula, wherein the i-th layer is characterizedThe updates through the GA module are represented as:
wherein the content of the first and second substances,
the LA module and the GA module reduce the redundancy of channels to a certain extent through channel compression parameters r and m and respectively obtain attention characteristics under local and global perspectivesAndon the basis, the gating mechanism utilizing the aggregation gating module is fused adaptivelyAndthereby enhancing the effective representation of salient features; for input featuresAndfirstly, splicing two features together, learning the correlation between the two features by using a convolution layer of 1 multiplied by 1, and obtaining a normalized correlation matrix W by using a sigmoid function gate Let it representThe weight matrix of (1), thenIs expressed as 1-W gate (ii) a Then, feature weighting is realized in a mode of multiplying the weight matrix and the features element by element, and finally the obtained optimized featuresExpressed as:
further, the specific implementation method of step S4 is:
optimized features for GDA module outputThe SGA module extracts global semantic information of spatial dimension layer by layer, generates a target semantic attention matrix, and then compares the target semantic attention matrix with the characteristics of a search areaInteracting layer by layer to obtain coarse-grained optimized search region characteristicsIn particular, for the ith layer template featuresGenerated target semantic attention moment arrayExpressed as:
wherein GAP (·) represents global average pooling in spatial dimensions, and σ is sigmoid function;
then, the SGA module adopts global view attention aggregation multi-layer characteristics, and the global view attention aggregation multi-layer characteristics share parameters with the global view attention in the GDA module to reduce actual calculation amount; through the interaction of target semantic information and the 'inquiry', 'key' and 'value' characteristics of the search area, the ith layer of search area characteristicsThe resulting Q, K and V are expressed as:
wherein the content of the first and second substances,
in the above equation, i represents the current layer and j represents other multi-layer features.
Further, the specific implementation method of step S5 is:
CGA module andandas input, on one hand, calculating the correlation of spatial pixels of a search area and the template overall; on the other hand, local relevance is calculated based on the salient features of the template only; by fusing global and local correlation, the relation of space positions is strengthened by adopting graph convolution, so that the construction of an attention map associated with the target feature is realized; in particular, features are optimized for the template of the ith layerAnd search area optimization featuresFirstly, respectively cutting the template along the space and the channel to obtain the space characteristicsAnd channel characteristicsWherein N is 1 =H 1 ×W 1 (ii) a For a certain pixel in the search area, firstly, the correlation between the certain pixel and the template spatial feature is calculated to obtain a spatial correlation diagram S 1 Expressed as:
wherein, Corr (·) is a correlation calculation function, and an inner product mode is adopted;
then, at S 1 On the basis, the retrieval of the template global information is realized by calculating the correlation with the channel characteristics, and a correlation graph S obtained by calculation at the moment 2 Expressed as:
the global relevance of a certain pixel in the search area to the template is expressed as:
wherein MaxPool (·) is the maximum pooling operation, ks and s represent the window size and step size, respectively;
the local correlation is then expressed as:
finally, fusing the two correlation graphs by adopting a method of adding corresponding elements, and constructing graph relations of the two correlation graphs, thereby enhancing the association of positions; the obtained correlation diagram and the characteristics of the search area are obtainedAdding to obtain finer grained optimization featuresExpressed as:
wherein GCN (-) is a two-layer graph convolution network;representing the addition of corresponding elements; further obtaining fine-grained optimized search region characteristics
The invention also provides a twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, the method steps being able to be carried out when the computer program instructions are executed by the processor.
Compared with the prior art, the invention has the following beneficial effects: the method and the system enhance the target salient features and inhibit background noise through self-attention aggregation and interaction of the template features and the search area features. In particular, a novel interaction and aggregation network is employed that includes a gated dual-perspective attention module, a semantically guided attention module, and a dependency graph aggregation module. The gated double-view attention module is used for aggregating the outputs of the local view attention submodule and the global view attention submodule based on a gating mechanism and is used for enhancing the characteristics of target significance and discriminability. The attention module of semantic guidance is to extract semantic information of a target, and the semantic information is used as feature optimization of a priori guidance search area. Further, for the optimized template and the search area characteristics, local similarity and global similarity are respectively constructed in a correlation graph aggregation module, and spatial position relation is strengthened through a graph convolution network.
Drawings
FIG. 1 is a flow chart of a method implementation of an embodiment of the present invention.
FIG. 2 is a result of comparing the accuracy of the method of the present invention with that of other target tracking methods under different attributes in the embodiment of the present invention.
Detailed Description
The invention is further explained by the following embodiments in conjunction with the drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a twin tracking method based on interactive and convergent feature optimization, which includes the following steps:
s1, initializing the template image and the search area image. The specific implementation method comprises the following steps:
cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a given target real bounding box (ground-computing bounding box) of the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point. The target prediction bounding box refers to the target position predicted by each frame and is given in the form of (x, y, w, h), where (x, y) denotes the center position of the prediction bounding box, and w and h denote the width and height of the prediction bounding box, respectively.
S2, constructing a feature extraction network, inputting the template image and the search area image, and obtaining the corresponding template multilayer featuresAnd a search areaDomain multi-layer featuresThe specific implementation method comprises the following steps:
adopting ResNet-50 and its improved network as characteristic extraction network, using template image and search area image as input to obtain template multilayer characteristicsAnd search area multi-layer featuresWhere l represents the total number of layers of extracted template or search area features,respectively represent the ith layer (i e [1, l ]]) Template features and search area features.
S3, constructing a gating double-view aggregation module GDA to optimize template multilayer characteristicsInputting the GDA module to obtain optimized template multilayer characteristicsThe specific implementation method comprises the following steps:
1) the GDA module includes three sub-modules, local-view attention (LA), global-view attention (GA), and aggregation gate (aggregation gate). The LA module is used for highlighting high-frequency information of a local visual angle. In order to save calculation parameters, a bottleneck structure scheme is adopted for design, and detail characteristics are highlighted in a residual error connection mode, so that the purposes of protecting original characteristics and preventing effective information from being erased by errors are achieved. In addition, for the attribute of convolution local connection, essentially each element of the feature map represents the embedded feature and signal strength of a specific area in the feature map of the previous layer, so that high-frequency discrimination can be obtained by subtracting the average signal strengthAnd (4) sexual information. In the case where the difference between the target and the background is not significant, it is important to express high-frequency information for enhancing the discrimination. Thus, for single layer template features of size C H WLocal visual angle attention featureExpressed as:
wherein, W 2 Is a learnable convolution parameter of sizeAndwhere r denotes the channel compression parameter;representing batch normalization; sigma represents a sigmoid function;represents a bit-wise multiplication; high frequency characteristicsObtained by subtracting the local mean, expressed as:
in the formula, W 1 For learnable volumesA product parameter;is a warp W 1 Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s represent window size and step size, respectively; δ denotes the nonlinear activation function, here ReLU.
2) The LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; and the GA module is used for aggregating the global information of different receptive fields through the interaction of multilayer features. The standard non-local attention captures the dependency relationship of different positions through the interaction at the pixel level, so that the performance of the visual task is greatly improved, but the information interaction of different network layers is not considered, and the importance of different receptive fields on semantic information mining is ignored. In view of the above, the standard non-local attention is extended to cross-layer non-local attention, which aggregates multiple layers of semantic information to the current layer through interaction with different receptive fields, thereby obtaining richer feature representations. For convenience of presentation and general generalization, the characteristic F ═ x for a set of l layers 1 ,x 2 ,...,x l In any two layers of featuresAndfirst, three convolutional layers, θ (-), φ (-), and g (-), are used for x i Performing linear mapping to obtain feature maps of ' query ', ' key ' and ' valueAndnamely that
Q=θ(x i )
K 1 =φ(x i )
V 1 =g(x i )
At the same time, feature x j Sharing convolutional layers phi (-) and g (-) to obtain corresponding feature mapsAndnamely, it is
K 2 =φ(x j )
V 2 =g(x j )
Then, splicing the key and the value of each layer together respectively to obtain a global representation of the multilayer featureAndwhere S ═ l × H × W, where l denotes the total number of feature layers queried. The global features K and V are then represented as:
K=[φ(x i )||φ(x j )]
V=[g(x i )||g(x j )]
wherein, [ | · ] represents that the features are spliced according to spatial dimensions.
Thus, the attention feature y is derived from a standard non-local attention formula i Expressed as:
in the above formula, the first and second carbon atoms are,which represents the softmax operation along the j dimension.
Finally, y is transformed by a convolution layer ξ (-) i The number of channels is unified with the original characteristic diagram and is added to the original characteristic diagram in a residual error mode, so that:
for template multi-layer featuresAccording to the above formula, wherein the ith layer is characterizedThe updates through the GA module are represented as:
wherein the content of the first and second substances,
3) the LA module and the GA module reduce the redundancy of channels to a certain extent through channel compression parameters r and m and respectively obtain attention characteristics under local and global perspectivesAndon the basis, the gating mechanism utilizing the aggregation gating module is fused adaptivelyAndthereby enhancing the effective representation of the salient features. For input featuresAndfirstly, splicing two features together, learning the correlation between the two features by using a 1 multiplied by 1 convolutional layer, and obtaining a normalized correlation matrix W by using a sigmoid function gate Let it representThe weight matrix of (1), thenIs expressed as 1-W gate . Then, realizing feature weighting by multiplying the weight matrix and the features element by element (Hardamard product), and finally obtaining the optimized featuresExpressed as:
s4, constructing a semantic-guided attention module SGA to realize coarse-grained feature optimization of the search areaAnd search area multi-layer featuresInputting the obtained data into SGA module to obtain coarse-grained optimized search region characteristicsThe specific implementation method comprises the following steps:
1) optimized features for GDA module outputThe SGA module extracts global semantic information of space dimensions layer by layer, generates a target semantic attention matrix and then compares the target semantic attention matrix with the characteristics of a search areaInteracting layer by layer to obtain coarse-grained optimized search region characteristicsIn particular, for the ith layer template featuresGenerated target semantic attention moment arrayExpressed as:
wherein GAP (-) represents the global average pooling in spatial dimension, and σ is sigmoid function.
2) The SGA module adopts global view attention aggregation multi-layer characteristics, and the global view attention aggregation multi-layer characteristics share parameters with global view attention in the GDA module to reduce actual calculation amount; in particular, the ith layer search area features are characterized by interaction of target semantic information with the 'query', 'key' and 'value' features of the search areaThe resulting Q, K and V are expressed as:
wherein the content of the first and second substances,
in the above equation, i represents the current layer and j represents other multi-layer features.
S5, constructing a correlation graph aggregation module CGA to realize fine-grained feature optimization of the search area, and optimizing the correlation graph aggregation module CGAAnd withInputting the CGA module to obtain fine-grained optimized search region characteristicsThe specific implementation method comprises the following steps:
CGA module andandas input, on one hand, calculating the correlation of spatial pixels of a search area and the global template; on the other hand, local correlation is calculated based on the salient features of the template only; and (3) by fusing global and local correlation, strengthening the relation of spatial positions by adopting graph convolution, thereby realizing the construction of the attention map associated with the target feature.
1) Optimizing features for a template of an ith layerAnd search area optimization featuresFirstly, the template is respectively cut along the space and the channel to obtain the space characteristicsAnd channel characteristicsWherein N is 1 =H 1 ×W 1 (ii) a For a certain pixel in the search area, firstly calculating the correlation between the certain pixel and the spatial feature of the template to obtain a spatial correlation diagram S 1 Expressed as:
where Corr (-) is a correlation calculation function, here in the form of an inner product.
Then, at S 1 On the basis, the retrieval of the template global information is realized by calculating the correlation with the channel characteristics, and the correlation graph S obtained by calculation at the moment 2 Expressed as:
for simplicity of description, the global correlation of a certain pixel in the search area with the template is expressed as:
where MaxPool (·) is the maximum pooling operation, and ks and s represent the window size and step size, respectively.
The local correlation is then expressed as:
2) fusing the two correlation graphs by adopting a method of adding corresponding elements, and constructing graph relations of the two correlation graphs, thereby enhancing the association of positions; the obtained correlation diagram and the characteristics of the search areaAdding to obtain finer grained optimization featuresExpressed as:
wherein GCN (-) is a two-layer graph convolution network;representing the addition of corresponding elements; further obtaining fine-grained optimized search region characteristics
S6, constructing a head prediction networkAndand inputting and predicting the position of the target of the current frame.
The present embodiment also provides a twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, implement the above-mentioned method steps.
In this embodiment, an OTB100 data set is used for comparison and verification, and fig. 2 shows the result of the accuracy comparison between the FRIA-Track and other target tracking methods under different attributes. Table 1 shows the success rate comparison results of the method proposed by the present invention with other target tracking methods on OTB100 data set.
TABLE 1 comparison of the present invention with other target tracking methods
As can be seen from FIG. 2, the FRIA-Track method of the present invention shows the best level under 8 attributes, and the performance under 10 attributes exceeds the standard algorithm, SiamCAR. As can be seen from table 1, the method of the present invention has the best success rate compared to other target tracking methods.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
Claims (7)
1. A twin tracking method based on interactive and convergent feature optimization is characterized by comprising the following steps:
s1, initializing a template image and a search area image;
s2, constructing a feature extraction network, inputting the template image and the search area image, and acquiring corresponding template multilayer features F z And search region Multi-layer feature F x ;
S3, constructing a gating dual-view aggregation module GDA to optimizeTemplate multilayer characterization, template multilayer characterization F z Inputting the GDA module to obtain optimized template multi-layer features
S4, constructing a semantic-guided attention module SGA to realize coarse-grained feature optimization of the search areaAnd search region Multi-layer feature F x Inputting the obtained data into SGA module to obtain coarse-grained optimized search region characteristics
S5, constructing a correlation graph aggregation module (CGA) to realize fine-grained feature optimization of the search areaAndinputting the CGA module to obtain fine-grained optimized search region characteristics
2. The twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S1 is implemented by:
cutting out a template image with the size of 3 multiplied by 127 on the first frame image according to a target real bounding box given by the first frame; starting from the second frame, the search area image with the size of 3 × 255 × 255 is cut out with the target prediction bounding box center coordinates of the previous frame as a reference point.
3. The twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S2 is implemented by:
adopting ResNet-50 as a feature extraction network, taking a template image and a search area image as input, and acquiring template multilayer featuresAnd search region multi-layer featuresWhere l represents the total number of levels of extracted template or search area features,respectively representing the template characteristics and the search region characteristics of the ith layer, i belongs to [1, l ]]。
4. The twin tracking method based on interactive and convergent feature optimization according to claim 1, wherein the specific implementation method of step S3 is:
the GDA module comprises three sub-modules, namely a local view attention LA, a global view attention GA and an aggregation gate; the LA module is used for highlighting high-frequency information of a local visual angle; for single layer template features of size C × H × WLocal visual angle attention featureExpressed as:
wherein, W 2 Is a learnable convolution parameter of sizeAndwhere r denotes the channel compression parameter;representing batch normalization; sigma represents a sigmoid function;represents a bit-wise multiplication; high frequency characteristicsObtained by subtracting the local mean, expressed as:
in the formula, W 1 Are learnable convolution parameters;is a warp W 1 Features of the convolution map; AvgPool (·) denotes average pooling for obtaining average signal strength of local areas; ks and s denote window size and step size, respectively; delta denotes the non-linear activation function, used hereReLU;
The LA module focuses on a fixed receptive field and aggregates information of a local area through convolution operation; the GA module is used for aggregating global information of different receptive fields through interaction of multilayer features; characteristic F ═ x for a set of l layers 1 ,x 2 ,...,x l For arbitrary two-layer features }Andfirst, three convolutional layers, θ (-), φ (-), and g (-), are used to pair x i Linear mapping is carried out to obtain feature maps of ' query ', ' key ' and ' value Andnamely, it is
Q=θ(x i )
K 1 =φ(x i )
V 1 =g(x i )
At the same time, feature x j Sharing convolutional layers phi (-) and g (-) to obtain corresponding feature mapsAndnamely, it is
K 2 =φ(x j )
V 2 =g(x j )
Then, the "key" and the "value" of each layer are spliced together respectively to obtainGlobal representation of multi-layer featuresAndwhere S ═ l × H × W, where l denotes the total number of feature layers queried; the global features K and V are then expressed as:
K=[φ(x i )||φ(x j )]
V=[g(x i )||g(x j )]
wherein, [ | · ] represents that the characteristic is spliced according to the space dimension;
thus, the attention feature y is derived from a standard non-local attention formula i Expressed as:
in the above formula, the first and second carbon atoms are,it represents the softmax operation along the j dimension;
finally, y is transformed by a convolutional layer xi (·) i The number of channels is unified with the original characteristic diagram and is added to the original characteristic diagram in a residual error mode, so that the following results are obtained:
for template multi-layer featuresAccording to the above formula, wherein the ith layer is characterizedUpdates through the GA Module are represented as:
Wherein, the first and the second end of the pipe are connected with each other,
the LA module and the GA module reduce the redundancy of channels to a certain extent through channel compression parameters r and m and respectively obtain attention characteristics under local and global perspectivesAndon the basis, the gating mechanism utilizing the aggregation gating module is fused adaptivelyAndthereby enhancing the effective representation of salient features; for input featuresAndfirstly, splicing two features together, learning the correlation between the two features by using a 1 multiplied by 1 convolutional layer, and obtaining a normalized correlation matrix W by using a sigmoid function gate Let it representThe weight matrix of (1), thenIs expressed as 1-W gate (ii) a Then, feature weighting is realized in a mode of multiplying the weight matrix and the features element by element, and finally the obtained optimized featuresExpressed as:
5. the twin tracking method based on interaction and convergent feature optimization according to claim 1, wherein the step S4 is implemented by:
optimized features for GDA module outputThe SGA module extracts global semantic information of space dimensions layer by layer, generates a target semantic attention matrix and then compares the target semantic attention matrix with the characteristics of a search areaInteracting layer by layer to obtain coarse-grained optimized search region characteristicsIn particular, for the ith layer template featuresGenerated target semantic attention moment arrayExpressed as:
wherein GAP (-) represents global average pooling in spatial dimension, and sigma is sigmoid function;
then, the SGA module adopts global view attention aggregation multi-layer characteristics, and the global view attention aggregation multi-layer characteristics share parameters with the global view attention in the GDA module to reduce actual calculation amount; through the interaction of target semantic information and the 'inquiry', 'key' and 'value' characteristics of the search area, the ith layer of search area characteristicsThe resulting Q, K and V are expressed as:
wherein, the first and the second end of the pipe are connected with each other,
in the above equation, i represents the current layer and j represents other multi-layer features.
6. The twin tracking method based on interactive and convergent feature optimization according to claim 1, wherein the specific implementation method of step S5 is:
CGA module andandas input, on one hand, calculating the correlation of spatial pixels of a search area and the global template; on the other hand, local relevance is calculated based on the salient features of the template only; by fusing global and local correlation, the relation of space positions is strengthened by adopting graph convolution, so that the construction of an attention map associated with the target feature is realized; in particular, features are optimized for the template of the ith layerAnd search area optimization featuresFirstly, respectively cutting the template along the space and the channel to obtain the space characteristicsAnd channel characteristicsWherein N is 1 =H 1 ×W 1 (ii) a For a certain pixel in the search area, firstly, the correlation between the certain pixel and the template spatial feature is calculated to obtain a spatial correlation diagram S 1 Expressed as:
wherein, Corr (·) is a correlation calculation function, and an inner product mode is adopted;
then, at S 1 On the basis, the retrieval of the template global information is realized by calculating the correlation with the channel characteristics, and the correlation graph S obtained by calculation at the moment 2 Expressed as:
the global relevance of a certain pixel in the search area to the template is expressed as:
wherein, MaxPool (-) is the maximum pooling operation, and ks and s respectively represent the window size and the step size;
the local correlation is then expressed as:
finally, fusing the two correlation graphs by adopting a method of adding corresponding elements, and constructing graph relations of the two correlation graphs, thereby enhancing the association of positions; the obtained correlation diagram and the characteristics of the search area are obtainedAdding to obtain finer grained optimization featuresExpressed as:
7. Twin tracking system based on interactive and convergent feature optimization, comprising a memory, a processor and computer program instructions stored on the memory and executable by the processor, which when executed by the processor, are capable of implementing the method steps of any of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210600748.XA CN114926652A (en) | 2022-05-30 | 2022-05-30 | Twin tracking method and system based on interactive and convergent feature optimization |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210600748.XA CN114926652A (en) | 2022-05-30 | 2022-05-30 | Twin tracking method and system based on interactive and convergent feature optimization |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114926652A true CN114926652A (en) | 2022-08-19 |
Family
ID=82812296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210600748.XA Pending CN114926652A (en) | 2022-05-30 | 2022-05-30 | Twin tracking method and system based on interactive and convergent feature optimization |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114926652A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115457259A (en) * | 2022-09-14 | 2022-12-09 | 华洋通信科技股份有限公司 | Image rapid saliency detection method based on multi-channel activation optimization |
-
2022
- 2022-05-30 CN CN202210600748.XA patent/CN114926652A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115457259A (en) * | 2022-09-14 | 2022-12-09 | 华洋通信科技股份有限公司 | Image rapid saliency detection method based on multi-channel activation optimization |
CN115457259B (en) * | 2022-09-14 | 2023-10-31 | 华洋通信科技股份有限公司 | Image rapid saliency detection method based on multichannel activation optimization |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kejriwal et al. | High performance loop closure detection using bag of word pairs | |
CN111144364B (en) | Twin network target tracking method based on channel attention updating mechanism | |
CN106780631B (en) | Robot closed-loop detection method based on deep learning | |
Dusmanu et al. | Multi-view optimization of local feature geometry | |
Fu et al. | Fast ORB-SLAM without keypoint descriptors | |
CN112651262A (en) | Cross-modal pedestrian re-identification method based on self-adaptive pedestrian alignment | |
Hu et al. | Semantic SLAM based on improved DeepLabv3⁺ in dynamic scenarios | |
CN115439507A (en) | Three-dimensional video target tracking method based on multi-level mutual enhancement and relevant pyramid | |
Sehgal et al. | Lidar-monocular visual odometry with genetic algorithm for parameter optimization | |
Urdiales et al. | An improved deep learning architecture for multi-object tracking systems | |
CN114926652A (en) | Twin tracking method and system based on interactive and convergent feature optimization | |
Liu et al. | Learning optical flow and scene flow with bidirectional camera-lidar fusion | |
Zeng et al. | NCT: noise-control multi-object tracking | |
Tsintotas et al. | The revisiting problem in simultaneous localization and mapping | |
Huang et al. | Correlation-filter based scale-adaptive visual tracking with hybrid-scheme sample learning | |
Bazeille et al. | Combining odometry and visual loop-closure detection for consistent topo-metrical mapping | |
CN116797799A (en) | Single-target tracking method and tracking system based on channel attention and space-time perception | |
CN116245913A (en) | Multi-target tracking method based on hierarchical context guidance | |
CN115880332A (en) | Target tracking method for low-altitude aircraft visual angle | |
CN115830631A (en) | One-person one-file system construction method based on posture-assisted occluded human body re-recognition | |
Jiang et al. | Semantic closed-loop based visual mapping algorithm for automated valet parking | |
Zhang et al. | Rt-track: robust tricks for multi-pedestrian tracking | |
Tan et al. | Online visual tracking via background-aware Siamese networks | |
Cai et al. | Explicit invariant feature induced cross-domain crowd counting | |
CN116152298B (en) | Target tracking method based on self-adaptive local mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |