CN116630845A - Target tracking method combining hierarchical encoder and parallel attention mechanism - Google Patents

Target tracking method combining hierarchical encoder and parallel attention mechanism Download PDF

Info

Publication number
CN116630845A
CN116630845A CN202310488020.7A CN202310488020A CN116630845A CN 116630845 A CN116630845 A CN 116630845A CN 202310488020 A CN202310488020 A CN 202310488020A CN 116630845 A CN116630845 A CN 116630845A
Authority
CN
China
Prior art keywords
features
feature
template
search
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310488020.7A
Other languages
Chinese (zh)
Inventor
符强
王阳
纪元法
孙希延
任风华
严素清
付文涛
黄建华
梁维彬
贾茜子
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanning Guidian Electronic Technology Research Institute Co ltd
Guilin University of Electronic Technology
Original Assignee
Nanning Guidian Electronic Technology Research Institute Co ltd
Guilin University of Electronic Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanning Guidian Electronic Technology Research Institute Co ltd, Guilin University of Electronic Technology filed Critical Nanning Guidian Electronic Technology Research Institute Co ltd
Priority to CN202310488020.7A priority Critical patent/CN116630845A/en
Publication of CN116630845A publication Critical patent/CN116630845A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of computer vision, in particular to a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which comprises the steps of preprocessing a video sequence to obtain an initial template image and an initial search image; performing feature extraction on the initial template image and the initial search image by utilizing the improved feature extraction backbone network VGG16 to obtain template features and search features; processing the template features and the search features through a parallel attention mechanism to obtain two features; the two obtained features are fused and then are respectively input into a second two-layer convolution block and a DWConv-transform encoder of the improved feature extraction backbone network VGG16 for processing, so that final template features and final search features are obtained; and performing cross-correlation convolution on the final template features and the final search features after the final template features are processed by a parallel attention mechanism to obtain a similar score graph of the final template features and the final search features, and mapping the maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result.

Description

Target tracking method combining hierarchical encoder and parallel attention mechanism
Technical Field
The invention relates to the technical field of computer vision, in particular to a twin network single target tracking method combining a transducer encoder and a parallel attention mechanism.
Background
Target tracking technology is a challenging task as a very popular research direction in computer vision disciplines. The technology can predict the motion trail of the target by modeling the appearance and the motion of the target so as to acquire the position information of the target at the future moment, so that the technology is widely applied to the fields of intelligent traffic monitoring, intelligent man-machine interaction, military reconnaissance and the like.
The twin network is applied to the tracking task by work, the tracking problem is converted into the similarity solving problem, and the solving process of the tracking task is greatly simplified. The regional suggestion network is introduced into the tracking framework, and foreground and background classification and bounding box regression are carried out by using the regional suggestion network, so that the accuracy of predicting the bounding box is effectively improved. There is work to improve the training effect of the model by controlling the distribution of the training data set using an efficient sampling strategy. There is work to improve the performance of trackers by improving the feature extraction backbone network. However, most twin network-based tracking methods use a single feature processing manner, and when facing a complex tracking environment such as foreign object shielding, illumination change, rapid movement, etc., the tracker may have tracking drift.
Disclosure of Invention
The invention aims to provide a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which aims to solve the problem that the existing tracker has tracking drift in a complex tracking environment, and comprises the following steps:
preprocessing a video sequence to obtain an initial template image and an initial search image;
performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features;
focusing the weight difference of the target on different characteristic channels and the dependency relationship between the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights;
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are respectively input into a two-layer convolution block and a DWConv-transform encoder of the improved characteristic extraction backbone network VGG16 for processing after being fused, so that final template characteristics and final search characteristics are obtained;
and performing cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism to obtain a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result.
Wherein the preprocessing includes clipping and RGB mean filling.
The feature extraction of the initial template image and the initial search image by using the improved feature extraction backbone network VGG16 to obtain template features and search features includes:
optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16;
inputting the initial template image and the initial search image into the improved feature extraction backbone network VGG16 to obtain template features and search features.
The optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16 includes:
and reserving the original 5-layer maximum pooling of the original feature extraction backbone network VGG16 as 3 layers, introducing a clipping module to clip the outermost layer features, and introducing hole convolution in the fourth layer and the fifth layer of the network to obtain the improved feature extraction backbone network VGG16.
The focusing the dependency relationship between the weight difference of the target in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights, comprising the following steps:
inputting the template features and the search features into a channel attention mechanism, and adaptively endowing different feature channels with different weights according to the response degrees of the feature images of the different channels to different target descriptors so as to adjust the importance degrees of the channels to different targets and obtain channel features containing channel information weights;
and inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and acquiring the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights.
The step of inputting the template features and the search features into a channel attention mechanism, the step of adaptively giving different weights to different feature channels according to the response degrees of different channels of a feature map to different target descriptors to adjust the importance degrees of the channels to different targets and obtain channel features containing channel information weights comprises the following steps:
obtaining a description matrix of channel dimensions by utilizing two-dimensional self-adaptive average pooling operation on the inputted template features and the searching features;
the description matrix is subjected to matrix transformation and one-dimensional convolution treatment to obtain a treatment matrix;
and endowing the channel information weight obtained by the processing matrix after the matrix transformation reshape and sigmoid activation function with the template feature and the search feature to obtain the channel feature containing the channel information weight.
The step of inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and obtaining the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights, includes:
respectively carrying out average pooling operation and maximum pooling operation on the inputted template features and the searching features to obtain a first pooling feature and a second pooling feature;
stacking the first pooled feature and the second pooled feature to obtain a stacked feature;
multiplying the template feature and the search feature with the stacking feature through a standard convolution to obtain a product feature;
and the product characteristic is subjected to a convolution layer with the convolution kernel size of 7 multiplied by 7 and a sigmoid activation function to obtain the spatial characteristic containing the position information weight.
Wherein the latter two-layer convolution block of the improved feature extraction backbone network VGG16 comprises 3-layer convolution layers with a convolution kernel size of 3×3 and 2-layer convolution layers with a convolution kernel size of 1×1.
The step of respectively inputting the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight into a convolutional block and a DWConv-transform encoder of the two later layers of the improved characteristic extraction backbone network VGG16 for processing after fusing, so as to obtain final template characteristics and final search characteristics, which comprises the following steps:
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then downsampled through a standard convolution, and then input into a DWConv-transform encoder, convolution projection mapping is carried out in the DWConv-transform encoder by using depth separable convolution with a preset step length and a convolution kernel size of 3 multiplied by 3, and query, key and value are obtained through matrix transformation;
taking the query, the key and the value as the input of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after a Softmax function, and finally carrying out dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;
the vector with attention weight is processed by Layer Normalization and a multi-layer perceptron (Multilayer Perceptron, MLP) and is subjected to residual error addition with the vector, and finally, the processing characteristics are obtained through matrix transformation;
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then input into a later two-layer convolution block of the improved characteristic extraction backbone network VGG16, so that output characteristics are obtained;
and carrying out element addition on the processing features and the output features to obtain final template features and final search features.
According to the target tracking method combining the hierarchical encoder and the parallel attention mechanism, an initial template image and an initial search image are obtained by preprocessing a video sequence; performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features; focusing the weight difference of the target on different characteristic channels and the dependency relationship between the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights; the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are respectively input into a two-layer convolution block and a DWConv-transform encoder of the improved characteristic extraction backbone network VGG16 for processing after being fused, so that final template characteristics and final search characteristics are obtained; and performing cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism to obtain a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result. The invention provides a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which can accurately track under complex scenes such as illumination change, similar object interference and the like.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a step diagram of a target tracking method combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.
Fig. 2 is a schematic diagram of a target tracking method combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.
Fig. 3 is a schematic block diagram of the DWConv-TransformerBlock encoder of fig. 2.
Fig. 4 is a schematic block diagram of the SAM of fig. 2.
Fig. 5 is a schematic block diagram of the CAM of fig. 2.
Fig. 6 is a flow chart of a method of object tracking combining a hierarchical encoder and a parallel attention mechanism provided by the present invention.
Fig. 7 is a comparison of the performance of the method provided by the present invention versus other methods on an OTB100 dataset.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
Referring to fig. 1 to 7, the present invention provides a target tracking method combining a hierarchical encoder and a parallel attention mechanism, comprising the following steps:
s1, preprocessing a video sequence to obtain an initial template image and an initial search image;
specifically, the preprocessing includes clipping and RGB mean filling.
The method comprises the steps of preprocessing an input video sequence through clipping and RGB mean filling operation to obtain an initial template image and a search image, wherein the specific operation is as follows: taking a first frame of an input video sequence as an initial target template, cutting the first frame by taking a target as a center, when the cutting range exceeds the image range, filling the exceeding part in a constant manner by taking an RGB average value, wherein the resolution of the preprocessed template image is 127 multiplied by 127, the channel dimension is 3, the search image is also acquired from other frames of the video sequence in the manner as above, the final resolution is 255 multiplied by 255, and the channel dimension is 3.
S2, performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features;
VGG16 is a deep neural network for feature processing of input images;
the specific method is as follows:
s21, optimizing an original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16;
specifically, the original 5 layers of the original feature extraction backbone network VGG16 are reserved as 3 layers in a pooling mode, a clipping module is introduced to clip the features of the outermost layer, and cavity convolution is introduced into the fourth layer and the fifth layer of the network, so that the improved feature extraction backbone network VGG16 is obtained.
Optimization of VGG 16: 1) The original 5-layer maximum pooling is reserved as 3 layers, so that the problem that the performance of a tracker is affected due to too small resolution of output characteristics caused by too long network stride is avoided; 2) The cutting module is introduced to cut out the outermost layer characteristics influenced by the filling operation, so that the influence of potential position deviation caused by the filling operation is reduced; 3) Hole convolution is introduced in the fourth layer and the fifth layer of the network to relieve the problem of insufficient receptive field caused by network stride shortening.
The cavity convolution supports exponential expansion of the receptive field without losing resolution, and the receptive field calculation formula for each element is as follows:
F i+1 =F i *2 i k i i=0,1,…,n-2
where k is a sliding filter of size 3×3 and F represents the characteristic after convolution output. From the above analysis, the magnitude of the receptive field of the elements in each layer of features is obtained based on the convolution result of the previous layer. When using a void fraction of 2 i The filter of (i=0, 1,2 …) is at F i After the layer is convolved, F i+1 The receptive field size of each element in the layer is expressed by the following formula:
therefore, the problem of insufficient receptive field caused by shortening the network stride can be effectively solved by introducing the hole convolution in the backbone network VGG16.
S22, inputting the initial template image and the initial search image into the improved feature extraction backbone network VGG16 to obtain template features and search features.
Specifically, the template image and the search image after the clipping and mean RGB filling pretreatment are input into a VGG16 network, the resolution of template features after the processing of the first three layers of convolution blocks is 11×11, the channel dimension is 256, the resolution of search features is 27×27, the channel dimension is 25, and the first three layers of convolution blocks comprise convolution layers with the convolution kernel size of 3×3 and 3 largest pooling layers.
S3, focusing the dependency relationship between the weight difference of the targets in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and space characteristics containing position information weights;
the specific method is as follows:
s31, inputting the template features and the search features into a channel attention mechanism, and adaptively endowing different feature channels with different weights according to the response degrees of the feature images to different target descriptors to adjust the importance degrees of the channels to different targets so as to obtain channel features containing channel information weights;
specifically, the two-dimensional adaptive average pooling operation is utilized to the inputted template features and the searching features to obtain a description matrix of the channel dimension; the description matrix is subjected to matrix transformation and one-dimensional convolution treatment to obtain a treatment matrix; and endowing the channel information weight obtained by the processing matrix after the matrix transformation reshape and sigmoid activation function with the template feature and the search feature to obtain the channel feature containing the channel information weight.
The channel attention mechanism (ChannelAttention Mechanism, CAM) is used for adaptively giving different weights to different characteristic channels according to the response degree of different channels of the characteristic map to different target descriptors so as to adjust the importance degree of the channels to different targets. For input featuresFirstly, a description matrix ++of channel dimension is obtained by utilizing two-dimensional adaptive average pooling operation>
Where h, w is the height and width of the feature map and (i, j) is the input feature map F c Pixel locations on the display. Will a c Is obtained through matrix transformation and one-dimensional convolution processingThe one-dimensional convolution is used for carrying out information interaction on different channels, and the number of the channel interactions N is adaptively adjusted according to the number of the input channels;
wherein, C is the number of input characteristic channels,represents an odd number nearest to x, α=2, β=1;
will b c Channel information weight obtained after matrix transformation reshape and sigmoid activation function is given to input feature F c Obtaining final channel self-attention output
S32, inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and acquiring the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights.
Specifically, carrying out average pooling operation and maximum pooling operation on the inputted template features and the searching features respectively to obtain a first pooling feature and a second pooling feature; stacking the first pooled feature and the second pooled feature to obtain a stacked feature; multiplying the template feature and the search feature with the stacking feature through a standard convolution to obtain a product feature; and the product characteristic is subjected to a convolution layer with the convolution kernel size of 7 multiplied by 7 and a sigmoid activation function to obtain the spatial characteristic containing the position information weight.
The function of the spatial attention mechanism (SpatialAttention Mechanism, SAM) is to describe the dependency relationship between the internal information of the characteristic channels and acquire the information weights of different spatial positions in the characteristic diagram. For input featuresWhere C, H, W are the number of channels, height and width of the input features, respectively. First for input feature F s The average pooling operation and the maximum pooling operation are respectively carried out to obtain +>And->Then F is carried out A And F M Stacking to obtainWhere r is the channel reduction factor. Will F s By a standard convolution layer with a convolution kernel size of 1 x 1 and then combining a with a s Multiplication of elements to obtain->
Will b s Obtaining spatial features containing position information weights through a convolution layer with a convolution kernel size of 7×7 and a sigmoid activation function
Wherein omega ij Information weight representing position (i, j) on the spatial attention feature;
as shown in fig. 4, the feature F is input in The size is C×H×W, F is obtained after spatial attention mechanism (Spatial Attention Mechanism, SAM) treatment out The size is 1×h×w, and a specific operation procedure can be described as:
wherein conv 7×7 And conv 1×1 The convolution layers with convolution kernel sizes of 7 x 7 and 1 x 1 are denoted, respectively, and Cat denotes the feature stack.
S4, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, and then respectively inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a two-layer convolution block and a DWConv-transducer encoder of the improved characteristic extraction backbone network VGG16 for processing to obtain final template characteristics and final search characteristics;
the specific method is as follows:
s41, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, performing downsampling through a standard convolution, inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a DWConv-transform encoder, performing convolution projection mapping in the DWConv-transform encoder by using depth separable convolution with a preset step length and a convolution kernel size of 3 multiplied by 3, and obtaining query, key and value through matrix transformation;
the modified transducer structure was named DWConv-transducer.
Specifically, the features processed by the first parallel attention module as shown in fig. 3 are input to the DWConv-transform encoder through a convolution layer as follows:
using step size S for features of parallel attention processing, respectively q =1,S v =S k =2, a convolution kernel size of 3×3 depth divisibleThe deconvolution projection mapping is carried out, the query Q, the key K and the value V are obtained through matrix transformation, and the calculation process is as follows:
where DWConv represents a depth-separable convolution, conv 3×3 Representing a standard convolution, F representing the input features.
S42, taking the query, the key and the value as input of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after Softmax, and finally carrying out dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;
specifically, Q, K and V are used as inputs of a multi-head attention module, similarity between Q and K is calculated through Einstein summation convention, then an attention weight matrix is obtained after Softmax, and finally a vector with attention weight is obtained by dot multiplication of a value V and the attention weight matrix, wherein the calculation process is as follows:
wherein d k Is the scaling factor of the attention weighting matrix.
S43, carrying out Layer Normalization and multi-layer perceptron (Multilayer Perceptron, MLP) processing on the vector with attention weight, carrying out residual error addition on the vector with attention weight and the vector, and finally obtaining processing characteristics through matrix transformation;
specifically, vectors obtained by the multi-head attention module are processed by LayerNormalization and a multi-layer perceptron (Multilayer Perceptron, MLP) and then subjected to residual addition, and finally, processing features with the same resolution as the original features of the input DWConv-transform encoder and the same channel dimension are obtained through matrix transformation.
S44, fusing the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight, and inputting the fused channel characteristics and the spatial characteristics containing the position information weight into a later two-layer convolution block of the improved characteristic extraction backbone network VGG16 to obtain output characteristics;
specifically, the features subjected to parallel attention processing are input to the latter two-layer convolution block of the backbone network VGG16, so as to obtain a convolution layer with a template feature resolution of 5×5, a channel dimension of 256, a search feature resolution of 21×21, and a channel dimension of 256, where the latter two-layer convolution block includes three convolution layers with a convolution kernel size of 3×3 and two convolution layers with a convolution kernel size of 1×1.
And S45, carrying out element addition on the processing features and the output features to obtain final template features and final search features.
Specifically, the characteristics obtained by processing the multi-layer DWConv-transform encoder and the characteristics output by the two-layer convolution blocks after VGG16 are subjected to element addition to obtain final template characteristics and search characteristics.
S5, carrying out cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism, obtaining a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to an original graph to obtain a tracking result.
Specifically, the final template features obtained in the step S3 are processed by a parallel attention mechanism to increase the characterization capability of the template; and performing cross-correlation convolution on the final search feature to obtain a similar score diagram of the final search feature and the final search feature, and mapping the maximum point coordinate in the similar score diagram back to the original diagram to obtain a tracking result.
The invention designs a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which mainly comprises the following key technical means:
1. the VGG16 is improved to be used as a feature extraction backbone network, meanwhile, a clipping block is designed to eliminate potential position deviation caused by filling operation in the network, and the problem of insufficient receptive field caused by shortening network stride is solved by combining hole convolution.
2. The design of the invention focuses on the weight difference of targets in different characteristic channels and the dependency relationship between the characteristic channels by a parallel attention mechanism, so as to adjust the attention gravity center of the tracker and enhance the expression capability of the characteristics.
3. The invention utilizes the capturing capability of the long time sequence context information of the hierarchical transform structure and the capturing capability of the convolution network to the local context information to improve the robustness of the tracker in a complex environment, replaces the linear mapping in the transform encoder with the convolution mapping, and replaces standard convolution with the depth separable convolution, thereby reducing the introduction of the extra parameter.
4. The invention designs a target tracking method combining a hierarchical encoder and a parallel attention mechanism, which performs tracking performance evaluation on a public data set OTB100, and the tracking accuracy (Precision) and the tracking success rate (AUC) are obviously improved and are superior to those of main stream algorithms such as SiamFC, siamRPN, daSiamRPN, and the result is shown in figure 7, wherein Ours is the performance result obtained by the method provided by the invention.
The above disclosure is merely illustrative of a preferred embodiment of the present invention of a target tracking method combining a hierarchical encoder and a parallel attention mechanism, but it should be understood that the scope of the present invention is not limited thereto, and those skilled in the art will understand that all or part of the above embodiments may be implemented and equivalents thereof may be modified according to the claims of the present invention.

Claims (9)

1. A method of object tracking combining a hierarchical encoder and a parallel attention mechanism, comprising the steps of:
preprocessing a video sequence to obtain an initial template image and an initial search image;
performing feature extraction on the initial template image and the initial search image by utilizing an improved feature extraction backbone network VGG16 to obtain template features and search features;
focusing the weight difference of the target on different characteristic channels and the dependency relationship between the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights;
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are respectively input into a two-layer convolution block and a DWConv-transform encoder of the improved characteristic extraction backbone network VGG16 for processing after being fused, so that final template characteristics and final search characteristics are obtained;
and performing cross-correlation convolution on the final template feature and the final search feature after the final template feature is processed by the parallel attention mechanism to obtain a similar score graph of the final template feature and the final search feature, and mapping maximum point coordinates in the similar score graph back to the original graph to obtain a tracking result.
2. The method of object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 1,
the preprocessing includes clipping and RGB mean filling.
3. The method of object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 2,
the feature extraction of the initial template image and the initial search image by using the improved feature extraction backbone network VGG16 to obtain template features and search features includes:
optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16;
inputting the initial template image and the initial search image into the improved feature extraction backbone network VGG16 to obtain template features and search features.
4. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 3,
the optimizing the original feature extraction backbone network VGG16 to obtain an improved feature extraction backbone network VGG16 includes:
and reserving the original 5-layer maximum pooling of the original feature extraction backbone network VGG16 as 3 layers, introducing a clipping module to clip the outermost layer features, and introducing hole convolution in the fourth layer and the fifth layer of the network to obtain the improved feature extraction backbone network VGG16.
5. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 4,
focusing the dependency relationship between the weight difference of the target in different characteristic channels and the characteristic channels through a parallel attention mechanism based on the template characteristics and the search characteristics to obtain channel characteristics containing channel information weights and spatial characteristics containing position information weights, wherein the method comprises the following steps:
inputting the template features and the search features into a channel attention mechanism, and adaptively endowing different feature channels with different weights according to the response degrees of the feature images of the different channels to different target descriptors so as to adjust the importance degrees of the channels to different targets and obtain channel features containing channel information weights;
and inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and acquiring the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights.
6. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 5,
the step of inputting the template features and the search features into a channel attention mechanism, adaptively giving different weights to different feature channels according to the response degrees of the different channels of a feature map to different target descriptors to adjust the importance degrees of the channels to different targets, and obtaining channel features containing channel information weights, comprising:
obtaining a description matrix of channel dimensions by utilizing two-dimensional self-adaptive average pooling operation on the inputted template features and the searching features;
the description matrix is subjected to matrix transformation and one-dimensional convolution treatment to obtain a treatment matrix;
and endowing the channel information weight obtained by the processing matrix after the matrix transformation reshape and sigmoid activation function with the template feature and the search feature to obtain the channel feature containing the channel information weight.
7. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 6,
the step of inputting the template features and the search features into a spatial attention mechanism, describing the dependency relationship between the internal information of the feature channels, and obtaining the information weights of different spatial positions in the feature map to obtain the spatial features containing the position information weights, comprises the following steps:
respectively carrying out average pooling operation and maximum pooling operation on the inputted template features and the searching features to obtain a first pooling feature and a second pooling feature;
stacking the first pooled feature and the second pooled feature to obtain a stacked feature;
multiplying the template feature and the search feature with the stacking feature through a standard convolution to obtain a product feature;
and the product characteristic is subjected to a convolution layer with the convolution kernel size of 7 multiplied by 7 and a sigmoid activation function to obtain the spatial characteristic containing the position information weight.
8. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 7,
the latter two-layer convolution block of the improved feature extraction backbone network VGG16 comprises 3 layers of convolution layers with a convolution kernel size of 3 x 3 and 2 layers of convolution layers with a convolution kernel size of 1 x 1.
9. The method for object tracking combining a hierarchical encoder and a parallel attention mechanism of claim 8,
the step of respectively inputting the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight into a two-layer convolution block and a DWConv-transducer encoder of the improved characteristic extraction backbone network VGG16 for processing after fusing, so as to obtain final template characteristics and final search characteristics, which comprises the following steps:
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then downsampled through a standard convolution, and then input into a DWConv-transform encoder, convolution projection mapping is carried out in the DWConv-transform encoder by using depth separable convolution with a preset step length and a convolution kernel size of 3 multiplied by 3, and query, key and value are obtained through matrix transformation;
taking the query, the key and the value as inputs of a multi-head attention module, calculating the similarity between the query and the key in a manner of Einstein summation convention, obtaining an attention weight matrix after Softmax, and finally performing dot multiplication on the value and the attention weight matrix to obtain a vector with attention weight;
processing the vector with the attention weight through a Layernormative and a multi-layer perceptron, adding residual errors with the vector, and finally obtaining processing characteristics through matrix transformation;
the channel characteristics containing the channel information weight and the spatial characteristics containing the position information weight are fused and then input into a later two-layer convolution block of the improved characteristic extraction backbone network VGG16, so that output characteristics are obtained;
and carrying out element addition on the processing features and the output features to obtain final template features and final search features.
CN202310488020.7A 2023-05-04 2023-05-04 Target tracking method combining hierarchical encoder and parallel attention mechanism Pending CN116630845A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310488020.7A CN116630845A (en) 2023-05-04 2023-05-04 Target tracking method combining hierarchical encoder and parallel attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310488020.7A CN116630845A (en) 2023-05-04 2023-05-04 Target tracking method combining hierarchical encoder and parallel attention mechanism

Publications (1)

Publication Number Publication Date
CN116630845A true CN116630845A (en) 2023-08-22

Family

ID=87596446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310488020.7A Pending CN116630845A (en) 2023-05-04 2023-05-04 Target tracking method combining hierarchical encoder and parallel attention mechanism

Country Status (1)

Country Link
CN (1) CN116630845A (en)

Similar Documents

Publication Publication Date Title
CN110210551B (en) Visual target tracking method based on adaptive subject sensitivity
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
CN111291679B (en) Target specific response attention target tracking method based on twin network
CN110111366B (en) End-to-end optical flow estimation method based on multistage loss
CN112257569B (en) Target detection and identification method based on real-time video stream
CN114255361A (en) Neural network model training method, image processing method and device
CN110781744A (en) Small-scale pedestrian detection method based on multi-level feature fusion
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN111126385A (en) Deep learning intelligent identification method for deformable living body small target
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN112465872A (en) Image sequence optical flow estimation method based on learnable occlusion mask and secondary deformation optimization
CN114724185A (en) Light-weight multi-person posture tracking method
CN117058456A (en) Visual target tracking method based on multiphase attention mechanism
CN116740439A (en) Crowd counting method based on trans-scale pyramid convertors
CN113763417B (en) Target tracking method based on twin network and residual error structure
CN113487530A (en) Infrared and visible light fusion imaging method based on deep learning
CN117011655A (en) Adaptive region selection feature fusion based method, target tracking method and system
CN112115786A (en) Monocular vision odometer method based on attention U-net
CN116797799A (en) Single-target tracking method and tracking system based on channel attention and space-time perception
CN116758340A (en) Small target detection method based on super-resolution feature pyramid and attention mechanism
CN116777956A (en) Moving target screening method based on multi-scale track management
Wang et al. EMAT: Efficient feature fusion network for visual tracking via optimized multi-head attention
CN116645625A (en) Target tracking method based on convolution transducer combination
CN116563337A (en) Target tracking method based on double-attention mechanism
CN115496859A (en) Three-dimensional scene motion trend estimation method based on scattered point cloud cross attention learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination