CN116563355A - Target tracking method based on space-time interaction attention mechanism - Google Patents
Target tracking method based on space-time interaction attention mechanism Download PDFInfo
- Publication number
- CN116563355A CN116563355A CN202310523575.0A CN202310523575A CN116563355A CN 116563355 A CN116563355 A CN 116563355A CN 202310523575 A CN202310523575 A CN 202310523575A CN 116563355 A CN116563355 A CN 116563355A
- Authority
- CN
- China
- Prior art keywords
- feature
- space
- query
- frame
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 49
- 230000007246 mechanism Effects 0.000 title claims abstract description 29
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 26
- 230000002708 enhancing effect Effects 0.000 claims abstract description 18
- 239000011159 matrix material Substances 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 36
- 239000013598 vector Substances 0.000 claims description 36
- 230000009466 transformation Effects 0.000 claims description 27
- 230000004044 response Effects 0.000 claims description 18
- 238000000605 extraction Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 238000010606 normalization Methods 0.000 claims description 9
- 238000010586 diagram Methods 0.000 claims description 8
- 238000011176 pooling Methods 0.000 claims description 7
- 230000009286 beneficial effect Effects 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 6
- 238000007906 compression Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 230000002452 interceptive effect Effects 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000002131 composite material Substances 0.000 claims description 3
- 230000002349 favourable effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/30—Determination of transform parameters for the alignment of images, i.e. image registration
- G06T7/33—Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
- G06T7/337—Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods involving reference images or patches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/62—Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/766—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/771—Feature selection, e.g. selecting representative features from a multi-dimensional feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention belongs to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism, which comprises the following steps: step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information; step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism. The invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.
Description
Technical Field
The invention relates to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism.
Background
The target tracking technology is one of important research directions in the field of computer vision, and is widely applied to the fields of intelligent driving, unmanned driving, man-machine interaction and the like. The object tracking task needs to keep accurate tracking of the object in a continuous video sequence to obtain a complete object motion track and calculate the position and size of the object in different image frames. The development of the target tracking technology plays an irreplaceable role in advanced video processing tasks such as behavior understanding, reasoning and decision making, and is also a basic technology of target recognition, behavior analysis, video compression coding, video understanding and the like. Although research on object tracking has advanced significantly over the past few years, and many efficient algorithms have emerged to address the challenging problem in various scenarios. However, there are still many problems such as object shielding, illumination variation, scale transformation, and background interference, so the research of the object tracking technology is still a difficult task. In order to solve these problems, it is necessary to propose a more accurate and robust tracker.
In 2016, with the advent of sialmfc, a Siamese network-based tracking framework became the main stream of a single-target tracking algorithm framework. Thereafter, the SiamRPN introduces a regional proposal network into the Siamese network, and excellent tracking results are obtained on a plurality of benchmark tests. SiamCorders introduce an improved corner pooling layer on the basis of a Siamese network, convert a bounding box into diagonal prediction and have good performance. Most of the trackers based on twin networks currently use initial frames as tracking strategies for templates, which will bring about some risks. When the tracked object is severely deformed and blocked, the target cannot be tracked well. To improve this, some trackers introduce a template update mechanism or use multiple templates, which may enhance the robustness of the trackers to some extent, but this is limited and inevitably increases the computational effort. In addition, these trackers use only appearance information of the memory frames and do not fully utilize the rich temporal context information in the historical frame sequence. Meanwhile, the Siamese network-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that the target cannot generate corresponding correlation in time and space.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which solves the problems that the space-time context information is difficult to establish association and the performance of a target is reduced when the target is subjected to challenges such as deformation, shielding and the like.
(II) technical scheme
The invention adopts the following technical scheme for realizing the purposes:
a target tracking method based on a space-time interaction attention mechanism comprises the following steps:
step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information;
step 2: enhancing space-time characteristics; enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism;
step 3: constructing a space-time interaction model; performing space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and performing secondary screening on the enhanced memory frame query frame characteristic information by using the interaction weights to obtain characteristics more beneficial to tracking;
step 4: updating a template; using a space-time memory network to update the template;
step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
Further, the feature extraction performed in step 1 is that firstly, the query frame image, the memory frame image and the tag map are obtained through data preprocessing, and the GoogleNet is firstly used as a backbone network in the memory frame branchFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q 。
Further, the spatio-temporal feature enhancement module for the memory frame image in step 2 mainly includes a temporal attention module and a spatial attention module, where the temporal attention module is mainly used to enhance the time sequence feature in the sequence, weight-emphasize important time sequence information, and filter out irrelevant information, and the spatial attention module is mainly used to enhance the spatial feature, and weight-process the target region and the background region in the image, so that attention is focused on the region where the target is located.
Further, the time attention module in the step 2 can perform weighted average on the sequence features with different weights at different moments of the sequence to obtain a time sequence feature representation. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1 。
Further, the spatial attention module in step 2 will first pass through a feature extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, and carrying out convolution operation on the feature vectors after feature compressionPerforming row dimension increasing operation, finally activating by using Sigmoid function, restraining output between 0-1, namely generating space channel weight corresponding to each space channel, and representing the output characteristics as F after space attention model processing m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )。
further, in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the features of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, which comprises the following specific operations:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value.
S22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key.
S23, performing SoftMax function normalization processing on the energy matrix, so that each feature point is assigned with a weight between 0 and 1 to obtain an attention weight coefficient matrix attention1, wherein the matrix is mainly used for highlighting feature points which are more important to the model.
S24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
Furthermore, the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' as input of the space-time interaction attention module, the input is processed into a feature matrix with the same size as the space dimension through reshape transformation, and the size is B multiplied by C multiplied by H multiplied by W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
s33, branching the query frame by using three convolution layers F q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", the formula is as follows:
F q ”=concat(F m,q ,F q ')=F m,q +F q '。
further, the template updating mechanism in step 4 uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and calculates the memory frame F output in step 3 m "AND INQUIRY Frames F q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
Further, the step 5 inputs the response graph into a classification regression network, and uses a lightweight classification convolution network omega cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W 。
Further, the loss function in the step 5 includes classifying the loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls +λ 1 L reg +λ 2 L ctr
wherein lambda is 1 、λ 2 Are all superbParameters.
(III) beneficial effects
Compared with the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which has the following beneficial effects:
the invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.
According to the invention, the space-time interaction module is introduced to perform space-time information interaction on the two enhanced branch characteristics, so that secondary screening of the characteristics is realized, the information of the memory frame can be fully utilized, the information in the query frame can be dynamically combined, the target can be more accurately tracked, and the complex problems that the model is difficult to establish association on space-time context information and the like are effectively solved.
Drawings
FIG. 1 is a block diagram of the overall network architecture of the present invention;
FIG. 2 is a diagram of a feature enhancement module architecture of the present invention;
FIG. 3 is a diagram of a space-time interactive module architecture of the present invention;
FIG. 4 is a template update flow diagram;
FIG. 5 results from an evaluation of GOT-10k test dataset.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
As shown in fig. 1-5, an object tracking method based on a space-time interaction attention mechanism according to an embodiment of the present invention includes the following steps:
step 1: extracting features; query image data and template image data of a target to be tracked are obtained, and corresponding characteristic information is extracted.
Step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism.
Step 3: constructing a space-time interaction model; and carrying out space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and carrying out secondary screening on the characteristic information of the enhanced memory frame query frame by utilizing the interaction weights to obtain the characteristics which are more beneficial to tracking.
Step 4: updating a template; and (5) performing template updating processing by using the space-time memory network.
Step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
The invention discloses a feature extraction method in step 1, which comprises the steps of firstly obtaining an inquiry frame image, a memory frame image and a label map through data preprocessing, and firstly using GoogleNet as a backbone network in a memory frame branchFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q 。
The space-time characteristic enhancement module mainly used for enhancing the memory frame image in the step 2 comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.
The step 2 of the invention adopts a time attention module for enhancing the memory frame characteristics, and can perform weighted average on the characteristics of different weight sequences at different moments of the sequences to obtain the characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1 。
The step 2 of the invention is to use the spatial attention module for enhancing the memory frame characteristics, firstly, the spatial attention module is to use the characteristic extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, and then carrying out processing on the results obtained by two poolingPerforming row splicing, namely sending the spliced result into a 3x3 convolution layer to perform convolution operation with the kernel size of 3, performing dimension lifting operation on the feature vector subjected to feature compression, finally activating by using a Sigmoid function, and restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output by space attention model processing as F m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )
the step 2 of the invention enhances the image characteristics of the query frame, firstly, each element vector in the input sequence of the image characteristics of the query is automatically weighted and summed with other vectors, then the similarity between the characteristics is calculated, key characteristics are extracted, and the weight of each characteristic is recalculated according to the key characteristics to obtain more enhanced characteristic representation, and the specific operation is as follows:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value.
S22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key.
S23, performing SoftMax function normalization processing on the energy matrix to obtain a weight between 0 and 1 for each feature point, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model.
S24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
The space-time interaction model in the step 3 is mainly to perform space-time interaction on the feature information on the memory frame branch and the query frame branch which are processed by the feature enhancement module, so as to obtain the feature representation which is more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' as input of the space-time interaction attention module, the input is processed into a feature matrix with the same size as the space dimension through reshape transformation, and the size is B multiplied by C multiplied by H multiplied by W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
query frame branching F using three convolutional layers q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", formula tableThe following is shown:
F q ”=concat(F m,q ,F q ')=F m,q +F q '
the template updating mechanism in the step 4 of the invention uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and firstly calculates the memory frame F output in the step 3 m "each pixel and query frame F q "similarity between the frames, obtain the similarity matrix, the similarity matrix is regarded as the mapping of soft weight and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
The step 5 of the invention inputs the response diagram into a classification regression network and uses a lightweight classification convolution network omega cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W 。
The loss function on which the overall network model is based in the training process comprises classification loss, regression loss and centrality loss, wherein the classification loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls +λ 1 L reg +λ 2 L ctr
wherein lambda is 1 、λ 2 Are super parameters.
In the field of target tracking, when a tracking target is subjected to complex environments such as shielding, deformation, background interference and the like, the tracking performance of the tracker is reduced, and in order to improve the situation, the method and the device for tracking the target by using the space-time characteristics enhance module for enhancing the characteristics of the original characteristics by combining space-time context information with a plurality of memory frames in a template branch, so that the expression capability of the memory frames and the query frames is enhanced, the characteristics are richer, and the robustness and the accuracy of target tracking are improved. Meanwhile, the Siamese-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that targets cannot generate corresponding correlations in time and space, and some trackers only use appearance information of memory frames and do not fully utilize rich time context information in historical frame sequences. Aiming at the problem, the invention creates a space-time interaction model by using an attention mechanism, so that the characteristic information of the memory frame branch and the query frame branch are interacted, and then the obtained interaction weight is used for carrying out secondary screening on the characteristic information after characteristic enhancement, so as to obtain the characteristics more beneficial to tracking. The space-time interaction model can utilize the information in the memory frame and dynamically combine the information in the query frame to realize the characteristic space-time interaction, thereby improving the robustness and the accuracy of target tracking.
The target tracking method based on the space-time interaction attention mechanism mainly uses GOT-10K, OTB-100 and LaSOT official data sets to train a network model, and uses a GOT-10k evaluating tool to test the training effect of the method. Comparing the data in table 1, it is found that the target tracking algorithm provided by the invention performs better on the data test set than the weight parameters trained by other algorithms.
By comparing the evaluation results in the table 5, the target tracking method based on the space-time interaction attention mechanism can be clearly observed, and the tracking effect of the tracker can be well improved by using the feature enhancement based on the attention mechanism and the enhancement screening of the features by the space-time interaction module.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A target tracking method based on a space-time interaction attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information;
step 2: enhancing space-time characteristics; enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism;
step 3: constructing a space-time interaction model; performing space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and performing secondary screening on the enhanced memory frame query frame characteristic information by using the interaction weights to obtain characteristics more beneficial to tracking;
step 4: updating a template; using a space-time memory network to update the template;
step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
2. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the feature extraction performed in the step 1 is that firstly, the query frame image, the memory frame image and the label map are obtained through data preprocessingAlternatively, googleNet is first used as backbone network in memory frame branchingFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q 。
3. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the space-time characteristic enhancement module for the memory frame image in the step 2 mainly comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics, and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.
4. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the time attention module in the step 2 canAnd at different moments of the sequence, carrying out weighted average on the sequence characteristics by using different weights to obtain characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1 。
5. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the spatial attention module in step 2 will first pass through the feature extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, feature vectors with the size of 1 multiplied by C are respectively obtained, C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, carrying out dimension increasing operation on the feature vectors after feature compression, finally activating by utilizing a Sigmoid function, restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output through space attention model processing as F m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )。
6. the method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the feature of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, and the method comprises the following specific operations:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value;
s22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key;
s23, performing SoftMax function normalization processing on the energy matrix to enable each feature point to be distributed with a weight between 0 and 1, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model;
s24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
7. The target tracking method based on the space-time interaction attention mechanism according to claim 1, wherein the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' input as a spatiotemporal interaction attention module is processed into a time dimension and a space dimension through reshape transformationThe feature matrix with equal degree is B×C×H×W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
s33, branching the query frame by using three convolution layers F q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", the formula is as follows:
F q ”=concat(F m,q ,F q ')=F m,q +F q '。
8. the method of claim 1, wherein the template updating mechanism in step 4 uses a space-time memory (memory reader) to target mainly using the history information of the memory frameUpdating the template, firstly calculating the memory frame F output in the step 3 m "AND INQUIRY Frames F q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
9. The method of claim 1, wherein the step 5 inputs the response map into a classification regression network using a lightweight classification convolution network ω cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W 。
10. The method of claim 1, wherein the loss function in step 5 comprises a classification loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls +λ 1 L reg +λ 2 L ctr
wherein lambda is 1 、λ 2 Are super parameters.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310523575.0A CN116563355A (en) | 2023-05-10 | 2023-05-10 | Target tracking method based on space-time interaction attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310523575.0A CN116563355A (en) | 2023-05-10 | 2023-05-10 | Target tracking method based on space-time interaction attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116563355A true CN116563355A (en) | 2023-08-08 |
Family
ID=87503039
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310523575.0A Pending CN116563355A (en) | 2023-05-10 | 2023-05-10 | Target tracking method based on space-time interaction attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116563355A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116953653A (en) * | 2023-09-19 | 2023-10-27 | 成都远望科技有限责任公司 | Networking echo extrapolation method based on multiband weather radar |
CN117522925A (en) * | 2024-01-05 | 2024-02-06 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
CN118172390A (en) * | 2024-05-15 | 2024-06-11 | 南京新越阳科技有限公司 | Target tracking method based on deep learning |
-
2023
- 2023-05-10 CN CN202310523575.0A patent/CN116563355A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116953653A (en) * | 2023-09-19 | 2023-10-27 | 成都远望科技有限责任公司 | Networking echo extrapolation method based on multiband weather radar |
CN116953653B (en) * | 2023-09-19 | 2023-12-26 | 成都远望科技有限责任公司 | Networking echo extrapolation method based on multiband weather radar |
CN117522925A (en) * | 2024-01-05 | 2024-02-06 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
CN117522925B (en) * | 2024-01-05 | 2024-04-16 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
CN118172390A (en) * | 2024-05-15 | 2024-06-11 | 南京新越阳科技有限公司 | Target tracking method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hou et al. | Cross attention network for few-shot classification | |
CN112288011B (en) | Image matching method based on self-attention deep neural network | |
WO2021022521A1 (en) | Method for processing data, and method and device for training neural network model | |
CN111627052A (en) | Action identification method based on double-flow space-time attention mechanism | |
CN116563355A (en) | Target tracking method based on space-time interaction attention mechanism | |
CN113673307A (en) | Light-weight video motion recognition method | |
CN111639692A (en) | Shadow detection method based on attention mechanism | |
CN112991350B (en) | RGB-T image semantic segmentation method based on modal difference reduction | |
CN115240121B (en) | Joint modeling method and device for enhancing local features of pedestrians | |
CN113011329A (en) | Pyramid network based on multi-scale features and dense crowd counting method | |
CN113706581B (en) | Target tracking method based on residual channel attention and multi-level classification regression | |
CN115147456B (en) | Target tracking method based on time sequence self-adaptive convolution and attention mechanism | |
CN113095254A (en) | Method and system for positioning key points of human body part | |
CN114913379B (en) | Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning | |
Yang et al. | BANDT: A border-aware network with deformable transformers for visual tracking | |
CN116453025A (en) | Volleyball match group behavior identification method integrating space-time information in frame-missing environment | |
CN115375737A (en) | Target tracking method and system based on adaptive time and serialized space-time characteristics | |
Huang et al. | Region-based non-local operation for video classification | |
CN116844004A (en) | Point cloud automatic semantic modeling method for digital twin scene | |
CN111914809B (en) | Target object positioning method, image processing method, device and computer equipment | |
CN114240811A (en) | Method for generating new image based on multiple images | |
CN117649582A (en) | Single-flow single-stage network target tracking method and system based on cascade attention | |
CN112528077A (en) | Video face retrieval method and system based on video embedding | |
CN113780305B (en) | Significance target detection method based on interaction of two clues | |
CN113706650A (en) | Image generation method based on attention mechanism and flow model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |