CN116563355A - Target tracking method based on space-time interaction attention mechanism - Google Patents

Target tracking method based on space-time interaction attention mechanism Download PDF

Info

Publication number
CN116563355A
CN116563355A CN202310523575.0A CN202310523575A CN116563355A CN 116563355 A CN116563355 A CN 116563355A CN 202310523575 A CN202310523575 A CN 202310523575A CN 116563355 A CN116563355 A CN 116563355A
Authority
CN
China
Prior art keywords
feature
space
query
frame
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310523575.0A
Other languages
Chinese (zh)
Inventor
黄丹丹
于斯宇
刘智
王英志
白昱
王一雯
杨明婷
胡力洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN202310523575.0A priority Critical patent/CN116563355A/en
Publication of CN116563355A publication Critical patent/CN116563355A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/30Determination of transform parameters for the alignment of images, i.e. image registration
    • G06T7/33Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods
    • G06T7/337Determination of transform parameters for the alignment of images, i.e. image registration using feature-based methods involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism, which comprises the following steps: step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information; step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism. The invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.

Description

Target tracking method based on space-time interaction attention mechanism
Technical Field
The invention relates to the technical field of target tracking of computer vision, in particular to a target tracking method based on a space-time interaction attention mechanism.
Background
The target tracking technology is one of important research directions in the field of computer vision, and is widely applied to the fields of intelligent driving, unmanned driving, man-machine interaction and the like. The object tracking task needs to keep accurate tracking of the object in a continuous video sequence to obtain a complete object motion track and calculate the position and size of the object in different image frames. The development of the target tracking technology plays an irreplaceable role in advanced video processing tasks such as behavior understanding, reasoning and decision making, and is also a basic technology of target recognition, behavior analysis, video compression coding, video understanding and the like. Although research on object tracking has advanced significantly over the past few years, and many efficient algorithms have emerged to address the challenging problem in various scenarios. However, there are still many problems such as object shielding, illumination variation, scale transformation, and background interference, so the research of the object tracking technology is still a difficult task. In order to solve these problems, it is necessary to propose a more accurate and robust tracker.
In 2016, with the advent of sialmfc, a Siamese network-based tracking framework became the main stream of a single-target tracking algorithm framework. Thereafter, the SiamRPN introduces a regional proposal network into the Siamese network, and excellent tracking results are obtained on a plurality of benchmark tests. SiamCorders introduce an improved corner pooling layer on the basis of a Siamese network, convert a bounding box into diagonal prediction and have good performance. Most of the trackers based on twin networks currently use initial frames as tracking strategies for templates, which will bring about some risks. When the tracked object is severely deformed and blocked, the target cannot be tracked well. To improve this, some trackers introduce a template update mechanism or use multiple templates, which may enhance the robustness of the trackers to some extent, but this is limited and inevitably increases the computational effort. In addition, these trackers use only appearance information of the memory frames and do not fully utilize the rich temporal context information in the historical frame sequence. Meanwhile, the Siamese network-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that the target cannot generate corresponding correlation in time and space.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which solves the problems that the space-time context information is difficult to establish association and the performance of a target is reduced when the target is subjected to challenges such as deformation, shielding and the like.
(II) technical scheme
The invention adopts the following technical scheme for realizing the purposes:
a target tracking method based on a space-time interaction attention mechanism comprises the following steps:
step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information;
step 2: enhancing space-time characteristics; enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism;
step 3: constructing a space-time interaction model; performing space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and performing secondary screening on the enhanced memory frame query frame characteristic information by using the interaction weights to obtain characteristics more beneficial to tracking;
step 4: updating a template; using a space-time memory network to update the template;
step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
Further, the feature extraction performed in step 1 is that firstly, the query frame image, the memory frame image and the tag map are obtained through data preprocessing, and the GoogleNet is firstly used as a backbone network in the memory frame branchFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q
Further, the spatio-temporal feature enhancement module for the memory frame image in step 2 mainly includes a temporal attention module and a spatial attention module, where the temporal attention module is mainly used to enhance the time sequence feature in the sequence, weight-emphasize important time sequence information, and filter out irrelevant information, and the spatial attention module is mainly used to enhance the spatial feature, and weight-process the target region and the background region in the image, so that attention is focused on the region where the target is located.
Further, the time attention module in the step 2 can perform weighted average on the sequence features with different weights at different moments of the sequence to obtain a time sequence feature representation. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1
Further, the spatial attention module in step 2 will first pass through a feature extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, and carrying out convolution operation on the feature vectors after feature compressionPerforming row dimension increasing operation, finally activating by using Sigmoid function, restraining output between 0-1, namely generating space channel weight corresponding to each space channel, and representing the output characteristics as F after space attention model processing m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )。
further, in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the features of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, which comprises the following specific operations:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value.
S22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key.
S23, performing SoftMax function normalization processing on the energy matrix, so that each feature point is assigned with a weight between 0 and 1 to obtain an attention weight coefficient matrix attention1, wherein the matrix is mainly used for highlighting feature points which are more important to the model.
S24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
Furthermore, the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' as input of the space-time interaction attention module, the input is processed into a feature matrix with the same size as the space dimension through reshape transformation, and the size is B multiplied by C multiplied by H multiplied by W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
s33, branching the query frame by using three convolution layers F q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", the formula is as follows:
F q ”=concat(F m,q ,F q ')=F m,q +F q '。
further, the template updating mechanism in step 4 uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and calculates the memory frame F output in step 3 m "AND INQUIRY Frames F q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
Further, the step 5 inputs the response graph into a classification regression network, and uses a lightweight classification convolution network omega cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W
Further, the loss function in the step 5 includes classifying the loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls1 L reg2 L ctr
wherein lambda is 1 、λ 2 Are all superbParameters.
(III) beneficial effects
Compared with the prior art, the invention provides a target tracking method based on a space-time interaction attention mechanism, which has the following beneficial effects:
the invention aims at the problem that the tracking performance is reduced when the target is subjected to complex environments such as shielding, deformation, background interference and the like, and the invention utilizes a time attention, a space attention and a self attention mechanism to respectively carry out weighting processing on the time sequence characteristics, the space characteristics and the query frame image characteristics of the memory frame image by introducing a characteristic enhancement module, thereby enhancing the expression capability of the memory frame and the query frame, leading the characteristics to be more abundant and further improving the robustness and the accuracy of target tracking.
According to the invention, the space-time interaction module is introduced to perform space-time information interaction on the two enhanced branch characteristics, so that secondary screening of the characteristics is realized, the information of the memory frame can be fully utilized, the information in the query frame can be dynamically combined, the target can be more accurately tracked, and the complex problems that the model is difficult to establish association on space-time context information and the like are effectively solved.
Drawings
FIG. 1 is a block diagram of the overall network architecture of the present invention;
FIG. 2 is a diagram of a feature enhancement module architecture of the present invention;
FIG. 3 is a diagram of a space-time interactive module architecture of the present invention;
FIG. 4 is a template update flow diagram;
FIG. 5 results from an evaluation of GOT-10k test dataset.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples
As shown in fig. 1-5, an object tracking method based on a space-time interaction attention mechanism according to an embodiment of the present invention includes the following steps:
step 1: extracting features; query image data and template image data of a target to be tracked are obtained, and corresponding characteristic information is extracted.
Step 2: enhancing space-time characteristics; and enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism.
Step 3: constructing a space-time interaction model; and carrying out space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and carrying out secondary screening on the characteristic information of the enhanced memory frame query frame by utilizing the interaction weights to obtain the characteristics which are more beneficial to tracking.
Step 4: updating a template; and (5) performing template updating processing by using the space-time memory network.
Step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
The invention discloses a feature extraction method in step 1, which comprises the steps of firstly obtaining an inquiry frame image, a memory frame image and a label map through data preprocessing, and firstly using GoogleNet as a backbone network in a memory frame branchFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q
The space-time characteristic enhancement module mainly used for enhancing the memory frame image in the step 2 comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.
The step 2 of the invention adopts a time attention module for enhancing the memory frame characteristics, and can perform weighted average on the characteristics of different weight sequences at different moments of the sequences to obtain the characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1
The step 2 of the invention is to use the spatial attention module for enhancing the memory frame characteristics, firstly, the spatial attention module is to use the characteristic extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling, wherein a first dimension of the image features, namely B, is kept unchanged, respectively obtaining feature vectors with the size of 1 multiplied by C, wherein C is the number of channels, and then carrying out processing on the results obtained by two poolingPerforming row splicing, namely sending the spliced result into a 3x3 convolution layer to perform convolution operation with the kernel size of 3, performing dimension lifting operation on the feature vector subjected to feature compression, finally activating by using a Sigmoid function, and restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output by space attention model processing as F m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )
the step 2 of the invention enhances the image characteristics of the query frame, firstly, each element vector in the input sequence of the image characteristics of the query is automatically weighted and summed with other vectors, then the similarity between the characteristics is calculated, key characteristics are extracted, and the weight of each characteristic is recalculated according to the key characteristics to obtain more enhanced characteristic representation, and the specific operation is as follows:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value.
S22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key.
S23, performing SoftMax function normalization processing on the energy matrix to obtain a weight between 0 and 1 for each feature point, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model.
S24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
The space-time interaction model in the step 3 is mainly to perform space-time interaction on the feature information on the memory frame branch and the query frame branch which are processed by the feature enhancement module, so as to obtain the feature representation which is more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' as input of the space-time interaction attention module, the input is processed into a feature matrix with the same size as the space dimension through reshape transformation, and the size is B multiplied by C multiplied by H multiplied by W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
query frame branching F using three convolutional layers q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", formula tableThe following is shown:
F q ”=concat(F m,q ,F q ')=F m,q +F q '
the template updating mechanism in the step 4 of the invention uses a space-time memory network (memory reader), updates the target template mainly by using the history information of the memory frame, and firstly calculates the memory frame F output in the step 3 m "each pixel and query frame F q "similarity between the frames, obtain the similarity matrix, the similarity matrix is regarded as the mapping of soft weight and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
The step 5 of the invention inputs the response diagram into a classification regression network and uses a lightweight classification convolution network omega cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W
The loss function on which the overall network model is based in the training process comprises classification loss, regression loss and centrality loss, wherein the classification loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls1 L reg2 L ctr
wherein lambda is 1 、λ 2 Are super parameters.
In the field of target tracking, when a tracking target is subjected to complex environments such as shielding, deformation, background interference and the like, the tracking performance of the tracker is reduced, and in order to improve the situation, the method and the device for tracking the target by using the space-time characteristics enhance module for enhancing the characteristics of the original characteristics by combining space-time context information with a plurality of memory frames in a template branch, so that the expression capability of the memory frames and the query frames is enhanced, the characteristics are richer, and the robustness and the accuracy of target tracking are improved. Meanwhile, the Siamese-based tracking algorithm does not pay attention to the correlation between frames and within frames of the video sequence, so that targets cannot generate corresponding correlations in time and space, and some trackers only use appearance information of memory frames and do not fully utilize rich time context information in historical frame sequences. Aiming at the problem, the invention creates a space-time interaction model by using an attention mechanism, so that the characteristic information of the memory frame branch and the query frame branch are interacted, and then the obtained interaction weight is used for carrying out secondary screening on the characteristic information after characteristic enhancement, so as to obtain the characteristics more beneficial to tracking. The space-time interaction model can utilize the information in the memory frame and dynamically combine the information in the query frame to realize the characteristic space-time interaction, thereby improving the robustness and the accuracy of target tracking.
The target tracking method based on the space-time interaction attention mechanism mainly uses GOT-10K, OTB-100 and LaSOT official data sets to train a network model, and uses a GOT-10k evaluating tool to test the training effect of the method. Comparing the data in table 1, it is found that the target tracking algorithm provided by the invention performs better on the data test set than the weight parameters trained by other algorithms.
By comparing the evaluation results in the table 5, the target tracking method based on the space-time interaction attention mechanism can be clearly observed, and the tracking effect of the tracker can be well improved by using the feature enhancement based on the attention mechanism and the enhancement screening of the features by the space-time interaction module.
Finally, it should be noted that: the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited thereto, but it is to be understood that modifications and equivalents of some of the technical features described in the foregoing embodiments may be made by those skilled in the art, although the present invention has been described in detail with reference to the foregoing embodiments. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A target tracking method based on a space-time interaction attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
step 1: extracting features; acquiring query image data and template image data of a target to be tracked and extracting corresponding characteristic information;
step 2: enhancing space-time characteristics; enhancing the memory frame image characteristics and the query frame image characteristics by using an attention mechanism;
step 3: constructing a space-time interaction model; performing space-time interaction on the memory frame branches and the query frame branches to obtain corresponding interaction weights, and performing secondary screening on the enhanced memory frame query frame characteristic information by using the interaction weights to obtain characteristics more beneficial to tracking;
step 4: updating a template; using a space-time memory network to update the template;
step 5: and inputting the response diagram into a classification regression network, and training the whole network model according to the loss function to realize tracking of the target in the video.
2. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the feature extraction performed in the step 1 is that firstly, the query frame image, the memory frame image and the label map are obtained through data preprocessingAlternatively, googleNet is first used as backbone network in memory frame branchingFeature extraction F m Using the first convolution layer +.>And an additional convolution layer g to process the memory frame and tag map, will +.>And g output F q And respectively adding the memory frame feature maps to the rest backbone networks to generate T memory frame feature maps. Finally F is passed through nonlinear convolution layer m Feature dimension is reduced to 512, query branch is +.>Obtaining a feature map F after inputting a query image q The overall structure is the same as the memory frame branch but the parameters are different, and finally, the characteristic F with the characteristic dimension of 512 is obtained through a nonlinear convolution layer q
3. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the space-time characteristic enhancement module for the memory frame image in the step 2 mainly comprises a time attention module and a space attention module, wherein the time attention module is mainly used for enhancing time sequence characteristics in a sequence, weighting and emphasizing important time sequence information and filtering irrelevant information, and the space attention module is mainly used for enhancing the space characteristics, and weighting a target area and a background area in the image so that attention is focused on the area where the target is located.
4. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the time attention module in the step 2 canAnd at different moments of the sequence, carrying out weighted average on the sequence characteristics by using different weights to obtain characteristic representation on time sequence. Firstly, the memory frame image characteristic F extracted by backbone network m Extracted into T feature vectors { f 1 ,f 2 ,...,f T Where T represents the sequence length, followed by a feature f for each time step t Performing linear transformation once, mapping the linear transformation into a new feature space, calculating attention coefficients of the new feature space by using a SoftMax function, and weighting and fusing the features according to the attention coefficients to obtain a weighted feature vector F m,1
5. The method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: the spatial attention module in step 2 will first pass through the feature extraction networkThe obtained template frame features F m Carrying out feature compression processing on image features through a global average pooling layer and maximum pooling in the network model, wherein a first dimension of the image features, namely B, is kept unchanged, feature vectors with the size of 1 multiplied by C are respectively obtained, C is the number of channels, then splicing the two pooled results, sending the spliced results into a 3 multiplied by 3 convolution layer to carry out convolution operation with the kernel size of 3, carrying out dimension increasing operation on the feature vectors after feature compression, finally activating by utilizing a Sigmoid function, restricting the output between 0 and 1, namely generating space channel weights corresponding to each space channel, and expressing the feature output through space attention model processing as F m2 Finally, the time attention and the space attention in the space-time feature enhancement module are respectively weighted for the time sequence feature and the space feature of the memory frame image to obtain the memory frame feature mapping:
F m '=concat(F m1 ,F m2 )。
6. the method for tracking targets based on spatiotemporal interaction attention mechanism of claim 1, wherein: in the step 2, the image features of the query frame are enhanced, each element vector in the input sequence of the feature of the query image is automatically weighted and summed with other vectors, then the similarity between the features is calculated, key features are extracted, and the weights of the features are recalculated according to the key features to obtain a more enhanced feature representation, and the method comprises the following specific operations:
s21, firstly, processing a feature vector matrix F obtained by backbone network q Three linear transformation convolution operations are carried out, and the three linear transformation convolution operations are projected to three groups of different feature spaces to obtain three feature representations, namely: query, key, value;
s22, performing transposed multiplication operation on the Query and the Key to obtain a matrix energy capable of describing similarity of the Query and the Key;
s23, performing SoftMax function normalization processing on the energy matrix to enable each feature point to be distributed with a weight between 0 and 1, and obtaining an attention weight coefficient matrix attention1 which is mainly used for highlighting feature points which are more important to the model;
s24, finally, weighting and summing all Value vectors by using the weight coefficient attribute 1 to obtain a final weight vector, reconstructing all final weight vectors into a new feature matrix according to the shape of the original feature matrix, and finally, processing the output feature by the query image feature enhancement module to be expressed as F q '。
7. The target tracking method based on the space-time interaction attention mechanism according to claim 1, wherein the space-time interaction model in the step 3 mainly performs space-time interaction on the feature information on the memory frame branch and the query frame branch processed by the feature enhancement module to obtain a feature representation more favorable for the tracking task, and the specific operation steps are as follows:
s31, the memory frame characteristic F obtained in the step 2 is obtained m ' and query frame feature F q ' input as a spatiotemporal interaction attention module is processed into a time dimension and a space dimension through reshape transformationThe feature matrix with equal degree is B×C×H×W.
S32, using three convolution layers to branch the memory frame F m ' Query, key transformation, branch F for Query frame q And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point, obtaining attention weight attention21, and utilizing the attention weight to inquire about the frame F q Weighting all Value vectors to obtain weighted memory frame feature F q,m Finally, the weighted features and the memory frame branches are subjected to feature splicing to obtain space-time interactive memory frame features F m ", the formula is as follows:
F m ”=concat(F q,m ,F m ')=F q,m +F m '
s33, branching the query frame by using three convolution layers F q ' Query, key transformation, branch F for memory frame m And performing Value transformation, and performing transpose multiplication operation on the Query and the Key to obtain a matrix capable of describing the correlation of the Query and the Key. Secondly, carrying out SoftMax function normalization processing on the similarity matrix to obtain a weight between 0 and 1 for each feature point to obtain attention weight attention22, and utilizing the attention weight to memorize the frame F m All Value vectors of' are weighted to obtain weighted query feature F m,q . Finally, the weighted features and the query frame branches are subjected to feature splicing to obtain the query frame features F of space-time interaction q ", the formula is as follows:
F q ”=concat(F m,q ,F q ')=F m,q +F q '。
8. the method of claim 1, wherein the template updating mechanism in step 4 uses a space-time memory (memory reader) to target mainly using the history information of the memory frameUpdating the template, firstly calculating the memory frame F output in the step 3 m "AND INQUIRY Frames F q Similarity between each pixel in the' to obtain a similarity matrix, wherein the similarity matrix is used as a soft weight mapping and memory frame F m "multiplying, adaptively searching information stored in memory frame, searching the most useful information related to inquiry frame from the information stored in memory frame so as to implement updating of template, finally combining the read information with inquiry frame characteristic F q "splice along channel dimension, generate final composite feature map y.
9. The method of claim 1, wherein the step 5 inputs the response map into a classification regression network using a lightweight classification convolution network ω cls Encoding the feature map y obtained in step 4, and using a linear convolution layer with a convolution kernel of 1x1 to encode ω cls The dimension is reduced to 1, and a final classification response graph R is obtained cls ∈R 1×H×W . The classification network also comprises calculation of centrality response, and the centrality response is shown as R ctr ∈R 1×H×W In the reasoning phase, R ctr And R is R cls Multiplying the confidence scores of the pixel classifications away from the center of the target is used to suppress the confidence scores of the pixel classifications in the regression task using the lightweight classification convolution network ω reg Encoding the feature map y obtained in the step 4, reducing the output feature dimension to 4, and generating a regression response map R for boundary box estimation reg ∈R 4×H×W
10. The method of claim 1, wherein the loss function in step 5 comprises a classification loss L cls Regression loss L using cross entropy loss function reg Adopting an IOU loss function and a centrality loss function L ctr The Sigmoid cross entropy loss function is used, and the total loss function is expressed as follows:
L=L cls1 L reg2 L ctr
wherein lambda is 1 、λ 2 Are super parameters.
CN202310523575.0A 2023-05-10 2023-05-10 Target tracking method based on space-time interaction attention mechanism Pending CN116563355A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310523575.0A CN116563355A (en) 2023-05-10 2023-05-10 Target tracking method based on space-time interaction attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310523575.0A CN116563355A (en) 2023-05-10 2023-05-10 Target tracking method based on space-time interaction attention mechanism

Publications (1)

Publication Number Publication Date
CN116563355A true CN116563355A (en) 2023-08-08

Family

ID=87503039

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310523575.0A Pending CN116563355A (en) 2023-05-10 2023-05-10 Target tracking method based on space-time interaction attention mechanism

Country Status (1)

Country Link
CN (1) CN116563355A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116953653A (en) * 2023-09-19 2023-10-27 成都远望科技有限责任公司 Networking echo extrapolation method based on multiband weather radar
CN117522925A (en) * 2024-01-05 2024-02-06 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism
CN118172390A (en) * 2024-05-15 2024-06-11 南京新越阳科技有限公司 Target tracking method based on deep learning

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116953653A (en) * 2023-09-19 2023-10-27 成都远望科技有限责任公司 Networking echo extrapolation method based on multiband weather radar
CN116953653B (en) * 2023-09-19 2023-12-26 成都远望科技有限责任公司 Networking echo extrapolation method based on multiband weather radar
CN117522925A (en) * 2024-01-05 2024-02-06 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism
CN117522925B (en) * 2024-01-05 2024-04-16 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism
CN118172390A (en) * 2024-05-15 2024-06-11 南京新越阳科技有限公司 Target tracking method based on deep learning

Similar Documents

Publication Publication Date Title
Hou et al. Cross attention network for few-shot classification
CN112288011B (en) Image matching method based on self-attention deep neural network
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
CN111627052A (en) Action identification method based on double-flow space-time attention mechanism
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN113673307A (en) Light-weight video motion recognition method
CN111639692A (en) Shadow detection method based on attention mechanism
CN112991350B (en) RGB-T image semantic segmentation method based on modal difference reduction
CN115240121B (en) Joint modeling method and device for enhancing local features of pedestrians
CN113011329A (en) Pyramid network based on multi-scale features and dense crowd counting method
CN113706581B (en) Target tracking method based on residual channel attention and multi-level classification regression
CN115147456B (en) Target tracking method based on time sequence self-adaptive convolution and attention mechanism
CN113095254A (en) Method and system for positioning key points of human body part
CN114913379B (en) Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning
Yang et al. BANDT: A border-aware network with deformable transformers for visual tracking
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN115375737A (en) Target tracking method and system based on adaptive time and serialized space-time characteristics
Huang et al. Region-based non-local operation for video classification
CN116844004A (en) Point cloud automatic semantic modeling method for digital twin scene
CN111914809B (en) Target object positioning method, image processing method, device and computer equipment
CN114240811A (en) Method for generating new image based on multiple images
CN117649582A (en) Single-flow single-stage network target tracking method and system based on cascade attention
CN112528077A (en) Video face retrieval method and system based on video embedding
CN113780305B (en) Significance target detection method based on interaction of two clues
CN113706650A (en) Image generation method based on attention mechanism and flow model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination