CN117036417A - Multi-scale transducer target tracking method based on space-time template updating - Google Patents
Multi-scale transducer target tracking method based on space-time template updating Download PDFInfo
- Publication number
- CN117036417A CN117036417A CN202311171271.9A CN202311171271A CN117036417A CN 117036417 A CN117036417 A CN 117036417A CN 202311171271 A CN202311171271 A CN 202311171271A CN 117036417 A CN117036417 A CN 117036417A
- Authority
- CN
- China
- Prior art keywords
- layer
- token
- template
- image
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000000605 extraction Methods 0.000 claims abstract description 6
- 230000004927 fusion Effects 0.000 claims abstract description 4
- 239000010410 layer Substances 0.000 claims description 77
- 238000004590 computer program Methods 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 8
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 238000012512 characterization method Methods 0.000 claims description 5
- 230000004913 activation Effects 0.000 claims description 3
- 230000002776 aggregation Effects 0.000 claims description 3
- 238000004220 aggregation Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 239000013047 polymeric layer Substances 0.000 claims description 3
- 230000001502 supplementing effect Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 230000008447 perception Effects 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/042—Knowledge-based neural networks; Logical representations of neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a multi-scale transducer target tracking method based on space-time template updating, which comprises the following steps: extracting features of the initial template image, the dynamic template image and the search area image by using Shunted Transformer to obtain three corresponding tokens; splicing the three tokens, and then sequentially sending the tokens into a transducer-based encoder and a transducer-based decoder for feature fusion to obtain a fused feature sequence; carrying out boundary frame prediction on the fused characteristic sequences through a classification branch and a regression branch, and outputting a tracking result; and after the running frame number of the tracker reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch. According to the invention, shunted Transformer is used as a feature extraction backbone network, so that multi-scale features can be learned, the representation capability of a target is improved, and meanwhile, the latest state of the target can be captured by adding the dynamic template, so that challenges such as shielding and deformation of the target are effectively met.
Description
Technical Field
The invention belongs to the field of image recognition and target tracking, in particular to a multi-scale Transformer target tracking method based on space-time template updating.
Background
Visual target tracking is a hot research field of computer vision, and has wide application prospect in the fields of military, unmanned driving, medical treatment and the like. Visual object tracking refers to automatically giving the position and shape of an object in subsequent frames of video after the object of interest marked with a bounding box in a given first frame. However, in a complex real scene, the tracking process is affected by environmental factors such as background interference, shielding, deformation and the like, so designing a tracking algorithm that operates efficiently in the real scene is a difficult task.
Based on different working modes, the current target tracking algorithm based on deep learning is mainly divided into a tracking algorithm based on a convolutional neural network and a tracking algorithm based on a transform, wherein the target tracking algorithm based on a twin network utilizes convolution kernel operation to describe a target, and although the target tracking algorithm performs well when modeling the local relation of image content, global context information cannot be effectively processed; the transducer-based algorithm can process local and global information by using an attention mechanism, so that the characterization capability of the algorithm is enhanced, and the robustness of the tracking algorithm is improved. However, most of these algorithms handle video frames in isolation, and do not fully mine temporal and spatial information in the video. And in real-world scenarios, the tracker needs to overcome challenges such as occlusion, object deformation, scale change, etc., and these challenges are further amplified with the increase of time span, so developing a high-precision, robust and real-time tracker remains extremely challenging.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a multi-scale transducer target tracking method based on space-time template updating, which better meets challenges such as shielding, target deformation and the like and enhances the robustness of a tracker.
In order to solve the technical problems, the invention provides the following technical scheme: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
Further, the step S1 includes the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch;
s104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q ,
K i ,V i =MTA(X,r i )W i K ,MTA(X,r i )W i V ,
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i K ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein x is the characteristic representation of the input image, and different i corresponds to r i Not exactly the same, the corresponding calculation costs are different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein h is i Represents the i-th self-attention head, d h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x′=FC(x;θ 1 ),
x″=FC(σ(x′+DS(x′;θ 1 );θ 2 ).
wherein DS (& gtθ) is a detail specific layer with a parameter of θ, and is realized by deep convolution, FC (& gtθ) is a fully connected layer with a parameter of θ, sigma (& gtθ) represents an activation function for performing nonlinear transformation on input, x represents a characteristic sequence of the input feedforward layer, x ' represents a characteristic sequence of x output through a first fully connected layer, and x ' represents a characteristic sequence of x ' output at a second fully connected layer after being activated through the detail specific layer, namely, a characteristic sequence of the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
Further, in the step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.
Further, the step S2 includes the following sub-steps:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area token f x L And an uncoupled mixed representation f L All inputs are first layer normalized and then calculatedAnd f L To obtain a final characterization, the formula is as follows:
f vm =f′ vm +FFN(LN(f′ vm ))。
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
Further, the step S3 includes the following sub-steps:
s301, sending the output sequence of the decoder into a classification branch, using I o U-perception classification score is used as a training target, variFocal Loss is used as a training Loss function, and classification Loss is expressed as
Where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box;
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
Further, the step S4 includes the following sub-steps:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
In another aspect the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods of the invention when the computer program is executed.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method of any of the invention.
Compared with the prior art, the beneficial technical effects of the invention adopting the technical scheme are as follows:
1. according to the invention, multi-scale characteristic information is reserved during characteristic extraction, and the obtained characteristic token can learn the relation between objects with different scales, so that the characterization capability of the tracker is improved;
2. the latest state of the target can be captured through the dynamic template, and challenges such as shielding, target deformation and the like can be effectively met by combining time and space information, so that the robustness of the tracker is enhanced.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of the operation of the feature extraction module of the present invention.
Detailed Description
For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.
Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.
As shown in fig. 1, the multi-scale transducer target tracking method based on space-time template updating of the invention comprises the following steps: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
Referring to fig. 2, further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S1 includes the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch; the first convolution layer is 7*7, the step length is 2, the second convolution layer is 3*3, and the non-overlapping projection layer step length is 1.
S104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q ,
K i ,V i =MTA(X,r i )W i K ,MTA(X,r i )W i V ,
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i K ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i The query Q, the key K and the value V corresponding to the ith head are respectively represented, X is the characteristic representation of the input image, and the corresponding calculation cost is different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein d is h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x′=FC(x;θ 1 ),
x″=FC(σ(x′+DS(x′;θ 1 );θ 2 )
wherein DS (& gtθ) is a detail specific layer with a parameter of θ, and is realized by deep convolution, FC (& gtθ) is a fully connected layer with a parameter of θ, sigma (& gtθ) represents an activation function for performing nonlinear transformation on input, x represents a characteristic sequence of the input feedforward layer, x ' represents a characteristic sequence of x output through a first fully connected layer, and x ' represents a characteristic sequence of x ' output at a second fully connected layer after being activated through the detail specific layer, namely, a characteristic sequence of the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S2 includes the following sub-steps:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202, an encoder comprises a series of transformers.Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->Layer normalized and input to a frontIn the feed network FFN, the output is then taken as the sum +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area tokenAnd an uncoupled mixed representation f L All inputs are first layer normalized and then +.>And f L To obtain a final characterization, the formula is as follows:
f vm =f′ vm +FFN(LN(f′ vm ))。
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S3 includes the following sub-steps:
s301, the output sequence of the decoder is sent to the classification branch, using IoU perceptual classification score as training target and VariFocal Loss as training Loss function, the classification Loss is expressed as:
where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box of the object,
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S4 includes the following sub-steps:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
In another aspect, the invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the method of the invention when executing said computer program.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to the invention.
While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.
Claims (8)
1. The multi-scale transducer target tracking method based on space-time template updating is characterized by comprising the following steps of:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
2. The method for multi-scale transducer target tracking based on space-time template updating according to claim 1, wherein the step S1 comprises the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch;
s104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q ,
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein X is the characteristic representation of the input image, and different i corresponds to r i Not exactly the same, the corresponding calculation costs are different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein h is i Represents the i-th self-attention head, d h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x ′ =FC(x;θ 1 ),
x″=FC(σ(x ′ +DS(x ′ ;θ 1 );θ 2 ).
wherein DS (& gttheta) is a detail specific layer with a parameter of theta, the detail specific layer is realized through deep convolution, FC (& gttheta) is a fully connected layer with a parameter of theta, sigma (& gttheta) represents an activation function for carrying out nonlinear transformation on input, x represents a characteristic sequence of an input feedforward layer, and x is a characteristic sequence of an input feedforward layer ′ Representing the characteristic sequence of x output through the first full connection layer, x' representing x ′ The feature sequence output at the second full-connection layer after the detail specific layer is activated, namely the feature sequence output by the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
3. The method according to claim 2, wherein in step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.
4. The method for multi-scale transducer target tracking based on space-time template updating according to claim 2, wherein the step S2 comprises the sub-steps of:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>The connection is obtainedThen pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area tokenAnd an uncoupled mixed representation f L All inputs are first layer normalized and then +.>And f L To obtain a final characterization, the formula is as follows:
f vm =f v ′ m +FFN(LN(f v ′ m ));
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
5. The method of multi-scale transducer target tracking based on space-time template updating according to claim 3, wherein step S3 comprises the sub-steps of:
s301, sending the output sequence of the decoder into a classification branch, using IoU perception classification score as training target, using VariFocal Loss as training Loss function, the classification Loss is expressed as
Where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box;
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
6. A multi-scale transducer target tracking method based on spatio-temporal template updating according to claim 3, characterized in that step S4 comprises the sub-steps of:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 6 when the computer program is executed by the processor.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311171271.9A CN117036417A (en) | 2023-09-12 | 2023-09-12 | Multi-scale transducer target tracking method based on space-time template updating |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311171271.9A CN117036417A (en) | 2023-09-12 | 2023-09-12 | Multi-scale transducer target tracking method based on space-time template updating |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117036417A true CN117036417A (en) | 2023-11-10 |
Family
ID=88622920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311171271.9A Pending CN117036417A (en) | 2023-09-12 | 2023-09-12 | Multi-scale transducer target tracking method based on space-time template updating |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117036417A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333514A (en) * | 2023-12-01 | 2024-01-02 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
CN117522925A (en) * | 2024-01-05 | 2024-02-06 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
-
2023
- 2023-09-12 CN CN202311171271.9A patent/CN117036417A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117333514A (en) * | 2023-12-01 | 2024-01-02 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
CN117333514B (en) * | 2023-12-01 | 2024-04-16 | 科大讯飞股份有限公司 | Single-target video tracking method, device, storage medium and equipment |
CN117522925A (en) * | 2024-01-05 | 2024-02-06 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
CN117522925B (en) * | 2024-01-05 | 2024-04-16 | 成都合能创越软件有限公司 | Method and system for judging object motion state in mobile camera under attention mechanism |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117036417A (en) | Multi-scale transducer target tracking method based on space-time template updating | |
EP2395478A1 (en) | Monocular 3D pose estimation and tracking by detection | |
Mocanu et al. | Deep-see face: A mobile face recognition system dedicated to visually impaired people | |
WO2020064253A1 (en) | Methods for generating a deep neural net and for localising an object in an input image, deep neural net, computer program product, and computer-readable storage medium | |
CN112115879B (en) | Self-supervision pedestrian re-identification method and system with shielding sensitivity | |
CN114170516B (en) | Vehicle weight recognition method and device based on roadside perception and electronic equipment | |
CN114724548A (en) | Training method of multi-mode speech recognition model, speech recognition method and equipment | |
CN111445496B (en) | Underwater image recognition tracking system and method | |
WO2023207778A1 (en) | Data recovery method and device, computer, and storage medium | |
CN111461181B (en) | Vehicle fine-grained classification method and device | |
US20200184285A1 (en) | System and method for label augmentation in video data | |
Ukwuoma et al. | Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism | |
CN116543409A (en) | Certificate target extraction method, system, medium, equipment and terminal | |
CN114333062A (en) | Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency | |
CN117011342A (en) | Attention-enhanced space-time transducer vision single-target tracking method | |
Lee et al. | Learning to remove bad weather: Towards robust visual perception for self-driving | |
CN112529934A (en) | Multi-target tracking method and device, electronic equipment and storage medium | |
CN112329736B (en) | Face recognition method and financial system | |
Wang et al. | TIENet: task-oriented image enhancement network for degraded object detection | |
CN112487927A (en) | Indoor scene recognition implementation method and system based on object associated attention | |
CN111985313A (en) | Multi-style pedestrian re-identification method, system and terminal based on counterstudy | |
CN116486203B (en) | Single-target tracking method based on twin network and online template updating | |
CN117197249B (en) | Target position determining method, device, electronic equipment and storage medium | |
WO2022126355A1 (en) | Image-based processing method and device | |
CN115631296A (en) | 3D target detection method, computer program product and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |