CN117036417A - Multi-scale transducer target tracking method based on space-time template updating - Google Patents

Multi-scale transducer target tracking method based on space-time template updating Download PDF

Info

Publication number
CN117036417A
CN117036417A CN202311171271.9A CN202311171271A CN117036417A CN 117036417 A CN117036417 A CN 117036417A CN 202311171271 A CN202311171271 A CN 202311171271A CN 117036417 A CN117036417 A CN 117036417A
Authority
CN
China
Prior art keywords
layer
token
template
image
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311171271.9A
Other languages
Chinese (zh)
Inventor
强旭艳
郑钰辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202311171271.9A priority Critical patent/CN117036417A/en
Publication of CN117036417A publication Critical patent/CN117036417A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale transducer target tracking method based on space-time template updating, which comprises the following steps: extracting features of the initial template image, the dynamic template image and the search area image by using Shunted Transformer to obtain three corresponding tokens; splicing the three tokens, and then sequentially sending the tokens into a transducer-based encoder and a transducer-based decoder for feature fusion to obtain a fused feature sequence; carrying out boundary frame prediction on the fused characteristic sequences through a classification branch and a regression branch, and outputting a tracking result; and after the running frame number of the tracker reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch. According to the invention, shunted Transformer is used as a feature extraction backbone network, so that multi-scale features can be learned, the representation capability of a target is improved, and meanwhile, the latest state of the target can be captured by adding the dynamic template, so that challenges such as shielding and deformation of the target are effectively met.

Description

Multi-scale transducer target tracking method based on space-time template updating
Technical Field
The invention belongs to the field of image recognition and target tracking, in particular to a multi-scale Transformer target tracking method based on space-time template updating.
Background
Visual target tracking is a hot research field of computer vision, and has wide application prospect in the fields of military, unmanned driving, medical treatment and the like. Visual object tracking refers to automatically giving the position and shape of an object in subsequent frames of video after the object of interest marked with a bounding box in a given first frame. However, in a complex real scene, the tracking process is affected by environmental factors such as background interference, shielding, deformation and the like, so designing a tracking algorithm that operates efficiently in the real scene is a difficult task.
Based on different working modes, the current target tracking algorithm based on deep learning is mainly divided into a tracking algorithm based on a convolutional neural network and a tracking algorithm based on a transform, wherein the target tracking algorithm based on a twin network utilizes convolution kernel operation to describe a target, and although the target tracking algorithm performs well when modeling the local relation of image content, global context information cannot be effectively processed; the transducer-based algorithm can process local and global information by using an attention mechanism, so that the characterization capability of the algorithm is enhanced, and the robustness of the tracking algorithm is improved. However, most of these algorithms handle video frames in isolation, and do not fully mine temporal and spatial information in the video. And in real-world scenarios, the tracker needs to overcome challenges such as occlusion, object deformation, scale change, etc., and these challenges are further amplified with the increase of time span, so developing a high-precision, robust and real-time tracker remains extremely challenging.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a multi-scale transducer target tracking method based on space-time template updating, which better meets challenges such as shielding, target deformation and the like and enhances the robustness of a tracker.
In order to solve the technical problems, the invention provides the following technical scheme: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
Further, the step S1 includes the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch;
s104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q
K i ,V i =MTA(X,r i )W i K ,MTA(X,r i )W i V
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i K ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein x is the characteristic representation of the input image, and different i corresponds to r i Not exactly the same, the corresponding calculation costs are different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein h is i Represents the i-th self-attention head, d h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x′=FC(x;θ 1 ),
x″=FC(σ(x′+DS(x′;θ 1 );θ 2 ).
wherein DS (& gtθ) is a detail specific layer with a parameter of θ, and is realized by deep convolution, FC (& gtθ) is a fully connected layer with a parameter of θ, sigma (& gtθ) represents an activation function for performing nonlinear transformation on input, x represents a characteristic sequence of the input feedforward layer, x ' represents a characteristic sequence of x output through a first fully connected layer, and x ' represents a characteristic sequence of x ' output at a second fully connected layer after being activated through the detail specific layer, namely, a characteristic sequence of the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
Further, in the step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.
Further, the step S2 includes the following sub-steps:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area token f x L And an uncoupled mixed representation f L All inputs are first layer normalized and then calculatedAnd f L To obtain a final characterization, the formula is as follows:
f vm =f′ vm +FFN(LN(f′ vm ))。
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
Further, the step S3 includes the following sub-steps:
s301, sending the output sequence of the decoder into a classification branch, using I o U-perception classification score is used as a training target, variFocal Loss is used as a training Loss function, and classification Loss is expressed as
Where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box;
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
Further, the step S4 includes the following sub-steps:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
In another aspect the invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods of the invention when the computer program is executed.
The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method of any of the invention.
Compared with the prior art, the beneficial technical effects of the invention adopting the technical scheme are as follows:
1. according to the invention, multi-scale characteristic information is reserved during characteristic extraction, and the obtained characteristic token can learn the relation between objects with different scales, so that the characterization capability of the tracker is improved;
2. the latest state of the target can be captured through the dynamic template, and challenges such as shielding, target deformation and the like can be effectively met by combining time and space information, so that the robustness of the tracker is enhanced.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 is a flow chart of the operation of the feature extraction module of the present invention.
Detailed Description
For a better understanding of the technical content of the present invention, specific examples are set forth below, along with the accompanying drawings.
Aspects of the invention are described herein with reference to the drawings, in which there are shown many illustrative embodiments. The embodiments of the present invention are not limited to the embodiments described in the drawings. It is to be understood that this invention is capable of being carried out by any of the various concepts and embodiments described above and as such described in detail below, since the disclosed concepts and embodiments are not limited to any implementation. Additionally, some aspects of the disclosure may be used alone or in any suitable combination with other aspects of the disclosure.
As shown in fig. 1, the multi-scale transducer target tracking method based on space-time template updating of the invention comprises the following steps: a multi-scale transducer target tracking method based on space-time template updating comprises the following steps:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
Referring to fig. 2, further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S1 includes the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch; the first convolution layer is 7*7, the step length is 2, the second convolution layer is 3*3, and the non-overlapping projection layer step length is 1.
S104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q
K i ,V i =MTA(X,r i )W i K ,MTA(X,r i )W i V
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i K ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i The query Q, the key K and the value V corresponding to the ith head are respectively represented, X is the characteristic representation of the input image, and the corresponding calculation cost is different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein d is h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x′=FC(x;θ 1 ),
x″=FC(σ(x′+DS(x′;θ 1 );θ 2 )
wherein DS (& gtθ) is a detail specific layer with a parameter of θ, and is realized by deep convolution, FC (& gtθ) is a fully connected layer with a parameter of θ, sigma (& gtθ) represents an activation function for performing nonlinear transformation on input, x represents a characteristic sequence of the input feedforward layer, x ' represents a characteristic sequence of x output through a first fully connected layer, and x ' represents a characteristic sequence of x ' output at a second fully connected layer after being activated through the detail specific layer, namely, a characteristic sequence of the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S2 includes the following sub-steps:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202, an encoder comprises a series of transformers.Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>Connection is made->Then pair->Layer normalized and input to a frontIn the feed network FFN, the output is then taken as the sum +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area tokenAnd an uncoupled mixed representation f L All inputs are first layer normalized and then +.>And f L To obtain a final characterization, the formula is as follows:
f vm =f′ vm +FFN(LN(f′ vm ))。
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S3 includes the following sub-steps:
s301, the output sequence of the decoder is sent to the classification branch, using IoU perceptual classification score as training target and VariFocal Loss as training Loss function, the classification Loss is expressed as:
where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box of the object,
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
Further, as a preferred technical solution of the multi-scale transducer target tracking method based on space-time template update of the present invention, step S4 includes the following sub-steps:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
In another aspect, the invention proposes an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the steps of the method of the invention when executing said computer program.
The invention also proposes a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method according to the invention.
While the invention has been described in terms of preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.

Claims (8)

1. The multi-scale transducer target tracking method based on space-time template updating is characterized by comprising the following steps of:
s1, extracting features of an initial template image, a dynamic template image and a search area image to obtain an initial template token, a dynamic template token and a search area token;
s2, splicing the initial template token, the dynamic template token and the search area token, and then sequentially inputting an encoder and a decoder of a transducer for feature fusion to obtain a fused feature sequence;
s3, carrying out boundary frame prediction on the fused characteristic sequences through classification branches and regression branches, and outputting tracking results;
and S4, after the running frame number reaches the updating interval, the fused feature sequence updates the dynamic template through the confidence branch.
2. The method for multi-scale transducer target tracking based on space-time template updating according to claim 1, wherein the step S1 comprises the following sub-steps:
s101, judging whether a current frame is a first frame or not according to a video image, if so, extracting features of an initial template image, and executing a step S103; otherwise the step S102 is performed,
s102, extracting features of the search area image, and extracting features of the dynamic template image when the dynamic template image is updated; then step S103 is performed,
S103, adopting a new patch embedding method for an input image with the size of H multiplied by W: sequentially inputting an input image into a first convolution layer, a second convolution layer and a non-overlapping projection layer to obtain a characteristic sequence of H/4 XW/4; the first convolution layer is used as a first layer embedded by a patch;
s104, extracting features, comprising three stages, wherein each stage comprises a linear embedding layer and a Shunted Transformer module, each Shunted Transformer module comprises a shunt self-attention layer and a detail specific feedforward, the output feature sequence obtained in the step S103 is firstly projected into tensors of a query Q, a key K and a value V as an input sequence F, multi-scale token aggregation is used, and for different heads indexed by i, the key K and the value V are downsampled to different sizes so as to preserve more fine granularity and lower detail:
Q i =XW i Q ,
V i =V i +LE(V i ).
wherein, MTA (.r) i ) Is the multi-scale mark polymeric layer in the ith header, r i For downsampling rate, W i Q ,W i V Is a parameter of the linear projection of the i-th head; LE (·) is the local enhancement component of MTA to the value V by deep convolution, Q i ,K i ,V i Respectively representing the query Q, the key K and the value V corresponding to the ith head, wherein X is the characteristic representation of the input image, and different i corresponds to r i Not exactly the same, the corresponding calculation costs are different.
S105, then, the multi-head self-attention uses H independent attention heads to calculate the split self-attention in parallel, as follows:
wherein h is i Represents the i-th self-attention head, d h Is a dimension;
s106, supplementing local details in the feedforward layer by adding a data specific layer between two fully connected layers of the feedforward layer, wherein the formula is as follows:
x =FC(x;θ 1 ),
x″=FC(σ(x +DS(x ;θ 1 );θ 2 ).
wherein DS (& gttheta) is a detail specific layer with a parameter of theta, the detail specific layer is realized through deep convolution, FC (& gttheta) is a fully connected layer with a parameter of theta, sigma (& gttheta) represents an activation function for carrying out nonlinear transformation on input, x represents a characteristic sequence of an input feedforward layer, and x is a characteristic sequence of an input feedforward layer Representing the characteristic sequence of x output through the first full connection layer, x' representing x The feature sequence output at the second full-connection layer after the detail specific layer is activated, namely the feature sequence output by the whole feedforward layer;
s107, iteratively executing the steps S104 to S106 twice to obtain the characteristic sequence of the input sequence F as
3. The method according to claim 2, wherein in step S103, the first convolution layer is 7*7, the step size is 2, the second convolution layer is 3*3, and the step size of the non-overlapping projection layer is 1.
4. The method for multi-scale transducer target tracking based on space-time template updating according to claim 2, wherein the step S2 comprises the sub-steps of:
s201, splicing the initial template token, the dynamic template token and the search area token along the space dimension to obtain a mixed representation, wherein the following formula is obtained:
wherein,respectively representing an initial template token, a dynamic template token and a search area token which are obtained through feature extraction.
S202、Firstly, inputting the calculated result and +.A. into a multi-head self-attention MSA module through layer normalization>The connection is obtainedThen pair->After layer normalization, the obtained product is input into a feed forward network FFN, and the output is then combined with +.>The output of a transducer block is obtained by concatenation>The formula is as follows:
s203, when the mixed representation outputs the encoder, decoupling the initial template token, the dynamic template token and the search area token by using a decoupling operation, wherein the decoupling operation comprises the following formula:
s204, the decoder takes the output of the encoder as input, namely the decoupled search area tokenAnd an uncoupled mixed representation f L All inputs are first layer normalized and then +.>And f L To obtain a final characterization, the formula is as follows:
f vm =f v m +FFN(LN(f v m ));
where MCA (-) represents a multi-headed cross-attention module, FFN (-) represents a feed-forward network, LN (-) represents layer normalization.
5. The method of multi-scale transducer target tracking based on space-time template updating according to claim 3, wherein step S3 comprises the sub-steps of:
s301, sending the output sequence of the decoder into a classification branch, using IoU perception classification score as training target, using VariFocal Loss as training Loss function, the classification Loss is expressed as
Where p represents the predicted IoU perceptual classification score, b represents the predicted bounding box, andrepresenting a real bounding box;
s302, the output sequence of the decoder is sent to a regression branch, and the generalized IoU loss is used, and the regression loss is expressed as
Wherein the GIoU penalty is weighted by p to emphasize high class score samples.
6. A multi-scale transducer target tracking method based on spatio-temporal template updating according to claim 3, characterized in that step S4 comprises the sub-steps of:
s401, when the number of video frames reaches a specified interval t, inputting an output sequence of a decoder into a confidence branch, and then determining whether the target state of the current frame is reliable or not, specifically: judging whether the confidence score is higher than a set threshold tau or not, if so, determining that the target state of the current frame is reliable, and updating the dynamic template; otherwise, the dynamic template is not updated, and a formula for judging whether the target state of the current frame is reliable is as follows:
s402, when the dynamic template is updated, the target predicted by the search area in the current frame is taken as the center, and the image with the size of the dynamic template is cut into a new dynamic template.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any one of claims 1 to 6 when the computer program is executed by the processor.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
CN202311171271.9A 2023-09-12 2023-09-12 Multi-scale transducer target tracking method based on space-time template updating Pending CN117036417A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311171271.9A CN117036417A (en) 2023-09-12 2023-09-12 Multi-scale transducer target tracking method based on space-time template updating

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311171271.9A CN117036417A (en) 2023-09-12 2023-09-12 Multi-scale transducer target tracking method based on space-time template updating

Publications (1)

Publication Number Publication Date
CN117036417A true CN117036417A (en) 2023-11-10

Family

ID=88622920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311171271.9A Pending CN117036417A (en) 2023-09-12 2023-09-12 Multi-scale transducer target tracking method based on space-time template updating

Country Status (1)

Country Link
CN (1) CN117036417A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment
CN117522925A (en) * 2024-01-05 2024-02-06 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333514A (en) * 2023-12-01 2024-01-02 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment
CN117333514B (en) * 2023-12-01 2024-04-16 科大讯飞股份有限公司 Single-target video tracking method, device, storage medium and equipment
CN117522925A (en) * 2024-01-05 2024-02-06 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism
CN117522925B (en) * 2024-01-05 2024-04-16 成都合能创越软件有限公司 Method and system for judging object motion state in mobile camera under attention mechanism

Similar Documents

Publication Publication Date Title
CN117036417A (en) Multi-scale transducer target tracking method based on space-time template updating
EP2395478A1 (en) Monocular 3D pose estimation and tracking by detection
Mocanu et al. Deep-see face: A mobile face recognition system dedicated to visually impaired people
WO2020064253A1 (en) Methods for generating a deep neural net and for localising an object in an input image, deep neural net, computer program product, and computer-readable storage medium
CN112115879B (en) Self-supervision pedestrian re-identification method and system with shielding sensitivity
CN114170516B (en) Vehicle weight recognition method and device based on roadside perception and electronic equipment
CN114724548A (en) Training method of multi-mode speech recognition model, speech recognition method and equipment
CN111445496B (en) Underwater image recognition tracking system and method
WO2023207778A1 (en) Data recovery method and device, computer, and storage medium
CN111461181B (en) Vehicle fine-grained classification method and device
US20200184285A1 (en) System and method for label augmentation in video data
Ukwuoma et al. Image inpainting and classification agent training based on reinforcement learning and generative models with attention mechanism
CN116543409A (en) Certificate target extraction method, system, medium, equipment and terminal
CN114333062A (en) Pedestrian re-recognition model training method based on heterogeneous dual networks and feature consistency
CN117011342A (en) Attention-enhanced space-time transducer vision single-target tracking method
Lee et al. Learning to remove bad weather: Towards robust visual perception for self-driving
CN112529934A (en) Multi-target tracking method and device, electronic equipment and storage medium
CN112329736B (en) Face recognition method and financial system
Wang et al. TIENet: task-oriented image enhancement network for degraded object detection
CN112487927A (en) Indoor scene recognition implementation method and system based on object associated attention
CN111985313A (en) Multi-style pedestrian re-identification method, system and terminal based on counterstudy
CN116486203B (en) Single-target tracking method based on twin network and online template updating
CN117197249B (en) Target position determining method, device, electronic equipment and storage medium
WO2022126355A1 (en) Image-based processing method and device
CN115631296A (en) 3D target detection method, computer program product and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination