CN114140495A - Single target tracking method based on multi-scale Transformer - Google Patents

Single target tracking method based on multi-scale Transformer Download PDF

Info

Publication number
CN114140495A
CN114140495A CN202111340646.0A CN202111340646A CN114140495A CN 114140495 A CN114140495 A CN 114140495A CN 202111340646 A CN202111340646 A CN 202111340646A CN 114140495 A CN114140495 A CN 114140495A
Authority
CN
China
Prior art keywords
target
feature
candidate
features
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111340646.0A
Other languages
Chinese (zh)
Inventor
何志伟
聂佳浩
伍瀚
高明煜
董哲康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202111340646.0A priority Critical patent/CN114140495A/en
Publication of CN114140495A publication Critical patent/CN114140495A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a single target tracking method based on a multi-scale Transformer. The method cuts out expression features with different space sizes from template features, obtains target feature information of a multi-scale semantic space through convolution kernels with different sizes, and then supervises the enhancement of the template features by utilizing the information so as to enable the template features to have the perception capability of the target features. And then, an IoU-Net is taken off line to evaluate the accuracy of the candidate frame, a feature modulation vector is learned by the feature of the target to act on the feature of the candidate frame, and the modulated feature is subjected to generalization learning to obtain the confidence score of the candidate frame. And finally, through multiple iterative optimization, finding a candidate frame with the highest confidence coefficient as a tracking result. Based on the multiscale Transformer module provided by the invention, the accuracy of the ATOM tracking method is improved to a certain extent, and the bounding box of the target can be estimated more accurately in a complex scene.

Description

Single target tracking method based on multi-scale Transformer
Technical Field
The invention belongs to the technical field of single target tracking, and particularly relates to a multi-scale Transformer feature-guided single target tracking method in a complex environment.
Background
Single target tracking is a basic and challenging task in computer vision. Given any object in the first frame as a priori knowledge, the tracker aims at locating this target and estimating its bounding box for subsequent frames. In recent years, single-target tracking is widely applied to the fields of unmanned aerial vehicles, intelligent video monitoring and the like, and great progress is made, but the tracking errors accumulated continuously can cause that the tracker cannot cope with complex scenes such as deformation, shielding and the like. Therefore, how to accurately estimate the bounding box of the object remains to be studied.
Early single-target trackers performed bounding box estimation using a conventional multi-scale method, and only performed multi-scale measurements using the tracking result of the previous frame as the reference bounding box of the current frame. This conventional method will limit the accuracy of tracking to some extent when the target is severely distorted in the video stream. With the development of deep learning, many high-precision tracking methods emerge. The bounding box evaluation methods employed by mainstream trackers today can be broadly divided into two broad categories: template matching based and candidate box evaluation based methods. The tracker using the first method cuts an image containing context information as a template with a target in a first frame as the center, extracts the characteristics of the given template and the subsequent frame by using a twin network, and learns the most similar area to the template as a tracking result by a full convolution mode. The method for evaluating the boundary box greatly improves the accuracy of the tracker and can effectively estimate the state of the object when the object deforms. However, the method of using the context information and the target as the template has a certain drawback, in which the position, posture, etc. of the target are obscured by a large amount of context information. Therefore, subsequent candidate-box based evaluation methods are proposed to solve this problem. The method also utilizes the twin network to extract features, except that the features of a given target in the template are explicitly modeled with a priori knowledge, and then the confidence evaluation of the candidate box is guided by propagating the priori knowledge through off-line training of IoU-Net, and the candidate box with the highest confidence is taken as a tracking result. Due to the specific characterization capability of the target features, the method based on candidate box evaluation can effectively overcome some scenes with background interference. However, when similar interfering objects appear in the image, the tracker still has a drift condition because the receptive field of the convolutional neural network is far larger than the target area, so that the target characteristics are mixed with redundant information, and the characterization capability is insufficient. In order to further improve the tracking accuracy, the method optimizes the characterization capability of the target in the single-target tracking process on the basis of the candidate box evaluation.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a single target tracking method based on a multi-scale Transformer, and the characterization capability of a target is enhanced by using a multi-scale Transformer characteristic enhancement technology. The invention uses ATOM as a reference tracking method, thereby realizing more accurate tracking result.
The single target tracking method based on the multi-scale Transformer specifically comprises the following steps:
step 1, after the template features extracted by the twin network are applied by a multi-scale Transformer module, guiding template feature enhancement by taking target features of different scales as supervision information to obtain enhanced template features T';
the method comprises the following specific steps:
1) cutting out 3 features with different space sizes on the template feature map by using the central position, wherein the scales of the three features are a multiplied by a, 2a multiplied by 2a and 3a multiplied by 3a respectively;
2) embedding the characteristics of different spaces into semantic spaces with different scales through a convolution layer with unchanged 3 channels, and finally modulating the characteristics into a 2-dimensional form; the overall flow of the multiscale Transformer is represented by the following formula:
Figure BDA0003351788790000021
3) in the multi-headed object attention module, the size of the linear convolution layer is 1 x1 through 1 convolution kernel
Figure BDA0003351788790000022
Reducing the number C of all the characteristic channels of V and K to C/4, so as to achieve the effect of accelerating the fitting of the model;
4) calculating a similarity matrix A between Q and K by taking the template characteristics as Q;
Figure BDA0003351788790000023
wherein the output is
Figure BDA0003351788790000024
dkIs characterized in that
Figure BDA0003351788790000025
Dimension (d);
5) after the similarity matrix is obtained, calculating the output characteristic O of the single target attention block through the following matrix operation;
O=A*V (3)
wherein, multiplication of the table matrix and output
Figure BDA0003351788790000026
6) Expanding the target attention block to multiple heads, and obtaining template characteristics T' enhanced by the multi-scale Transformer through summing and normalization processing;
T'=MultiHead(Q,K,V)=Norm(Concat(O1,O2,O3)Wo+Q) (4)
wherein Norm represents the utilization of l2Normalization adjusts the entire template features; woThe characteristic channel 3C is adjusted to input characteristic channel number C;
step 2, taking the tracking result of the previous frame as a reference frame of the current frame, randomly generating candidate frames with a plurality of scales, wherein the random length-width ratio scaling factor interval of the candidate frames is [ 1-alpha, 1+ alpha ], and extracting the features of the candidate frames and the target features in the enhanced template features T' through PrRoi posing;
step 3, training one IoU-Net through a public data set in an off-line manner; transmitting the target information to the candidate frames, adjusting the characteristics of the candidate frames by a vector modulation method, and evaluating the confidence score of each candidate frame;
and 4, in an online testing stage, continuously fine-tuning the candidate frame position information through the gradient of the candidate frame position information, and iterating a more accurate boundary frame to serve as a tracking result.
Preferably, the backbone of the twin network in step 1 is a ResNet-18 network pre-trained on an ImageNet large data set, and parameters of the ResNet-18 network are shared by the template and subsequent image branches; to be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module, with a down-sampling rate of 16.
Preferably, the PrRoi posing extraction characteristic process in step 2 is: firstly, solving the quantization problem of regional pooling by using an interpolation method; interpolation is represented by the following equation:
Figure BDA0003351788790000031
IC(x,y,i,j)=max(0,1-|x-i|·max(0,1-|y-j|)) (6)
wherein i, j is the coordinate position of the feature image, and IC (x, y, i, j) is the interpolation coefficient; finally, the feature region after interpolation is extracted using double integration, the formula is as follows:
Figure BDA0003351788790000032
where F is a feature map, and the upper left corner (x1, y1) and the lower right corner (x2, y2) represent a region from which features are to be extracted.
Preferably, the common data set in step 3 comprises TrackingNet, LaSOT and COCO; two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center.
Preferably, the confidence score of each candidate box is evaluated in step 3, and the steps are as follows:
1) applying a full connection layer on the target feature to obtain a modulation vector x, wherein the formula is as follows:
x=φ(Flatten(Ftarget)) (8)
in the formula, inputting the target characteristics
Figure BDA0003351788790000033
Outputting through a full connection layer phi () by dimensional adjustment
Figure BDA0003351788790000041
2) And modulating the feature of each candidate frame by using the feature vector learned from the target feature, and outputting a final confidence score by the adjusted feature through another full-connection layer theta (eta), wherein the process is represented by the following formula:
Figure BDA0003351788790000042
wherein,
Figure BDA0003351788790000046
representing broadcast multiplication, to be characterized
Figure BDA0003351788790000043
The 25 eigenvalues of each channel in the system are given the same weight xiThe output of i-0, 1,2, …, C-1, s is the confidence score for each candidate box.
Preferably, the iteration step in step 4 is as follows:
1) in the testing stage, the coordinate position of the candidate frame position information is adjusted online by back-propagating the gradient of the candidate frame position information, wherein the coordinates at the upper left corner are (x1, y1), and the coordinates at the lower right corner are (x2, y 2); thanks to the continuity of the PrRoi posing feature extraction operation, the gradient information obtained by the reverse derivation is more accurate, and the derivation formula is as follows:
Figure BDA0003351788790000044
each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; after multiple iterations, selecting a candidate frame with the maximum confidence score from the output s as a tracking result;
2) each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; the update formula is as follows:
Figure BDA0003351788790000045
3) repeating the steps 1) and 2) for multiple times, and selecting the candidate box with the maximum confidence score from the output s as the tracking result.
The invention has the following beneficial effects:
1. a multi-scale Transformer module is bridged behind a twin network of the tracking framework based on ATOM, targets with different scales are used as supervision information to enhance template characteristics, interference of a large amount of background information is effectively inhibited, more accurate bounding box evaluation is guided, tracking errors are reduced, and single-target tracking in a complex scene is achieved.
2. The multi-scale Transformer feature enhancement strategy has strong universality and is suitable for most trackers based on candidate box evaluation.
Drawings
FIG. 1: a flow diagram of a single target tracking method based on candidate box evaluation;
FIG. 2: a structure diagram of a multi-scale transform module;
FIG. 3: an improved ATOM tracking framework map;
FIG. 4: the tracking examples are shown in contrast.
Detailed Description
The invention is further explained below with reference to the drawings.
As shown in fig. 1 and 3, the multiscale Transformer-based single-target tracking method is improved by using an ATOM tracking frame as a reference, and specifically includes the following steps:
step 1, firstly, extracting the image characteristics of a template (a first frame) and subsequent images by utilizing a twin network, wherein the trunk of the twin network adopts a ResNet-18 network pre-trained on an ImageNet large data set, and the parameters of the ResNet-18 network are shared by the template and the subsequent image branches. To be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module.
After the template of the twin network structure is branched, the template features are strongly paired by the multi-scale Transformer module provided by the invention, as shown in fig. 2. The module derives template features from ResNet-18 template features
Figure BDA0003351788790000051
The space size is H W and the number of channels is C. In order to follow the input criteria of the self-attention mechanism, the dimension information of the 3-dimensional template T is adjusted to
Figure BDA0003351788790000052
As input Q for Multi-Head Target-orientation. And the template T is also input into a Pyramid transducer submodule to extract multi-scale features of the target so as to supervise the enhancement of the template features.
The submodule cuts out 3 features (4 x 4, 8 x 8 and 12 x 12) with different space sizes on a template feature diagram by using the center, then embeds the features of different spaces into semantic spaces with different scales by using convolution layers with 3 channels unchanged, and finally modulates the features into a 2-dimensional form. The overall flow for Pyramid transducer can be represented by the following formula:
Figure BDA0003351788790000061
in the Multi-Head Target-orientation submodule, first 1 convolution kernel is passed through a linear convolution layer of 1 × 1 size
Figure BDA0003351788790000062
And the number C of all the characteristic channels of V and K is reduced to C/4, so that the fitting effect of the acceleration model is achieved. Then, the similarity matrix a between Q and K is calculated again using the following equation.
Figure BDA0003351788790000063
Wherein the output is
Figure BDA0003351788790000064
dkIs characterized in that
Figure BDA0003351788790000065
Of (c) is calculated. After the similarity matrix is obtained, the output characteristic O of a single target attention block is calculated by the following matrix operation.
O=A*V (3)
Wherein, multiplication of the table matrix and output
Figure BDA0003351788790000066
And finally, expanding the Target-orientation to a multi-head, and performing summation and normalization processing to obtain the template characteristic T' enhanced by the multi-scale Transformer.
T'=MultiHead(Q,K,V)=Norm(Concat(O1,O2,O3)Wo+Q) (4)
Wherein Norm represents the utilization of l2Normalization adjusts the entire template features. WoIs a parameter matrix that can be learned, and adjusts the eigen channel 3C to the input eigen channel number C.
And 2, based on the fact that the target change of the front frame and the rear frame is small in the video sequence. The tracking result of the previous frame is used as a reference boundary frame, the center position of the reference boundary frame is kept unchanged, and 10 candidate frames with different sizes are randomly generated to serve as evaluation objects. The range of scaling factors for the length and width of these candidate frames is specified in the interval [0.75,1.25 ] as compared to the reference bounding box]And (4) the following steps. Extracting the target characteristics in the enhanced template in the step 1 by using a PrRoi posing operator
Figure BDA0003351788790000067
And features of candidate frames
Figure BDA0003351788790000068
Step 3, the characteristics obtained in the step 2
Figure BDA0003351788790000069
And
Figure BDA00033517887900000610
IoU-Net is input and offline training is performed. The present invention uses the same common data set as ATOM for training, including TrackingNet, LaSOT, and COCO. Two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center. After training is complete, IoU-Net evaluates the confidence scores of the candidate boxes using its vector modulation method. The specific process is as follows: applying a full connection layer on the target feature to obtain a modulation vector x, wherein the formula is as follows:
x=φ(Flatten(Ftarget)) (5)
in the formula, inputting the target characteristics
Figure BDA0003351788790000071
Outputting through a full connection layer phi () by dimensional adjustment
Figure BDA0003351788790000072
The feature vector learned from the target feature is used to modulate the feature of each candidate frame, and the adjusted feature outputs a final confidence score through another fully connected layer θ (), which can be expressed as follows:
Figure BDA0003351788790000073
wherein,
Figure BDA0003351788790000074
representing broadcast multiplication, to be characterized
Figure BDA0003351788790000075
The 25 eigenvalues of each channel in the system are given the same weight xi(i ═ 0,1,2, …, C-1), s outputs the confidence score for each candidate box.
And 4, in the testing stage, adjusting the coordinate position of the candidate frame on line by reversely propagating the gradient of the position information of the candidate frame, wherein the coordinate at the upper left corner is (x1, y1), and the coordinate at the lower right corner is (x2, y 2). Thanks to the continuity of the PrRoi posing feature extraction operation, the gradient information obtained by the reverse derivation is more accurate, and the derivation formula is as follows:
Figure BDA0003351788790000076
each candidate box obtains its corresponding gradient value and updates the position and size of its bounding box. After 5 iterations, the candidate box with the largest confidence score is selected from the output s as the tracking result.
Step 5, the experimental environment of the invention is as follows: CPU is
Figure BDA0003351788790000077
CoreTMi5-7300HQ @2.50GHz, GPU GTX1050Ti, video memory 4GB, system version Linux 5.4.0-81-genetic Ubuntu 18.04.5 LTS, Cuda version number 10.2, and deep learning frame Pytrch 1.6.0. The test results in the common data set OTB100 are as follows:
table 1: performance comparison before and after ATOM tracking method improvement
Accuracy FPS
ATOM 0.655 24.35
ATOM + Multi-Scale Transgormer (Ours) 0.664 22.10
As can be seen from table 1, the accuracy of the tracker can be effectively improved by the multi-scale transform module provided by the present invention under the condition of low speed loss. To visually compare the boosting effect, fig. 4 shows an example comparison of two video sequences (basetball and Diving) in the OTB 100.

Claims (6)

1. The single-target tracking method based on the multi-scale Transformer is characterized by comprising the following steps:
step 1, after the template features extracted by the twin network are applied by a multi-scale Transformer module, guiding template feature enhancement by taking target features of different scales as supervision information to obtain enhanced template features T';
the method comprises the following specific steps:
1) cutting out 3 features with different space sizes on the template feature map by using the central position, wherein the scales of the three features are a multiplied by a, 2a multiplied by 2a and 3a multiplied by 3a respectively;
2) embedding the characteristics of different spaces into semantic spaces with different scales through a convolution layer with unchanged 3 channels, and finally modulating the characteristics into a 2-dimensional form; the overall flow of the multiscale Transformer is represented by the following formula:
Figure FDA0003351788780000011
3) in the multi-headed object attention module, the size of the linear convolution layer is 1 x1 through 1 convolution kernel
Figure FDA0003351788780000012
Reducing the number C of all the characteristic channels of V and K to C/4, so as to achieve the effect of accelerating the fitting of the model;
4) calculating a similarity matrix A between Q and K by taking the template characteristics as Q;
Figure FDA0003351788780000013
wherein the output is
Figure FDA0003351788780000014
dkIs characterized in that
Figure FDA0003351788780000015
Dimension (d);
5) after the similarity matrix is obtained, calculating the output characteristic O of the single target attention block through the following matrix operation;
O=A*V (3)
wherein, multiplication of the table matrix and output
Figure FDA0003351788780000016
6) Expanding the target attention block to multiple heads, and obtaining template characteristics T' enhanced by the multi-scale Transformer through summing and normalization processing;
T'=MultiHead(Q,K,V)=Norm(Concat(O1,O2,O3)Wo+Q) (4)
wherein Norm represents the utilization of l2Normalization adjusts the entire template features; woIs a parameter matrix which can be learnt and adjusts the characteristic channel 3CThe whole is the number C of input characteristic channels;
step 2, taking the tracking result of the previous frame as a reference frame of the current frame, randomly generating candidate frames with a plurality of scales, wherein the random length-width ratio scaling factor interval of the candidate frames is [ 1-alpha, 1+ alpha ], and extracting the features of the candidate frames and the target features in the enhanced template features T' through PrRoi posing;
step 3, training one IoU-Net through a public data set in an off-line manner; transmitting the target information to the candidate frames, adjusting the characteristics of the candidate frames by a vector modulation method, and evaluating the confidence score of each candidate frame;
and 4, in an online testing stage, continuously fine-tuning the candidate frame position information through the gradient of the candidate frame position information, and iterating a more accurate boundary frame to serve as a tracking result.
2. The multi-scale transform-based single-target tracking method of claim 1, wherein: the main trunk of the twin network in the step 1 adopts a ResNet-18 network pre-trained on an ImageNet big data set, and the parameters of the ResNet-18 network are shared by the template and the subsequent image branches; to be suitable for the tracking task, a ResNet-18 network with the full connectivity layer removed is used as the feature extraction module, with a down-sampling rate of 16.
3. The multi-scale transform-based single-target tracking method of claim 1, wherein: the PrRoi pooling extraction characteristic process in the step 2 comprises the following steps: firstly, solving the quantization problem of regional pooling by using an interpolation method; interpolation is represented by the following equation:
Figure FDA0003351788780000021
IC(x,y,i,j)=max(0,1-|x-i|·max(0,1-|y-j|)) (6)
wherein i, j is the coordinate position of the feature image, and IC (x, y, i, j) is the interpolation coefficient; finally, the feature region after interpolation is extracted using double integration, the formula is as follows:
Figure FDA0003351788780000022
where F is a feature map, and the upper left corner (x1, y1) and the lower right corner (x2, y2) represent a region from which features are to be extracted.
4. The multi-scale transform-based single-target tracking method of claim 1, wherein: the common data set in the step 3 comprises TrackingNet, LaSOT and COCO; two frames of images of the same video sequence are sampled during the training process as an input image pair of the model, and each image is a 288 × 288 region cut by taking the target as the center.
5. The multi-scale transform-based single-target tracking method of claim 1, wherein: in step 3, the confidence score of each candidate box is evaluated, and the steps are as follows:
1) applying a full connection layer on the target feature to obtain a modulation vector x, wherein the formula is as follows:
x=φ(Flatten(Ftarget)) (8)
in the formula, inputting the target characteristics
Figure FDA0003351788780000031
Outputting through a full connection layer phi () by dimensional adjustment
Figure FDA0003351788780000032
2) And modulating the feature of each candidate frame by using the feature vector learned from the target feature, and outputting a final confidence score by the adjusted feature through another full-connection layer theta (eta), wherein the process is represented by the following formula:
Figure FDA0003351788780000033
wherein,
Figure FDA0003351788780000034
representing broadcast multiplication, to be characterized
Figure FDA0003351788780000035
The 25 eigenvalues of each channel in the system are given the same weight xiThe output of i-0, 1,2, …, C-1, s is the confidence score for each candidate box.
6. The multi-scale transform-based single-target tracking method of claim 1, wherein: the iteration steps in the step 4 are as follows:
1) in the testing stage, the coordinate position of the candidate frame position information is adjusted online by back-propagating the gradient of the candidate frame position information, wherein the coordinates at the upper left corner are (x1, y1), and the coordinates at the lower right corner are (x2, y 2); thanks to the continuity of the PrRoi posing feature extraction operation, the gradient information obtained by the reverse derivation is more accurate, and the derivation formula is as follows:
Figure FDA0003351788780000036
each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; after multiple iterations, selecting a candidate frame with the maximum confidence score from the output s as a tracking result;
2) each candidate frame obtains a corresponding gradient value and updates the position and the size of a boundary frame of the candidate frame; the update formula is as follows:
Figure FDA0003351788780000041
3) repeating the steps 1) and 2) for multiple times, and selecting the candidate box with the maximum confidence score from the output s as the tracking result.
CN202111340646.0A 2021-11-12 2021-11-12 Single target tracking method based on multi-scale Transformer Pending CN114140495A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111340646.0A CN114140495A (en) 2021-11-12 2021-11-12 Single target tracking method based on multi-scale Transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111340646.0A CN114140495A (en) 2021-11-12 2021-11-12 Single target tracking method based on multi-scale Transformer

Publications (1)

Publication Number Publication Date
CN114140495A true CN114140495A (en) 2022-03-04

Family

ID=80393732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111340646.0A Pending CN114140495A (en) 2021-11-12 2021-11-12 Single target tracking method based on multi-scale Transformer

Country Status (1)

Country Link
CN (1) CN114140495A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853664A (en) * 2024-03-04 2024-04-09 云南大学 Three-dimensional face reconstruction method based on double-branch feature fusion

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117853664A (en) * 2024-03-04 2024-04-09 云南大学 Three-dimensional face reconstruction method based on double-branch feature fusion
CN117853664B (en) * 2024-03-04 2024-05-14 云南大学 Three-dimensional face reconstruction method based on double-branch feature fusion

Similar Documents

Publication Publication Date Title
CN108154118B (en) A kind of target detection system and method based on adaptive combined filter and multistage detection
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
US11182644B2 (en) Method and apparatus for pose planar constraining on the basis of planar feature extraction
CN110348330B (en) Face pose virtual view generation method based on VAE-ACGAN
CN112184752A (en) Video target tracking method based on pyramid convolution
CN111179314A (en) Target tracking method based on residual dense twin network
CN113065546B (en) Target pose estimation method and system based on attention mechanism and Hough voting
CN108665491B (en) Rapid point cloud registration method based on local reference points
CN102262724B (en) Object image characteristic points positioning method and object image characteristic points positioning system
CN112862792B (en) Wheat powdery mildew spore segmentation method for small sample image dataset
CN107871099A (en) Face detection method and apparatus
CN111815665B (en) Single image crowd counting method based on depth information and scale perception information
CN110246151B (en) Underwater robot target tracking method based on deep learning and monocular vision
CN112785636B (en) Multi-scale enhanced monocular depth estimation method
CN102799646B (en) A kind of semantic object segmentation method towards multi-view point video
CN112232134A (en) Human body posture estimation method based on hourglass network and attention mechanism
CN116468995A (en) Sonar image classification method combining SLIC super-pixel and graph annotation meaning network
CN107862680A (en) A kind of target following optimization method based on correlation filter
CN112801945A (en) Depth Gaussian mixture model skull registration method based on dual attention mechanism feature extraction
CN112396036A (en) Method for re-identifying blocked pedestrians by combining space transformation network and multi-scale feature extraction
CN110135435B (en) Saliency detection method and device based on breadth learning system
CN115830375A (en) Point cloud classification method and device
CN116189147A (en) YOLO-based three-dimensional point cloud low-power-consumption rapid target detection method
CN114140495A (en) Single target tracking method based on multi-scale Transformer
CN113723468B (en) Object detection method of three-dimensional point cloud

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination