CN114897941A - Target tracking method based on Transformer and CNN - Google Patents

Target tracking method based on Transformer and CNN Download PDF

Info

Publication number
CN114897941A
CN114897941A CN202210819539.4A CN202210819539A CN114897941A CN 114897941 A CN114897941 A CN 114897941A CN 202210819539 A CN202210819539 A CN 202210819539A CN 114897941 A CN114897941 A CN 114897941A
Authority
CN
China
Prior art keywords
target
network
tracking
transformer
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210819539.4A
Other languages
Chinese (zh)
Other versions
CN114897941B (en
Inventor
余知音
向北海
邹融
陈远益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Chaochuang Electronic Technology Co ltd
Original Assignee
Changsha Chaochuang Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Chaochuang Electronic Technology Co ltd filed Critical Changsha Chaochuang Electronic Technology Co ltd
Priority to CN202210819539.4A priority Critical patent/CN114897941B/en
Publication of CN114897941A publication Critical patent/CN114897941A/en
Application granted granted Critical
Publication of CN114897941B publication Critical patent/CN114897941B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/248Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving reference images or patches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20016Hierarchical, coarse-to-fine, multiscale or multiresolution image processing; Pyramid transform

Abstract

The invention discloses a target tracking method based on Transformer and CNN, which comprises the following steps: cutting the target and the search area according to the initial target position; generating an online template library by using an image augmentation means according to a known target; extracting target characteristics through a CNN network; analyzing the target characteristics to obtain a corresponding frame score map; similarity calculation is carried out on the two targets, and if the similarity is higher than a certain threshold value, simple processing is carried out to output the similarity; if the similarity is lower than the threshold value, the cause of missed detection is judged, if only a single network has missed detection, the template library is updated by using the current correct target, and the corresponding network is corrected; if both networks miss the target at the same time. The invention ensures the robustness of tracking deformation and fuzzy targets; ensuring the self-adaptive force to the target scale change during tracking; the stability of long-term tracking of the target is ensured; and the tracking robustness of the shielded and deformed target is ensured.

Description

Target tracking method based on Transformer and CNN
Technical Field
The invention belongs to the technical field of single target tracking in computer vision, and particularly relates to a target tracking method based on a Transformer and a CNN.
Background
As one of important tasks of computer vision, single-target tracking has wide application prospects in real life, such as pedestrian detection and tracking under complex backgrounds, human-computer interaction, unmanned driving technology and the like. The existing single-target tracking algorithm obtains higher tracking precision and better short-term tracking effect under the condition that the target is in a single background and has less deformation and shielding.
However, the existing mainstream tracking algorithms all have considerable defects, for example, the conventional tracking algorithm based on the relevant filtering has the problem of low updating precision of the scale of the tracked target, and when the size of the target changes, the problems of missing detection and error detection easily occur. For a twin network tracking algorithm capable of stably tracking when a target deforms, the problems of shading and insufficient robustness of fuzzy target tracking caused by lack of an effective template updating mechanism are solved. The tracking algorithm based on the transform structure has strong advantages in the aspect of dealing with the problem of target occlusion, but the transform structure which only performs linear segmentation and mapping on target features has poor capability of extracting local features of a target and has the problem of low recognition force on target deformation.
Disclosure of Invention
In view of the above, the invention adopts a mode of integrating a deep learning tracking algorithm with complementary performance to solve the long-term single-target tracking problem under the situation that the target is shielded and deformed.
In order to solve the problems, the invention provides a dual-network deep learning tracking method based on ensemble learning, which realizes the stable tracking of morphologically variable targets under a complex background. The method has the translation invariance of the CNN tracking network and the strong extraction capability of local information, and also effectively fuses the strong anti-shielding capability of the Transformer tracking network, thereby obtaining stable tracking effect on target deformation and shielding. Meanwhile, in order to realize long-time tracking, the invention provides an online learning method, so that online optimization of network weight is realized.
In addition, the invention adopts a self-adaptive strategy of the search range according to the different sizes of the tracking targets so as to improve the accuracy of the tracking result,
specifically, the invention discloses a target tracking method based on a Transformer and a CNN, which comprises the following steps:
initial target loading: cutting the target and the search area according to the initial target position;
and (3) generating an online target template library: generating an online template library by using an image augmentation means according to a known target;
feature extraction: extracting target characteristics through a CNN network;
target prediction: simultaneously, two deep learning networks with different structures are used for obtaining corresponding frame score maps through analyzing the target characteristics, and the score maps are converted into the relative positions of the target in the map frame through an angular point regression network;
and (3) judging the similarity: in order to reduce calculation parameters and improve the real-time performance of a tracking algorithm, similarity calculation is carried out on two targets obtained in the target prediction step, and if the similarity is higher than a certain threshold value
Figure 208108DEST_PATH_IMAGE001
If so, the two algorithms are considered to be stably tracked and directly output after simple processing;
and (4) missing detection correction: if the similarity is lower than the threshold
Figure 237244DEST_PATH_IMAGE001
If the single network has the missed detection, updating the template base by using the current correct target, and correcting the corresponding network;
target recovery: if the two networks are missed to be detected simultaneously and the target is lost, the tracking is stopped, the searching range is expanded, and the target is tried to be found back.
Further, the cutting method comprises the following steps:
Figure 664683DEST_PATH_IMAGE002
wherein
Figure 287426DEST_PATH_IMAGE003
Figure 701089DEST_PATH_IMAGE004
The length and the width of the final cutting target template are shown,
Figure 974945DEST_PATH_IMAGE005
Figure 448651DEST_PATH_IMAGE006
the length and width of the initial target are shown,
Figure 293111DEST_PATH_IMAGE007
the adaptive amplification factor for the search range varies according to the size of the target.
Further, the image augmentation means includes image rotation, image size conversion, and motion blur, the image size conversion including: and respectively carrying out zooming processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets, wherein the motion blur carries out blur processing on the image by using mean filtering.
Further, the gaussian pyramid formula is as follows:
Figure 635099DEST_PATH_IMAGE008
wherein
Figure 638827DEST_PATH_IMAGE009
Is a kernel of a gaussian convolution with the result that,
Figure 893222DEST_PATH_IMAGE010
is the original template image and is a template image,
Figure 615191DEST_PATH_IMAGE011
the template image after the quarter reduction is obtained;
the bilinear interpolation algorithm formula is as follows:
Figure 495291DEST_PATH_IMAGE012
further, a residual error network ResNet based on CNN is used as a backbone network to realize feature extraction of a target template library and a search area.
Further, the target prediction uses a convolution tracking network and a Transformer tracking network;
the convolution tracking network consists of a convolution layer and a linear full-connection layer, and updates the convolution classifier in a mode of learning the characteristics of the target template library on line; in order to accelerate the convergence rate of the classification model, the weight of the model is optimized by adopting a Gaussian-Newton iteration method in the updating process, and the updated classifier is used for positioning the target of the current frame in the search area to obtain a corresponding score map;
the Transformer tracking network consists of an Attention module and a linear full-connection layer, and in order to further strengthen the local information perception capability of the Transformer network, a convolution layer is used for flattening picture features extracted by ResNet and mapping the picture features into features required by Attention calculation
Figure 353525DEST_PATH_IMAGE013
A component;
after the current search area and the target template characteristics are subjected to attention calculation, obtaining corresponding scores of the search area characteristics F through a linear full link layer (MLP);
the Transformer network score calculation formula is represented by the following formula:
Figure 309980DEST_PATH_IMAGE014
wherein
Figure 378299DEST_PATH_IMAGE013
The components are derived from mapping the ResNet extracted features,
Figure 671877DEST_PATH_IMAGE015
is a data dimension;
to ensure long-term tracking capability, cross-entropy loss function is used
Figure 259984DEST_PATH_IMAGE016
And a ternary loss function
Figure 511974DEST_PATH_IMAGE017
A weighted average is performed as shown in the following equation:
Figure 67589DEST_PATH_IMAGE018
Figure 164858DEST_PATH_IMAGE019
wherein
Figure 607472DEST_PATH_IMAGE020
And
Figure 154997DEST_PATH_IMAGE021
respectively, represent the weight of the corresponding loss function,
Figure 338854DEST_PATH_IMAGE022
is a training threshold constant that is constant for one,
Figure 849601DEST_PATH_IMAGE023
and
Figure 927147DEST_PATH_IMAGE024
representing the mahalanobis distance between the current result and the positive and negative samples respectively,
Figure 255360DEST_PATH_IMAGE025
is a constant.
Furthermore, in order to obtain more accurate target scale estimation, a corner regression network is used, and the score maps of the two prediction networks are converted into corresponding tracking frames and network confidence degrees through a structure of a multilayer convolution plus a corner pooling layer.
Further, the similarity of the two predicted targets is represented by an image structure similarity SSIM, and an SSIM index calculation formula is shown as the following formula:
Figure 270721DEST_PATH_IMAGE026
wherein
Figure 975371DEST_PATH_IMAGE027
Representing the luminance similarity of the predicted object by the convolutional network and the Transformer network,
Figure 930862DEST_PATH_IMAGE028
respectively representing the contrast of the target predicted by the convolutional network and the Transformer network,
Figure 429976DEST_PATH_IMAGE029
representing the structural similarity of the two predicted objects,
Figure 932633DEST_PATH_IMAGE030
Figure 909816DEST_PATH_IMAGE031
and
Figure 961955DEST_PATH_IMAGE032
respectively, represent the corresponding similarity classification weights,
Figure 507336DEST_PATH_IMAGE033
is a constant.
Further, the network correction method includes:
for the convolution tracking network, a temporary template base is reconstructed by using a current target, and the classifier weight is optimized by using an online updating method in the target prediction step;
for the Transformer network, the current correct target position is used as a positive sample, the Transformer missing detection result is used as a negative sample, and the contrast loss is calculated
Figure 621923DEST_PATH_IMAGE034
Figure 527431DEST_PATH_IMAGE035
Wherein
Figure 309442DEST_PATH_IMAGE036
In order to train the threshold constant for the training,
Figure 25725DEST_PATH_IMAGE037
the mahalanobis distance between the input samples.
Compared with the prior art, the invention has the following beneficial effects:
two deep learning tracking networks are integrated to synchronously track the target, and the accuracy of long-term tracking of the target is improved by a method of integrating a complementary network.
A template base updating strategy based on confidence coefficient and similarity is provided, and the tracking robustness of deformation and fuzzy targets is ensured;
an angular point regression network is provided to ensure the self-adaptive capacity to the target scale change during tracking;
the target missing detection and correction strategy based on online learning and complementary network integration is provided, and the stability of tracking the target for a long time is ensured;
a convolution-fused Transformer tracking network is provided, and robustness of tracking of occluded and deformed targets is guaranteed.
Drawings
FIG. 1 is a flowchart of the process of the present invention;
FIG. 2 is a flow diagram of the convolution tracking network of the present invention;
FIG. 3 is a flow chart of the Transformer tracking network of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
The invention provides a dual-network deep learning tracking method based on ensemble learning, which realizes stable tracking of morphologically variable targets under a complex background. The method has the translation invariance of the CNN tracking network and the strong extraction capability of local information, and also effectively fuses the strong anti-shielding capability of the Transformer tracking network, thereby obtaining stable tracking effect on target deformation and shielding. Meanwhile, in order to realize long-time tracking, the invention provides an online learning method, so that online optimization of network weight is realized.
In addition, the present invention also adopts a search range adaptive strategy according to the size of the tracking target to improve the accuracy of the tracking result, referring to the flow chart of the present invention program of fig. 1, the steps of the present invention include:
s1 initial target load: cutting the target and the search area according to the initial target position;
to reduce the amount of computation, the present invention will crop the target. The clipping size is determined by the formula (1), wherein
Figure 17821DEST_PATH_IMAGE003
Figure 336807DEST_PATH_IMAGE004
The length and the width of the final cutting target template are shown,
Figure 848691DEST_PATH_IMAGE005
Figure 860509DEST_PATH_IMAGE006
the length and width of the initial target are shown,
Figure 74322DEST_PATH_IMAGE007
the adaptive amplification factor for the search range will vary according to the size of the target;
Figure 462578DEST_PATH_IMAGE002
(1)
s2 online target template library generation: generating an online template library by using an image augmentation means according to a known target;
by amplifying the target in the step 1), the image amplifying method mainly comprises the following steps:
a. rotating the image;
b. image size transformation: and respectively carrying out scaling processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets.
The formula (2) is a general formula of a Gaussian pyramid, wherein
Figure 563389DEST_PATH_IMAGE009
Is a kernel of a gaussian convolution with the result that,
Figure 136322DEST_PATH_IMAGE010
is the original template image and is a template image,
Figure 712797DEST_PATH_IMAGE011
to reduce the template image by one quarter.
Figure 514530DEST_PATH_IMAGE008
(2)
Formula (3) is a general formula of the bilinear interpolation algorithm adopted by the invention, and the general formula is as follows:
Figure 594482DEST_PATH_IMAGE038
Figure 603895DEST_PATH_IMAGE039
(3)
c. motion blur: the image is blurred using mean filtering.
d. And (3) feature enhancement: the invention selectively enhances the weak information according to the brightness information and the structure information quantity of the current target.
S3 feature extraction: extracting target characteristics through a CNN convolutional network;
the method uses a residual error network ResNet based on CNN as a backbone network to realize the feature extraction of a target template library and a search area; ResNet is the prior art in the field, and the present invention is not described in detail.
S4 target prediction: and simultaneously, obtaining corresponding frame score maps by analyzing the target features by using two deep learning networks with different structures, and converting the score maps into the relative position of the target in the map frame by using an angular point regression network.
The two tracking networks used in the invention are respectively
a. Convolution tracking network:
the convolution tracking network mainly comprises a convolution layer and a linear full-connection layer, and updates a convolution classifier in a mode of learning the characteristics of a target template library on line. In order to accelerate the convergence rate of the classification model, the invention optimizes the model weight by adopting a Gauss-Newton iteration method. And the updated classifier is used for positioning the current frame target in the search area to obtain a corresponding score map. In order to reduce the computation time of the convolution network, the invention is characterized in that the depth separable convolution network is adopted to carry out convolution computation. Fig. 2 is a flow chart of the convolutional network of the present invention.
b. Transformer tracking network:
the invention relates to a Transformer tracking network which mainly comprises an Attention (Attention) module and a linear full-connection layer, and simultaneously, in order to further enhance the local information perception capability of the Transformer network, the invention particularly uses a convolution layer to replace a mode of linear mapping and position coding in a general Transformer structure, and flattens the picture characteristics extracted by ResNet and maps the picture characteristics into the image characteristics required by Attention calculation
Figure 402087DEST_PATH_IMAGE013
And (4) components. And after the current search area and the target template feature are subjected to attention calculation, obtaining a corresponding score of a search area feature F through a linear full connecting layer (MLP).
The transform network score calculation formula is represented by formula (4), wherein
Figure 7512DEST_PATH_IMAGE013
The components are derived from mapping the ResNet extracted features,
Figure 332183DEST_PATH_IMAGE015
is the data dimension.
Figure 122284DEST_PATH_IMAGE040
(4)
FIG. 3 is a flow chart of the Transformer network of the present invention.
In the aspect of loss function, in order to ensure the long-term tracking capability of the invention, a cross entropy loss function is specially selected
Figure 283138DEST_PATH_IMAGE016
And ternary loss functions commonly used in re-identification problems
Figure 816888DEST_PATH_IMAGE017
Weighted average of the following equations (5) and (6) was carried out. Wherein
Figure 261644DEST_PATH_IMAGE020
And
Figure 832434DEST_PATH_IMAGE021
respectively, represent the weight of the corresponding loss function,
Figure 605218DEST_PATH_IMAGE022
is a training threshold constant that is constant for one,
Figure 801713DEST_PATH_IMAGE023
and
Figure 976342DEST_PATH_IMAGE024
representing the mahalanobis distance between the current result and the positive and negative samples respectively,
Figure 983613DEST_PATH_IMAGE025
is a constant.
Figure 978113DEST_PATH_IMAGE018
(5)
Figure 243879DEST_PATH_IMAGE041
(6)
In order to obtain more accurate target scale estimation, the invention uses a corner regression network which is a structure of multilayer convolution and a corner pooling layer and converts the score maps of two prediction networks into corresponding tracking frames and network confidence coefficients.
And S5 similarity judgment: in order to reduce calculation parameters and improve the real-time performance of the tracking algorithm, the similarity of the two targets obtained in the step 4) is calculated, and if the similarity is higher than a certain threshold value
Figure 148381DEST_PATH_IMAGE001
If the two algorithms are stable, the two algorithms can be processed simply and output directly;
in order to reduce calculation parameters and improve the real-time performance of the tracking algorithm, the similarity calculation is firstly carried out on the two predicted targets obtained in the step 4), if the similarity is higher than a certain threshold value, the two algorithms are considered to be stably tracked, and the two algorithms can be directly output after simple processing. Meanwhile, in order to ensure the long-term tracking capability of the invention. When outputting the tracking result, the invention refers to the confidence of two tracking networks at the same time, if the similarity of the two networks is higher than the threshold value
Figure 451186DEST_PATH_IMAGE042
But with confidence levels below their corresponding threshold
Figure 323196DEST_PATH_IMAGE043
Then it is considered asIf the current target is greatly deformed, the current target needs to be selected and added into an online target template library, and if the size of the online template library is larger than a set threshold value
Figure 736860DEST_PATH_IMAGE044
Then the oldest added template is deleted according to the template addition time.
In the invention, the Similarity of two prediction targets is represented by an image Structure Similarity (SSIM) index, and an SSIM index calculation formula is shown as a formula (7).
Figure 495868DEST_PATH_IMAGE045
(7)
Wherein
Figure 703996DEST_PATH_IMAGE027
Representing the luminance similarity of the predicted object by the convolutional network and the Transformer network,
Figure 63302DEST_PATH_IMAGE028
respectively representing the contrast of the target predicted by the convolutional network and the Transformer network,
Figure 156023DEST_PATH_IMAGE029
representing the structural similarity of the two predicted objects,
Figure 894172DEST_PATH_IMAGE030
Figure 928993DEST_PATH_IMAGE031
and
Figure 385382DEST_PATH_IMAGE032
respectively, represent the corresponding similarity classification weights,
Figure 16214DEST_PATH_IMAGE033
is a constant.
S6 missing detection correction: if the similarity is lower than the threshold
Figure 874449DEST_PATH_IMAGE042
If the single network has the missed detection, the template library is updated by using the current correct target, and the corresponding network is corrected.
If SSIM is below a threshold
Figure 80171DEST_PATH_IMAGE042
If the condition of missing inspection appears, the cause of the missing inspection needs to be judged. The invention adopts the missing detection judgment index as the confidence coefficient of the corresponding network to the current predicted tracking target, if only one network confidence coefficient is lower than the corresponding threshold value
Figure 633643DEST_PATH_IMAGE046
If the network is judged to have missed detection, the target tracking can be continuously tracked by taking the output of the other network as correct output and correcting the missed detection network.
The network correction method adopted by the invention respectively comprises the following steps:
a. for the convolution tracking network, the invention uses the current target to reconstruct a temporary template base, and uses the online updating method as in the step 4) to optimize the classifier weight, thereby realizing the correction of the undetected network.
b. For the Transformer network, the invention also selects an online learning mode to realize the correction of the false detection network. The process comprises the steps of using the current correct target position as a positive sample, using a Transformer missing detection result as a negative sample, and calculating a contrast loss
Figure 458380DEST_PATH_IMAGE034
(formula 8), wherein
Figure 295755DEST_PATH_IMAGE036
In order to train the threshold constant for the training,
Figure 282165DEST_PATH_IMAGE037
mahalanobis distance between input samples). Meanwhile, Gauss is adoptedAnd Newton iteration method to realize network weight correction.
Figure 588513DEST_PATH_IMAGE035
(8)
S7 target retrieval: if the two networks are missed to be detected simultaneously and the target is lost, the tracking is stopped, the searching range is expanded, and the target is tried to be found back.
If the confidence of both networks is lower than the corresponding threshold
Figure 75995DEST_PATH_IMAGE046
If the two networks are missed, the target is lost. At this time, the invention will stop tracking, expand the target search range, try to retrieve the target by using kuen-man clais bipartite matching algorithm, and the weight used in the matching algorithm is also represented by the contrast loss of equation (8).
Comparative experiment:
the hardware environment for the experiment of the invention is i7-8700 CPU and Yingwei GTX 2080Ti GPU. The software environments are Python3.6 and CUDA 11.0. The VOT 2019 public data set is adopted in the experimental data set, and the algorithm is compared with a current leading edge single-target tracking algorithm. The results of the experiments are shown in the following table:
Figure DEST_PATH_IMAGE048AAAA
compared with the prior art, the invention has the following beneficial effects:
two deep learning tracking networks are integrated to synchronously track the target, and the accuracy of long-term tracking of the target is improved by a method of integrating a complementary network.
A template base updating strategy based on confidence coefficient and similarity is provided, and the tracking robustness of deformation and fuzzy targets is ensured;
an angular point regression network is provided to ensure the self-adaptive capacity to the target scale change during tracking;
the target missing detection and correction strategy based on online learning and complementary network integration is provided, and the stability of tracking the target for a long time is ensured;
a convolution-fused Transformer tracking network is provided, and robustness of tracking of occluded and deformed targets is guaranteed.
The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.
In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims (9)

1. The target tracking method based on the Transformer and the CNN is characterized by comprising the following steps of:
initial target loading: cutting the target and the search area according to the initial target position;
and (3) generating an online target template library: generating an online template library by using an image augmentation means according to a known target;
feature extraction: extracting target characteristics through a CNN network;
target prediction: simultaneously, two deep learning networks with different structures are used for obtaining corresponding frame score maps through analyzing the target characteristics, and the score maps are converted into the relative positions of the target in the map frame through an angular point regression network;
and (3) judging the similarity: in order to reduce calculation parameters and improve the real-time performance of a tracking algorithm, similarity calculation is carried out on two targets obtained in the target prediction step, and if the similarity is higher than a certain threshold value
Figure 86549DEST_PATH_IMAGE001
Then, consider both algorithms to track stablyDirectly outputting after simple processing;
and (4) missing detection correction: if the similarity is lower than the threshold
Figure 907875DEST_PATH_IMAGE001
If the single network has the missed detection, updating the template base by using the current correct target, and correcting the corresponding network;
target recovery: if the two networks are missed to be detected simultaneously and the target is lost, the tracking is stopped, the searching range is expanded, and the target is tried to be found back.
2. The Transformer and CNN-based target tracking method according to claim 1, wherein the clipping method is as follows:
Figure 178319DEST_PATH_IMAGE002
wherein
Figure 209729DEST_PATH_IMAGE003
Figure 99188DEST_PATH_IMAGE004
The length and the width of the final cutting target template are shown,
Figure 899653DEST_PATH_IMAGE005
Figure 481944DEST_PATH_IMAGE006
the length and width of the initial target are shown,
Figure 735071DEST_PATH_IMAGE007
the adaptive amplification factor for the search range varies according to the size of the target.
3. The Transformer and CNN based target tracking method of claim 1, wherein the image augmentation means comprises image rotation, image size transformation and motion blur, the image size transformation comprises: and respectively carrying out zooming processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets, wherein the motion blur carries out blur processing on the image by using mean filtering.
4. The Transformer and CNN based target tracking method of claim 3, wherein the gaussian pyramid formula is as follows:
Figure 552855DEST_PATH_IMAGE008
wherein
Figure 348772DEST_PATH_IMAGE009
Is a kernel of a gaussian convolution with the result that,
Figure 961019DEST_PATH_IMAGE010
is the original template image and is a template image,
Figure 576808DEST_PATH_IMAGE011
to reduce the template image by one quarter.
5. The method for tracking targets based on Transformer and CNN as claimed in claim 1, wherein residual ResNet based on CNN is used as backbone network to extract features of target template library and search area.
6. The Transformer and CNN based target tracking method of claim 1, wherein the target prediction uses a convolution tracking network and a Transformer tracking network;
the convolution tracking network consists of a convolution layer and a linear full-connection layer, and updates the convolution classifier in a mode of learning the characteristics of the target template library on line; in order to accelerate the convergence rate of the classification model, the weight of the model is optimized by adopting a Gaussian-Newton iteration method in the updating process, and the updated classifier is used for positioning the target of the current frame in the search area to obtain a corresponding score map;
the Transformer tracking network consists of an Attention module and a linear full-connection layer, and in order to further strengthen the local information perception capability of the Transformer network, a convolution layer is used for flattening picture features extracted by ResNet and mapping the picture features into features required by Attention calculation
Figure 198282DEST_PATH_IMAGE012
A component;
after the current search area and the target template characteristics are subjected to attention calculation, obtaining corresponding scores of the search area characteristics F through a linear full-link layer MLP;
the Transformer network score calculation formula is represented by the following formula:
Figure 848707DEST_PATH_IMAGE013
wherein
Figure 897434DEST_PATH_IMAGE012
The components are derived from mapping the ResNet extracted features,
Figure 883011DEST_PATH_IMAGE014
is a data dimension;
to ensure long-term tracking capability, cross-entropy loss function is used
Figure 652384DEST_PATH_IMAGE015
And a ternary loss function
Figure 547528DEST_PATH_IMAGE016
A weighted average is performed as shown in the following equation:
Figure 642523DEST_PATH_IMAGE017
Figure 357538DEST_PATH_IMAGE018
wherein
Figure 55236DEST_PATH_IMAGE019
And
Figure 680252DEST_PATH_IMAGE020
respectively, represent the weight of the corresponding loss function,
Figure 70782DEST_PATH_IMAGE021
is a training threshold constant that is constant for one,
Figure 882880DEST_PATH_IMAGE022
and
Figure 384269DEST_PATH_IMAGE023
representing the mahalanobis distance between the current result and the positive and negative samples respectively,
Figure 863792DEST_PATH_IMAGE024
is a constant.
7. The Transformer and CNN-based target tracking method according to claim 1, wherein to obtain a more accurate target scale estimate, a corner regression network is used, which is a structure of multiple layers of convolution plus corner pooling layers, and the score maps of two prediction networks are converted into corresponding tracking boxes and network confidences.
8. The Transformer and CNN-based target tracking method according to claim 1, wherein the similarity between two predicted targets is represented by image structure similarity SSIM, and the SSIM index calculation formula is shown as follows:
Figure 425223DEST_PATH_IMAGE025
wherein
Figure 990197DEST_PATH_IMAGE026
Representing the luminance similarity of the predicted object by the convolutional network and the Transformer network,
Figure 29697DEST_PATH_IMAGE027
respectively representing the contrast of the target predicted by the convolutional network and the Transformer network,
Figure 488360DEST_PATH_IMAGE028
representing the structural similarity of the two predicted objects,
Figure 830480DEST_PATH_IMAGE029
Figure 7383DEST_PATH_IMAGE030
and
Figure 991519DEST_PATH_IMAGE031
respectively, represent the corresponding similarity classification weights,
Figure 304689DEST_PATH_IMAGE032
is a constant.
9. The Transformer and CNN-based target tracking method according to claim 1, wherein the network correction method comprises:
for the convolution tracking network, a temporary template base is reconstructed by using a current target, and the classifier weight is optimized by using an online updating method in the target prediction step;
for Transformer networksCalculating the contrast loss by using the current correct target position as a positive sample and the Transformer missing detection result as a negative sample
Figure 942344DEST_PATH_IMAGE033
Figure 481910DEST_PATH_IMAGE034
Wherein
Figure 128792DEST_PATH_IMAGE035
In order to train the threshold constant for the training,
Figure 171834DEST_PATH_IMAGE036
the mahalanobis distance between the input samples.
CN202210819539.4A 2022-07-13 2022-07-13 Target tracking method based on Transformer and CNN Active CN114897941B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210819539.4A CN114897941B (en) 2022-07-13 2022-07-13 Target tracking method based on Transformer and CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210819539.4A CN114897941B (en) 2022-07-13 2022-07-13 Target tracking method based on Transformer and CNN

Publications (2)

Publication Number Publication Date
CN114897941A true CN114897941A (en) 2022-08-12
CN114897941B CN114897941B (en) 2022-09-30

Family

ID=82729589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210819539.4A Active CN114897941B (en) 2022-07-13 2022-07-13 Target tracking method based on Transformer and CNN

Country Status (1)

Country Link
CN (1) CN114897941B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147602A1 (en) * 2017-11-13 2019-05-16 Qualcomm Technologies, Inc. Hybrid and self-aware long-term object tracking
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer
CN110660082A (en) * 2019-09-25 2020-01-07 西南交通大学 Target tracking method based on graph convolution and trajectory convolution network learning
CN112561907A (en) * 2020-12-24 2021-03-26 南开大学 Video tampering operation detection method and device based on double-current network
CN113256637A (en) * 2021-07-15 2021-08-13 北京小蝇科技有限责任公司 Urine visible component detection method based on deep learning and context correlation
CN113628249A (en) * 2021-08-16 2021-11-09 电子科技大学 RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN113902773A (en) * 2021-09-24 2022-01-07 南京信息工程大学 Long-term target tracking method using double detectors
CN114529581A (en) * 2022-01-28 2022-05-24 西安电子科技大学 Multi-target tracking method based on deep learning and multi-task joint training

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147602A1 (en) * 2017-11-13 2019-05-16 Qualcomm Technologies, Inc. Hybrid and self-aware long-term object tracking
CN110533691A (en) * 2019-08-15 2019-12-03 合肥工业大学 Method for tracking target, equipment and storage medium based on multi-categorizer
CN110660082A (en) * 2019-09-25 2020-01-07 西南交通大学 Target tracking method based on graph convolution and trajectory convolution network learning
CN112561907A (en) * 2020-12-24 2021-03-26 南开大学 Video tampering operation detection method and device based on double-current network
CN113256637A (en) * 2021-07-15 2021-08-13 北京小蝇科技有限责任公司 Urine visible component detection method based on deep learning and context correlation
CN113628249A (en) * 2021-08-16 2021-11-09 电子科技大学 RGBT target tracking method based on cross-modal attention mechanism and twin structure
CN113902773A (en) * 2021-09-24 2022-01-07 南京信息工程大学 Long-term target tracking method using double detectors
CN114529581A (en) * 2022-01-28 2022-05-24 西安电子科技大学 Multi-target tracking method based on deep learning and multi-task joint training

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
QIANGYU LI ET AL.: "Visual Object Tracking: Method and Comparison", 《ICETCI》 *
XIN LI ET AL.: "Dual-regression model for visual tracking", 《NEURAL NETWORKS》 *
YABIN ZHU ET AL.: "RGBT tracking by trident fusion network", 《IEEE》 *
YIHONG ZHANG ET AL.: "Parallel three-branch correlation filters for complex marine environmental object tracking based on a confidence mechanism", 《SENSORS》 *
马勇等: "水域无人系统平台自主航行及协同控制研究进展", 《无人系统技术》 *

Also Published As

Publication number Publication date
CN114897941B (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
CN111354017B (en) Target tracking method based on twin neural network and parallel attention module
CN108090919B (en) Improved kernel correlation filtering tracking method based on super-pixel optical flow and adaptive learning factor
CN106599836B (en) Multi-face tracking method and tracking system
CN108647694B (en) Context-aware and adaptive response-based related filtering target tracking method
CN112733822B (en) End-to-end text detection and identification method
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN111260688A (en) Twin double-path target tracking method
CN110120064B (en) Depth-related target tracking algorithm based on mutual reinforcement and multi-attention mechanism learning
CN112364931B (en) Few-sample target detection method and network system based on meta-feature and weight adjustment
CN110942471B (en) Long-term target tracking method based on space-time constraint
CN112232134B (en) Human body posture estimation method based on hourglass network and attention mechanism
CN113221925B (en) Target detection method and device based on multi-scale image
CN110889865A (en) Video target tracking method based on local weighted sparse feature selection
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN110706256A (en) Detection tracking algorithm optimization method based on multi-core heterogeneous platform
CN114445715A (en) Crop disease identification method based on convolutional neural network
CN116030396A (en) Accurate segmentation method for video structured extraction
CN113393385B (en) Multi-scale fusion-based unsupervised rain removing method, system, device and medium
CN113627481A (en) Multi-model combined unmanned aerial vehicle garbage classification method for smart gardens
CN114897941B (en) Target tracking method based on Transformer and CNN
CN114882076B (en) Lightweight video object segmentation method based on big data memory storage
CN116385281A (en) Remote sensing image denoising method based on real noise model and generated countermeasure network
CN115661860A (en) Method, device and system for dog behavior and action recognition technology and storage medium
CN114202694A (en) Small sample remote sensing scene image classification method based on manifold mixed interpolation and contrast learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant