CN114897941A

CN114897941A - Target tracking method based on Transformer and CNN

Info

Publication number: CN114897941A
Application number: CN202210819539.4A
Authority: CN
Inventors: 余知音; 向北海; 邹融; 陈远益
Original assignee: Changsha Chaochuang Electronic Technology Co ltd
Current assignee: Changsha Chaochuang Electronic Technology Co ltd
Priority date: 2022-07-13
Filing date: 2022-07-13
Publication date: 2022-08-12
Anticipated expiration: 2042-07-13
Also published as: CN114897941B

Abstract

The invention discloses a target tracking method based on Transformer and CNN, which comprises the following steps: cutting the target and the search area according to the initial target position; generating an online template library by using an image augmentation means according to a known target; extracting target characteristics through a CNN network; analyzing the target characteristics to obtain a corresponding frame score map; similarity calculation is carried out on the two targets, and if the similarity is higher than a certain threshold value, simple processing is carried out to output the similarity; if the similarity is lower than the threshold value, the cause of missed detection is judged, if only a single network has missed detection, the template library is updated by using the current correct target, and the corresponding network is corrected; if both networks miss the target at the same time. The invention ensures the robustness of tracking deformation and fuzzy targets; ensuring the self-adaptive force to the target scale change during tracking; the stability of long-term tracking of the target is ensured; and the tracking robustness of the shielded and deformed target is ensured.

Description

Target tracking method based on Transformer and CNN

Technical Field

The invention belongs to the technical field of single target tracking in computer vision, and particularly relates to a target tracking method based on a Transformer and a CNN.

Background

As one of important tasks of computer vision, single-target tracking has wide application prospects in real life, such as pedestrian detection and tracking under complex backgrounds, human-computer interaction, unmanned driving technology and the like. The existing single-target tracking algorithm obtains higher tracking precision and better short-term tracking effect under the condition that the target is in a single background and has less deformation and shielding.

However, the existing mainstream tracking algorithms all have considerable defects, for example, the conventional tracking algorithm based on the relevant filtering has the problem of low updating precision of the scale of the tracked target, and when the size of the target changes, the problems of missing detection and error detection easily occur. For a twin network tracking algorithm capable of stably tracking when a target deforms, the problems of shading and insufficient robustness of fuzzy target tracking caused by lack of an effective template updating mechanism are solved. The tracking algorithm based on the transform structure has strong advantages in the aspect of dealing with the problem of target occlusion, but the transform structure which only performs linear segmentation and mapping on target features has poor capability of extracting local features of a target and has the problem of low recognition force on target deformation.

Disclosure of Invention

In view of the above, the invention adopts a mode of integrating a deep learning tracking algorithm with complementary performance to solve the long-term single-target tracking problem under the situation that the target is shielded and deformed.

In order to solve the problems, the invention provides a dual-network deep learning tracking method based on ensemble learning, which realizes the stable tracking of morphologically variable targets under a complex background. The method has the translation invariance of the CNN tracking network and the strong extraction capability of local information, and also effectively fuses the strong anti-shielding capability of the Transformer tracking network, thereby obtaining stable tracking effect on target deformation and shielding. Meanwhile, in order to realize long-time tracking, the invention provides an online learning method, so that online optimization of network weight is realized.

In addition, the invention adopts a self-adaptive strategy of the search range according to the different sizes of the tracking targets so as to improve the accuracy of the tracking result,

specifically, the invention discloses a target tracking method based on a Transformer and a CNN, which comprises the following steps:

initial target loading: cutting the target and the search area according to the initial target position;

and (3) generating an online target template library: generating an online template library by using an image augmentation means according to a known target;

feature extraction: extracting target characteristics through a CNN network;

target prediction: simultaneously, two deep learning networks with different structures are used for obtaining corresponding frame score maps through analyzing the target characteristics, and the score maps are converted into the relative positions of the target in the map frame through an angular point regression network;

and (3) judging the similarity: in order to reduce calculation parameters and improve the real-time performance of a tracking algorithm, similarity calculation is carried out on two targets obtained in the target prediction step, and if the similarity is higher than a certain threshold value

If so, the two algorithms are considered to be stably tracked and directly output after simple processing;

and (4) missing detection correction: if the similarity is lower than the threshold

If the single network has the missed detection, updating the template base by using the current correct target, and correcting the corresponding network;

target recovery: if the two networks are missed to be detected simultaneously and the target is lost, the tracking is stopped, the searching range is expanded, and the target is tried to be found back.

Further, the cutting method comprises the following steps:

wherein

，

The length and the width of the final cutting target template are shown,

，

the length and width of the initial target are shown,

the adaptive amplification factor for the search range varies according to the size of the target.

Further, the image augmentation means includes image rotation, image size conversion, and motion blur, the image size conversion including: and respectively carrying out zooming processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets, wherein the motion blur carries out blur processing on the image by using mean filtering.

Further, the gaussian pyramid formula is as follows:

wherein

Is a kernel of a gaussian convolution with the result that,

is the original template image and is a template image,

the template image after the quarter reduction is obtained;

the bilinear interpolation algorithm formula is as follows:

。

further, a residual error network ResNet based on CNN is used as a backbone network to realize feature extraction of a target template library and a search area.

Further, the target prediction uses a convolution tracking network and a Transformer tracking network;

the convolution tracking network consists of a convolution layer and a linear full-connection layer, and updates the convolution classifier in a mode of learning the characteristics of the target template library on line; in order to accelerate the convergence rate of the classification model, the weight of the model is optimized by adopting a Gaussian-Newton iteration method in the updating process, and the updated classifier is used for positioning the target of the current frame in the search area to obtain a corresponding score map;

the Transformer tracking network consists of an Attention module and a linear full-connection layer, and in order to further strengthen the local information perception capability of the Transformer network, a convolution layer is used for flattening picture features extracted by ResNet and mapping the picture features into features required by Attention calculation

A component;

after the current search area and the target template characteristics are subjected to attention calculation, obtaining corresponding scores of the search area characteristics F through a linear full link layer (MLP);

the Transformer network score calculation formula is represented by the following formula:

wherein

The components are derived from mapping the ResNet extracted features,

is a data dimension;

to ensure long-term tracking capability, cross-entropy loss function is used

And a ternary loss function

A weighted average is performed as shown in the following equation:

wherein

And

respectively, represent the weight of the corresponding loss function,

is a training threshold constant that is constant for one,

and

representing the mahalanobis distance between the current result and the positive and negative samples respectively,

is a constant.

Furthermore, in order to obtain more accurate target scale estimation, a corner regression network is used, and the score maps of the two prediction networks are converted into corresponding tracking frames and network confidence degrees through a structure of a multilayer convolution plus a corner pooling layer.

Further, the similarity of the two predicted targets is represented by an image structure similarity SSIM, and an SSIM index calculation formula is shown as the following formula:

wherein

Representing the luminance similarity of the predicted object by the convolutional network and the Transformer network,

respectively representing the contrast of the target predicted by the convolutional network and the Transformer network,

representing the structural similarity of the two predicted objects,

，

and

respectively, represent the corresponding similarity classification weights,

is a constant.

Further, the network correction method includes:

for the convolution tracking network, a temporary template base is reconstructed by using a current target, and the classifier weight is optimized by using an online updating method in the target prediction step;

for the Transformer network, the current correct target position is used as a positive sample, the Transformer missing detection result is used as a negative sample, and the contrast loss is calculated

Wherein

In order to train the threshold constant for the training,

the mahalanobis distance between the input samples.

Compared with the prior art, the invention has the following beneficial effects:

two deep learning tracking networks are integrated to synchronously track the target, and the accuracy of long-term tracking of the target is improved by a method of integrating a complementary network.

A template base updating strategy based on confidence coefficient and similarity is provided, and the tracking robustness of deformation and fuzzy targets is ensured;

an angular point regression network is provided to ensure the self-adaptive capacity to the target scale change during tracking;

the target missing detection and correction strategy based on online learning and complementary network integration is provided, and the stability of tracking the target for a long time is ensured;

a convolution-fused Transformer tracking network is provided, and robustness of tracking of occluded and deformed targets is guaranteed.

Drawings

FIG. 1 is a flowchart of the process of the present invention;

FIG. 2 is a flow diagram of the convolution tracking network of the present invention;

FIG. 3 is a flow chart of the Transformer tracking network of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

The invention provides a dual-network deep learning tracking method based on ensemble learning, which realizes stable tracking of morphologically variable targets under a complex background. The method has the translation invariance of the CNN tracking network and the strong extraction capability of local information, and also effectively fuses the strong anti-shielding capability of the Transformer tracking network, thereby obtaining stable tracking effect on target deformation and shielding. Meanwhile, in order to realize long-time tracking, the invention provides an online learning method, so that online optimization of network weight is realized.

In addition, the present invention also adopts a search range adaptive strategy according to the size of the tracking target to improve the accuracy of the tracking result, referring to the flow chart of the present invention program of fig. 1, the steps of the present invention include:

s1 initial target load: cutting the target and the search area according to the initial target position;

to reduce the amount of computation, the present invention will crop the target. The clipping size is determined by the formula (1), wherein

，

The length and the width of the final cutting target template are shown,

，

the length and width of the initial target are shown,

the adaptive amplification factor for the search range will vary according to the size of the target;

（1）

s2 online target template library generation: generating an online template library by using an image augmentation means according to a known target;

by amplifying the target in the step 1), the image amplifying method mainly comprises the following steps:

a. rotating the image;

b. image size transformation: and respectively carrying out scaling processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets.

The formula (2) is a general formula of a Gaussian pyramid, wherein

Is a kernel of a gaussian convolution with the result that,

is the original template image and is a template image,

to reduce the template image by one quarter.

（2）

Formula (3) is a general formula of the bilinear interpolation algorithm adopted by the invention, and the general formula is as follows:

（3）

c. motion blur: the image is blurred using mean filtering.

d. And (3) feature enhancement: the invention selectively enhances the weak information according to the brightness information and the structure information quantity of the current target.

S3 feature extraction: extracting target characteristics through a CNN convolutional network;

the method uses a residual error network ResNet based on CNN as a backbone network to realize the feature extraction of a target template library and a search area; ResNet is the prior art in the field, and the present invention is not described in detail.

S4 target prediction: and simultaneously, obtaining corresponding frame score maps by analyzing the target features by using two deep learning networks with different structures, and converting the score maps into the relative position of the target in the map frame by using an angular point regression network.

The two tracking networks used in the invention are respectively

a. Convolution tracking network:

the convolution tracking network mainly comprises a convolution layer and a linear full-connection layer, and updates a convolution classifier in a mode of learning the characteristics of a target template library on line. In order to accelerate the convergence rate of the classification model, the invention optimizes the model weight by adopting a Gauss-Newton iteration method. And the updated classifier is used for positioning the current frame target in the search area to obtain a corresponding score map. In order to reduce the computation time of the convolution network, the invention is characterized in that the depth separable convolution network is adopted to carry out convolution computation. Fig. 2 is a flow chart of the convolutional network of the present invention.

b. Transformer tracking network:

the invention relates to a Transformer tracking network which mainly comprises an Attention (Attention) module and a linear full-connection layer, and simultaneously, in order to further enhance the local information perception capability of the Transformer network, the invention particularly uses a convolution layer to replace a mode of linear mapping and position coding in a general Transformer structure, and flattens the picture characteristics extracted by ResNet and maps the picture characteristics into the image characteristics required by Attention calculation

And (4) components. And after the current search area and the target template feature are subjected to attention calculation, obtaining a corresponding score of a search area feature F through a linear full connecting layer (MLP).

The transform network score calculation formula is represented by formula (4), wherein

The components are derived from mapping the ResNet extracted features,

is the data dimension.

（4）

FIG. 3 is a flow chart of the Transformer network of the present invention.

In the aspect of loss function, in order to ensure the long-term tracking capability of the invention, a cross entropy loss function is specially selected

And ternary loss functions commonly used in re-identification problems

Weighted average of the following equations (5) and (6) was carried out. Wherein

And

respectively, represent the weight of the corresponding loss function,

is a training threshold constant that is constant for one,

and

is a constant.

（5）

（6）

In order to obtain more accurate target scale estimation, the invention uses a corner regression network which is a structure of multilayer convolution and a corner pooling layer and converts the score maps of two prediction networks into corresponding tracking frames and network confidence coefficients.

And S5 similarity judgment: in order to reduce calculation parameters and improve the real-time performance of the tracking algorithm, the similarity of the two targets obtained in the step 4) is calculated, and if the similarity is higher than a certain threshold value

If the two algorithms are stable, the two algorithms can be processed simply and output directly;

in order to reduce calculation parameters and improve the real-time performance of the tracking algorithm, the similarity calculation is firstly carried out on the two predicted targets obtained in the step 4), if the similarity is higher than a certain threshold value, the two algorithms are considered to be stably tracked, and the two algorithms can be directly output after simple processing. Meanwhile, in order to ensure the long-term tracking capability of the invention. When outputting the tracking result, the invention refers to the confidence of two tracking networks at the same time, if the similarity of the two networks is higher than the threshold value

But with confidence levels below their corresponding threshold

Then it is considered asIf the current target is greatly deformed, the current target needs to be selected and added into an online target template library, and if the size of the online template library is larger than a set threshold value

Then the oldest added template is deleted according to the template addition time.

In the invention, the Similarity of two prediction targets is represented by an image Structure Similarity (SSIM) index, and an SSIM index calculation formula is shown as a formula (7).

（7）

Wherein

representing the structural similarity of the two predicted objects,

，

and

respectively, represent the corresponding similarity classification weights,

is a constant.

S6 missing detection correction: if the similarity is lower than the threshold

If the single network has the missed detection, the template library is updated by using the current correct target, and the corresponding network is corrected.

If SSIM is below a threshold

If the condition of missing inspection appears, the cause of the missing inspection needs to be judged. The invention adopts the missing detection judgment index as the confidence coefficient of the corresponding network to the current predicted tracking target, if only one network confidence coefficient is lower than the corresponding threshold value

If the network is judged to have missed detection, the target tracking can be continuously tracked by taking the output of the other network as correct output and correcting the missed detection network.

The network correction method adopted by the invention respectively comprises the following steps:

a. for the convolution tracking network, the invention uses the current target to reconstruct a temporary template base, and uses the online updating method as in the step 4) to optimize the classifier weight, thereby realizing the correction of the undetected network.

b. For the Transformer network, the invention also selects an online learning mode to realize the correction of the false detection network. The process comprises the steps of using the current correct target position as a positive sample, using a Transformer missing detection result as a negative sample, and calculating a contrast loss

(formula 8), wherein

In order to train the threshold constant for the training,

mahalanobis distance between input samples). Meanwhile, Gauss is adoptedAnd Newton iteration method to realize network weight correction.

（8）

S7 target retrieval: if the two networks are missed to be detected simultaneously and the target is lost, the tracking is stopped, the searching range is expanded, and the target is tried to be found back.

If the confidence of both networks is lower than the corresponding threshold

If the two networks are missed, the target is lost. At this time, the invention will stop tracking, expand the target search range, try to retrieve the target by using kuen-man clais bipartite matching algorithm, and the weight used in the matching algorithm is also represented by the contrast loss of equation (8).

Comparative experiment:

the hardware environment for the experiment of the invention is i7-8700 CPU and Yingwei GTX 2080Ti GPU. The software environments are Python3.6 and CUDA 11.0. The VOT 2019 public data set is adopted in the experimental data set, and the algorithm is compared with a current leading edge single-target tracking algorithm. The results of the experiments are shown in the following table:

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.

Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.

In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements within the protection scope of the present invention.

Claims

1. The target tracking method based on the Transformer and the CNN is characterized by comprising the following steps of:

feature extraction: extracting target characteristics through a CNN network;

Then, consider both algorithms to track stablyDirectly outputting after simple processing;

2. The Transformer and CNN-based target tracking method according to claim 1, wherein the clipping method is as follows:

wherein

，

The length and the width of the final cutting target template are shown,

，

the length and width of the initial target are shown,

3. The Transformer and CNN based target tracking method of claim 1, wherein the image augmentation means comprises image rotation, image size transformation and motion blur, the image size transformation comprises: and respectively carrying out zooming processing on the template image through a Gaussian pyramid and bilinear interpolation to obtain different scale characteristics of different current targets, wherein the motion blur carries out blur processing on the image by using mean filtering.

4. The Transformer and CNN based target tracking method of claim 3, wherein the gaussian pyramid formula is as follows:

wherein

Is a kernel of a gaussian convolution with the result that,

is the original template image and is a template image,

to reduce the template image by one quarter.

5. The method for tracking targets based on Transformer and CNN as claimed in claim 1, wherein residual ResNet based on CNN is used as backbone network to extract features of target template library and search area.

6. The Transformer and CNN based target tracking method of claim 1, wherein the target prediction uses a convolution tracking network and a Transformer tracking network;

A component;

after the current search area and the target template characteristics are subjected to attention calculation, obtaining corresponding scores of the search area characteristics F through a linear full-link layer MLP;

wherein

The components are derived from mapping the ResNet extracted features,

is a data dimension;

to ensure long-term tracking capability, cross-entropy loss function is used

And a ternary loss function

A weighted average is performed as shown in the following equation:

wherein

And

respectively, represent the weight of the corresponding loss function,

is a training threshold constant that is constant for one,

and

is a constant.

7. The Transformer and CNN-based target tracking method according to claim 1, wherein to obtain a more accurate target scale estimate, a corner regression network is used, which is a structure of multiple layers of convolution plus corner pooling layers, and the score maps of two prediction networks are converted into corresponding tracking boxes and network confidences.

8. The Transformer and CNN-based target tracking method according to claim 1, wherein the similarity between two predicted targets is represented by image structure similarity SSIM, and the SSIM index calculation formula is shown as follows:

wherein

representing the structural similarity of the two predicted objects,

，

and

respectively, represent the corresponding similarity classification weights,

is a constant.

9. The Transformer and CNN-based target tracking method according to claim 1, wherein the network correction method comprises:

for Transformer networksCalculating the contrast loss by using the current correct target position as a positive sample and the Transformer missing detection result as a negative sample

Wherein

In order to train the threshold constant for the training,

the mahalanobis distance between the input samples.