CN116485839A

CN116485839A - Visual tracking method based on attention self-adaptive selection of transducer

Info

Publication number: CN116485839A
Application number: CN202310358214.5A
Authority: CN
Inventors: 钱诚; 何越江; 李腾娇; 曹国锋; 殷文慧; 罗丰瑞; 许家豪; 杜雨菲
Original assignee: Changzhou Institute of Technology
Current assignee: Changzhou Institute of Technology
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-25

Abstract

The invention relates to the technical field of vision tracking, in particular to a vision tracking method based on attention self-adaptive selection transducer, which aims at the problems that the prior vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are lower, and the following scheme is provided, wherein the method comprises the following steps: s1: constructing a network model, and S2: the invention aims to select the characterization of the correlation between the target template image and the target search image which is more beneficial to tracking by replacing the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.

Description

Visual tracking method based on attention self-adaptive selection of transducer

Technical Field

The invention relates to the technical field of visual tracking, in particular to a visual tracking method based on attention self-adaptive selection of a transducer.

Background

The visual tracking is to continuously infer in the following video frames according to the appearance information of the target object determined in the first frame in the video, so as to determine the spatial position of the target object in the video, and the visual tracking has wide application in the fields of video monitoring, robot navigation, product surface quality detection and the like. The development of deep learning is benefited, the visual tracking capability in the aspect of target appearance characterization is stronger, so that the current tracking method mainly adopts template matching to realize tracking, an appearance image of a target object in a first frame is used as a template, the similarity of each local image area and the template is calculated through sliding matching of the template in a video frame image in a subsequent video, and finally an image area with the maximum similarity is selected as a target image area.

Current deep learning-based visual tracking methods can be broadly divided into two categories, depending on the type of core network module used for target matching, in general: a vision tracking method based on a twin convolutional neural network and a vision tracking method based on a transducer. In the aspect of feature extraction, the two types of methods mostly adopt classical convolutional neural networks such as ResNet, VGG and the like as a main network for feature extraction; in the aspect of calculation of correlation between target template features and search image features, the visual tracking method based on a twin convolutional neural network still estimates the similarity between the template and each local area of the search image in a convolutional manner, and in contrast, the visual tracking method based on a Transformer decomposes video frame images into local image feature embedding, and calculates the similarity between the template and the local image features in the feature embedding form through an encoder and a decoder. In the decision section, the two methods generally construct a classification head, a target frame coordinate regression head or a target frame intersection ratio estimation head, and the like, so as to output the estimated target frames on the search image and the confidence corresponding to each frame, and provide a basis for the decision of the final target frame.

However, the existing vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are low, and therefore, we propose a vision tracking method based on attention self-adaptive selection transducer for solving the problems.

Disclosure of Invention

The invention aims to solve the problems that the accuracy and success rate performance index of a visual tracking method are low due to the fact that a plurality of single-head attention attempts are simply spliced and correct correlation descriptions are weak by using a traditional transducer encoder and a decoder in the existing visual tracking technology at present, and provides the visual tracking method based on the attention self-adaptive selection transducer, so that the accuracy of target tracking and the success rate performance index of the visual tracking method are improved.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:

s1: constructing a network model: constructing a network model by professionals;

S2: training: training the constructed Transformer network by a professional;

s3: reasoning: reasoning the target frame by a professional through a transducer;

preferably, in the step S1, a professional builds a network model, wherein the adaptive attention selection transducer network model is performed during the network model building, and the network structure of the network model includes a main network for extracting the object and the background appearance convolution characteristics, a characteristic fusion network, a classification output head, and a bounding box regression output head, and a residual network res net50 is introduced into the main network to extract the object and the backgroundWherein the network structure of the ResNet50 is modified when the residual network ResNet50 is introduced, the last stage of the ResNet50 network is removed by modification, the first 4 stages are reserved, the convolution operation is set as hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is taken as the output characteristic of the main network to be sent into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, the information about the features is obtained by introducing a multi-head attention modelMore correlation characterization between vectors, where setting the number of heads to 8 assumes that the attention of the single head output in the attention model strives to be H _i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results _O The expression of (2) is +.>Wherein W is ^O Is a weight matrix, w _i For the i-th head, the linear combination parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P _z Spatial position coding in the form of a sinusoidal function in a transducerIs a constant matrix, finally according to the expression +.>Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images _x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is _x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing 2 layers of full links into both branchesThe cross attention is subjected to nonlinear transformation by a feedforward neural network (FFN) to improve the characterization capability of the feature, wherein the expression +.>And->Wherein->And->Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;

Preferably, in the step S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, so that classification loss function L related to classification branching is achieved _cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y _j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y _j The j-th point is outside the target frame and is a negative sample, corresponding to class label y _j ＝0，p _i Representation predictionThe j-th point is the label value of the point in the target frame, and the regression loss function L related to the regression branch of the boundary frame _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process _cls +λ·L _reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;

Preferably, in the step S3, a professional performs reasoning on the target frame through a transducer, wherein when the reasoning is performed on the target frame, for a video frame input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, the target search image is sent into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.

Compared with the prior art, the invention has the beneficial effects that:

1. the self-adaptive attention selection mechanism in the multi-head attention calculation module replaces the conventional multi-head splicing mode in a multi-head linear combination mode, so that the representation of the correlation between the target template image and the target search image which are more beneficial to tracking is selected, the correct correlation is enhanced, the noise correlation is inhibited, and the accuracy and the success rate performance index of the visual tracking method are improved.

The invention aims to replace the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, select the characterization of the correlation between a target template image and a target search image which are more beneficial to tracking, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.

Drawings

FIG. 1 is a flow chart of a visual tracking method based on an attention-adaptive selection transducer according to the present invention;

FIG. 2 is a diagram showing a visual tracking method based on an attention adaptive selection transducer according to the present invention;

FIG. 3 is a schematic diagram of a multi-head attention calculation module according to the visual tracking method of the present invention based on an attention adaptive selection transform;

fig. 4 is a comparison chart of success rate performance indexes of a visual tracking method based on attention self-adaptive selection of a transducer according to the present invention.

Detailed Description

The following description of the technical solutions in the embodiments of the present invention will be clear and complete, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments.

Example 1

Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:

s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a more accurate characterization method of the correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be%>Where dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model _i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results _O The expression of (2) is +.>Wherein W is ^O Is a weight matrix, w _i For the i-th head, the linear combination parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P _z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhancing original convolution features by attempting to residual form with multi-headed attentionObtaining enhanced convolution features with better characterizations of templates using expressionsIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images _x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is _x Spatial position coding in form of sine function in transducer, cross-attention fusion module uses the expression +.>Is used for calculating the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature by setting 2 cross attention calculation branches to calculate attention force diagramAnd +.>And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>Andwherein->And- >Is a matrix of weights and is a matrix of weights,and->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;

s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching _cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y _j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y _j The j-th point is outside the target frame and is a negative sample, corresponding to class label y _j ＝0，p _i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the Bi is calculated by the following steps ofThe process obtains a total loss function L of l=l _cls +λ·L _reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;

s3: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.

Example two

s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model _i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results _O The expression of (2) is +.>Wherein W is ^O Is a weight matrix, w _i For the i-th head, the linear combination parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P _z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +.>The expression isMiddle->For enhanced convolution characteristics of template images, convolution characteristics F of target search images _x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is _x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And- >Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;

s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching _cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y _j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y _j The j-th point is outside the target frame and is a negative sample, corresponding to class label y _j ＝0，p _i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process _cls +λ·L _reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.

Example III

s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, set as 1024, the feature fusion network part contains a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and is formed by proposing a selective multi-head attention moduleProviding a more accurate characterization method of the correlation of template image features and target search images for a self-attention module and a cross-attention fusion module, wherein the attention calculation model is designed by calculating attention force by using an attention function of a transducer, wherein the assumption is that the query vector of the transducer is Q, the key vector is K, the value vector is V, and the assumption is that the single-head attention function in the transducer is obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model _i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results _O The expression of (2) is +.>Wherein W is ^O Is a weight matrix, w _i For the i-th head, the linear combination parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P _z Is a spatial position code in the form of a sine function in a transducer, which is a constant matrix, ultimately according to an expressionMulti-head linear combination of outputs allows multi-head attention attempt resultsEnhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images _x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is _x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < > >And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;

s2: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.

Example IV

s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method with more accurate correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model;

s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching _cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y _j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y _j The j-th point is outside the target frame and is a negative sample, corresponding to class label y _j ＝0，p _i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process _cls +λ·L _reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;

Example five

s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +. >And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model _i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results _O The expression of (2) is +.>Wherein W is ^O Is a weight matrix, w _i For the i-th head, the linear combination parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and further obtaining each head through a softmax layer based on the weight resultIs a linear combination of parameters w _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P _z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images _x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is _x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < >>And->Is an offset vector by inputting FFNs of two branchesCarrying out summation calculation and sending the summation calculation into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value with each point position as the center of a target frame and the distance from each point to 4 sides of the target frame;

S2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching _cls Is thatRegression loss function L related to bounding box regression branching _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted target frame and real target frame, wherein +.>The overall loss function L obtained by the process is l=l _cls +λ·L _reg And performing error back propagation on the training data set according to the loss function L to obtain a parameter-adjusted transducer network.

The visual tracking method based on the attention adaptive selection transducer in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment is tested, and the following results are obtained:

the visual tracking method based on the attention self-adaptive selection transducer, which is prepared in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment, has significantly improved accuracy and success rate performance index compared with the existing visual tracking method, and the first embodiment is the best embodiment.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:

s2: training: training the constructed Transformer network by a professional;

s3: reasoning: the target box is inferred by a professional through a transducer.

2. The visual tracking method based on attention adaptive selection of a transducer according to claim 1, wherein in the step S1, a network model is constructed by a professional, wherein the adaptive attention selection of the transducer network model is performed when the network model is constructed, and the network structure of the network model comprises a backbone network for extracting the convolution characteristics of the object and the background, a characteristic fusion network, a classification output head, and a bounding box regression output head, the characteristics of the object and the background are extracted by introducing a residual network ResNet50 into the backbone network part, wherein the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, and the convolution operation is set as a hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is fed into the characteristic fusion network as the output characteristics of the backbone network, and the input template image is assumed to be The input search image is +.>H _z Is the height of the template image, W _z Is the width of the template image, H _x Is the height of the search image, W _x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is thatAnd the convolution characteristic of the search image is +.>C is the number of channels of the convolution feature, set to 1024.

3. The visual tracking method based on attention self-adaptive selection of a transducer according to claim 2, wherein the feature fusion network part comprises a self-attention module, a cross-attention fusion module of template image features and target search images, and a more accurate characterization method of the correlation of the template image features and the target search images is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention map using an attention function of the transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is obtained by assuming that the attention function is QWhere dk is the dimension of query vector Q and key vector K.

4. A visual tracking method based on attention adaptive selection transform according to claim 3, characterized in that more correlation characterization between feature vectors is obtained by introducing a multi-headed attention model, wherein setting the number of multi-headed to 8 assumes the attention force of the single-headed output as H in the attention model _i =attention (Q, K, V), where i= {0,1, …,7}, by combining multiple heads in a linear combination when fusing multiple head Attention output resultsAttention results are fused together to obtain an attention graph result H after adaptive attention selection _O The expression of (2) isWherein W is ^O Is a weight matrix, w _i Is the linear combination parameter of the i-th head.

5. The visual tracking method based on attention-adaptive selection transducer according to claim 4, wherein the linear parameter w _i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s _i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer _i The expression of (2) is w _i ＝softmax(s _i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F _z And H is _i ＝Attention((F _z +P _z )W _i ^Q ,(F _z +P _z )W _i ^K ,F _z W _i ^V ) Wherein W is _i ^Q 、W _i ^K And W is _i ^V Representing projection matrices, P, of the ith header corresponding to query vector Q, key vector K and value vector V, respectively _z Spatial position coding in the form of a sinusoidal function in a transducer, ultimately according to an expressionMulti-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>Is an enhanced convolution feature of the template image.

6. The visual tracking method based on attention-adaptive selection of a transducer according to claim 5, wherein for a convolution characteristic F of a target search image _x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is _x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And one in both branchesStep-wise introduction of a 2-layer fully connected feedforward neural network (FFN) to nonlinearly transform a cross-attention attempt, wherein the expression is employed And->Wherein the method comprises the steps ofAnd->Is a weight matrix, < >>And->The offset vector is obtained by summing the FFN outputs of two branches, sending the FFN outputs to a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating the label value of each point position on the classification chart as the center of the target frame and the distance from each point to 4 sides of the target frame where the label value is located through the two branches.

7. The visual tracking method based on attention-adaptive selection of a transducer according to claim 1, wherein in S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, and classification loss function L related to classification branching is obtained _cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y _j Represents the j-th pointAnd the j-th point falls into the target frame and is a positive sample, corresponding to the class label y _j The j-th point is outside the target frame and is a negative sample, corresponding to class label y _j ＝0，p _i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame _reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process _cls +λ·L _reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.

8. The visual tracking method based on the adaptive attention selection of the transducer according to claim 1, wherein in the step S3, a professional infers the target frame through the transducer, wherein when the target frame is inferred, for the video frames input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width to 2 times of the length and the width of the target template image, and sends the target search image into the transducer network, the point with the maximum tag value on the output classification chart is the point in the target frame, and the rectangular target frame is inferred from the distance from the point to four sides of the frame.