CN116485839A - Visual tracking method based on attention self-adaptive selection of transducer - Google Patents
Visual tracking method based on attention self-adaptive selection of transducer Download PDFInfo
- Publication number
- CN116485839A CN116485839A CN202310358214.5A CN202310358214A CN116485839A CN 116485839 A CN116485839 A CN 116485839A CN 202310358214 A CN202310358214 A CN 202310358214A CN 116485839 A CN116485839 A CN 116485839A
- Authority
- CN
- China
- Prior art keywords
- attention
- network
- target
- transducer
- head
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000000007 visual effect Effects 0.000 title claims abstract description 38
- 238000004364 calculation method Methods 0.000 claims abstract description 59
- 238000012512 characterization method Methods 0.000 claims abstract description 24
- 239000013598 vector Substances 0.000 claims description 84
- 230000004927 fusion Effects 0.000 claims description 41
- 238000012549 training Methods 0.000 claims description 32
- 239000011159 matrix material Substances 0.000 claims description 23
- 238000010586 diagram Methods 0.000 claims description 18
- 230000003044 adaptive effect Effects 0.000 claims description 14
- 238000013528 artificial neural network Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000009286 beneficial effect Effects 0.000 abstract description 4
- 230000007246 mechanism Effects 0.000 abstract description 3
- 230000006870 function Effects 0.000 description 40
- 230000009466 transformation Effects 0.000 description 5
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the technical field of vision tracking, in particular to a vision tracking method based on attention self-adaptive selection transducer, which aims at the problems that the prior vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are lower, and the following scheme is provided, wherein the method comprises the following steps: s1: constructing a network model, and S2: the invention aims to select the characterization of the correlation between the target template image and the target search image which is more beneficial to tracking by replacing the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.
Description
Technical Field
The invention relates to the technical field of visual tracking, in particular to a visual tracking method based on attention self-adaptive selection of a transducer.
Background
The visual tracking is to continuously infer in the following video frames according to the appearance information of the target object determined in the first frame in the video, so as to determine the spatial position of the target object in the video, and the visual tracking has wide application in the fields of video monitoring, robot navigation, product surface quality detection and the like. The development of deep learning is benefited, the visual tracking capability in the aspect of target appearance characterization is stronger, so that the current tracking method mainly adopts template matching to realize tracking, an appearance image of a target object in a first frame is used as a template, the similarity of each local image area and the template is calculated through sliding matching of the template in a video frame image in a subsequent video, and finally an image area with the maximum similarity is selected as a target image area.
Current deep learning-based visual tracking methods can be broadly divided into two categories, depending on the type of core network module used for target matching, in general: a vision tracking method based on a twin convolutional neural network and a vision tracking method based on a transducer. In the aspect of feature extraction, the two types of methods mostly adopt classical convolutional neural networks such as ResNet, VGG and the like as a main network for feature extraction; in the aspect of calculation of correlation between target template features and search image features, the visual tracking method based on a twin convolutional neural network still estimates the similarity between the template and each local area of the search image in a convolutional manner, and in contrast, the visual tracking method based on a Transformer decomposes video frame images into local image feature embedding, and calculates the similarity between the template and the local image features in the feature embedding form through an encoder and a decoder. In the decision section, the two methods generally construct a classification head, a target frame coordinate regression head or a target frame intersection ratio estimation head, and the like, so as to output the estimated target frames on the search image and the confidence corresponding to each frame, and provide a basis for the decision of the final target frame.
However, the existing vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are low, and therefore, we propose a vision tracking method based on attention self-adaptive selection transducer for solving the problems.
Disclosure of Invention
The invention aims to solve the problems that the accuracy and success rate performance index of a visual tracking method are low due to the fact that a plurality of single-head attention attempts are simply spliced and correct correlation descriptions are weak by using a traditional transducer encoder and a decoder in the existing visual tracking technology at present, and provides the visual tracking method based on the attention self-adaptive selection transducer, so that the accuracy of target tracking and the success rate performance index of the visual tracking method are improved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals;
S2: training: training the constructed Transformer network by a professional;
s3: reasoning: reasoning the target frame by a professional through a transducer;
preferably, in the step S1, a professional builds a network model, wherein the adaptive attention selection transducer network model is performed during the network model building, and the network structure of the network model includes a main network for extracting the object and the background appearance convolution characteristics, a characteristic fusion network, a classification output head, and a bounding box regression output head, and a residual network res net50 is introduced into the main network to extract the object and the backgroundWherein the network structure of the ResNet50 is modified when the residual network ResNet50 is introduced, the last stage of the ResNet50 network is removed by modification, the first 4 stages are reserved, the convolution operation is set as hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is taken as the output characteristic of the main network to be sent into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, the information about the features is obtained by introducing a multi-head attention modelMore correlation characterization between vectors, where setting the number of heads to 8 assumes that the attention of the single head output in the attention model strives to be H i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Spatial position coding in the form of a sinusoidal function in a transducerIs a constant matrix, finally according to the expression +.>Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing 2 layers of full links into both branchesThe cross attention is subjected to nonlinear transformation by a feedforward neural network (FFN) to improve the characterization capability of the feature, wherein the expression +.>And->Wherein->And->Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
Preferably, in the step S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, so that classification loss function L related to classification branching is achieved cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i Representation predictionThe j-th point is the label value of the point in the target frame, and the regression loss function L related to the regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
Preferably, in the step S3, a professional performs reasoning on the target frame through a transducer, wherein when the reasoning is performed on the target frame, for a video frame input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, the target search image is sent into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Compared with the prior art, the invention has the beneficial effects that:
1. the self-adaptive attention selection mechanism in the multi-head attention calculation module replaces the conventional multi-head splicing mode in a multi-head linear combination mode, so that the representation of the correlation between the target template image and the target search image which are more beneficial to tracking is selected, the correct correlation is enhanced, the noise correlation is inhibited, and the accuracy and the success rate performance index of the visual tracking method are improved.
The invention aims to replace the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, select the characterization of the correlation between a target template image and a target search image which are more beneficial to tracking, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.
Drawings
FIG. 1 is a flow chart of a visual tracking method based on an attention-adaptive selection transducer according to the present invention;
FIG. 2 is a diagram showing a visual tracking method based on an attention adaptive selection transducer according to the present invention;
FIG. 3 is a schematic diagram of a multi-head attention calculation module according to the visual tracking method of the present invention based on an attention adaptive selection transform;
fig. 4 is a comparison chart of success rate performance indexes of a visual tracking method based on attention self-adaptive selection of a transducer according to the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present invention will be clear and complete, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments.
Example 1
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a more accurate characterization method of the correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be%>Where dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhancing original convolution features by attempting to residual form with multi-headed attentionObtaining enhanced convolution features with better characterizations of templates using expressionsIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in form of sine function in transducer, cross-attention fusion module uses the expression +.>Is used for calculating the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature by setting 2 cross attention calculation branches to calculate attention force diagramAnd +.>And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>Andwherein->And- >Is a matrix of weights and is a matrix of weights,and->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the Bi is calculated by the following steps ofThe process obtains a total loss function L of l=l cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
s3: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example two
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +.>The expression isMiddle->For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And- >Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.
Example III
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, set as 1024, the feature fusion network part contains a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and is formed by proposing a selective multi-head attention moduleProviding a more accurate characterization method of the correlation of template image features and target search images for a self-attention module and a cross-attention fusion module, wherein the attention calculation model is designed by calculating attention force by using an attention function of a transducer, wherein the assumption is that the query vector of the transducer is Q, the key vector is K, the value vector is V, and the assumption is that the single-head attention function in the transducer is obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position code in the form of a sine function in a transducer, which is a constant matrix, ultimately according to an expressionMulti-head linear combination of outputs allows multi-head attention attempt resultsEnhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < > >And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example IV
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method with more accurate correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
s3: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example five
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +. >And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and further obtaining each head through a softmax layer based on the weight resultIs a linear combination of parameters w i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < >>And->Is an offset vector by inputting FFNs of two branchesCarrying out summation calculation and sending the summation calculation into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value with each point position as the center of a target frame and the distance from each point to 4 sides of the target frame;
S2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatRegression loss function L related to bounding box regression branching reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted target frame and real target frame, wherein +.>The overall loss function L obtained by the process is l=l cls +λ·L reg And performing error back propagation on the training data set according to the loss function L to obtain a parameter-adjusted transducer network.
The visual tracking method based on the attention adaptive selection transducer in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment is tested, and the following results are obtained:
the visual tracking method based on the attention self-adaptive selection transducer, which is prepared in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment, has significantly improved accuracy and success rate performance index compared with the existing visual tracking method, and the first embodiment is the best embodiment.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.
Claims (8)
1. A visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals;
s2: training: training the constructed Transformer network by a professional;
s3: reasoning: the target box is inferred by a professional through a transducer.
2. The visual tracking method based on attention adaptive selection of a transducer according to claim 1, wherein in the step S1, a network model is constructed by a professional, wherein the adaptive attention selection of the transducer network model is performed when the network model is constructed, and the network structure of the network model comprises a backbone network for extracting the convolution characteristics of the object and the background, a characteristic fusion network, a classification output head, and a bounding box regression output head, the characteristics of the object and the background are extracted by introducing a residual network ResNet50 into the backbone network part, wherein the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, and the convolution operation is set as a hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is fed into the characteristic fusion network as the output characteristics of the backbone network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is thatAnd the convolution characteristic of the search image is +.>C is the number of channels of the convolution feature, set to 1024.
3. The visual tracking method based on attention self-adaptive selection of a transducer according to claim 2, wherein the feature fusion network part comprises a self-attention module, a cross-attention fusion module of template image features and target search images, and a more accurate characterization method of the correlation of the template image features and the target search images is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention map using an attention function of the transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is obtained by assuming that the attention function is QWhere dk is the dimension of query vector Q and key vector K.
4. A visual tracking method based on attention adaptive selection transform according to claim 3, characterized in that more correlation characterization between feature vectors is obtained by introducing a multi-headed attention model, wherein setting the number of multi-headed to 8 assumes the attention force of the single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, by combining multiple heads in a linear combination when fusing multiple head Attention output resultsAttention results are fused together to obtain an attention graph result H after adaptive attention selection O The expression of (2) isWherein W is O Is a weight matrix, w i Is the linear combination parameter of the i-th head.
5. The visual tracking method based on attention-adaptive selection transducer according to claim 4, wherein the linear parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And H is i =Attention((F z +P z )W i Q ,(F z +P z )W i K ,F z W i V ) Wherein W is i Q 、W i K And W is i V Representing projection matrices, P, of the ith header corresponding to query vector Q, key vector K and value vector V, respectively z Spatial position coding in the form of a sinusoidal function in a transducer, ultimately according to an expressionMulti-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>Is an enhanced convolution feature of the template image.
6. The visual tracking method based on attention-adaptive selection of a transducer according to claim 5, wherein for a convolution characteristic F of a target search image x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And one in both branchesStep-wise introduction of a 2-layer fully connected feedforward neural network (FFN) to nonlinearly transform a cross-attention attempt, wherein the expression is employed And->Wherein the method comprises the steps ofAnd->Is a weight matrix, < >>And->The offset vector is obtained by summing the FFN outputs of two branches, sending the FFN outputs to a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating the label value of each point position on the classification chart as the center of the target frame and the distance from each point to 4 sides of the target frame where the label value is located through the two branches.
7. The visual tracking method based on attention-adaptive selection of a transducer according to claim 1, wherein in S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, and classification loss function L related to classification branching is obtained cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j Represents the j-th pointAnd the j-th point falls into the target frame and is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.
8. The visual tracking method based on the adaptive attention selection of the transducer according to claim 1, wherein in the step S3, a professional infers the target frame through the transducer, wherein when the target frame is inferred, for the video frames input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width to 2 times of the length and the width of the target template image, and sends the target search image into the transducer network, the point with the maximum tag value on the output classification chart is the point in the target frame, and the rectangular target frame is inferred from the distance from the point to four sides of the frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310358214.5A CN116485839A (en) | 2023-04-06 | 2023-04-06 | Visual tracking method based on attention self-adaptive selection of transducer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310358214.5A CN116485839A (en) | 2023-04-06 | 2023-04-06 | Visual tracking method based on attention self-adaptive selection of transducer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116485839A true CN116485839A (en) | 2023-07-25 |
Family
ID=87226090
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310358214.5A Pending CN116485839A (en) | 2023-04-06 | 2023-04-06 | Visual tracking method based on attention self-adaptive selection of transducer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116485839A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117292338A (en) * | 2023-11-27 | 2023-12-26 | 山东远东保险公估有限公司 | Vehicle accident identification and analysis method based on video stream analysis |
CN117649582A (en) * | 2024-01-25 | 2024-03-05 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
-
2023
- 2023-04-06 CN CN202310358214.5A patent/CN116485839A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117292338A (en) * | 2023-11-27 | 2023-12-26 | 山东远东保险公估有限公司 | Vehicle accident identification and analysis method based on video stream analysis |
CN117292338B (en) * | 2023-11-27 | 2024-02-13 | 山东远东保险公估有限公司 | Vehicle accident identification and analysis method based on video stream analysis |
CN117649582A (en) * | 2024-01-25 | 2024-03-05 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
CN117649582B (en) * | 2024-01-25 | 2024-04-19 | 南昌工程学院 | Single-flow single-stage network target tracking method and system based on cascade attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116485839A (en) | Visual tracking method based on attention self-adaptive selection of transducer | |
CN109443382A (en) | Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network | |
WO2024021394A1 (en) | Person re-identification method and apparatus for fusing global features with ladder-shaped local features | |
CN111626764A (en) | Commodity sales volume prediction method and device based on Transformer + LSTM neural network model | |
CN112699682A (en) | Named entity identification method and device based on combinable weak authenticator | |
CN113297972B (en) | Transformer substation equipment defect intelligent analysis method based on data fusion deep learning | |
CN115223082A (en) | Aerial video classification method based on space-time multi-scale transform | |
CN114913403B (en) | Visual question-answering method based on metric learning | |
CN113313232A (en) | Functional brain network classification method based on pre-training and graph neural network | |
CN114241191A (en) | Cross-modal self-attention-based non-candidate-box expression understanding method | |
CN115423847B (en) | Twin multi-modal target tracking method based on Transformer | |
CN113609326B (en) | Image description generation method based on relationship between external knowledge and target | |
CN115761240B (en) | Image semantic segmentation method and device for chaotic back propagation graph neural network | |
CN111652021B (en) | BP neural network-based face recognition method and system | |
CN117036760A (en) | Multi-view clustering model implementation method based on graph comparison learning | |
CN117058882A (en) | Traffic data compensation method based on multi-feature double-discriminant | |
CN116844004A (en) | Point cloud automatic semantic modeling method for digital twin scene | |
CN115861713A (en) | Carotid plaque ultrasonic image processing method based on multitask learning | |
CN113450313B (en) | Image significance visualization method based on regional contrast learning | |
CN113283393B (en) | Deepfake video detection method based on image group and two-stream network | |
CN115170888A (en) | Electronic component zero sample identification model and method based on visual information and semantic attributes | |
CN115018134A (en) | Pedestrian trajectory prediction method based on three-scale spatiotemporal information | |
CN114187569A (en) | Real-time target detection method integrating Pearson coefficient matrix and attention | |
CN112598115A (en) | Deep neural network hierarchical analysis method based on non-local neighbor relation learning | |
CN117251583B (en) | Text enhanced knowledge graph representation learning method and system based on local graph structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |