CN116485839A - Visual tracking method based on attention self-adaptive selection of transducer - Google Patents

Visual tracking method based on attention self-adaptive selection of transducer Download PDF

Info

Publication number
CN116485839A
CN116485839A CN202310358214.5A CN202310358214A CN116485839A CN 116485839 A CN116485839 A CN 116485839A CN 202310358214 A CN202310358214 A CN 202310358214A CN 116485839 A CN116485839 A CN 116485839A
Authority
CN
China
Prior art keywords
attention
network
target
transducer
head
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310358214.5A
Other languages
Chinese (zh)
Inventor
钱诚
何越江
李腾娇
曹国锋
殷文慧
罗丰瑞
许家豪
杜雨菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Institute of Technology
Original Assignee
Changzhou Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Institute of Technology filed Critical Changzhou Institute of Technology
Priority to CN202310358214.5A priority Critical patent/CN116485839A/en
Publication of CN116485839A publication Critical patent/CN116485839A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of vision tracking, in particular to a vision tracking method based on attention self-adaptive selection transducer, which aims at the problems that the prior vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are lower, and the following scheme is provided, wherein the method comprises the following steps: s1: constructing a network model, and S2: the invention aims to select the characterization of the correlation between the target template image and the target search image which is more beneficial to tracking by replacing the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.

Description

Visual tracking method based on attention self-adaptive selection of transducer
Technical Field
The invention relates to the technical field of visual tracking, in particular to a visual tracking method based on attention self-adaptive selection of a transducer.
Background
The visual tracking is to continuously infer in the following video frames according to the appearance information of the target object determined in the first frame in the video, so as to determine the spatial position of the target object in the video, and the visual tracking has wide application in the fields of video monitoring, robot navigation, product surface quality detection and the like. The development of deep learning is benefited, the visual tracking capability in the aspect of target appearance characterization is stronger, so that the current tracking method mainly adopts template matching to realize tracking, an appearance image of a target object in a first frame is used as a template, the similarity of each local image area and the template is calculated through sliding matching of the template in a video frame image in a subsequent video, and finally an image area with the maximum similarity is selected as a target image area.
Current deep learning-based visual tracking methods can be broadly divided into two categories, depending on the type of core network module used for target matching, in general: a vision tracking method based on a twin convolutional neural network and a vision tracking method based on a transducer. In the aspect of feature extraction, the two types of methods mostly adopt classical convolutional neural networks such as ResNet, VGG and the like as a main network for feature extraction; in the aspect of calculation of correlation between target template features and search image features, the visual tracking method based on a twin convolutional neural network still estimates the similarity between the template and each local area of the search image in a convolutional manner, and in contrast, the visual tracking method based on a Transformer decomposes video frame images into local image feature embedding, and calculates the similarity between the template and the local image features in the feature embedding form through an encoder and a decoder. In the decision section, the two methods generally construct a classification head, a target frame coordinate regression head or a target frame intersection ratio estimation head, and the like, so as to output the estimated target frames on the search image and the confidence corresponding to each frame, and provide a basis for the decision of the final target frame.
However, the existing vision tracking technology simply splices a plurality of single-head attention attempts by using a traditional transducer encoder and a decoder, and the accurate correlation description is weak, so that the accuracy and success rate performance index of the vision tracking method are low, and therefore, we propose a vision tracking method based on attention self-adaptive selection transducer for solving the problems.
Disclosure of Invention
The invention aims to solve the problems that the accuracy and success rate performance index of a visual tracking method are low due to the fact that a plurality of single-head attention attempts are simply spliced and correct correlation descriptions are weak by using a traditional transducer encoder and a decoder in the existing visual tracking technology at present, and provides the visual tracking method based on the attention self-adaptive selection transducer, so that the accuracy of target tracking and the success rate performance index of the visual tracking method are improved.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals;
S2: training: training the constructed Transformer network by a professional;
s3: reasoning: reasoning the target frame by a professional through a transducer;
preferably, in the step S1, a professional builds a network model, wherein the adaptive attention selection transducer network model is performed during the network model building, and the network structure of the network model includes a main network for extracting the object and the background appearance convolution characteristics, a characteristic fusion network, a classification output head, and a bounding box regression output head, and a residual network res net50 is introduced into the main network to extract the object and the backgroundWherein the network structure of the ResNet50 is modified when the residual network ResNet50 is introduced, the last stage of the ResNet50 network is removed by modification, the first 4 stages are reserved, the convolution operation is set as hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is taken as the output characteristic of the main network to be sent into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, the information about the features is obtained by introducing a multi-head attention modelMore correlation characterization between vectors, where setting the number of heads to 8 assumes that the attention of the single head output in the attention model strives to be H i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Spatial position coding in the form of a sinusoidal function in a transducerIs a constant matrix, finally according to the expression +.>Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing 2 layers of full links into both branchesThe cross attention is subjected to nonlinear transformation by a feedforward neural network (FFN) to improve the characterization capability of the feature, wherein the expression +.>And->Wherein->And->Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
Preferably, in the step S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, so that classification loss function L related to classification branching is achieved cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i Representation predictionThe j-th point is the label value of the point in the target frame, and the regression loss function L related to the regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
Preferably, in the step S3, a professional performs reasoning on the target frame through a transducer, wherein when the reasoning is performed on the target frame, for a video frame input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, the target search image is sent into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Compared with the prior art, the invention has the beneficial effects that:
1. the self-adaptive attention selection mechanism in the multi-head attention calculation module replaces the conventional multi-head splicing mode in a multi-head linear combination mode, so that the representation of the correlation between the target template image and the target search image which are more beneficial to tracking is selected, the correct correlation is enhanced, the noise correlation is inhibited, and the accuracy and the success rate performance index of the visual tracking method are improved.
The invention aims to replace the conventional multi-head splicing mode by a multi-head linear combination mode through a self-adaptive attention selection mechanism in a multi-head attention calculation module, select the characterization of the correlation between a target template image and a target search image which are more beneficial to tracking, enhance the correct correlation, inhibit the noise correlation and improve the accuracy and success rate performance index of a visual tracking method.
Drawings
FIG. 1 is a flow chart of a visual tracking method based on an attention-adaptive selection transducer according to the present invention;
FIG. 2 is a diagram showing a visual tracking method based on an attention adaptive selection transducer according to the present invention;
FIG. 3 is a schematic diagram of a multi-head attention calculation module according to the visual tracking method of the present invention based on an attention adaptive selection transform;
fig. 4 is a comparison chart of success rate performance indexes of a visual tracking method based on attention self-adaptive selection of a transducer according to the present invention.
Detailed Description
The following description of the technical solutions in the embodiments of the present invention will be clear and complete, and it is obvious that the described embodiments are only some embodiments of the present invention, but not all embodiments.
Example 1
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a more accurate characterization method of the correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be%>Where dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhancing original convolution features by attempting to residual form with multi-headed attentionObtaining enhanced convolution features with better characterizations of templates using expressionsIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in form of sine function in transducer, cross-attention fusion module uses the expression +.>Is used for calculating the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature by setting 2 cross attention calculation branches to calculate attention force diagramAnd +.>And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>Andwherein->And- >Is a matrix of weights and is a matrix of weights,and->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the Bi is calculated by the following steps ofThe process obtains a total loss function L of l=l cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
s3: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example two
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +.>The expression isMiddle->For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And- >Is a weight matrix, < >>And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, wherein The Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.
Example III
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, set as 1024, the feature fusion network part contains a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and is formed by proposing a selective multi-head attention moduleProviding a more accurate characterization method of the correlation of template image features and target search images for a self-attention module and a cross-attention fusion module, wherein the attention calculation model is designed by calculating attention force by using an attention function of a transducer, wherein the assumption is that the query vector of the transducer is Q, the key vector is K, the value vector is V, and the assumption is that the single-head attention function in the transducer is obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position code in the form of a sine function in a transducer, which is a constant matrix, ultimately according to an expressionMulti-head linear combination of outputs allows multi-head attention attempt resultsEnhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < > >And->The method is an offset vector, and comprises the steps of carrying out summation calculation on FFN output of two branches, sending the FFN output into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value of which each point position is the center of a target frame on a classification chart and the distance from each point to 4 edges of the target frame where the label value is located through the two branches;
s2: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example IV
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +.>And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method with more accurate correlation of the template image feature and the target search image is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model;
s2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j The true class label representing the jth point, and the jth point falling into the target frame is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Wherein lambda is a positive balance parameter, and error back propagation is performed on the training data set according to the loss function L to obtain a parameter-adjusted transducer network;
s3: reasoning: and (3) reasoning the target frame by a professional through a transducer, wherein when reasoning the target frame, for the video frame input subsequently, each frame of the video image input according to the length and the width of the target image in the previous frame intercepts the target search image, scales the length and the width of the target search image to 2 times of the length and the width of the target template image, sends the target search image into the transducer network, the point with the maximum label value on the output classification chart is the point in the target frame, and the rectangular target frame is deduced from the distance from the point to four sides of the frame.
Example five
Referring to fig. 1-4, a visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals, wherein the network model is constructed by self-adaptive attention selection Transformer network model, the network structure of the network model comprises a main network for extracting target and background appearance convolution characteristics, a characteristic fusion network, a classification output head and a bounding box regression output head, a residual network ResNet50 is introduced into a main network part to extract the target and background characteristics, the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, the convolution operation is set as cavity convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step size is set as 1, and the convolution result in the 4 th stage is taken as the output characteristic of the main network to be fed into the characteristic fusion network, and the input template image is assumed to beThe input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is +. >And the convolution characteristic of the search image is +.>C is the channel number of the convolution feature, which is set as 1024, the feature fusion network part comprises a self-attention module, a cross-attention fusion module of the template image feature and the target search image, and a characterization method for providing more accurate correlation of the template image feature and the target search image for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention force diagram by using an attention function of a transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is assumed to be obtainedWhere dk is the dimension of query vector Q and key vector K, more relevance characterizations are obtained for feature vectors by introducing a multi-headed attention model, where setting the number of heads to 8 assumes the attention force of a single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, an adaptive Attention map result H after Attention selection is obtained by fusing the multi-head Attention results together in a linear combination when fusing the multi-head Attention output results O The expression of (2) is +.>Wherein W is O Is a weight matrix, w i For the i-th head, the linear combination parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and further obtaining each head through a softmax layer based on the weight resultIs a linear combination of parameters w i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And (2) andwherein->And->Representing the projection matrix of the ith head corresponding to the query vector Q, key vector K and value vector V, respectively, with the purpose of projecting the input Q, K and V into a linear subspace for expression, P z Is a spatial position coding in the form of a sinusoidal function in a transducer, which is a constant matrix, finally +_ according to the expression->Multi-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better characterizability for templates are obtained by attempting to enhance the original convolution features in residual form with multiple head attention, wherein the expression +. >In the expression +.>For enhanced convolution characteristics of template images, convolution characteristics F of target search images x Obtaining enhanced convolution feature ++in residual form by multi-head attention seeking>And adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And further introducing a 2-layer fully connected feedforward neural network (FFN) into the two branches to perform nonlinear transformation on the cross attention attempt so as to improve the characterization capability of the feature, wherein the expression>And->Wherein->And->Is a weight matrix, < >>And->Is an offset vector by inputting FFNs of two branchesCarrying out summation calculation and sending the summation calculation into a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating a label value with each point position as the center of a target frame and the distance from each point to 4 sides of the target frame;
S2: training: training the constructed transducer network by a professional, wherein ResNet50 adopts public pre-training model parameters during training, and the ResNet50 network parameters are frozen, are not adjusted and are classified into a classification loss function L related to classification branching cls Is thatRegression loss function L related to bounding box regression branching reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted target frame and real target frame, wherein +.>The overall loss function L obtained by the process is l=l cls +λ·L reg And performing error back propagation on the training data set according to the loss function L to obtain a parameter-adjusted transducer network.
The visual tracking method based on the attention adaptive selection transducer in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment is tested, and the following results are obtained:
the visual tracking method based on the attention self-adaptive selection transducer, which is prepared in the first embodiment, the second embodiment, the third embodiment, the fourth embodiment and the fifth embodiment, has significantly improved accuracy and success rate performance index compared with the existing visual tracking method, and the first embodiment is the best embodiment.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (8)

1. A visual tracking method based on attention-adaptive selection of a transducer, comprising the steps of:
s1: constructing a network model: constructing a network model by professionals;
s2: training: training the constructed Transformer network by a professional;
s3: reasoning: the target box is inferred by a professional through a transducer.
2. The visual tracking method based on attention adaptive selection of a transducer according to claim 1, wherein in the step S1, a network model is constructed by a professional, wherein the adaptive attention selection of the transducer network model is performed when the network model is constructed, and the network structure of the network model comprises a backbone network for extracting the convolution characteristics of the object and the background, a characteristic fusion network, a classification output head, and a bounding box regression output head, the characteristics of the object and the background are extracted by introducing a residual network ResNet50 into the backbone network part, wherein the network structure of the ResNet50 needs to be modified when the residual network ResNet50 is introduced, the last stage of removing the ResNet50 network is modified, the former 4 stages are reserved, and the convolution operation is set as a hole convolution in the 4 th stage, the convolution expansion rate is set as 2, the convolution step length is set as 1, and the convolution result of the 4 th stage is fed into the characteristic fusion network as the output characteristics of the backbone network, and the input template image is assumed to be The input search image is +.>H z Is the height of the template image, W z Is the width of the template image, H x Is the height of the search image, W x Is the width of the search image, and the convolution characteristic of the template image obtained after passing through the residual network ResNet50 is thatAnd the convolution characteristic of the search image is +.>C is the number of channels of the convolution feature, set to 1024.
3. The visual tracking method based on attention self-adaptive selection of a transducer according to claim 2, wherein the feature fusion network part comprises a self-attention module, a cross-attention fusion module of template image features and target search images, and a more accurate characterization method of the correlation of the template image features and the target search images is provided for the self-attention module and the cross-attention fusion module by proposing a selective multi-head attention model, wherein the attention calculation model is designed by calculating an attention map using an attention function of the transducer, wherein the query vector of the transducer is assumed to be Q, the key vector is K, the value vector is V, and the single-head attention function in the transducer is obtained by assuming that the attention function is QWhere dk is the dimension of query vector Q and key vector K.
4. A visual tracking method based on attention adaptive selection transform according to claim 3, characterized in that more correlation characterization between feature vectors is obtained by introducing a multi-headed attention model, wherein setting the number of multi-headed to 8 assumes the attention force of the single-headed output as H in the attention model i =attention (Q, K, V), where i= {0,1, …,7}, by combining multiple heads in a linear combination when fusing multiple head Attention output resultsAttention results are fused together to obtain an attention graph result H after adaptive attention selection O The expression of (2) isWherein W is O Is a weight matrix, w i Is the linear combination parameter of the i-th head.
5. The visual tracking method based on attention-adaptive selection transducer according to claim 4, wherein the linear parameter w i The calculation step of (1) is that firstly, calculation is carried outWherein AP (·) represents the mean pooling layer, FC (·) represents the fully connected layer, s i Weight vector representing output result of each head, and weight result for describing attention diagram of output of each head, and obtaining linear combination parameter w of each head through a softmax layer i The expression of (2) is w i =softmax(s i ) The self-attention module part comprises self-attention calculation of the target template image feature and self-attention calculation of the target search image feature, and for the self-attention calculation of the template image feature, the query vector Q, the key vector K and the value vector V are all target template image feature F z And H is i =Attention((F z +P z )W i Q ,(F z +P z )W i K ,F z W i V ) Wherein W is i Q 、W i K And W is i V Representing projection matrices, P, of the ith header corresponding to query vector Q, key vector K and value vector V, respectively z Spatial position coding in the form of a sinusoidal function in a transducer, ultimately according to an expressionMulti-head linear combination of outputs can be used for multi-head attention try to result +.>Enhanced convolution features with better template characterization by attempting to enhance the original convolution features in residual form with multiple head attention, using the expressionIn the expression +.>Is an enhanced convolution feature of the template image.
6. The visual tracking method based on attention-adaptive selection of a transducer according to claim 5, wherein for a convolution characteristic F of a target search image x Obtaining enhanced convolution features in residual form by multi-head attention seekingAnd adopts the expression +.>Wherein P is x Spatial position coding in the form of a sinusoidal function in a transducer, the cross-attention fusion module uses an expressionThe multi-head attention model of (1) calculates the cross attention of the target template image enhanced convolution feature and the target search image enhanced convolution feature, calculates the attention force diagram by setting 2 cross attention calculation branches>and And one in both branchesStep-wise introduction of a 2-layer fully connected feedforward neural network (FFN) to nonlinearly transform a cross-attention attempt, wherein the expression is employed And->Wherein the method comprises the steps ofAnd->Is a weight matrix, < >>And->The offset vector is obtained by summing the FFN outputs of two branches, sending the FFN outputs to a classification branch and a boundary frame regression branch which are formed by 2 multi-layer perceptrons, and respectively estimating the label value of each point position on the classification chart as the center of the target frame and the distance from each point to 4 sides of the target frame where the label value is located through the two branches.
7. The visual tracking method based on attention-adaptive selection of a transducer according to claim 1, wherein in S2, a professional trains the constructed transducer network, wherein ResNet50 adopts public pre-training model parameters during training, and ResNet50 network parameters are frozen during training, and are not adjusted, and classification loss function L related to classification branching is obtained cls Is thatWhere j represents the index number of each point on the classification map of the classification branch output, y j Represents the j-th pointAnd the j-th point falls into the target frame and is a positive sample, corresponding to the class label y j The j-th point is outside the target frame and is a negative sample, corresponding to class label y j =0,p i A label value indicating that the jth point is predicted as the point in the target frame, and a regression loss function L related to a regression branch of the boundary frame reg Binary cross entropy calculation using the cross-over ratio (IOU) between predicted and real target frames, whereinThe Bi is a target frame estimated according to the predicted value of the boundary distance from the ith point to the target frame, the B is a real target frame, and the total loss function L is L=L obtained through the process cls +λ·L reg Where λ is a positive balance parameter, and error back-propagation is performed on the training dataset according to the loss function L to obtain a parameter-adjusted transducer network.
8. The visual tracking method based on the adaptive attention selection of the transducer according to claim 1, wherein in the step S3, a professional infers the target frame through the transducer, wherein when the target frame is inferred, for the video frames input subsequently, the video image input by each frame intercepts the target search image according to 2 times of the length and the width of the target image in the previous frame, scales the length and the width to 2 times of the length and the width of the target template image, and sends the target search image into the transducer network, the point with the maximum tag value on the output classification chart is the point in the target frame, and the rectangular target frame is inferred from the distance from the point to four sides of the frame.
CN202310358214.5A 2023-04-06 2023-04-06 Visual tracking method based on attention self-adaptive selection of transducer Pending CN116485839A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310358214.5A CN116485839A (en) 2023-04-06 2023-04-06 Visual tracking method based on attention self-adaptive selection of transducer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310358214.5A CN116485839A (en) 2023-04-06 2023-04-06 Visual tracking method based on attention self-adaptive selection of transducer

Publications (1)

Publication Number Publication Date
CN116485839A true CN116485839A (en) 2023-07-25

Family

ID=87226090

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310358214.5A Pending CN116485839A (en) 2023-04-06 2023-04-06 Visual tracking method based on attention self-adaptive selection of transducer

Country Status (1)

Country Link
CN (1) CN116485839A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292338A (en) * 2023-11-27 2023-12-26 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis
CN117649582A (en) * 2024-01-25 2024-03-05 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117292338A (en) * 2023-11-27 2023-12-26 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis
CN117292338B (en) * 2023-11-27 2024-02-13 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis
CN117649582A (en) * 2024-01-25 2024-03-05 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention
CN117649582B (en) * 2024-01-25 2024-04-19 南昌工程学院 Single-flow single-stage network target tracking method and system based on cascade attention

Similar Documents

Publication Publication Date Title
CN116485839A (en) Visual tracking method based on attention self-adaptive selection of transducer
CN109443382A (en) Vision SLAM closed loop detection method based on feature extraction Yu dimensionality reduction neural network
WO2024021394A1 (en) Person re-identification method and apparatus for fusing global features with ladder-shaped local features
CN111626764A (en) Commodity sales volume prediction method and device based on Transformer + LSTM neural network model
CN112699682A (en) Named entity identification method and device based on combinable weak authenticator
CN113297972B (en) Transformer substation equipment defect intelligent analysis method based on data fusion deep learning
CN115223082A (en) Aerial video classification method based on space-time multi-scale transform
CN114913403B (en) Visual question-answering method based on metric learning
CN113313232A (en) Functional brain network classification method based on pre-training and graph neural network
CN114241191A (en) Cross-modal self-attention-based non-candidate-box expression understanding method
CN115423847B (en) Twin multi-modal target tracking method based on Transformer
CN113609326B (en) Image description generation method based on relationship between external knowledge and target
CN115761240B (en) Image semantic segmentation method and device for chaotic back propagation graph neural network
CN111652021B (en) BP neural network-based face recognition method and system
CN117036760A (en) Multi-view clustering model implementation method based on graph comparison learning
CN117058882A (en) Traffic data compensation method based on multi-feature double-discriminant
CN116844004A (en) Point cloud automatic semantic modeling method for digital twin scene
CN115861713A (en) Carotid plaque ultrasonic image processing method based on multitask learning
CN113450313B (en) Image significance visualization method based on regional contrast learning
CN113283393B (en) Deepfake video detection method based on image group and two-stream network
CN115170888A (en) Electronic component zero sample identification model and method based on visual information and semantic attributes
CN115018134A (en) Pedestrian trajectory prediction method based on three-scale spatiotemporal information
CN114187569A (en) Real-time target detection method integrating Pearson coefficient matrix and attention
CN112598115A (en) Deep neural network hierarchical analysis method based on non-local neighbor relation learning
CN117251583B (en) Text enhanced knowledge graph representation learning method and system based on local graph structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination